Abstract

   Dynamic changes in biological systems can be captured by measuring
   molecular expression from different levels (e.g., genes and proteins)
   across time. Integration of such data aims to identify molecules that
   show similar expression changes over time; such molecules may be
   co-regulated and thus involved in similar biological processes.
   Combining data sources presents a systematic approach to study
   molecular behaviour. It can compensate for missing data in one source,
   and can reduce false positives when multiple sources highlight the same
   pathways. However, integrative approaches must accommodate the
   challenges inherent in ‘omics’ data, including high-dimensionality,
   noise, and timing differences in expression. As current methods for
   identification of co-expression cannot cope with this level of
   complexity, we developed a novel algorithm called DynOmics. DynOmics is
   based on the fast Fourier transform, from which the difference in
   expression initiation between trajectories can be estimated. This delay
   can then be used to realign the trajectories and identify those which
   show a high degree of correlation. Through extensive simulations, we
   demonstrate that DynOmics is efficient and accurate compared to
   existing approaches. We consider two case studies highlighting its
   application, identifying regulatory relationships across ‘omics’ data
   within an organism and for comparative gene expression analysis across
   organisms.
     __________________________________________________________________

   High-throughput ‘omics’ platforms such as transcriptomics, proteomics,
   and metabolomics enable the simultaneous monitoring of thousands of
   biological molecules (transcripts, proteins, and metabolites),
   typically through a single static experiment[28]^1. The recent decrease
   in cost of such technological platforms has made possible the study of
   dynamic biological processes by instead quantify molecules at several
   time points. This allows deeper insight into the behaviour of the
   molecules in situations ranging from developmental processes to drug
   response. These time course ‘omics’ experiments enable the
   identification of regulators, and may give a better understanding of
   the structure and dynamics of biological systems.

   The statistical analysis of dynamic ‘omics’ experiments is difficult.
   Applying traditional statistical methods for static experiments is
   limited, since each time point will be treated as independent, ignoring
   potentially important correlations between sampling times. Indeed,
   realising the potential power offered in time course studies to
   investigate a wide variety of changes is nontrivial. Analytical
   challenges are further complicated by noise, small sample sizes per
   time point, and few sampled time points. In the past decade, several
   methods have been proposed to analyse time course ‘omics’ data, with a
   particular focus on microarray and RNA-Seq data. These methods perform
   differential expression analysis using spline fitting[29]^2,[30]^3,
   Bayesian methods[31]^4,[32]^5,[33]^6,[34]^7, Gaussian
   processes[35]^8,[36]^9,[37]^10, and a two-step regression approach
   (maSigPro[38]^11). Other methods focus on clustering expression
   profiles to identify co-expressed trajectories, e.g., a subset of
   molecules for which expression changes occur simultaneously across
   time[39]^3,[40]^7,[41]^12,[42]^13,[43]^14,[44]^15,[45]^16. Targeted
   co-expression analysis can also be performed using various model-based
   applications to retrieve data sets from databases given specific query
   data[46]^17,[47]^18,[48]^19. Finally, a third category of methods was
   proposed based on biological pathway analysis[49]^20,[50]^21 see the
   detailed review of Spies et al.[51]^22. Co-expression analysis can
   provide valuable insight into the role of molecules during biological
   processes[52]^23,[53]^24,[54]^25, but faces significant challenges in
   dealing with different types of ‘omics’ and their variation in
   molecular response times. These timing differences or delays in the
   initiation or suppression of molecule expression are a common
   phenomenon in biology and occur across both different molecular levels
   and organisms. For example, the study of regulatory processes after
   environmental changes has revealed that there is often a measurable
   delay from the time of signal introduction to molecular
   response[55]^23,[56]^24,[57]^26,[58]^27,[59]^28. This can result from
   differences in the reaction kinetics between an enzyme and its
   substrate, presence of an inhibitor, or altered binding affinities of
   transcription factors. Such processes can be studied through time
   course miRNA and mRNA data, since miRNAs play an important role in gene
   translation regulation in many organisms, through either mRNA
   translation inhibition or mRNA degradation[60]^29. This ability of
   miRNAs to fine tune gene expression and translation in a broad range of
   important biological processes is of broad interest in
   medicine[61]^30,[62]^31,[63]^32. While correlation analysis is
   frequently used to analyse miRNA-mRNA time course data[64]^33,[65]^34,
   it may have limited power in situations where delayed dynamic
   expression changes of miRNAs relative to mRNA have been
   observed[66]^30,[67]^35.

   Delays can also hinder gene expression comparisons across organisms,
   since even highly conserved processes may vary in timing. The
   pre-implantation embryonic development (PED) is a highly conserved
   process across mammals, reflected through the progression of the same
   morphologic stages[68]^36. Nevertheless, attempts to compare PED in
   mammals based on gene expression data have faced challenges due to
   differences in timing of genome activation and regulatory
   processes[69]^37. Hence ignoring these delays in co-expression analysis
   can mask true associations; the first step should instead be to detect
   and quantify the time delay between molecules. This will enable
   identification of functionally related molecules regardless of the
   differences in the timing of expression changes, as well as allowing
   quantification of similarities and differences between the observed
   responses in more detail.

   To date, very few methods for time course ‘omics’ data account for time
   delay between molecule expression levels. Aijo et al.[70]^10 recently
   proposed DynB, a set of methods based on Gaussian processes, to
   quantify RNA-Seq gene expression dynamics. This allows rescaling of
   time profiles, but only between replicates (i.e., at the sample level)
   rather than at the molecule expression level. The most commonly used
   approach for molecules is to consider the Pearson
   correlation[71]^34,[72]^38, despite its obvious limitations for
   detecting co-expressed molecules when their expression change occurs at
   different time points. Lagged Pearson correlation, a.k.a. Pearson
   cross-correlation for lagged time series, circumvents this limitation
   by introducing artificial delays or lags in the time expression
   profiles for every possible time shift. The method eventually applies
   the delay that maximises the correlation with the original profile, but
   can be prone to overestimation of delay.

   More sophisticated approaches for time course ‘omics’ data come at the
   expense of computational cost. Shi et al.[73]^39 proposed an
   probabilistic model based on multiple datasets tabular combinations to
   identify pairwise transcription factor and gene (TF-G) pairs under
   different experimental conditions. This approach has shown to reduce
   false positive predictions but requires a time consuming learning step
   on existing and known TF-G pair data[74]^39. Dynamic Time Warping
   (DTW)[75]^40,[76]^41,[77]^42 is an algorithm that aligns the time
   points of two trajectories to minimise the distance between them. It
   can therefore identify similarities between trajectories which may vary
   in phase and speed. One variation, DTW4Omics[78]^24 identifies
   co-expressed molecules with a permutation test, but this can be
   computationally expensive. An alternate approach[79]^25 utilises a
   combined statistic based on Hidden Markov Models (HMM) and Pearson
   correlation. HMMs are trained on a set of trajectories where a
   distribution of values is considered for each time point. This
   generates a probability to observe a trajectory under the trained model
   that can tolerate small delays. While promising, this approach cannot
   detect large delays. Additionally, both it and DTW4Omics can only
   identify positively correlated trajectories, requiring heavier
   computational costs to exhaustively explore potential associations.

   While integrating time course experiments from different ‘omics’
   functional levels is the key to identifying dynamic molecular
   interactions, its challenging nature has thus far prevented much
   methodological development. Difficulties lie not only in the
   computation required by complex algorithms, but also in variation in
   types of correlation, levels of noise in the expression profiles, and
   the delays themselves.

   We present DynOmics, a novel algorithm to detect, estimate, and account
   for delays between ‘omics’ time expression profiles. The algorithm is
   based on the fast Fourier transform (FFT)[80]^43, which has already
   been shown to successfully detect periodically expressed genes in
   transcriptomics experiments[81]^44,[82]^45,[83]^46. By combining the
   FFT angular difference between reference and query trajectories with
   lagged Pearson correlation, we are able to characterise the direction
   and magnitude of delay, whether the reference and query are positively
   or negatively correlated. After accounting for the estimated delays,
   similar profiles can be clustered for further insight. Simulation
   results show that DynOmics outperforms current methods to detect time
   shift, both in terms of sensitivity and specificity. We apply it to two
   biological case studies: one focusing on the integration of miRNAs and
   mRNAs in mouse lung development, and one on the conservation of gene
   molecular processes across multiple organisms (mouse, bovine, human)
   during PED. In both cases, DynOmics is able to unravel timing
   differences between ‘omics’ functional levels, demonstrating its wide
   applicability. DynOmics is implemented in the open source programming
   language R and is freely available via CRAN[84]^47. Our repository and
   user manual are also available at the following link
   [85]https://bitbucket.org/Jasmin87/dynomics.

Material and Methods

   The expression changes of molecules monitored in time course
   experiments often form simple temporary, sustainable or cyclic patterns
   that can be modelled as mixtures of oscillating/cyclic patterns using
   the discrete Fourier transform (FT)[86]^48. We introduce DynOmics, a
   novel method that first converts trajectories to the frequency domain
   using the FFT, from which it extracts the frequency of the main cyclic
   pattern. Condensing the trajectory to information on the main frequency
   is then used to identify whether two trajectories are related or
   associated, while ignoring the noise in each time expression profile.

Fourier Transform

   For a given time series x = (x[1], …, x[t], …x[T]), measured at time
   points t = 1, …T, the FT decomposes x into circular components or
   cyclic patterns for each frequency k = 1, …, T − 1 as:
   graphic file with name srep40131-m1.jpg

   As the amplitude at frequency k = 0 simply describes the y-axis offset
   (i.e., the global differences of expression levels), this frequency is
   not included in our analysis context. [87]Equation (1) can be written
   with polar coordinates with real part a and imaginary part b as
   X[k] = a[k] + b[k]i.

   For each frequency k = 1, …, T − 1 we can calculate the amplitude r[k]
   of the component as Inline graphic . The amplitude reflects the
   contribution of the k^th cyclic pattern to the overall trajectory, and
   the pattern with maximum amplitude Inline graphic describes the main
   shape of the time series. The argument Arg(X[k]) is the offset of the
   cyclic pattern, defined as:
   graphic file with name srep40131-m4.jpg

   We can transform the argument to the phase angle (delay) ϕ[k] in
   degrees by:
   graphic file with name srep40131-m5.jpg

   Together, the amplitude and phase angle describe each frequency
   component, and the set of these quantities is known as the frequency
   domain representation.

DynOmics

   We describe DynOmics, a novel method to estimate delays between a
   reference x and query y given in frequency domain representation.
   First, we identify K as the frequency of the pattern with maximum
   amplitude for x, i.e., the main reference pattern frequency. Then, for
   both x and y, we extract phase angles at this frequency ϕ^x = ϕ[xK],
   ϕ^y = ϕ[yK] and define Δ[xy] = ϕ^x − ϕ^y as the difference between the
   phase angles. In FFT literature, Δ[xy] is often expressed in the range
   of [−180, 180]. To simplify representation in DynOmics, when Δ[xy] < 0,
   we add 360 so that Δ[xy] is in the range of 0 to 359. Δ[xy] indicates
   both the sign of the correlation between x and y and the sign of the
   delay, as seen in [88]Fig. 1. The trajectories x and y can be either
   positively ([89]Fig. 1a,b,f) or negatively correlated ([90]Fig.
   1c,d,e), with a delay that we refer to as negative, i.e., the reference
   x is prior to the query y ([91]Fig. 1b,e) or positive, i.e., the
   reference x is delayed with respect to the query y ([92]Fig. 1c,f).
   Specific angular difference cases include when Δ[xy] = 0 (positive
   correlation, but no delay, [93]Fig. 1a) and when Δ[xy] = 180 (negative
   correlation, no delay, [94]Fig. 1d). We can estimate the delay between
   two trajectories based on the FT frequency, the length of the time
   series and Δ[xy] as

Figure 1. Relationship between angular differences, correlation and delay for
a reference trajectory x (red dots) and a query trajectory y (green line).

   [95]Figure 1
   [96]Open in a new tab

   The trajectories are (a) positively correlated with no delay
   (Δ[xy] = 0); (b) positively correlated with negative delay
   (0 < Δ[xy] ≤ 90); (c) negatively correlated with positive delay
   (90 < Δ[xy] < 180); (d) negatively correlated with no delay
   (Δ[xy] = 180); (e) negatively correlated with negative delay
   (180 < Δ[xy] < 270); (f) positively correlated with positive delay
   (270 ≤ Δ[xy] < 360).
   graphic file with name srep40131-m6.jpg

   where δ[xy] ranges from 0 to Inline graphic . In order to keep delay
   estimates and hence signal shifts as small as possible, we collapse
   these values to the range of Inline graphic by setting Inline graphic
   when 270 ≤ Δ[xy] ≤ 359, and Inline graphic when 90 ≤ Δ[xy] ≤ 270. We
   note that for query profiles either positively or negatively correlated
   with the reference, this means that δ[xy] < 0 represents positive
   delay, and δ[xy] > 0 represents negative delay.

Using lagged Pearson correlation to increase accuracy in delay estimation

   The delay estimate based on the angular difference presented above is
   based on approximating both the reference and query by the pattern at
   the main frequency for the reference. This approximation works well
   when the signals from both query and reference are dominated by that
   main pattern and relatively ‘noise-free’. However, when multiple
   frequencies have substantial contributions to the overall signal, we
   can improve the estimate by maximising the lagged Pearson correlation
   coefficient over small perturbations in the delay.

   Specifically, let Inline graphic denote our initial delay estimate,
   rounded to the closest integer. Let Inline graphic be a set of lags {l}
   representing perturbations to this initial delay estimate. For each
   lag, we construct trajectories x[l] and y[l] by shifting the original
   trajectories so that if l < 0, x[l] = x[1+|l|], …, x[T] and
   y[l] = y[1], …, y[T−|l|]; if l > 0, conversely, x[l] = x[1], …,
   x[T−|l|]and y[l] = y[1+|l|], …, y[T]. The (lagged) Pearson correlation
   coefficient between the two trajectories x[l] and y[l] is defined as:
   graphic file with name srep40131-m13.jpg

   where Inline graphic ( Inline graphic ) is the sample mean across time
   points for each trajectory. We determine the optimal delay for a given
   set of lags as that for which the Pearson correlation coefficient is
   maximised:
   graphic file with name srep40131-m16.jpg

   with Inline graphic then used to assess the strength and direction of
   the association.

   We consider two sets Inline graphic of lags which represent
   perturbations of δ[0]:
   graphic file with name srep40131-m19.jpg

   where δ[1] is the result of the optimization over Inline graphic .
   These two optimisations thus allow us to compare the initial estimate
   with that in the opposite direction, and then with delays in a local
   neighbourhood. While the optimisations do increase the computation
   required, our restriction to local perturbations minimises the
   additional computation while improving the estimate in the presence of
   noise.

Sensitivity and specificity of DynOmics compared to other studies

   We compared DynOmics performance with current available methods
   (DTW4Omics[97]^24, Pearson and lagged Pearson correlation) using
   measures of sensitivity and specificity while identifying associations
   in simulated data. The threshold for correlation measurement was >0.9
   and p-values < 0.05 after FDR correction for multiple testing. The
   simulated data were generated based on similar scenarios to[98]^25 with
   different parameters. We simulated different expression patterns with
   different delays (−2, 1, 0, 1, 2) as well as different noise levels
   Inline graphic . We also simulated different number of time points (7
   and 14 times points). A number of 7 times points characterises best
   conventional ‘omics’ time course experiments.

   We observed that for the simulated scenario with 7 time points,
   DynOmics increased sensitivity compared to commonly used methods by at
   least 8%, while still remaining highly specific (>0.9). With 14 time
   points, all methods performed similarly, including DynOmics. Pearson
   correlation which does not take time delays into account performed the
   worst in terms of sensitivity in all scenarios, demonstrating that
   ordinary correlation measures are not sufficient to detect associations
   when trajectories are delayed. A detailed description of the simulation
   study and the results is provided in the [99]Supporting Material
   Section A.

Implementation and computation time

   DynOmics is implemented in R and uses the FFT implemented in the
   function fft() from the stats R package[100]^47 for the decomposition
   of the time series. DynOmics utilises the R package parallel to perform
   calculation on CPUs in parallel where possible. DynOmics’ computation
   time was tested and compared to DTW4Omics on simulated datasets with
   seven time points and the Lung Organogenesis study described below with
   14 time points. On simulated data with one reference and 100 queries
   DynOmics required two seconds, while DTW4Omics required 30 seconds. On
   the Lung Organogenesis study the association of 50 references and 50