Abstract Dynamic changes in biological systems can be captured by measuring molecular expression from different levels (e.g., genes and proteins) across time. Integration of such data aims to identify molecules that show similar expression changes over time; such molecules may be co-regulated and thus involved in similar biological processes. Combining data sources presents a systematic approach to study molecular behaviour. It can compensate for missing data in one source, and can reduce false positives when multiple sources highlight the same pathways. However, integrative approaches must accommodate the challenges inherent in ‘omics’ data, including high-dimensionality, noise, and timing differences in expression. As current methods for identification of co-expression cannot cope with this level of complexity, we developed a novel algorithm called DynOmics. DynOmics is based on the fast Fourier transform, from which the difference in expression initiation between trajectories can be estimated. This delay can then be used to realign the trajectories and identify those which show a high degree of correlation. Through extensive simulations, we demonstrate that DynOmics is efficient and accurate compared to existing approaches. We consider two case studies highlighting its application, identifying regulatory relationships across ‘omics’ data within an organism and for comparative gene expression analysis across organisms. __________________________________________________________________ High-throughput ‘omics’ platforms such as transcriptomics, proteomics, and metabolomics enable the simultaneous monitoring of thousands of biological molecules (transcripts, proteins, and metabolites), typically through a single static experiment[28]^1. The recent decrease in cost of such technological platforms has made possible the study of dynamic biological processes by instead quantify molecules at several time points. This allows deeper insight into the behaviour of the molecules in situations ranging from developmental processes to drug response. These time course ‘omics’ experiments enable the identification of regulators, and may give a better understanding of the structure and dynamics of biological systems. The statistical analysis of dynamic ‘omics’ experiments is difficult. Applying traditional statistical methods for static experiments is limited, since each time point will be treated as independent, ignoring potentially important correlations between sampling times. Indeed, realising the potential power offered in time course studies to investigate a wide variety of changes is nontrivial. Analytical challenges are further complicated by noise, small sample sizes per time point, and few sampled time points. In the past decade, several methods have been proposed to analyse time course ‘omics’ data, with a particular focus on microarray and RNA-Seq data. These methods perform differential expression analysis using spline fitting[29]^2,[30]^3, Bayesian methods[31]^4,[32]^5,[33]^6,[34]^7, Gaussian processes[35]^8,[36]^9,[37]^10, and a two-step regression approach (maSigPro[38]^11). Other methods focus on clustering expression profiles to identify co-expressed trajectories, e.g., a subset of molecules for which expression changes occur simultaneously across time[39]^3,[40]^7,[41]^12,[42]^13,[43]^14,[44]^15,[45]^16. Targeted co-expression analysis can also be performed using various model-based applications to retrieve data sets from databases given specific query data[46]^17,[47]^18,[48]^19. Finally, a third category of methods was proposed based on biological pathway analysis[49]^20,[50]^21 see the detailed review of Spies et al.[51]^22. Co-expression analysis can provide valuable insight into the role of molecules during biological processes[52]^23,[53]^24,[54]^25, but faces significant challenges in dealing with different types of ‘omics’ and their variation in molecular response times. These timing differences or delays in the initiation or suppression of molecule expression are a common phenomenon in biology and occur across both different molecular levels and organisms. For example, the study of regulatory processes after environmental changes has revealed that there is often a measurable delay from the time of signal introduction to molecular response[55]^23,[56]^24,[57]^26,[58]^27,[59]^28. This can result from differences in the reaction kinetics between an enzyme and its substrate, presence of an inhibitor, or altered binding affinities of transcription factors. Such processes can be studied through time course miRNA and mRNA data, since miRNAs play an important role in gene translation regulation in many organisms, through either mRNA translation inhibition or mRNA degradation[60]^29. This ability of miRNAs to fine tune gene expression and translation in a broad range of important biological processes is of broad interest in medicine[61]^30,[62]^31,[63]^32. While correlation analysis is frequently used to analyse miRNA-mRNA time course data[64]^33,[65]^34, it may have limited power in situations where delayed dynamic expression changes of miRNAs relative to mRNA have been observed[66]^30,[67]^35. Delays can also hinder gene expression comparisons across organisms, since even highly conserved processes may vary in timing. The pre-implantation embryonic development (PED) is a highly conserved process across mammals, reflected through the progression of the same morphologic stages[68]^36. Nevertheless, attempts to compare PED in mammals based on gene expression data have faced challenges due to differences in timing of genome activation and regulatory processes[69]^37. Hence ignoring these delays in co-expression analysis can mask true associations; the first step should instead be to detect and quantify the time delay between molecules. This will enable identification of functionally related molecules regardless of the differences in the timing of expression changes, as well as allowing quantification of similarities and differences between the observed responses in more detail. To date, very few methods for time course ‘omics’ data account for time delay between molecule expression levels. Aijo et al.[70]^10 recently proposed DynB, a set of methods based on Gaussian processes, to quantify RNA-Seq gene expression dynamics. This allows rescaling of time profiles, but only between replicates (i.e., at the sample level) rather than at the molecule expression level. The most commonly used approach for molecules is to consider the Pearson correlation[71]^34,[72]^38, despite its obvious limitations for detecting co-expressed molecules when their expression change occurs at different time points. Lagged Pearson correlation, a.k.a. Pearson cross-correlation for lagged time series, circumvents this limitation by introducing artificial delays or lags in the time expression profiles for every possible time shift. The method eventually applies the delay that maximises the correlation with the original profile, but can be prone to overestimation of delay. More sophisticated approaches for time course ‘omics’ data come at the expense of computational cost. Shi et al.[73]^39 proposed an probabilistic model based on multiple datasets tabular combinations to identify pairwise transcription factor and gene (TF-G) pairs under different experimental conditions. This approach has shown to reduce false positive predictions but requires a time consuming learning step on existing and known TF-G pair data[74]^39. Dynamic Time Warping (DTW)[75]^40,[76]^41,[77]^42 is an algorithm that aligns the time points of two trajectories to minimise the distance between them. It can therefore identify similarities between trajectories which may vary in phase and speed. One variation, DTW4Omics[78]^24 identifies co-expressed molecules with a permutation test, but this can be computationally expensive. An alternate approach[79]^25 utilises a combined statistic based on Hidden Markov Models (HMM) and Pearson correlation. HMMs are trained on a set of trajectories where a distribution of values is considered for each time point. This generates a probability to observe a trajectory under the trained model that can tolerate small delays. While promising, this approach cannot detect large delays. Additionally, both it and DTW4Omics can only identify positively correlated trajectories, requiring heavier computational costs to exhaustively explore potential associations. While integrating time course experiments from different ‘omics’ functional levels is the key to identifying dynamic molecular interactions, its challenging nature has thus far prevented much methodological development. Difficulties lie not only in the computation required by complex algorithms, but also in variation in types of correlation, levels of noise in the expression profiles, and the delays themselves. We present DynOmics, a novel algorithm to detect, estimate, and account for delays between ‘omics’ time expression profiles. The algorithm is based on the fast Fourier transform (FFT)[80]^43, which has already been shown to successfully detect periodically expressed genes in transcriptomics experiments[81]^44,[82]^45,[83]^46. By combining the FFT angular difference between reference and query trajectories with lagged Pearson correlation, we are able to characterise the direction and magnitude of delay, whether the reference and query are positively or negatively correlated. After accounting for the estimated delays, similar profiles can be clustered for further insight. Simulation results show that DynOmics outperforms current methods to detect time shift, both in terms of sensitivity and specificity. We apply it to two biological case studies: one focusing on the integration of miRNAs and mRNAs in mouse lung development, and one on the conservation of gene molecular processes across multiple organisms (mouse, bovine, human) during PED. In both cases, DynOmics is able to unravel timing differences between ‘omics’ functional levels, demonstrating its wide applicability. DynOmics is implemented in the open source programming language R and is freely available via CRAN[84]^47. Our repository and user manual are also available at the following link [85]https://bitbucket.org/Jasmin87/dynomics. Material and Methods The expression changes of molecules monitored in time course experiments often form simple temporary, sustainable or cyclic patterns that can be modelled as mixtures of oscillating/cyclic patterns using the discrete Fourier transform (FT)[86]^48. We introduce DynOmics, a novel method that first converts trajectories to the frequency domain using the FFT, from which it extracts the frequency of the main cyclic pattern. Condensing the trajectory to information on the main frequency is then used to identify whether two trajectories are related or associated, while ignoring the noise in each time expression profile. Fourier Transform For a given time series x = (x[1], …, x[t], …x[T]), measured at time points t = 1, …T, the FT decomposes x into circular components or cyclic patterns for each frequency k = 1, …, T − 1 as: graphic file with name srep40131-m1.jpg As the amplitude at frequency k = 0 simply describes the y-axis offset (i.e., the global differences of expression levels), this frequency is not included in our analysis context. [87]Equation (1) can be written with polar coordinates with real part a and imaginary part b as X[k] = a[k] + b[k]i. For each frequency k = 1, …, T − 1 we can calculate the amplitude r[k] of the component as Inline graphic . The amplitude reflects the contribution of the k^th cyclic pattern to the overall trajectory, and the pattern with maximum amplitude Inline graphic describes the main shape of the time series. The argument Arg(X[k]) is the offset of the cyclic pattern, defined as: graphic file with name srep40131-m4.jpg We can transform the argument to the phase angle (delay) ϕ[k] in degrees by: graphic file with name srep40131-m5.jpg Together, the amplitude and phase angle describe each frequency component, and the set of these quantities is known as the frequency domain representation. DynOmics We describe DynOmics, a novel method to estimate delays between a reference x and query y given in frequency domain representation. First, we identify K as the frequency of the pattern with maximum amplitude for x, i.e., the main reference pattern frequency. Then, for both x and y, we extract phase angles at this frequency ϕ^x = ϕ[xK], ϕ^y = ϕ[yK] and define Δ[xy] = ϕ^x − ϕ^y as the difference between the phase angles. In FFT literature, Δ[xy] is often expressed in the range of [−180, 180]. To simplify representation in DynOmics, when Δ[xy] < 0, we add 360 so that Δ[xy] is in the range of 0 to 359. Δ[xy] indicates both the sign of the correlation between x and y and the sign of the delay, as seen in [88]Fig. 1. The trajectories x and y can be either positively ([89]Fig. 1a,b,f) or negatively correlated ([90]Fig. 1c,d,e), with a delay that we refer to as negative, i.e., the reference x is prior to the query y ([91]Fig. 1b,e) or positive, i.e., the reference x is delayed with respect to the query y ([92]Fig. 1c,f). Specific angular difference cases include when Δ[xy] = 0 (positive correlation, but no delay, [93]Fig. 1a) and when Δ[xy] = 180 (negative correlation, no delay, [94]Fig. 1d). We can estimate the delay between two trajectories based on the FT frequency, the length of the time series and Δ[xy] as Figure 1. Relationship between angular differences, correlation and delay for a reference trajectory x (red dots) and a query trajectory y (green line). [95]Figure 1 [96]Open in a new tab The trajectories are (a) positively correlated with no delay (Δ[xy] = 0); (b) positively correlated with negative delay (0 < Δ[xy] ≤ 90); (c) negatively correlated with positive delay (90 < Δ[xy] < 180); (d) negatively correlated with no delay (Δ[xy] = 180); (e) negatively correlated with negative delay (180 < Δ[xy] < 270); (f) positively correlated with positive delay (270 ≤ Δ[xy] < 360). graphic file with name srep40131-m6.jpg where δ[xy] ranges from 0 to Inline graphic . In order to keep delay estimates and hence signal shifts as small as possible, we collapse these values to the range of Inline graphic by setting Inline graphic when 270 ≤ Δ[xy] ≤ 359, and Inline graphic when 90 ≤ Δ[xy] ≤ 270. We note that for query profiles either positively or negatively correlated with the reference, this means that δ[xy] < 0 represents positive delay, and δ[xy] > 0 represents negative delay. Using lagged Pearson correlation to increase accuracy in delay estimation The delay estimate based on the angular difference presented above is based on approximating both the reference and query by the pattern at the main frequency for the reference. This approximation works well when the signals from both query and reference are dominated by that main pattern and relatively ‘noise-free’. However, when multiple frequencies have substantial contributions to the overall signal, we can improve the estimate by maximising the lagged Pearson correlation coefficient over small perturbations in the delay. Specifically, let Inline graphic denote our initial delay estimate, rounded to the closest integer. Let Inline graphic be a set of lags {l} representing perturbations to this initial delay estimate. For each lag, we construct trajectories x[l] and y[l] by shifting the original trajectories so that if l < 0, x[l] = x[1+|l|], …, x[T] and y[l] = y[1], …, y[T−|l|]; if l > 0, conversely, x[l] = x[1], …, x[T−|l|]and y[l] = y[1+|l|], …, y[T]. The (lagged) Pearson correlation coefficient between the two trajectories x[l] and y[l] is defined as: graphic file with name srep40131-m13.jpg where Inline graphic ( Inline graphic ) is the sample mean across time points for each trajectory. We determine the optimal delay for a given set of lags as that for which the Pearson correlation coefficient is maximised: graphic file with name srep40131-m16.jpg with Inline graphic then used to assess the strength and direction of the association. We consider two sets Inline graphic of lags which represent perturbations of δ[0]: graphic file with name srep40131-m19.jpg where δ[1] is the result of the optimization over Inline graphic . These two optimisations thus allow us to compare the initial estimate with that in the opposite direction, and then with delays in a local neighbourhood. While the optimisations do increase the computation required, our restriction to local perturbations minimises the additional computation while improving the estimate in the presence of noise. Sensitivity and specificity of DynOmics compared to other studies We compared DynOmics performance with current available methods (DTW4Omics[97]^24, Pearson and lagged Pearson correlation) using measures of sensitivity and specificity while identifying associations in simulated data. The threshold for correlation measurement was >0.9 and p-values < 0.05 after FDR correction for multiple testing. The simulated data were generated based on similar scenarios to[98]^25 with different parameters. We simulated different expression patterns with different delays (−2, 1, 0, 1, 2) as well as different noise levels Inline graphic . We also simulated different number of time points (7 and 14 times points). A number of 7 times points characterises best conventional ‘omics’ time course experiments. We observed that for the simulated scenario with 7 time points, DynOmics increased sensitivity compared to commonly used methods by at least 8%, while still remaining highly specific (>0.9). With 14 time points, all methods performed similarly, including DynOmics. Pearson correlation which does not take time delays into account performed the worst in terms of sensitivity in all scenarios, demonstrating that ordinary correlation measures are not sufficient to detect associations when trajectories are delayed. A detailed description of the simulation study and the results is provided in the [99]Supporting Material Section A. Implementation and computation time DynOmics is implemented in R and uses the FFT implemented in the function fft() from the stats R package[100]^47 for the decomposition of the time series. DynOmics utilises the R package parallel to perform calculation on CPUs in parallel where possible. DynOmics’ computation time was tested and compared to DTW4Omics on simulated datasets with seven time points and the Lung Organogenesis study described below with 14 time points. On simulated data with one reference and 100 queries DynOmics required two seconds, while DTW4Omics required 30 seconds. On the Lung Organogenesis study the association of 50 references and 50