Abstract

   Pseudotime algorithms can be employed to extract latent temporal
   information from cross-sectional data sets allowing dynamic biological
   processes to be studied in situations where the collection of time
   series data is challenging or prohibitive. Computational techniques
   have arisen from single-cell ‘omics and cancer modelling where
   pseudotime can be used to learn about cellular differentiation or
   tumour progression. However, methods to date typically implicitly
   assume homogeneous genetic, phenotypic or environmental backgrounds,
   which becomes limiting as data sets grow in size and complexity. We
   describe a novel statistical framework that learns how pseudotime
   trajectories can be modulated through covariates that encode such
   factors. We apply this model to both single-cell and bulk gene
   expression data sets and show that the approach can recover known and
   novel covariate-pseudotime interaction effects. This hybrid
   regression-latent variable model framework extends pseudotemporal
   modelling from its most prevalent area of single cell genomics to wider
   applications.
     __________________________________________________________________

   Cross-sectional omic data often have non-homogeneous genetic,
   phenotypic, or environmental backgrounds. Here, the authors develop a
   statistical framework to infer pseudotime trajectories in the presence
   of such factors as well as their interactions in both single-cell and
   bulk gene expression analysis

Introduction

   Dynamic or progressive biological behaviour are ideally studied within
   a longitudinal framework that allows for monitoring of individuals over
   time leading to direct time course data. However, longitudinal studies
   are often challenging to conduct and cohort sizes limited by logistical
   and resource availability. In contrast, cross-sectional surveys of a
   population are relatively easier to conduct in large numbers and more
   prevalent for molecular ‘omics based studies. Cross-sectional studies
   do not directly capture the changes in disease characteristics in
   patients but it may be possible to recapitulate aspects of temporal
   variation by applying “pseudotime” computational analysis.

   The objective of pseudotime analysis is to take a collection of
   high-dimensional molecular data from a cross-sectional cohort of
   individuals and to map these on to a series of one-dimensional
   quantities, called pseudotimes. These pseudotimes measure the relative
   progression of each of the individuals along the biological process of
   interest, e.g., disease progression, cellular development, etc.,
   allowing us to understand the (pseudo)temporal behaviour of measured
   features without explicit time series data (Fig. [26]1a). This analysis
   is possible when individuals in the cross-sectional cohort behave
   asynchronously and each is at a different stage of progression.
   Therefore, by creating a relative ordering of the individuals, we can
   define a series of molecular states that constitute a trajectory for
   the process of interest.

Fig. 1.

   [27]Fig. 1
   [28]Open in a new tab

   An illustration of pseudotemporal analysis. a High-dimensional
   molecular data from a cross-sectional cohort is mapped on to a
   one-dimensional pseudotemporal progression scale allowing
   pseudotemporal behaviour of individual features to be analysed. b If
   the cohort contains sub-populations we may want each sub-population to
   be associated to distinct trajectories. c PhenoPath models observed
   expression as a combination of standard differential expression and
   pseudotime/pathway effects, including covariate-pathway interactions. d
   PCA representation of a simulated data set coloured by pseudotime shows
   a clear splitting of trajectories between covariate status x = (−1, 1).
   e Median Kendall-τ correlation to true pseudotime across varying
   fractions of genes exhibiting covariate-trajectory interactions on
   simulated data. PhenoPath is the only algorithm for which the accuracy
   of inference is independent of the (unknown) fraction. f Median AUCs
   measuring the accuracy of different approaches to detecting
   covariate-trajectory interactions using Limma Voom for differential
   expression analysis. As before, PhenoPath is the only algorithm for
   which the accuracy is independent of the underlying fraction of genes
   exhibiting covariate-trajectory interactions

   Pseudotime methods generally rely on the assumption that any two
   individuals with similar observations should carry correspondingly
   similar pseudotimes and algorithms will attempt to find some ordering
   of the individuals that satisfies some overall global measure that best
   adheres to this assumption (Fig. [29]1a). Exact implementations and
   specifications differ between pseudotime approaches particularly in the
   way “similarity” is defined. When applied to molecular data, pseudotime
   analysis typically captures some dominant mode of variation that
   corresponds to the continuous (de)activation of a set of biological
   pathways^[30]1.

   Pseudotime analysis has gained particular popularity in the domain of
   single-cell gene expression analysis (where each “individual” is now a
   single cell) in which it has been applied to model the differentiation
   of single-cells^[31]2–[32]9 (a comprehensive catalogue of single-cell
   pseudotime algorithms can be obtained from
   [33]https://github.com/agitter/single-cell-pseudotime). Using advanced
   machine learning techniques, these methods can be applied to
   characterise complex, nonlinear behaviours, such as cell cycle, and
   modelling branching behaviours to allow, for example, the possibility
   of cell fate decision making and lineage reconstruction. However, these
   single-cell applications were pre-dated by more general applications in
   modeling cancer progression^[34]10–[35]12, as well as other progressive
   diseases^[36]13–[37]16. Examples of such work provided early
   inspiration for single-cell pseudotime methods, e.g., Monocle^[38]2. To
   date, there has been little cross-over between these distinct
   application domains in terms of methodological development due to the
   different contexts in which methods are applied. However, there are
   interesting possibilities that could arise by translating recent
   advances in single-cell pseudotime modelling and applying these to
   tackling related problems in disease progression modelling. This is the
   topic of the work presented here.

   We focus on a variant of pseudotime analysis that has previously been
   unexplored. While recent single-cell pseudotime approaches provide
   powerful means for unsupervised identification of single or multiple,
   branching pseudotime trajectories, these can only be retrospectively
   examined for their association with prior factors of interest. We
   sought to develop a statistical model in which these factors could be
   explicitly incorporated into pseudotime analysis. This capability is
   important as it would provide a mechanism to account for known genetic,
   phenotypic or environmental factors allowing gene expression
   variability to be decomposed into different contributory factors. Doing
   so would allow us to answer questions related to the interaction
   between heterogeneity in these external factors and biological
   progression. For example, how does cellular development differ when
   cells are exposed to different stimuli? Does the evolution of
   transcriptional programming in cancer depend on the histopathological
   classification of the tumours?

   We describe a novel Bayesian statistical framework for pseudotime
   trajectory modelling that allows explicit inclusion of prior factors of
   interest. Our approach allows us to incorporate information in the form
   of covariates that can modulate the pseudotemporal progression allowing
   sub-groups within the cross-sectional population to each develop their
   own trajectory (Fig. [39]1b). Our approach combines linear regression
   and latent variable modelling and allows for interactions between the
   covariates and temporally driven components of the model. We believe
   our method to be the first integrated statistical approach to allow for
   modelling pseudotime trajectories on heterogeneous backgrounds allowing
   its utility in both single and non-single cell applications.

Results

A Bayesian approach for pseudotemporal learning with covariates

   We first give an overview of our statistical method which we call
   “PhenoPath”. For simplicity, our descriptions will assume that the
   observed data are high-dimensional gene expression measurements which
   are used throughout our empirical experiments but we stress that the
   model would be applicable to a wider range of data modalities.

   The objective of PhenoPath is to provide a probabilistic ordering of
   high-dimensional gene expression measurements across objects (e.g.,
   cells, tumours, patients, etc) (see Fig. [40]1a). This is achieved by
   compressing the information contained within the data on to a
   unidimensional axis. Our aim is to construct an axis such that relative
   positions along the axis correspond to some meaningful biological or
   disease progression. The novelty of PhenoPath is to introduce the
   notion that our objects may have different labels (covariates) attached
   to them corresponding to different innate properties or exposure to
   external stimuli. These factors might cause the objects to evolve over
   (pseudo)time differently (Fig. [41]1b). The result is that PhenoPath
   simultaneously learns a pseudotemporal axis that is common to the
   different object labels, while decomposing gene expression variability
   into static and dynamic components.

   More specifically, PhenoPath uses a Bayesian statistical framework that
   integrates linear regression and latent variable modelling. The
   observed data (y[n]) for the nth individual is a linear function of
   both measured covariates (x[n]) and an unobserved latent variable
   (z[n]) corresponding to latent progression that we will term
   pseudotime.

   A schematic relating the parameters in the overall model is shown in
   Fig. [42]1c. In PhenoPath, the model involves three components: (i)
   gene expression is determined by a static component based on your
   covariate status (
   [MATH: <mi>A</mi><msubsup><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi
   mathvariant="normal">T</mi></mrow></msubsup> :MATH]
   ), (ii) a dynamic component related to how far along the biological
   process you are (λz[n]) and the main novelty (iii) an interaction
   component which allows your covariate status to change the direction of
   the dynamic component of the gene expression (
   [MATH: <mi>B</mi><msubsup><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>n</mi></mrow><mrow><mi
   mathvariant="normal">T</mi></mrow></msubsup><msub><mrow><mi>z</mi></mro
   w><mrow><mi>n</mi></mrow></msub> :MATH]
   ). PhenoPath reduces down to linear regression based differential
   expression analysis or factor analysis based pseudotime analysis if
   only the first or second components are used respectively. Standard
   models are therefore nested within PhenoPath.

   In our investigations, the covariates will be binary quantities but
   this is not a necessary restriction and in practice any arbitrary
   design matrix that can be used for standard regression may be used for
   x (Supplementary Results). Sparse Bayesian prior probability
   distributions are used to constrain the parameters (A, B, λ) so that
   covariates only drive the emergence of distinct trajectories if there
   is sufficient information within the data to do so. Computational
   inference within PhenoPath is handled by a fast and highly scalable
   variational Bayesian inference framework that can handle thousands of
   features and samples in minutes using a standard personal computer
   making it readily applicable to large data sets without the use of
   high-performance computing (see Methods section for details). Though
   variational inference of such hierarchical Bayesian models can be
   sensitive to hyperparameters values and parameter initialisation we
   found PhenoPath to be robust to such choices by fitting on over 80
   combinations of (hyper)parameter initialisation (Supplementary
   Results).

Simulation study

   We first developed a simulation study to assess the performance of our
   model relative to existing approaches for pseudotime estimation and
   differential expression analyses for situations in which pseudotime
   trajectories are modulated by covariate status. To do this we simulated
   pseudotemporally regulated RNA-seq data from a nonlinear mean function
   with a negative binomial noise distribution. This is an entirely
   different generative process to that assumed by PhenoPath and designed
   to test for robustness to model misspecification. We generated
   simulated data sets containing gene sets involving 5, 10, 20, …, 50% of
   genes with covariate-trajectory interactions. An example is shown in
   Fig. [43]1d where the direction of the pseudotime trajectory depends on
   whether the artificial covariate x = −1 or x = 1. This was repeated for
   data sets involving 200 and 500 samples, for high and low noise regimes
   with 40 replicates per condition, giving 960 distinct data sets
   (see Supplementary Methods and Results).

   We applied PhenoPath and four state-of-the-art pseudotime algorithms:
   Monocle 2^[44]8, Diffusion Pseudotime (DPT^[45]5), Wishbone^[46]7 and
   TSCAN^[47]3. We measured the median Kendall-τ correlation between the
   inferred pseudotimes and the true pseudotimes used in the simulations
   (Fig. [48]1e). Our results showed that when the fraction of genes
   exhibiting covariate-trajectory interactions is small (5%), all
   approaches perform well. However, as expected, as this fraction
   increased (>10%), the performance of PhenoPath remains consistent while
   the others diminished rapidly since the latter do not account for such
   interaction effects.

   Next, for each data set and each pseudotime analysis, we performed
   differential expression analysis testing for covariate-trajectory
   interactions using Limma Voom^[49]17, DESeq2^[50]18, MAST^[51]19 and
   Monocle 2^[52]8. This gave a total of 35,520 distinct differential
   expression workflows (full details in Supplementary Results). The
   accuracy of each method to identify interactions was assessed using the
   area under the receiver-operator curve (AUC). Again, when the fraction
   of genes exhibiting covariate-trajectory interactions is small (5–10%)
   then all algorithms perform well at identifying interactions with high
   AUCs (Fig. [53]1f). However, as this fraction increases, the AUC of all
   algorithms other than PhenoPath rapidly decreases, while PhenoPath
   maintains the ability to detect interactions.

   Overall, our simulations showed that if pseudotime trajectories are
   modulated by covariate status, then the application of standard
   pseudotime algorithms may be sub-optimal if there are a number of such
   interactions. For real data sets, where the underlying fraction of
   covariate-trajectory interactions would be unknown a priori, the
   uniformity of PhenoPath performance in these simulations is
   advantageous. Furthermore, our integrated model is more powerful than a
   two-stage procedure in which pseudotime is fitted first and then
   standard differential analysis applied since if pseudotime is
   incorrectly estimated at the first stage, covariate-trajectory
   interactions will not be identified correctly at the second stage
   (Supplementary Results).

   An alternative analysis strategy is to fit pseudotime to subsets of the
   data—one subset for each covariate value. This approach would only be
   applicable for discrete covariates where there are sufficient numbers
   of samples per covariate level but not continuous covariates (PhenoPath
   can also use continuous covariates). However, pseudotimes would have to
   be fitted to every combination of the factor levels, resulting in an
   exponentially increasing number of groups for pseudotime inference and
   downstream analysis. Furthermore, while this could enable accurate
   pseudotime estimation for each covariate group, it would be necessary
   to align the pseudotime trajectories between the groups leading to
   further algorithmic design and implementation choices. PhenoPath
   circumvents all of these issues by providing an integrated model for
   deriving a single universal pseudotime trajectory which is locally
   modulated for features that vary by covariate status alleviating the
   requirement to align multiple trajectories. Further discussion of this
   strategy is explored in Supplementary Results.

Single-cell RNA-seq perturbation analysis

   We next examined a time-series single-cell RNA-seq (scRNA-seq) data set
   of bone marrow derived dendritic cells responding to particular
   stimuli^[54]20. Cells were exposed to LPS, a component of Gram-negative
   bacteria, and PAM, a synthetic mimic of bacterial lipopeptides, and
   scRNA-seq performed at 1, 2, 4 and 6 h after stimulation. Using the
   capture time information, the original study was able to study
   single-cell gene expression dynamics under the two exposures. However,
   although capture times were measured, previous analyses have suggested
   this data set is more suited to a “pseudotime” analysis as the cells
   respond asynchronously and heterogeneity exists within the cellular
   populations at each time point^[55]4.

   We conducted an analysis using PhenoPath where we encoded the stimulant
   to which the cells were exposed as a binary covariate. Each gene was
   therefore modelled as a combination of static effects due to LPS/PAM
   exposure, dynamic effects due to temporal variation (independent of
   stimulant type) and dynamic effects that were modulated by the
   stimulant. We applied PhenoPath to 820 cells using the 7500 most
   variable genes from a recent re-quantification of the original data
   set^[56]21 using Salmon^[57]22. The capture times were not used for
   PhenoPath analysis (details of quality control and data filtering are
   given in Supplementary Methods).

   A principal components analysis (PCA) representation of the PhenoPath
   pseudotime fit is shown in Fig. [58]2a. Distinct response trajectories
   of the dendritic cells under either LPS or PAM stimulation are evident,
   with a common cell state at the beginning of pseudotime diverging under
   LPS and PAM stimulation. Despite capture times not being used as an
   input, the PhenoPath pseudotime trajectory recapitulates the physical
   time progression of the cells with an R^2 = 0.68 (Fig. [59]2b) with
   7500 highly variable genes as input. We compared the ability of
   PhenoPath to recapitulate the physical progression of the cells through
   pseudotime inference to three state-of-the-art pseudotime algorithms
   (Monocle 2^[60]8, DPT^[61]5 and TSCAN^[62]3) across a wide range of
   gene set sizes. We found that for every input gene set size PhenoPath
   reported a higher correlation with capture time (Fig. [63]2c) than
   other methods tested. Figure [64]2d depicts the gene expression
   behaviour for four selected genes based on the original physical
   capture times that display apparent time-dependent behaviour that
   depends also on the stimulation applied. PhenoPath trajectories enhance
   our ability to resolve these trends by aligning the cells under LPS and
   PAM on to a common pseudotemporal scale without the need to compute
   separate trajectories (Fig. [65]2e).

Fig. 2.

   [66]Fig. 2
   [67]Open in a new tab

   PhenoPath pseudotime analysis of single-cell RNA-sequencing data under
   perturbation. a PCA representation of the PhenoPath pseudotime fit
   shows distinct trajectories of the dendritic cells under LPS or PAM
   stimulation, with an initial common cell state before transcriptomic
   divergence as the different stimulants are applied. b The inferred
   pseudotimes recapitulate the physical progression of the cells with
   R^2 = 0.68 to the capture times. c The pseudotimes inferred by
   PhenoPath recapitulate the physical progression of the cells more
   accurately than three state-of-the-art pseudotime algorithms across a
   range of input genes. d Expression of Tnf, Rasgefb1, Tnfaip3, and
   Malt1—the four genes with the largest interaction effects—as a function
   of physical capture time stratified by applied stimulant. e The same
   four genes from (d) as a function of physical capture time. Strikingly,
   Tnf is upregulated under PAM exposure yet downregulated under LPS
   stimulation

   We next examined the genes whose behaviour over pseudotime were most
   perturbed by LPS or PAM stimulation. We uncovered a landscape of
   interactions where the (pseudo)-temporal behaviour of expressed genes
   depended on whether the cells were exposed to LPS or PAM (Fig. [68]3a).
   Figure [69]2e illustrates four such genes. Most notably, the tumour
   necrosis factor Tnf had around twice the interaction effect size of any
   other gene, and its expression decreases under LPS stimulation but
   increases under PAM. Further genes exhibit differential regulation
   according to stimulant, such as Mef2c that has constant expression over
   pseudotime under LPS stimulation yet shows downregulation under PAM
   stimulation.

Fig. 3.

   [70]Fig. 3
   [71]Open in a new tab

   PhenoPath identifies genes differentially modulated over pseudotime
   using single-cell RNA-sequencing data under perturbation. a PhenoPath
   applied to the Shalek et al. data set uncovers a landscape of genes
   differentially regulated along pseudotime depending on the stimulant
   (LPS or PAM) applied. b A comparison of p-values obtained through a
   statistical test for differential expression between LPS and PAM
   stimulation shows no particular relation with the interaction
   parameters β inferred with PhenoPath. c A GO enrichment analysis of the
   genes upregulated along pseudotime whose upregulation was increased by
   LPS stimulation showed enrichment for immune system processes. d
   Relationship between PhenoPath pseudotime and antiviral score^[72]20

   To find out whether these interacting genes would have been identified
   using a simple differential expression (DE) analysis, we used Limma
   Voom^[73]17 to test for stimulant-dependent differences in expression
   and compared this to the interaction coefficients (β) inferred using
   PhenoPath (Fig. [74]3b). We found that while some genes that exhibit
   stimulant-pseudotime interactions can be identified as differentially
   expressed genes, the majority require the explicit PhenoPath model to
   resolve the relative contributions of the static and dynamic expression
   components.

   To investigate which biological pathways are perturbed as the cells
   progress under the different stimulants we performed a Gene Ontology
   enrichment analysis^[75]23. Genes whose upregulation was increased over
   (pseudo-)time by LPS exposure were highly enriched for immune response
   (Fig. [76]3c), consistent with previous results^[77]4,[78]20 that
   suggest a “core” module of antiviral genes upregulated at later
   timepoints in LPS cells, though discovered through an entirely
   unsupervised and integrative methodology. We confirmed this by
   comparing the inferred pseudotimes of the cells to the antiviral score
   based on the Id gene set from the original publication, finding a
   strong relationship for both cells stimulated under LPS and PAM
   (Fig. [79]3d). Furthermore, of the top 30 significant interactions with
   negative β values (indicating stronger downregulation under LPS) 40%
   were present in the IIIc peaked inflammatory module identified in the
   original publication, including Tnf and Malt1. In our analysis,
   PhenoPath was able to successfully recapitulate previous results
   (obtained through clustering and manual annotation) in an unsupervised
   manner without knowledge of the capture times.

Pseudotemporal modelling in colorectal cancer

   We next applied our model in a non-single-cell setting by examining RNA
   sequencing gene expression data from the TCGA colorectal adenocarcinoma
   (COAD) cohort^[80]24 using microsatellite instability (MSI) status as a
   phenotypic covariate. MSI is genetic hypermutability that is present in
   ~10–15% of colorectal tumours and is associated with differential
   response to chemotherapeutics and marginally improved prognosis^[81]25.
   Pseudotime inference using PhenoPath was applied to 4801 highly
   variable genes across 284 COAD samples (details of quality control and
   data filtering are given in Supplementary Methods).

   Using PhenoPath we identified a common pseudotemporal scale but
   distinct development trajectories for MSI-high and MSI-low tumours
   (Fig. [82]4a). We observed that the expression of T-regulatory cell
   (Tregs) immune markers (Fig. [83]4b) was increased along the trajectory
   and found, in a Gene Ontology (GO) analysis, an enrichment of
   immune-related pathways (Fig. [84]4c). This suggested that PhenoPath
   has ordered the tumours according to levels of tumour immunogenicity
   and Tregs infiltration of the tumours. This is consistent with Tregs
   acting as potent immunosuppressive cells of the immune system and
   promote progression of cancer through their ability to limit
   anti-tumour immunity^[85]26. To corroborate this proposition, we used
   an bulk RNA sequencing deconvolution tool, quanTIseq^[86]27, which uses
   transcriptomic profiles of immune cells to estimate immune cell content
   of each tumour (Fig. [87]4d). We found that tumours identified by
   quanTIseq as having high regulatory T cell or immune cell content
   scores were most correlated with PhenoPath pseudotime implying that
   PhenoPath had unbiasedly identified an immunogenic contribution to
   colorectal cancer progression through unsupervised analysis.

Fig. 4.

   [88]Fig. 4
   [89]Open in a new tab

   PhenoPath analysis of colorectal adenocarcinoma progression. a PCA
   representation of the trajectory inferred by PhenoPath stratified by
   microsatellite instability status. b Heatmap of expression of immune
   response genes shows upregulation over pseudotime. c A GO enrichment
   analysis of upregulated genes confirms the latent trajectory encodes
   immune pathway activation in each tumour. d Correlation of quanTIseq
   tumour immune content to PhenoPath pseudotime

   We next examined 92 putative covariate-pseudotime interactions
   including known tumour suppressor genes (Fig. [90]5a). Importantly,
   PhenoPath identified the MLH1 gene whose interaction effect size was
   far larger than any other gene. This association provides an important
   positive control since MLH1 is a well-known DNA mismatch repair gene.
   Germline mutations in MLH1 are causal for hereditary non-polyposis
   colorectal cancer^[91]28,[92]29 while epigenetic silencing in sporadic
   CRCs is associated with MSI.

Fig. 5.

   [93]Fig. 5
   [94]Open in a new tab

   Immune-microsatellite instability interactions in colorectal
   adenocarcinoma. a PhenoPath applied to colorectal adenocarcinoma (COAD)
   RNA-seq expression data uncovers a landscape of interactions between
   the inferred immune trajectory and microsatellite instability status
   (MSI). b A comparison to the FDR-corrected q-values reported by Limma
   Voom demonstrates genes found interacting with MSI status and the
   immune pathway are found to be both DE and non-DE in standard analyses.
   c Standard differential expression analysis masks the relationship
   between immune response and microsatellite instability in the
   expression of MLH1 and TGFBR2. d Using PhenoPath pseudotime the
   expression of MLH1 and TGFBR2 were identified as being significantly
   perturbed along the immune trajectory by MSI status. MLH1 shows no
   interaction with immune pathway activation in the MSI-low regime yet is
   highly correlated with immune pathway activation in the MSI-high regime

   We performed a standard differential expression analysis to determine
   differences between MSI groups using limma voom^[95]17 (Fig. [96]5b).
   Whilst many of these 92 genes are differentially expressed between MSI
   groups, including MLH2 and TGFBR2 (Fig. [97]5c), PhenoPath is able to
   resolve the dynamic contribution to these expression differences
   (Fig. [98]5d). In this case, while the expression of these genes in
   MSI-low tumours is relatively constant, in MSI-high tumours, there is a
   spectrum of expression levels that linearly changes over pseudotime
   following the increasing immune cell infiltration in the MSI-high
   tumours.

   We next sought to uncover whether the other genes exhibiting
   interactions between the immune response and microsatellite instability
   displayed a concerted action in any cancer-related pathways. We took
   the top 20 genes by interaction effect size and performed an
   unsupervised pathway enrichment analysis using Reactome^[99]30. At an
   FDR <5% we found these genes were enriched for RUNX1/RUNX2 regulates
   genes involved in differentiation of myeloid cells. This enrichment was
   due to the presence of the gene LGALS3^[100]31 that was found to
   exhibit interactions by PhenoPath. The protein Galactin-3 is encoded by
   LGALS3 and altered expression of galectins in human gastrointestinal
   tissues as being implicated in colorectal cancer progression^[101]32.

Pseudotemporal modelling in breast cancer

   We finally performed a pseudotemporal analysis of the TCGA breast
   cancer cohort using estrogen receptor (ER) status as a phenotypic
   covariate. Approximately 60% of breast cancers are estrogen receptor
   positive^[102]33, which is typically associated with improved prognosis
   and a longer time to recurrence^[103]34.

   We applied PhenoPath to 1135 breast cancers over 4579 highly variable
   genes and identified distinct ER status specific pseudotemporal
   trajectories (Fig. [104]6a). Details of quality control and data
   filtering are given in Supplementary Methods. We found that markers of
   vascular growth pathways or angiogenesis—a well-known and
   uncontroversial hallmark of cancer development^[105]35,[106]36 - showed
   common pseudotemporal progression independent of ER status. This
   included fibroblast growth factor-2 (FGF2) and vascular endothelial
   growth factors C and D (VEGFC/VEGFD) (Fig. [107]6b). A GO enrichment
   analysis indicated that the genes driving the inferred pseudotemporal
   trajectory were indeed enriched for vascular growth pathways
   (Fig. [108]6c). Through unsupervised analysis, PhenoPath had ordered
   the breast tumours and measured breast tumour progression in terms of
   angiogenic development. Survival analysis using stratified (by ER
   status) Cox proportional hazards modelling with covariates suggested
   that the pseudotime covariate was significant (p = 0.0032). This gave
   evidence that increasing pseudotemporal progression in these breast
   tumours conferred reduced overall survival rates (Supplementary
   Results; Supplementary Fig. [109]17).

Fig. 6.

   [110]Fig. 6
   [111]Open in a new tab

   Vascular growth trajectories uncovered by PhenoPath in breast cancer. a
   Principal components visualisation of ER− dependent pseudotemporal
   trajectories. b Expression of a number of vascular growth markers shows
   upregulation over the inferred pseudotime trajectory. c A GO enrichment
   analysis of upregulated genes confirms the latent trajectory encodes
   angiogenesis pathway activation in each tumour

   In order to understand how angiogenic development differs by ER status,
   we examined the landscape of genes exhibiting covariate-pseudotime
   interactions (Fig. [112]7a). We identified 1932 genes (42%) affected by
   an interaction between the pseudotemporal trajectory and ER receptor
   status. The large percentage was expected given the heterogeneity of
   breast cancers and the strong stratification power of ER status in
   breast cancer subtyping^[113]24. Encouragingly (and to be expected),
   the Estrogen Receptor 1 (ESR1) was identified as one such gene. This
   positive control provided reassuring evidence that PhenoPath was
   discovering real interactions. Furthermore, the expression of
   fructose-1,6-biphosphatase (FBP1) and forkhead transcription factor C
   FOXC1 also showed pseudotemporal dependence that was dependent on ER
   status (Figs. [114]6a and [115]7d). In the ER− regime, FBP1 is
   upregulated along the trajectory while in the ER+ regime it is
   downregulated. Intriguingly, FBP1 has been identified as a marker to
   distinguish ER+ from ER− subtypes and its expression has been shown to
   be negatively correlated with SNAIL as the Snail-G9a-Dnmt1 complex, is
   critical for E-cadherin promoter silencing, and required for the
   promoter methylation of FBP1 in basal-like breast cancer^[116]37
   (Supplementary Fig. [117]18). Similarly, FOXC1 which is known to be
   involved with ERα mediated action in breast cancer^[118]38 shows no
   regulation in the ER− regime yet is strongly upregulated in the ER+
   case.

Fig. 7.

   [119]Fig. 7
   [120]Open in a new tab

   Vascular growth-ER status interactions uncovered by PhenoPath in breast
   cancer. a PhenoPath applied to Breast Cancer (BRCA) RNA-seq expression
   data uncovers a landscape of interactions between the inferred
   angiogenesis trajectory and estrogen receptor (ER) status. b A
   comparison to the FDR-corrected q-values reported by Limma Voom
   identifies a significant number of DE genes display an interaction with
   ER status and the anthropogenic pathway. c A histogram of the
   cross-over points of all genes whose trajectory-covariate interactions
   were significant. The vast majority of cross-over points are at the end
   of the trajectory (around 0.8, where the “middle” pathway score is 0)
   implying a convergence of gene expression as the trajectory progresses.
   d Three example genes ESR1, FBP1, and (FOXC1) were identified by
   PhenoPath as significantly perturbed along the angiogenesis trajectory
   by ER status. The vertical dashed line signifies the calculated
   cross-over point, demonstrating the expression profiles of these genes
   converge towards the end of the trajectory

   To complement this analysis, we performed a pathway enrichment analysis
   using Reactome^[121]30 to discover whether any of the top 20
   interacting genes (by β value) converge on a cancer-related pathway. We
   found (at a FDR <5%) enrichment for Unfolded protein response and ATF6α
   activating chaperone genes. Previous studies have shown that knockouts
   of ATF6α blocked estrogen induction of the antiapoptotic chaperone BiP,
   which in turn inhibited ER-stimulated cell proliferation^[122]39.
   Therefore PhenoPath analysis suggests a relationship between the ER
   status of the tumour to the (vascular) growth via pathway-specific
   action mediated by ATF6α. The interaction gene set was further enriched
   for TFAP2 family regulates transcription of growth factors and their
   receptors. TFAP2 has previously been shown to directly interact with an
   estrogen receptor promoter^[123]40 and provides one of the key
   regulators of hormone responsiveness in breast cancers^[124]41. In
   particular, TFAP2 has been shown experimentally to regulate some of the
   key genes we find as significant interactions, including ESR1 and
   FOXA1^[125]42.

   Many of these genes exhibit a convergence—they have markedly different
   expression at the beginning of the trajectory based on ER status yet
   converge towards the end. We derived a mathematical formula to infer
   such convergence points and calculated these for all genes showing
   significant interactions (see Supplementary Results for details).
   Remarkably, the vast majority converge towards the end of the
   trajectory (Fig. [126]7c), implying a common end-point in vascular
   development for both ER+ and ER− cancer subtypes. This effect can be
   seen in the trajectory plots in Fig. [127]6a, where the ER+ and ER−
   tumours converge at the end of their trajectories. This suggests that
   while there exist low levels of angiogenesis pathway activation, ER
   status dominates gene expression while as angiogenesis pathway
   activation increases it comes to dominate expression patterns over ER
   status. This finding might have implications for the application of
   angiogenesis inhibitors in breast cancer treatment.

Discussion

   PhenoPath provides a novel contribution to the existing arsenal of
   pseudotemporal analysis algorithms developed across a range of
   application areas including single cell ‘omics and cancer. Using a
   statistical model that allows for covariate-modulated pseudotemporal
   trajectories, PhenoPath generalises pseudotime analysis to a wider
   range of applications where genetic, phenotypic or environmental
   contexts may vary between samples and be influential in the
   trajectories. We have demonstrated its utility in an application to
   single-cell transcriptomics involving external stimuli and there is
   potential usage in high-throughput single-cell CRISPR experiments that
   are as yet unexplored^[128]43,[129]44. We also demonstrated
   applications to The Cancer Genome Atlas using PhenoPath to model
   disease trajectories in colorectal and breast cancer. The trajectories
   identified were consistent with pre-existing knowledge concerning
   tumorigenesis in these diseases. Importantly, PhenoPath was able to
   identify covariate-pathway interactions that might be driving specific
   trajectory differences recovering known associations as well as novel
   genes. We showed that these behaviours cannot be readily determined
   with standard differential expression analyses without taking into
   account the latent disease progression. An assumption made by PhenoPath
   is that features evolve linearly with respect to pseudotime. We tested
   this assumption in a number of single cell data sets and found this
   approximation to be surprisingly accurate (Supplementary Results).
   However, it cannot be discounted that nonlinear effects may occur and
   checks should be conducted to verify that PhenoPath model fits are
   consistent with the data. Gaussian Processes offer a means of providing
   a more flexible nonlinear framework and further work in this area is
   anticipated. In summary, PhenoPath provides a powerful and scalable
   pseudotime analysis framework for modelling latent progression in a
   variety of experimental settings. Future work will expand the ability
   of PhenoPath to handle complex mixtures of continuous and discrete
   covariates in high-dimensional settings.

Methods

   We summarise the model specification and inference algorithms below.
   Further details are reported in Supplementary Methods.

Statistical model

   We begin with an N × G data matrix Y where y[ng] denotes the nth entry
   in the gth column for n ∈ 1, …, N samples and g ∈ 1, …, G features.
   Such a matrix would correspond to the measurement of a dynamic
   molecular process that we might reasonably expect to show continuous
   evolution such as gene expression corresponding to a particular
   pathway. It is then trivial to learn a one-dimensional linear embedding
   that would be our “best guess” of such progression via a factor
   analysis model:
   [MATH:
   <msub><mrow><mi>y</mi></mrow><mrow><mi>n</mi><mi>g</mi></mrow></msub><m
   o>=</mo><msub><mrow><mi>λ</mi></mrow><mrow><mi>g</mi></mrow></msub><msu
   b><mrow><mi>z</mi></mrow><mrow><mi>n</mi></mrow></msub><mo>+</mo><msub>
   <mrow><mi>ϵ</mi></mrow><mrow><mi>n</mi><mi>g</mi></mrow></msub><mo>,</m
   o><msub><mrow><mi>ϵ</mi></mrow><mrow><mi>n</mi><mi>g</mi></mrow></msub>
   <mo>~</mo><mi mathvariant="normal">N</mi><mfenced close=")" open="("
   separators=""><mrow><mn>0</mn><mo>,</mo><msubsup><mrow><mi>τ</mi></mrow
   ><mrow><mi>g</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msubsup></mr
   ow></mfenced><mo>,</mo> :MATH]
   1

   where z[n] is the latent measure of progression for sample n and λ[g]
   is the factor loading for feature g which essentially describes the
   evolution of g along the trajectory.

   However, it is conceivable that the evolution of feature g along the
   trajectory is not identical for all samples but is instead affected by
   a set of external covariates. Note that we expect such features to be
   “static” and should not correlate with the trajectory itself.

   Introducing the N × P covariate matrix X with the entry in the nth row
   and pth column given by x[np], we allow such measurements to perturb
   the factor loading matrix
   [MATH:
   <msub><mrow><mi>λ</mi></mrow><mrow><mi>g</mi></mrow></msub><mo>→</mo><m
   sub><mrow><mi>λ</mi></mrow><mrow><mi>n</mi><mi>g</mi></mrow></msub><mo>
   =</mo><msub><mrow><mi>λ</mi></mrow><mrow><mi>g</mi></mrow></msub><mo>+<
   /mo><munderover accent="false" accentunder="false"><mrow><mo>
   ∑</mo></mrow><mrow><mi>p</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>P</mi
   ></mrow></munderover><msub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mi>g
   </mi></mrow></msub><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi><mi>p</
   mi></mrow></msub><mo>,</mo> :MATH]
   2

   where β[pg] quantifies the effect of covariate p on the evolution of
   feature g. Despite Y being column-centred we need to reintroduce gene
   and covariate-specific intercepts to satisfy the model assumptions,
   giving a generative model of the form
   [MATH:
   <msub><mrow><mi>y</mi></mrow><mrow><mi>n</mi><mi>g</mi></mrow></msub><m
   o>=</mo><msub><mrow><mi>η</mi></mrow><mrow><mi>g</mi></mrow></msub><mo>
   +</mo><munderover accent="false" accentunder="false"><mrow><mo>
   ∑</mo></mrow><mrow><mi>p</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>P</mi
   ></mrow></munderover><msub><mrow><mi>α</mi></mrow><mrow><mi>p</mi><mi>g
   </mi></mrow></msub><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi><mi>p</
   mi></mrow></msub><mo>+</mo><mfenced close=")" open="("
   separators=""><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>g</mi></mrow
   ></msub><mo>+</mo><munderover accent="false"
   accentunder="false"><mrow><mo>
   ∑</mo></mrow><mrow><mi>p</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>P</mi
   ></mrow></munderover><msub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mi>g
   </mi></mrow></msub><msub><mrow><mi>x</mi></mrow><mrow><mi>n</mi><mi>p</
   mi></mrow></msub></mrow></mfenced><msub><mrow><mi>z</mi></mrow><mrow><m
   i>n</mi></mrow></msub><mo>+</mo><msub><mrow><mi>ϵ</mi></mrow><mrow><mi>
   n</mi><mi>g</mi></mrow></msub><mo>,</mo><msub><mrow><mi>ϵ</mi></mrow><m
   row><mi>n</mi><mi>g</mi></mrow></msub><mo>~</mo><mi
   mathvariant="normal">N</mi><mfenced close=")" open="("
   separators=""><mrow><mn>0</mn><mo>,</mo><msubsup><mrow><mi>τ</mi></mrow
   ><mrow><mi>g</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msubsup></mr
   ow></mfenced><mo>.</mo> :MATH]
   3

   Our goal is inference of z[n] that encodes progression along with β[pg]
   which is informative of novel interactions between continuous
   trajectories and external covariates. Consequently, we place a sparse
   Bayesian prior on β[pg] of the form
   [MATH:
   <msub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mi>g</mi></mrow></msub><m
   o>~</mo><mi mathvariant="normal">N</mi><mfenced close=")" open="("
   separators=""><mrow><mn>0</mn><mo>,</mo><msubsup><mrow><mi>χ</mi></mrow
   ><mrow><mi>p</mi><mi>g</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></ms
   ubsup></mrow></mfenced> :MATH]
   where the posterior of χ[pg] is informative of the model’s belief that
   β[pg] is non-zero. The complete generative model is therefore given by
   [MATH: <mtable><mtr><mtd
   columnalign="left"><msub><mrow><mi>α</mi></mrow><mrow><mi>p</mi><mi>g</
   mi></mrow></msub><mo>~</mo><mi mathvariant="normal"> N</mi><mfenced
   close=")" open="("
   separators=""><mrow><mn>0</mn><mo>,</mo><msubsup><mrow><mi>τ</mi></mrow
   ><mrow><mi>α</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msubsup></mr
   ow></mfenced></mtd></mtr><mtr><mtd
   columnalign="left"><msub><mrow><mi>λ</mi></mrow><mrow><mi>g</mi></mrow>
   </msub><mo>~</mo><mi mathvariant="normal"> N</mi><mfenced close=")"
   open="("
   separators=""><mrow><mn>0</mn><mo>,</mo><msubsup><mrow><mi>τ</mi></mrow
   ><mrow><mi>λ</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msubsup></mr
   ow></mfenced></mtd></mtr><mtr><mtd
   columnalign="left"><msub><mrow><mi>z</mi></mrow><mrow><mi>n</mi></mrow>
   </msub><mo>~</mo><mi mathvariant="normal"> N</mi><mfenced close=")"
   open="("
   separators=""><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>n</mi></mrow
   ></msub><mo>,</mo><msubsup><mrow><mi>τ</mi></mrow><mrow><mi>q</mi></mro
   w><mrow><mo>-</mo><mn>1</mn></mrow></msubsup></mrow></mfenced></mtd></m
   tr><mtr><mtd
   columnalign="left"><msub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mi>g</
   mi></mrow></msub><mo>~</mo><mi mathvariant="normal"> N</mi><mfenced
   close=")" open="("
   separators=""><mrow><mn>0</mn><mo>,</mo><msubsup><mrow><mi>χ</mi></mrow
   ><mrow><mi>p</mi><mi>g</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></ms
   ubsup></mrow></mfenced></mtd></mtr><mtr><mtd
   columnalign="left"><msubsup><mrow><mi>χ</mi></mrow><mrow><mi>p</mi><mi>
   g</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mo>~</mo><mi
   mathvariant="normal">
   Gamma</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>a</mi></mrow><mrow><mi>
   β</mi></mrow></msub><mo>,</mo><msub><mrow><mi>b</mi></mrow><mrow><mi>β<
   /mi></mrow></msub></mrow><mo>)</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="left"><msubsup><mrow><mi>τ</mi></mrow><mrow><mi>g</mi></mr
   ow><mrow><mo>-</mo><mn>1</mn></mrow></msubsup><mo>~</mo><mi
   mathvariant="normal">
   Gamma</mi><mrow><mo>(</mo><mrow><mi>a</mi><mo>,</mo><mi>b</mi></mrow><m
   o>)</mo></mrow></mtd></mtr><mtr><mtd
   columnalign="left"><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi></mrow>
   </msub><mo>~</mo><mi mathvariant="normal"> N</mi><mfenced close=")"
   open="("
   separators=""><mrow><mn>0</mn><mo>,</mo><msubsup><mrow><mi>τ</mi></mrow
   ><mrow><mi>μ</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msubsup></mr
   ow></mfenced></mtd></mtr><mtr><mtd
   columnalign="left"><msub><mrow><mi>ϵ</mi></mrow><mrow><mi>n</mi><mi>g</
   mi></mrow></msub><mo>~</mo><mi mathvariant="normal"> N</mi><mfenced
   close=")" open="("
   separators=""><mrow><mn>0</mn><mo>,</mo><msubsup><mrow><mi>τ</mi></mrow
   ><mrow><mi>g</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msubsup></mr
   ow></mfenced></mtd></mtr><mtr><mtd
   columnalign="left"><msub><mrow><mi>y</mi></mrow><mrow><mi>n</mi><mi>g</
   mi></mrow></msub><mo>=</mo><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi
   ></mrow></msub><mo>+</mo><munder><mrow><mo>
   ∑</mo></mrow><mrow><mi>p</mi></mrow></munder><msub><mrow><mi>α</mi></mr
   ow><mrow><mi>p</mi><mi>g</mi></mrow></msub><msub><mrow><mi>x</mi></mrow
   ><mrow><mi>n</mi><mi>p</mi></mrow></msub><mo>+</mo><mfenced close=")"
   open="("
   separators=""><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>g</mi></mrow
   ></msub><mo>+</mo><munder><mrow><mo>
   ∑</mo></mrow><mrow><mi>p</mi></mrow></munder><msub><mrow><mi>β</mi></mr
   ow><mrow><mi>p</mi><mi>g</mi></mrow></msub><msub><mrow><mi>x</mi></mrow
   ><mrow><mi>n</mi><mi>p</mi></mrow></msub></mrow></mfenced><msub><mrow><
   mi>z</mi></mrow><mrow><mi>n</mi></mrow></msub><mo>+</mo><msub><mrow><mi
   >ϵ</mi></mrow><mrow><mi>n</mi><mi>g</mi></mrow></msub><mo>,</mo></mtd><
   /mtr></mtable> :MATH]
   4

   where τ[α], τ[λ], a, b, a[β], b[β], τ[q] are fixed hyperparameters and
   q[n] encodes prior information about z[n] if available but typically
   q[n] = 0 ∀i in the uninformative case.

Inference

   We perform co-ordinate ascent mean field variational inference (see
   ref. ^[130]45) with an approximating distribution of the form
   [MATH: <mtable><mtr><mtd columnalign="left"><mi>q</mi><mfenced close=""
   open="(" separators=""><mrow><msubsup><mrow><mfenced close="}" open="{"
   separators=""><mrow><msub><mrow><mi>z</mi></mrow><mrow><mi>n</mi></mrow
   ></msub></mrow></mfenced></mrow><mrow><mi>n</mi><mo>=</mo><mn>1</mn></m
   row><mrow><mi>N</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mrow><mo
   >{</mo><mrow><msub><mrow><mi>μ</mi></mrow><mrow><mi>g</mi></mrow></msub
   ></mrow><mo>}</mo></mrow></mrow><mrow><mi>g</mi><mo>=</mo><mn>1</mn></m
   row><mrow><mi>G</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mrow><mo
   >{</mo><mrow><msub><mrow><mi>τ</mi></mrow><mrow><mi>g</mi></mrow></msub
   ></mrow><mo>}</mo></mrow></mrow><mrow><mi>g</mi><mo>=</mo><mn>1</mn></m
   row><mrow><mi>G</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mrow><mo
   >{</mo><mrow><msub><mrow><mi>λ</mi></mrow><mrow><mi>g</mi></mrow></msub
   ></mrow><mo>}</mo></mrow></mrow><mrow><mi>g</mi><mo>=</mo><mn>1</mn></m
   row><mrow><mi>G</mi></mrow></msubsup><mo>,</mo></mrow></mfenced><mfence
   d close=")" open=""
   separators=""><mrow><msubsup><mrow><mrow><mo>{</mo><mrow><msub><mrow><m
   i>α</mi></mrow><mrow><mi>p</mi><mi>g</mi></mrow></msub></mrow><mo>}</mo
   ></mrow></mrow><mrow><mi>g</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>p</mi>
   <mo>=</mo><mn>1</mn></mrow><mrow><mi>G</mi><mo>,</mo><mi>P</mi></mrow><
   /msubsup><msubsup><mrow><mrow><mo>{</mo><mrow><msub><mrow><mi>β</mi></m
   row><mrow><mi>p</mi><mi>g</mi></mrow></msub></mrow><mo>}</mo></mrow></m
   row><mrow><mi>g</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>p</mi><mo>=</mo><
   mn>1</mn></mrow><mrow><mi>G</mi><mo>,</mo><mi>P</mi></mrow></msubsup><m
   subsup><mrow><mrow><mo>{</mo><mrow><msub><mrow><mi>χ</mi></mrow><mrow><
   mi>p</mi><mi>g</mi></mrow></msub></mrow><mo>}</mo></mrow></mrow><mrow><
   mi>g</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>p</mi><mo>=</mo><mn>1</mn></
   mrow><mrow><mi>G</mi><mo>,</mo><mi>P</mi></mrow></msubsup></mrow></mfen
   ced></mtd></mtr><mtr><mtd columnalign="left"><mo>=</mo><munderover
   accent="false" accentunder="false"><mrow><mo>
   ∏</mo></mrow><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi
   ></mrow></munderover><msub><mrow><munder
   accentunder="true"><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>z</mi><
   /mrow></msub><mfenced close=")" open="("
   separators=""><mrow><msub><mrow><mi>z</mi></mrow><mrow><mi>n</mi></mrow
   ></msub></mrow></mfenced></mrow><mrow><mo
   stretchy="true">⏟</mo></mrow></munder></mrow><mrow><mi
   mathvariant="normal">Normal</mi></mrow></msub><munderover
   accent="false" accentunder="false"><mrow><mo>
   ∏</mo></mrow><mrow><mi>g</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>G</mi
   ></mrow></munderover><msub><mrow><munder
   accentunder="true"><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>μ</mi><
   /mrow></msub><mrow><mo>(</mo><mrow><msub><mrow><mi>μ</mi></mrow><mrow><
   mi>g</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow><mo
   stretchy="true">⏟</mo></mrow></munder></mrow><mrow><mi
   mathvariant="normal">Normal</mi></mrow></msub><msub><mrow><munder
   accentunder="true"><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>τ</mi><
   /mrow></msub><mrow><mo>(</mo><mrow><msub><mrow><mi>τ</mi></mrow><mrow><
   mi>g</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow><mo
   stretchy="true">⏟</mo></mrow></munder></mrow><mrow><mi
   mathvariant="normal">Gamma</mi></mrow></msub><msub><mrow><munder
   accentunder="true"><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>λ</mi><
   /mrow></msub><mrow><mo>(</mo><mrow><msub><mrow><mi>λ</mi></mrow><mrow><
   mi>g</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow><mo
   stretchy="true">⏟</mo></mrow></munder></mrow><mrow><mi
   mathvariant="normal">Normal</mi></mrow></msub><munderover
   accent="false" accentunder="false"><mrow><mo>
   ∏</mo></mrow><mrow><mi>p</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>P</mi
   ></mrow></munderover><msub><mrow><munder
   accentunder="true"><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>α</mi><
   /mrow></msub><mrow><mo>(</mo><mrow><msub><mrow><mi>α</mi></mrow><mrow><
   mi>p</mi><mi>g</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow><
   mo stretchy="true">⏟</mo></mrow></munder></mrow><mrow><mi
   mathvariant="normal">Normal</mi></mrow></msub><msub><mrow><munder
   accentunder="true"><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>β</mi><
   /mrow></msub><mrow><mo>(</mo><mrow><msub><mrow><mi>β</mi></mrow><mrow><
   mi>p</mi><mi>g</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow><
   mo stretchy="true">⏟</mo></mrow></munder></mrow><mrow><mi
   mathvariant="normal">Normal</mi></mrow></msub><msub><mrow><munder
   accentunder="true"><mrow><msub><mrow><mi>q</mi></mrow><mrow><mi>χ</mi><
   /mrow></msub><mrow><mo>(</mo><mrow><msub><mrow><mi>χ</mi></mrow><mrow><
   mi>p</mi><mi>g</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><mrow><
   mo stretchy="true">⏟</mo></mrow></munder></mrow><mrow><mi
   mathvariant="normal">Gamma</mi></mrow></msub></mtd></mtr></mtable><mo>.
   </mo> :MATH]
   5

   Due to the model’s conjugacy the optimal update for each parameter θ[j]
   given all other parameters θ[−j] can easily be computed via
   [MATH:
   <msubsup><mrow><mi>q</mi></mrow><mrow><mi>j</mi></mrow><mrow><mo>*</mo>
   </mrow></msubsup><mrow><mo>(</mo><mrow><msub><mrow><mi>θ</mi></mrow><mr
   ow><mi>j</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>∝</mo><mi
   mathvariant="normal">exp</mi><mfenced close="}" open="{"
   separators=""><mrow><msub><mrow><mi
   mathvariant="bold">E</mi></mrow><mrow><mo>-</mo><mi>j</mi></mrow></msub
   ><mrow><mo>[</mo><mrow><mi
   mathvariant="normal">log</mi><mi>p</mi><mrow><mo>(</mo><mrow><msub><mro
   w><mi>θ</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>∣</mo><msub><mrow>
   <mi
   mathvariant="bold-italic">θ</mi></mrow><mrow><mo>-</mo><mi>j</mi></mrow
   ></msub><mo>,</mo><mi
   mathvariant="bold">X</mi><mo>,</mo><mi>Y</mi></mrow><mo>)</mo></mrow></
   mrow><mo>]</mo></mrow></mrow></mfenced><mo>,</mo> :MATH]
   6

   where the expectation is taken with respect to the variational density
   over θ[−j]. The precise form of the variational updates can be found
   in Supplementary Text.

Ranking covariate-pathway interactions

   For each gene g and covariate p we have β[pg] that encodes the effect
   of p on the evolution of g along the trajectory z. We would like to
   identify interesting interactions for further analysis and follow-up.
   The variational approximation for β[pg] is given by
   [MATH:
   <msub><mrow><mi>q</mi></mrow><mrow><msub><mrow><mi>β</mi></mrow><mrow><
   mi>p</mi><mi>g</mi></mrow></msub></mrow></msub><mo>~</mo><mi
   mathvariant="normal">N</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>m</mi>
   </mrow><mrow><msub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mi>g</mi></m
   row></msub></mrow></msub><mo>,</mo><msub><mrow><mi>s</mi></mrow><mrow><
   msub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mi>g</mi></mrow></msub></m
   row></msub></mrow><mo>)</mo></mrow><mo>.</mo> :MATH]
   7

   which after (approximately) maximising the ELBO will give estimates
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>m</mi></mrow><mo>^</mo></mover></mrow><mrow><ms
   ub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mi>g</mi></mrow></msub></mro
   w></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>ŝ</mi></mrow><mrow><msub><mrow><mi>β</mi></mrow><mrow><
   mi>p</mi><mi>g</mi></mrow></msub></mrow></msub> :MATH]
   for every gene and covariate. We classify or label an interaction as of
   interest if
   [MATH: <mfrac><mrow><mfenced close="∣" open="∣"
   separators=""><mrow><msub><mrow><mover
   accent="true"><mrow><mi>m</mi></mrow><mo>^</mo></mover></mrow><mrow><ms
   ub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mi>g</mi></mrow></msub></mro
   w></msub></mrow></mfenced></mrow><mrow><msub><mrow><mi>ŝ</mi></mrow><mr
   ow><msub><mrow><mi>β</mi></mrow><mrow><mi>p</mi><mi>g</mi></mrow></msub
   ></mrow></msub></mrow></mfrac><mo>></mo><mspace
   width="0.3em"></mspace><mi>k</mi><mo>,</mo> :MATH]
   8

   where k is a positive constant. In other words, the interaction is not
   of interest if β[pg] = 0 falls within k posterior standard deviations
   of the posterior estimate of the mean of the interaction. This is
   equivalent to a decision theoretic loss criteria governing whether the
   true value for β lies in the tails of the posterior marginal or not.

Data availability

   We provide an R implementation of our method PhenoPath at
   [131]https://bioconductor.org/packages/release/bioc/html/phenopath.html
   .

Electronic supplementary material

   [132]Supplementary Information^ (5.3MB, pdf)

Acknowledgements