Abstract
RNA velocity is closely related with cell fate and is an important
indicator for the prediction of cell states with elegant physical
explanation derived from single-cell RNA-seq data. Most existing RNA
velocity models aim to extract dynamics from the phase delay between
unspliced and spliced mRNA for each individual gene. However,
unspliced/spliced mRNA abundance may not provide sufficient signal for
dynamic modeling, leading to poor fit in phase portraits. Motivated by
the idea that RNA velocity could be driven by the transcriptional
regulation, we propose TFvelo, which expands RNA velocity concept to
various single-cell datasets without relying on splicing information,
by introducing gene regulatory information. Our experiments on
synthetic data and multiple scRNA-Seq datasets show that TFvelo can
accurately fit genes dynamics on phase portraits, and effectively infer
cell pseudo-time and trajectory from RNA abundance data. TFvelo opens a
robust and accurate avenue for modeling RNA velocity for single cell
data.
Subject terms: Computational models, Genome informatics, RNA
sequencing, Data processing, Gene regulatory networks
__________________________________________________________________
Most RNA velocity models extract dynamics from the phase delay between
unspliced and spliced mRNA for each gene. Here, authors propose TFvelo,
broadening RNA velocity beyond splicing information to include gene
regulation. TFvelo accurately models genes dynamics and infers cell
pseudo-time from RNA abundance data.
Introduction
Single-cell RNA sequencing (scRNA-seq)^[30]1–[31]3 provides a wealth of
information about the gene expression profile of individual cells. To
expand the scope of scRNA-seq analysis beyond a static snapshot and
capture cellular dynamics without tracking alive individual cells over
time, many trajectory inference (TI) approaches, like PAGA^[32]4,
Monocle^[33]5, Slingshot^[34]6 and Palantir^[35]7 have been developed.
While these TI methods enable pseudotime analysis at both the cell and
gene levels, they typically require the annotation of initial
cells^[36]8. In recent years, RNA velocity^[37]9 theory was proposed,
which describes the time derivative of gene abundance in a
physical-informed approach, by modeling the relationship between the
unspliced (immature) and spliced (mature) mRNAs. Combining velocities
across multiple genes, velocity-based pseudotime and cell fate can be
inferred from the transition probability between cells. To estimate the
RNA velocity, velocyto^[38]9 is introduced as the vanilla approach with
a steady state assumption. ScVelo^[39]10 models the dynamics without
the steady state assumption and employs an Expectation-Maximization
(EM) approach for better reconstructing the underlying kinetics.
More recently, several approaches have been developed to improve the
estimation of RNA velocity. For instance, VeloAE^[40]11 utilized an
auto-encoder and low-dimensional space to smooth RNA velocity. Still in
the unspliced/spliced space, DeepCycle^[41]12 used an autoencoder to
map cell cycle-related genes. UniTVelo^[42]13 directly designed a
function of time to model the spliced RNA level. In contrast to
deterministic models, VeloVI^[43]8, Pyrovelocity^[44]14 and
veloVAE^[45]15 estimated RNA velocity using Bayesian inference
frameworks. LatentVelo^[46]16 embeds the unspliced and spliced
expression into the latent space with a variational auto-encoder, so
that a low-dimensional representation of RNA velocity could be
obtained. CellDancer^[47]17 models the transcription, splicing and
degradation rate of a gene as a function of its unspliced and spliced
counts with a neural network. Furthermore, apart from the
unspliced/spliced data, dynamics can also be modeled with additional
information. For instance, Dynamo^[48]18 improve the RNA velocity model
by using the new/total labeled RNA-seq^[49]19. Taking the advantage of
single-cell multi-omics technology^[50]20,[51]21, RNA velocity analysis
can be expanded to protein abundances (protaccel^[52]22) and
single-cell ATAC-seq (MultiVelo^[53]23) datasets. However, these
additional omics data or labeling may be cost or even not available in
most single cell datasets.
While RNA velocity theory has improved the inference of single-cell
trajectories, pseudo-time, and gene regulation inference in numerous
studies^[54]24, current RNA velocity models still face severe
challenges. Firstly, RNA velocity models always treat each gene
independently and do not take the underlying regulation into
consideration^[55]25, although it is expected to be helpful for cell
fate inference by integrating gene regulatory mechanism analysis.
Secondly, these models can not fit the gene dynamics well on most
genes, which might be because that the transcriptional dynamics of mRNA
splicing may not provide sufficient signal in single cell
resolution^[56]25. Most conventional approaches assume that the joint
plot between unspliced and spliced expression should form a clockwise
curve on the phase portrait because of the phase delay within
them^[57]9,[58]10,[59]13,[60]26, which is, nevertheless, rarely
observed from the data. The high sparsity and noise nature of unspliced
mRNA counts^[61]27,[62]28 could be one significant reason. The short
time scale of splicing process could also make it hard to extract
dynamics from the delay between unspliced and spliced RNA. Last but not
least, the RNA velocity models can only be applied to scRNA-seq
datasets with unspliced/spliced or new/total labels. However, the
abundance of unspliced/spliced data may not be available in certain
datasets, such as those obtained through Fluorescence In Situ
Hybridization (FISH) technologies^[63]29, and some human sequencing
data due to privacy restrictions. These challenges motivate us to
construct the gene dynamics based on regulation, instead of only
relying on the unspliced/spliced counts.
The gene regulation has been explored a lot in the field of single cell
research^[64]30–[65]34. In this study, we aim at inferring the gene
dynamics and cell fate based on the underlying regulatory among genes.
To this end, we investigate the relationship between gene regulatory
patterns and RNA velocity, which reveals that the RNA velocity can be
approximately estimated as a linear combination of the expression
levels of transcription factors (TFs). Notably, the TFs with the
non-zero weights are significantly enriched with TF set functionally
linked to the target gene. In addition, we find that the joint
distribution of expressions between a TF and its target could form a
clockwise curve on the co-expression plot^[66]35, indicating the phase
delay between them. Those findings imply that the abundance of TFs
could also be used to construct dynamics model of RNA velocity.
In this study, we propose TFvelo to model the RNA velocity based on the
expression levels of TFs, where the velocity refers to the changing
rate of RNA abundance. The computational framework of TFvelo relies on
a generalized EM algorithm, which iteratively updates the learned
representation of TFs, the latent time of cells, and the parameters in
the dynamic equation. TFvelo can accomplish analysis performed by
previous RNA velocity studies, such as gene-specific phase portrait
fitting, velocity modeling, inferring pseudotime without annotation of
initial states, and visualizing cell fate with the velocity-based
transition stream plot. Unlike methods that depend on splicing
information, TFvelo can robustly work on genes with sparse unspliced
counts and even datasets without splicing information. Using both
synthetic data and multiple real scRNA-seq datasets, we show that the
TFvelo model can capture the dynamics on phase portraits of TFs-target,
accurately infer the pseudo-time and the developmental trajectories of
cells and help explore biological findings.
Results
Findings: RNA velocity can be approximately estimated with TFs’ abundance
The fact that RNA velocity can be modelled by the transcriptional
regulation is supported by the following two findings. We find that the
RNA velocity model by previous approach can be approximately estimated
as a linear combination of TFs’ abundance. On the scRNA-Seq pancreas
dataset^[67]9,[68]36, given a target gene, Least Absolute Shrinkage and
Selection Operator (LASSO) regression is applied to predict RNA
velocity modelled by scVelo, based on the expression level of TFs and
the target gene itself (Fig. S[69]1a). Here we run scVelo pipeline on
pancreas dataset with the default parameters to get the velocities, and
only apply LASSO regression to those genes which are modelled by
scVelo. We feed all TFs^[70]37 into LASSO, and get a sparse
weights-vector where only some TFs have non-zero weights. As a result,
the RNA velocity of gene
[MATH: g :MATH]
can be estimated as:
[MATH:
vg=∑TFi∈Sgw
mrow>g,TF<
mrow>ieTFi
msub>+wg
mi>,geg :MATH]
1
where
[MATH: Sg
:MATH]
represents the set of TFs,
[MATH:
eTFi :MATH]
and
[MATH: eg
:MATH]
are the expression level of
[MATH:
TFi
:MATH]
and gene
[MATH: g :MATH]
respectively,
[MATH:
wg,
TFi
:MATH]
and
[MATH:
wg,g :MATH]
are the weights of
[MATH:
TFi
:MATH]
and the target gene itself.
Cells are separated into training and test sets to evaluate the LASSO
model. The regression results on the test set shows a high correlation
between the predicted and labeled velocity, as demonstrated in Fig.
S[71]1b,c, which indicates that velocity of the target gene can be
predicted based on the expressions of its TFs. In addition, as shown in
Table [72]S1, the TFs with non-zero weights are significantly enriched
in TFs set functionally linked to the target gene (according to the
ENCODE TF-target database^[73]38) with p value of 6.66e-06 (one-sided
t-test), further suggesting that velocity could be modelled by TFs.
Secondly, due to regulatory relationship between TF and target, the
expression levels of them change asynchronously, which could result in
a clockwise curve on the phase portraits. And as expected, we find the
desired clockwise curve on the co-expression plots between some
TF-target pairs^[74]35 (Fig. S[75]2), which is similar to the
theoretical prediction of existing velocity models on unspliced-spliced
space. These findings suggest that it is possible to construct a novel
RNA velocity model based on the expression of TFs, instead of using
unspliced/spliced RNA counts or any additional experimental omics data
and labels.
Estimate RNA velocity with TFvelo
Here we report TFvelo to estimate the RNA velocity of a target gene
using the abundance of the target gene itself and its potential TFs.
Figure [76]1a provides an example of phase portrait fitting on a gene
from pancreas dataset, to explain how TFvelo outperforms splicing-based
methods. Previous methods focus on modeling the dynamics between the
spliced and unspliced RNAs, while transcriptional dynamics of RNA
splicing may not provide sufficient signal^[77]25, and the 2D
unspliced-spliced RNA level could be too noisy or sparse to fit. By
comparison, TFvelo considers the relationship between the target gene
and multiple TFs, by learning a representation of regulation, allowing
for more robust and accurate modeling based on gene regulatory network.
Fig. 1. The overview of TFvelo.
[78]Fig. 1
[79]Open in a new tab
a The comparison among constructed dynamics on LITAF gene in Pancreas
dataset by TFvelo and previous approaches. Cells are colored in the
same way as those in Fig. [80]2a. b The workflow of TFvelo. c–e The
phase portrait plot on WX-y space, the plot of WX with respect to
pseudotime, and the plot of y with respect to pseudotime. The color bar
indicating the RNA velocity is share by the three panels. For
simplification,
[MATH:
Wg,Xg,yg :MATH]
and
[MATH: tg
:MATH]
, are written as
[MATH: W,X,y
:MATH]
and
[MATH: t :MATH]
, respectively. f–h The steps within each iteration of the generalized
EM algorithm. Here the purple line represents the current prediction by
the model. f Assignment of latent time to each cell (in green) to find
the corresponding target point (shown in red), which is the nearest
point to the cell on the purple line. g Optimization of the weights
[MATH: W :MATH]
to move all cells along the
[MATH: WX :MATH]
axis, minimizing the mean loss over all cells. h Optimization of the
parameters in the dynamical function, to move the target point on the
curve closer to each cell. i–k Visualization and analysis of TFvelo’s
results, including phase portrait fitting pseudo-time inference, and
cell trajectory prediction. l–o Simulation on synthetic dataset. l An
example of the generated synthetic dynamics. m Reconstructed dynamics
by TFvelo of the sample shown in (l) with MSE loss. n The joint plot of
ground truth velocity and inferred velocity with their spearman
correlation. The red reference line refers to that the ground truth is
equal to the inferred value. o The joint plot of ground truth weights
and inferred weights with their spearman correlation. Source data are
provided as a Source Data file.
For a target gene
[MATH: g :MATH]
, TFvelo explores the relationships between its expression level
[MATH: yg
:MATH]
and the regulation of TFs on
[MATH: g :MATH]
. In TFvelo, the RNA velocity
[MATH:
dygtdt<
/mi> :MATH]
, which is defined as the time derivative of RNA abundance, is
determined by the abundance of involved TFs
[MATH: Xg :MATH]
and itself,
[MATH:
dygtdt<
/mi>=hXgt;Ψg :MATH]
2
Using a top-down strategy, which can relax the gene dynamics to more
flexible profiles^[81]13, TFvelo directly designs a profile function of
target gene’s expression level,
[MATH:
ygt=ft;Φg :MATH]
3
where
[MATH: Ψg
:MATH]
and
[MATH: Φg
:MATH]
are two sets of gene-specific time-invariant parameters, which describe
the influence of TFs on the target gene, and the shape of phase
portraits, respectively. Considering that linear models have been
employed to represent the gene regulatory in previous
studies^[82]34,[83]39–[84]41,
[MATH: h(Xgt,y
mi>gt;Ψ
mi>g) :MATH]
is implemented with a linear model
[MATH:
dygtdt<
/mi>=WgXgt−γ
mi>gygt :MATH]
. The profile function of
[MATH:
ygt :MATH]
can be chosen flexibly from a series of second-order differentiable
functions^[85]13, which is designed as a
function in implementation (Methods). This is because sine functions
can model the nonmonotonic dynamics start from both up-regulation and
down-regulation (e.g., gene MGST3 in Fig. S[86]8), which is not
feasible for those methods relying on one switching time
point^[87]9,[88]10,[89]18.
For optimization, all learnable parameters are divided into three
groups, which are cell-specific latent time
[MATH: t :MATH]
, TFs’ weights
[MATH: Wg :MATH]
and the shape parameters of phase portraits
[MATH: ψg :MATH]
, in which
[MATH:
ncell<
/mi> :MATH]
and
[MATH:
nTF
:MATH]
represent number of cells and number of genes, respectively. A
generalized EM^[90]42,[91]43 algorithm is adopted to minimize the loss
function in each iteration, by updating the three groups of parameters
alternately (Fig. [92]1f–h). Based on the optimized models on all
genes, we can obtain a transition matrix and calculate the pseudotime
and velocity stream for cell trajectory analysis. Please find the
details in Methods.
TFvelo can reconstruct the dynamic model and detect the regulation
relationship on synthetic datasets
To validate that TFvelo can detect the underlying dynamic from data, we
firstly test it on a synthetic dataset (see Methods), where a target
gene is regulated by 10 TFs, and the corresponding weights are either
positive or negative, representing that TF may up-regulate or
down-regulate the target gene, respectively. We randomly generate 200
synthetic gene dynamics on 1000 cells with different ground truth
parameters. For each of them, the TFvelo reconstructs the dynamic
function and estimates the weight of each TF, based on the simulated
expressions of TFs and target.
The performance of TFvelo was evaluated using the spearman correlation
between the ground truth and the reconstructed values. Under 20
iterations for optimization, the spearman correlation coefficient
between the ground truth weights and inferred weights of TFs is 0.823,
and that for velocity estimation is 0.894 (Fig. [93]1n, o). The high
consistence shows that TFvelo can effectively reconstruct the
underlying dynamics. In addition, the high F1 score (0.944) and high
ratio of area under the receiver operating characteristic (ROC) curve
(0.962) also demonstrate that TFvelo can correctly recognize whether
the weight is positive or negative. In addition, we employ a vanilla EM
approach as baseline here, by removing the optimization step on TF
weight, as well as the strategy of multiple points initialization.
After 20 optimization iterations, the performance of this baseline is
much poorer than that of TFvelo (Fig. S[94]4). Please see Supplementary
Information Section 3 for details on the synthetic dataset.
TFvelo can model cell differentiation process on Pancreas dataset
To evaluate TFvelo on real scRNA-seq data, we first apply it to the
dataset of E15.5 mouse pancreas^[95]9, which has been widely adopted in
previous RNA velocity studies. The pseudo-time and trajectory predicted
by TFvelo can identify the differentiation process (Fig. [96]2a, b).
Compared with previous methods, TFvelo shows advantage on phase
portrait fitting. Example genes are shown in Figs. [97]2c, S[98]7.
Figure [99]2c compares the phase portrait fitting of different methods,
and Fig. S[100]7 provides the gene expression dynamics resolved along
gene-specific latent time obtained by scVelo and TFvelo. The expression
of H19, which plays an important role in early development^[101]44, is
found to increase at the beginning and then decrease at the stage of
Ngn3 high EP by TFvelo. For MAML3, only TFvelo can correctly detect the
process that starts with up-regulation then turns to down-regulation.
Other methods can not correctly detect the order between pre-endocrine
cells (in green) and those endocrine populations, including alpha,
beta, delta and epsilon cells (in blue, light blue, purple and pink).
Fig. 2. Results of TFvelo on pancreas dataset.
[102]Fig. 2
[103]Open in a new tab
a Pseudo-time inferred by TFvelo in the UMAP-based embedding space. b
The stream plot projected to UMAP. The Ngn3 low/high EP denotes the
neurogenin3 low/high epithelial cells. c Two example genes for
illustrating the dynamics fitting in phase portrait. The cells are
colored in the same way as panel (b). d Comparison on the intra class
distance in phase portrait. Two-sided t-test is applied without
adjustment. e Comparison on the inter class distance in phase portrait.
Two-sided t-test is applied without adjustment. f Comparison on the
inter-cluster coherence. g The cosine similarity between the velocity
vectors obtained by scVelo and TFvelo. For the boxplots in panels (d,
e, f, h) the down, central and up hinges correspond to the first
quartile, median value and third quartile, respectively. The whiskers
extend to 1.5× the interquartile range of the distribution from the
hinge. The number of samples is 405 for boxplots in panel d, e and f.
The number of samples for each boxplot (from top to down) in panel g is
916, 262, 642, 592, 591, 481, 70 and 142, respectively. Source data are
provided as a Source Data file.
To quantitatively evaluate the phase portrait fitting, we propose three
metrics, including: (1) The intra-class distance on phase portrait,
which reflects how cells within the same type gather on the phase
portrait (the lower the better). (2) The inter-class distance on phase
portrait, which reflects how cells from different types are separated
well on the phase portrait (the higher the better). (3) The fitting
error on phase portrait, which measures the normalized distance between
each cell to the constructed model on the phase portrait. We calculate
these metrics based on the genes that can be commonly modeled by all
methods. For the “intra-class distance” and “inter class distance”, we
take the Paired Samples T Test between the value obtained on the same
gene by TFvelo’s phase portrait and the un/spliced phase portrait,
which is shared by all baseline approaches, including scVelo, Dynamo,
UniTVelo and cellDancer. The TFvelo results (Fig. [104]2d, e) show
advantage with extreme high significance (p = 4.23e-51 and
p = 5.65e-117). Compared with those un/spliced based approaches, TFvelo
can achieve lower intra-class distance, higher inter-class distance and
lower fitting error (Fig. S[105]5h). To draw a conclusion, TFvelo can
address the current issue caused by noisy un/spliced data for RNA
velocity modeling, by using the learned feature with transcription
regulation.
To quantitatively evaluate the learned streams, we employ two metrics
adopted in VeloAE^[106]11 and UniTVelo^[107]13, which are “Cross
Boundary Direction Correctness (CBDir)” and “In Cluster Coherence
(ICCoh)”. CBDir assesses the correctness of transitions from one cell
type to the next, utilizing boundary cells with ground truth
annotations. ICCoh is computed using cosine similarity among cells
within the same cluster, measuring the smoothness of velocity in
high-dimensional space within clusters. As shown in Fig. [108]2f,
S[109]5i, TFvelo achieves similar median value compared to UniTVelo,
and significantly outperforms the other methods. Also TFvelo can even
achieve a high velocity consistency (Fig. S[110]5j), which indicates a
higher consistency of RNA velocities across neighboring cells in the
UMAP space^[111]10. The direct reason for the higher consistency scores
achieved by TFvelo and UniTVelo is that they can generate smoother
velocity streams across neighboring cells (Fig. S[112]5b, d). This
might be potentially attributed to the similarity in the models of
TFvelo and UniTVelo. Both methods directly model gene abundance as
high-order differentiable functions with respect to time, without
assuming a switching point that could disrupt the smoothness of the
model. In addition, to show how the velocity inferred by TFvelo and
scVelo correlated with each other, we calculate the cosine similarity
of velocity vectors obtained through both methods, using the commonly
modeled genes. The distribution of cosine similarity on each cell type
is shown in Fig. [113]2g.
Then we conduct KEGG pathways enrichment analysis based on the best
fitted genes, and observe a strong enrichment associated with insulin
secretion and the glucagon signaling pathway (Fig. [114]3a).
Additionally, we find that REST and HMGN3 consistently exhibit high
absolute weights on modeling most target genes (Fig. [115]3b). When
examining the UMAP distribution, it is clear that REST expression
decreases during differentiation, while HMGN3 expression increases
(Fig. [116]3c). Previous research has established REST as a key
negative regulator of endocrine differentiation during pancreas
organogenesis^[117]45,[118]46. Earlier studies have also report HMGN3
to be a regulator for insulin secretion in pancreatic cells^[119]46. We
further analyze the weights of these two TFs on modeling genes within
the insulin secretion pathway. REST consistently has a negative weight,
while HMGN3 consistently exhibits a positive weight (Fig. [120]3d),
which is consistent with the previous findings.
Fig. 3. Biological analysis based on the results of TFvelo on pancreas
dataset.
[121]Fig. 3
[122]Open in a new tab
a The KEGG pathway enrichment from the genes best fitted by TFvelo on
pancreas dataset. The adjusted p-value is computed using the Fisher
exact test. b List of TFs having high weights on multiple target genes.
The counts mean the number of target genes where the TF have a
normalized weight whose absolute value is larger than 0.5. c The
distribution of REST and HMGN3 on UMAP. d The weights distribution of
TFs REST and HMGN3 on modeling in the genes that in the insulin
secretion pathway. The down, central and up hinges correspond to the
first quartile, median value and third quartile, respectively. The
whiskers extend to 1.5× the interquartile range of the distribution
from the hinge. The numbers of samples for the boxplots corresponding
to REST and HMGN3 are 7 and 5, respectively. e Phase analysis on two
example genes. On each row, from the left to the right: the phase
portrait fitting, the distribution of abundance on UMAP, the value of
learned TFs representation (WX) and the target gene along with
pseudotime, and the abundance of a TF and the target gene along with
pseudotime. f In simulation, the visualization of two variables where
the time delay between them is very short or relatively long. The
colorbar reflects pseudotime. g Comparison of the phase portrait
fitting and expression level along pseudotime obtained by scVelo and
TFvelo on two example genes. Source data are provided as a Source Data
file.
The comparison of phase between TFs and target are shown in
Fig. [123]3e, which shows clear phase delay between the learned
representation of TFs (
[MATH: WX :MATH]
) and target gene’s expression. We also analyze the phase delay between
a TF and the individual target. Among the two genes, HMGN3 has a
positive weight on modeling SURF, where a phase delay between them can
be observed. On the contrary, REST has a negative one on modeling ECE1,
where their changing shows negative correlation.
In addition to the sparsity and noise in unspliced data (Fig. S[124]6),
the time scale of splicing dynamics might also make it hard to extract
dynamics from the delay between unspliced and spliced RNA. It has been
reported that splicing can be accomplished within 30 s^[125]47, a
duration much shorter than the timescale of the entire differentiation
process described in scRNA-seq datasets. Consequently, the phase delay
between the unspliced and spliced mRNA might be too brief to be
captured in the phase portrait. This might be one reason why the
theoretical curve cannot be observed in the unspliced-spliced phase
portrait (Fig. [126]3f, g). Figure [127]3f provide simulations to
illustrate that, even with the same level of noise, a shorter phase
delay between variables could make fitting the phase portrait more
challenging. We also observed phenomena from the scRNA-seq dataset
supporting our hypothesis. Figure [128]3g shows that the spliced and
unspliced changes almost synchronously on some genes, while TFvelo
could detect a clearer phase delay for dynamic modeling.
TFvelo can model cell differentiation process on gastrulation erythroid
dataset
The Gastrulation Erythroid dataset is selected from the transcriptional
profile for mouse embryos^[129]48, and has be adopted in RNA velocity
studies^[130]13. Using TFvelo, the pseudo-time and cell development
flow visualized by streamlines are consistent with the development in
the erythroid lineage, from blood progenitors to erythroid cells
(Figs. [131]4a, b). By comparison, several of the previous approaches
cannot capture such developmental dynamics, as shown in Fig. [132]4d.
Fig. 4. Results of TFvelo on gastrulation erythroid dataset.
[133]Fig. 4
[134]Open in a new tab
a Pseudo-time inferenced by TFvelo projected into a UMAP-based
embedding. b The stream plot obtained by TFvelo. c The sparsity for
unspliced, spliced and total mRNA counts. The sparsity of a gene is
defined as:
[MATH: Sparsity=ThenumberofcellsthatthecountofthisgeneiszeroThetotalnumberofcells :MATH]
. d The stream plot obtained by baseline approaches. Each plot is drawn
according to the method’s tutorial. e The phase portrait fitting on
genes with sparse unspliced counts. f The phase portrait fitting on
TACC1 and HSP90AB1 with sparse unspliced counts. g The GO term
biological process enrichment based on the best fitted genes. The
adjusted p-value is computed using the Fisher exact test. h The number
of target genes where the absolute value of the weight larger than 0.5
for each TF. Source data are provided as a Source Data file.
The sparsity in data is a common challenge for scRNA-seq studies.
Considering that only about 20% of reads contained unspliced intronic
sequences^[135]9, the sparse unspliced abundance is a large obstacle
for dynamics fitting in phase portrait. Our quantitative analysis of
the sparsity in unspliced, spliced and total mRNA counts is shown in
Fig. [136]4c, which verifies the high sparsity in unspliced counts.
Figure [137]4e shows the comparison on two genes with very sparse
unspliced counts to illustrate the advantage of TFvelo for addressing
the issue of sparsity. Although these genes can pass the filtering and
selection during preprocessing, they are still too sparse to provide
sufficient information for fitting the spiced-based models well.
Figure [138]4f provides more example genes of the comparison between
TFvelo and baselines. TACC1 is annotated as a gene involved in
promoting cell division prior to the formation of differentiated
tissues^[139]49. In our experiment, the expression of TACC1 initially
increases, reaching its highest value in blood progenitors 2, which is
just before differentiation into erythroid cells. Subsequently, TACC1’s
expression decreases with pseudo-time from erythroid cells 1 to
erythroid cells 3. By comparison, previous approaches fail to detect a
dynamic starting from blood progenitors 1 (in red). As for HSP90AB1,
only TFvelo can capture the correct dynamics from the learned phase
portrait.
We next take the GO term enrichment analysis based on the best fitted
genes, and find that the most significant GO terms include
porphyrin-containing compound biosynthetic process (GO:0006779) and
heme biosynthetic process (GO:0006783), which are directly related to
erythroid development (Fig. [140]4g). In addition, the TFs having high
weights on most target genes are GATA1, GATA2 and LMO2 (Fig. [141]4h),
all of which has been reported to be involved in erythroid
differentiation. GATA1 plays a significant role in regulating the
transcriptional aspects of erythroid maturation and function^[142]50.
GATA2 is verified to regulates hematopoietic stem cell proliferation
and differentiation^[143]51. LMO2 is a bridging molecule assembling an
erythroid, DNA-binding complex^[144]52. These findings demonstrate the
potential of TFvelo in detecting important regulatory genes from single
cell data.
TFvelo can achieve competitive performance compared to Multivelo using only
scRNA-seq data
Next we apply TFvelo to a 10x multi-omics embryonic mouse brain
dataset, where both Assay for Transposase-Accessible Chromatin with
sequencing (ATAC-Seq)^[145]53 and scRNA-seq are available. This dataset
was employed in Multivelo study, which is an RNA velocity model
designed for multi-omics datasets, capturing dynamics between chromatin
accessibility and unspliced mRNA^[146]12. On this dataset, as described
in the Multivelo study, Radial glia (RG) cells in the outer
subventricular zone can generate neurons, astrocytes, and
oligodendrocytes. The development of cortical layers follows an
inside-out pattern during neuronal migration, with younger cells
ascending to upper layers while older cells remain in deeper layers. RG
cells can divide, producing intermediate progenitor cells (IPC) that
act as neural stem cells and give rise to various mature excitatory
neurons in cortical layers.
The pseudotime inferred by TFvelo closely resemble Multivelo’s results
(Fig. [147]5a, b), and both methods correctly identify the
differentiation direction in most cells. Notably, TFvelo achieves this
performance without using ATAC-seq data. Figure [148]5c shows the phase
portrait fitting of both methods, where Multivelo could show an
additional c-u phase portrait, reflecting the joint plot between
ATAC-seq and unspliced mRNA.
Fig. 5. Results of TFvelo on 10x multi-omics embryonic mouse brain dataset.
[149]Fig. 5
[150]Open in a new tab
a Pseudo-time and trajectory inferenced by Multivelo projected into a
UMAP-based embedding. b Pseudo-time and trajectory inferenced by TFvelo
projected into a UMAP-based embedding. c The comparison between
Multivelo and TFvelo on three example genes, which are AHI1, NTRK2 and
GRIN2B, respectively. For Multivelo plot, (c) means the chromatin
accessibility in ATAC-seq.
Regarding the AHI1 gene, Multivelo mistakenly constructs a cyclic
dynamic, in contrast, TFvelo correctly captures the dynamic, with
expression increasing consistently from V-SVZ cells (in green). For the
NTRK2 gene, TFvelo captures such dynamics correctly, while Multivelo
misses the decreasing process at the beginning. On the GRIN2B gene,
both TFvelo and Multivelo detect the same dynamics, with expression
increasing from the beginning to the terminal stage. These results
suggest that by extracting features from multiple TFs, TFvelo can
achieve a better fit for the dynamics of individual genes than
Multivelo, without using ATAC-seq.
Although ATAC-seq could provide additional information for RNA dynamics
modeling, it could be challenging due to multiple issues. For instance,
the high sparse and near binary ATAC-seq data^[151]54 could lead to the
mixing of cells from different types, posing a challenge for phase
portrait fitting. As shown in the first column at Fig. [152]5c, even
the ATAC/unspliced/spliced 3D phase portrait can not provide enough
information for dynamic modeling. Specifically, the c-axis (ATAC) could
not help separate cells of different types (second column in
Fig. [153]5c). Additionally, peaks obtained by ATAC are more
challenging to directly link to a specific gene. By comparison, TFvelo
can directly link multiple TFs into modeling a target gene. As a
result, TFvelo could extract dynamics that align with the
differentiation process well in TFs-target phase portraits (fourth
column in Fig. [154]5c).
TFvelo can predict the cell fate on dataset without splicing information
Next, TFvelo was applied to another single-cell RNA-seq dataset
comprising 1,529 cells obtained from 88 human preimplantation embryos
ranging from embryonic day 3 to 7^[155]55. Since the raw sequencing
file is not provided on the online dataset server (see Data
Availability Statement), those RNA velocity methods relying on the
spliced/unspliced information are not feasible. By contrast, TFvelo
still successfully identifies the pseudo-time and developmental
trajectory of the cells, as depicted in Fig. [156]6a, b. Additionally,
dynamic models successfully fit the expression patterns of target
genes, which are illustrated in Fig. [157]6d as examples. In details,
the dynamics predicted on PPP1R14A reveals a clockwise curve on
TFs-target plot, and finds the changing point between the increasing
and decreasing of expression level at day 5 post-fertilization.
Consistent with the TFvelo results, it has been reported that the
expression of PPP1R14A can be detected with high dynamics that it
increases at the beginning and declines throughout the development in
the early stage of zebrafish embryos^[158]56. TFvelo finds that the
expression of RGS2 is the highest at the beginning, and decreases with
the pseudo-time, which can be supported by the early findings that RGS2
plays a critical role in early embryo development^[159]57.
Fig. 6. Results of TFvelo on dataset without splicing information.
[160]Fig. 6
[161]Open in a new tab
a Pseudo-time inferenced by TFvelo projected into a UMAP-based
embedding. b The streamplot of UMAP. c The heatmap which shows the
dynamics of gene expression resolved along pseudo-time in the top 300
likelihood-ranked genes. d The constructed dynamic models on three
example genes with high likelihoods, which are PPP1R14A, RGS2 and
GSTP1, respectively. From the left to the right within each row, the
three subfigures show phase portrait fitting, distribution of velocity
on UMAP, distribution of expression on UMAP, and the dynamics of
expression resolved along with pseudo-time, respectively.
Discussion
The recent development in RNA velocity approaches provide new insights
for temporal analysis of single cell data. However, the existing
methods often fails to fit the cell dynamics well in the
unspliced/spliced space, which is partially because of the insufficient
signal and high sparsity and noise. Motivated by the insight that the
phase delay between TFs and target can also provide temporal
information, we propose TFvelo to estimate the gene dynamics based on
TFs abundance. Different from previous approaches which require the
splicing information from the data, TFvelo models the dynamics of a
target gene with the expression level of TFs and the target gene
itself. In the computation framework, a generalized EM algorithm is
adopted, so that the latent-time, the weight of TFs and those learnable
parameters in the dynamic function can be updated alternately in each
iteration.
Experiments on a synthetic dataset validate that TFvelo could detect
the underlying dynamics from the data. Results on various scRNA-seq
datasets and a 10× multi-omics dataset further demonstrate that TFvelo
can fit the gene dynamics well, showing a desired clockwise curve on
the phase portrait between the TFs representation and the target.
Furthermore, TFvelo can be used to infer the pseudo time, cell
trajectory and detect key TFs which play important role in the
differentiation. In addition, compared with the previous RNA velocity
model relying on splicing information, TFvelo can achieve better
performance on those scRNA-seq datasets, and be flexibly applied to
more categories of datasets without splicing information.
Due to the requirement of learning a representation of TFs, the time
efficiency of TFvelo is lower than baseline approaches in the same
framework, e.g., scVelo. This could be a weakness when being applied to
the large-scale datasets. Performing TFvelo with more CPUs in parallel
could reduce the running time. Meanwhile, TFvelo still cannot fit the
dynamics of some genes well (Fig. S[162]14). Improving the modeling for
more genes remains a challenging task that requires further
exploration. In the future, more complex models may be explored to
replace the current linear model in TFvelo for estimating the weight of
each TFs. The employed sine function for describing expression dynamics
could also be replaced by other functional forms flexibly. In addition,
those recently proposed ideas for improving unspliced/spliced-based RNA
velocity models can also be adopted in the further development of
TFvelo, including stochastics modeling with Bayesian inference^[163]14,
multi-omics integration^[164]22 and universe time inference^[165]13.
Methods
Data preprocessing
We utilize the data preparation procedure with the default setting in
scVelo^[166]10, where the difference is that TFvelo is applied to the
total mRNA counts, which is the sum of unspliced and spliced counts.
Additionally, after filtering out those genes that can be detected from
fewer than 2% of cells, and selecting the top 2000 highly variable
genes (HVGs), we normalize the expression profile of these HVGs by the
total count in each cell. A nearest-neighbor graph (with 30 neighbors
by default) was calculated based on Euclidean distances in principal
component analysis space (with 30 principal components by default) on
log-transformed spliced counts. After that, first-order and
second-order moments (means and uncentered variances) are computed for
each cell across its nearest neighbors. These steps are implemented in
the same way as scv.pp.filter_and_normalize() and scv.pp.moments().
Additionally, to find potential transcriptional relationship within the
selected 2000 highly variable genes, we refer to the TF-target pairs
annotated in ENCODE TF-target dataset^[167]38 and ChEA TF-target
dataset^[168]58. If the regulatory relationship between a TF-target
pair is labeled in at least one of these two datasets, the TF will be
included in the involved TFs set when modeling the dynamic of the
target gene.
The computational framework of TFvelo
Given a gene
[MATH: g :MATH]
, TFvelo assumes that the RNA velocity
[MATH:
dygtdt<
/mi> :MATH]
is determined by the expression level of involved TFs
[MATH: Xg :MATH]
and the target gene itself
[MATH: yg
:MATH]
, which is
[MATH:
dygtdt<
/mi>=h(Xgt,y
mi>gt;Ψ
mi>g) :MATH]
. Using a top-down strategy^[169]13, TFvelo directly designs a profile
function of target gene’s expression level to relax the gene dynamics
to more flexible profiles, that is
[MATH:
ygt=f(t;Φg) :MATH]
.
[MATH: Ψg
:MATH]
and
[MATH: Φg
:MATH]
are two sets of gene-specific, time-invariant parameters, which control
the influence of TFs on the target gene, and the shape of target gene’s
phase portraits, respectively. Our findings and early studies^[170]39
have shown that it is feasible to infer the regulation relationship
according to the learned parameter in linear regression models based on
expression data. Additionally, linear models always have high
interpretability, and rarely suffer from over-fitting. As a result, to
show the capability of modeling RNA velocity based on gene regulatory
relationship, we simply adopt a linear model as
[MATH: h(Xgt,y
mi>gt;Ψ
mi>g) :MATH]
, which can map the expression levels of multiple TFs
[MATH: Xg∈RnTF<
/mi> :MATH]
from
[MATH:
nTF
:MATH]
-dimensional space to 1-dimensional representation
[MATH: WgXg :MATH]
with a weight vector
[MATH: Wg :MATH]
, where
[MATH:
nTF
:MATH]
represents the number of involved TFs. As a result, the dynamic
equation between TFs and the target gene
[MATH: g :MATH]
can be written as
[MATH:
dygtdt<
/mi>=hXgt,y
mi>gt;Ψg=WgXgt−γ
mi>gygt, :MATH]
4
where
[MATH: t∈[0, 1) :MATH]
is the latent time assigned to each cell,
[MATH: γg
:MATH]
is the degradation rate.
Besides, The profile function of
[MATH:
ygt :MATH]
can be chosen flexibly from those second-order differentiable
functions^[171]13. For simplification,
[MATH:
ygt :MATH]
is designed with a sine function in our implementation, which is
high-order differentiable and suitable for capture the curves on phase
portrait. In addition, different with those models which assume the
gene expression should follow a pattern of initial improvement followed
by a decrease (or part of this whole process), TFvelo is more flexible
and can also model gene dynamics where expression initially decreases
and later increases with the sine function.
[MATH:
ygt=ft;Φg=α
gsinωgt+θg+βg, :MATH]
5
where
[MATH:
αg,βg,θg :MATH]
are gene-specific parameters to be learned.
[MATH: ωg
:MATH]
is fixed as
[MATH: 2π :MATH]
, so that the
[MATH:
ygt :MATH]
is unimodal within
[MATH: t∈[0, 1) :MATH]
, which follows the common assumption shared by most current
approaches. By employing the
[MATH: sin :MATH]
function, TFvelo can model the dynamics of each gene without defining a
switching time point between the of increasing and decreasing
periods^[172]9,[173]10,[174]14,[175]18. To optimize the parameters in
TFvelo, a generalized EM^[176]42,[177]43 algorithm is adopted, as
discussed in the following.
Optimization of generalized EM algorithm
To simplify the mathematical representation, here we denote the
[MATH: Wg,Xg,yg,α
g,β
mi>g,θg,γ<
/mrow>g :MATH]
[MATH: and :MATH]
[MATH: ωg
:MATH]
as
[MATH: W,X,y,α,<
/mo>β,θ,γ :MATH]
and
[MATH: ω :MATH]
, respectively. As a result, the Eqs. ([178]5, [179]4) can be written
as
[MATH: yt=αsin
mi>2πt+θ+β, :MATH]
6
[MATH: dytdt<
/mi>=WX(t)<
/mo>−γy(t). :MATH]
7
From the dynamics model in Eqs. ([180]6, [181]7), we can derive that,
[MATH: WXt=α4π2+γ2
sin2πt+θ+<
/mo>ϕ+βγ,
ϕ=arctan2πγ
, :MATH]
8
For modeling a gene, suppose observations
[MATH: Xcobs :MATH]
and
[MATH:
yco
bs :MATH]
be the normalized expression levels of TFs and the target gene within a
cell
[MATH: c :MATH]
,
[MATH:
sco
bs :MATH]
be the state space representation of the cell
[MATH: c :MATH]
, and
[MATH: s^(tc
) :MATH]
be the modeled state at time
[MATH: tc
:MATH]
, we have
[MATH:
sco
bs=[WXcobs,ycobs
mi>] :MATH]
and
[MATH: s^tc=WXtc,ytc. :MATH]
Aiming at finding the phase trajectory
[MATH: s^(t) :MATH]
that best fits the observations for all cells, we define the loss
function as:
[MATH:
L=∑c<
mrow>||sc
obs−sˆtc||2
=∑c(WXcobs−WXtc)2+<
msup>(yc
obs−ytc)2
, :MATH]
9
The residuals of the observations are defined as
[MATH:
ec=sign(ycob<
mi>s−ytc)∣∣sc
obs−<
/mo>s^tc∣∣2 :MATH]
, which is assumed to follow a normal distribution
[MATH:
ec~N(0,σ2) :MATH]
, where the gene-specific
[MATH: σ :MATH]
is constant across cells, and the observations are independent and
identically distributed. Then we can arrive at the likelihood function,
[MATH: LW,α,β,<
/mo>θ,γ=12πσ*exp−12
σ2∑c∣∣sc
obs
(W)−
s^tc
mrow>W,α,β,<
/mo>θ,γ∣∣2
, :MATH]
10
Where
[MATH: s^tc
mrow>W,α,β,<
/mo>θ,γ :MATH]
represents
[MATH: s^tc;W,α,β,<
/mo>θ,γ :MATH]
. Finally, the optimization algorithm aims to minimize the negative
log-likelihood function:
[MATH: l(W,α,β,<
/mo>θ,γ)=−log12
mn>πσ<
mo>+12
σ2∑c∣∣sc
obs
(W)−
s^tc
mrow>W,α,β,<
/mo>θ,γ∣∣2,
mo> :MATH]
11
In standard EM algorithms, the E-step computes the expected value of
likelihood function given the observed data and the current estimated
parameters to update the latent variables, and the M-step consists of
maximizing that expectation computed in E step by optimizing the
parameters. While under some circumstances, there is an intractable
problem in M step, which could be addressed by generalized EM
algorithms. Expectation conditional maximization is one form of
generalized EM algorithms, which makes multiple constraint
optimizations within each M step^[182]42. In detail, the parameters can
be separated into more subsets, and the M step consists of multiple
steps as well, each of which optimizes one subset of parameters with
the reminder fixed^[183]43.
In TFvelo, all learnable parameters can be divided into three groups,
which are cell-specific latent time
[MATH:
t∈R
ncell :MATH]
, weight of each TF
[MATH: W∈R<
msub>nTF :MATH]
and scalars [
[MATH:
α,β,θ,γ
:MATH]
] respectively, where
[MATH:
ncell<
/mi> :MATH]
and
[MATH:
nTF
:MATH]
represent the number of cells and the number of involved TFs.
Generalized EM algorithm is adopted so that the three group of
parameters will be updated alternately to minimize the loss function in
each iteration:
1. Assign
[MATH: t :MATH]
within [0, 1] for each cell by grid search. For each cell, on the
phase plot, the closest point located on the curve described by the
dynamic model is defined as the target point. The step (b) and (c)
aim to minimize the average distance between cells and their
corresponding target points.
2. Update
[MATH: W :MATH]
by linear regression with bounds on these weights to minimize the
loss function. Trust region reflective algorithm^[184]59, which is
an interior-point-like method is adopted to solve this linear
least-squares problem with inequality constraint. In our
implementation, the weights are bounded within (−20, 20) by
default.
3. Update [
[MATH:
α,β,θ,γ
mi> :MATH]
] with constraints of
[MATH: α∈ :MATH]
(0, ∞),
[MATH: θ∈ :MATH]
(−π, π) and
[MATH: γ∈ :MATH]
(0, ∞) to minimize the loss function.
By iteratively running the three steps, parameters of TFvelo for each
target gene can be optimized. Furthermore, multiple downstream analysis
could be conducted by combining the learned dynamics of all target
genes.
Initialization of the generalized EM algorithm
Since EM algorithms could be trapped in local minimum, parameters are
initialized in the following steps. Firstly, those cells in which
expression of target gene is 0 are filtered out. Then
[MATH: X,y :MATH]
are normalized by
[MATH: X=XvarX,
y=y/var(y) :MATH]
, where
[MATH: var(⋅) :MATH]
represents the standard deviation function. The weights
[MATH: W :MATH]
is initialized with the value of spearman correlation of each TF-target
pair. For parameters
[MATH: α,β :MATH]
and
[MATH: γ :MATH]
, according to the steady-state assumption that
[MATH: y°∣y=max(y)=<
/mo>0 :MATH]
,
[MATH: γ :MATH]
can be initialized by
[MATH: γ=WXy∣y→ymax :MATH]
. At the points of maximal and minimal
[MATH: y :MATH]
value, we have
[MATH: ymax=α
+β :MATH]
and
[MATH: ymin=
−α+β
:MATH]
. As a result,
[MATH: α :MATH]
and
[MATH: β :MATH]
can be initialized by
[MATH:
α=ymax−ymin2 :MATH]
and
[MATH: β=ymax+
ymin2 :MATH]
, respectively. These assumptions are only used for initialization and
in the following optimization process, all parameters will be updated
iteratively. The hyper-parameter
[MATH: ω :MATH]
is set as
[MATH: 2π :MATH]
and parameter
[MATH: θ :MATH]
isintialized with various values, so that the generalized EM algorithm
will run at different start points. After that, the model with the
lowest loss is finally selected.
Inference of Pseudo-time
After constructing the dynamic model for each gene, TFvelo will infer a
gene-agnostic pseudo-time for each cell based on the combination of RNA
velocities of all genes. For this purpose, we first compute the cosine
similarities between velocities and potential cell state transitions to
get the transition matrix. Then the end points and root cells are
obtained as stationary states of the velocity-inferred transition
matrix and its transposed, respectively. After inferring the root
cells’ location on the embedding space, velocity pseudo-time, which
measures the average number of steps it takes to reach a cell after
start walking from the root cell, can be computed. The strategy of
calculating pseudo-time is the similar with that in scVelo^[185]10. The
different is that in scVelo, the transition matrix is obtained based on
the abundance and velocity of spliced mRNA, while in TFvelo that is
based on the abundance and velocity of total mRNA. Details are provided
in the Supplementary Information section 6 titled “The root and end
cells detection”.
Velocity streams on 2D embedding space
To illustrate cell transitions, we create stream plots in a 2D space
using UMAP visualization as the default. To this end, we select genes
whose loss is low and gene-specific time aligns with the pseudotime to
construct velocity streams.
For the single cell datasets, we first select genes with modeling error
in the bottom 50%. Since sine functions exhibit periodicity, we need to
determine the order of cells instead of using latent time directly. We
achieve this by analyzing the histogram of latent time. If there are
several consecutive blank bins in the cell distribution of latent time,
we set the minimum value of the next non-blank bin as the initial state
(normalized latent time = 0). Conversely, the maximum value of the last
non-blank bin will be set as the final state (normalized latent
time = 1). The normalized latent time of all other cells will be mapped
within the range of 0 to 1. Subsequently, we select genes where the
normalized latent time aligns with the pseudotime based on the Spearman
correlation coefficient between them.
Metrics for evaluating in phase portrait
We propose three metrics for phase portrait fitting.
1. The intra-class distance on phase portrait, which measures the
normalized distance between cells within the same type on the phase
portrait. This reflects whether cells within the same type gather
on the phase portrait, which is the lower the better. On
unspliced/spliced based methods, the intra-class distance is
defined as
[MATH:
Dist=∑
k∑j∈typ
ek<
mfenced close="]" open="[">u
j−uckstdu2+s
j−sckstds2
:MATH]
, where
[MATH:
(uc
mi>k,sck) :MATH]
means the center of cell type k, std(.) means the standard variance
and
[MATH: j :MATH]
means the index of a cell. For TFvelo, the distance can be defined
by substituting
[MATH: u :MATH]
with
[MATH: WX :MATH]
and
[MATH: s :MATH]
with
[MATH: y :MATH]
.
2. The inter-class distance on phase portrait, which measures the
normalized distance between the distribution centers of different
cell types on the phase portrait, which is defined as
[MATH:
Dist=∑
k1≠k2u<
mi>ck1−uc
k2st<
mi>du2+s<
mi>ck1−sc
k2st<
mi>ds2
:MATH]
. This reflects whether cells from different types are separated
well on the phase portrait, which is the higher the better.
3. The fitting error on phase portrait, which measures the normalized
distance between each cell to the constructed model on the phase
portrait. This reflects the fitting accuracy and calculated in the
following way. On the phase portrait of each selected gene,
calculate the normalized mean square error of the distance between
each cell to the inferred model curve. For the TFvelo data,
[MATH:
Err=∑
mrow>j=1NWXj−WXtj
std
WX
mfenced>2+y
j−ytj
std
y2, :MATH]
where
[MATH: j :MATH]
means the index of a cell, t[j] means the latent time modeled for
cell
[MATH: j :MATH]
and N is the total number of cells.
[MATH: WXj :MATH]
and
[MATH: yj
:MATH]
refer to the location of cell on the phase portrait.
[MATH: WX(t
mi>j)
:MATH]
and
[MATH:
y(t
j) :MATH]
refers to the model given by TFvelo. Similarly, the error for
un/spliced data-based methods is defined as
[MATH:
Err=∑
mrow>j=1Nu
j−utj
std
u2+s
j−stj
std
s2
:MATH]
.
Metrics for evaluating the velocity stream
1. Cross boundary direction correctness (CBDir) and within-cluster
velocity coherence (ICVCoh). According to the definition given in
VeloAE^[186]11 and UniTVelo^[187]13, CBDir evaluates the accuracy
of transitions from a source cluster to a target cluster based on
the information provided by boundary cells, which requires the
ground truth annotation. ICVCoh is computed with a cosine
similarity scoring function between cell velocities within the same
cluster. We directly run the functions provided in UniTVelo^[188]13
to obtain the of CBDir and ICVCoh.
2. Velocity consistency. We executed scVelo’s velocity_confidence()
function and interpreted the outcomes as velocity consistency. This
is because that it indeed measures the consistency of velocities
within neighboring cells, as opposed to the statistical definition
of “confidence”.
GO term and KEGG pathway enrichment analysis
To perform the GO term and KEGG pathway enrichment analysis, we
utilized the “gseapy.enrichr()” function in Python package “gseapy”
with using Fisher’s exact test by default. For each dataset, the
background gene list consists of all genes within the original dataset
before preprocessing.
Weights normalization and TFs selection
Considering that the abundance of different TFs could vary a lot, the
quantitative comparison between the weights of TFs is conducted after
normalizing them. For each TF on modeling a target, we multiply the
mean expression level by the learned weights to get the average
influence of the TF to the target gene. For instance, the average
influence of TF
[MATH: i :MATH]
is
[MATH:
wi*x¯i :MATH]
. Then the normalized weights are defined as:
[MATH: w~i=wi*x¯imax
i(∣wi*x¯i∣)
:MATH]
. The TFs with high weights are select according to these normalized
weights with default threshold of
[MATH: 0.5 :MATH]
.
Statistics & reproducibility
Four datasets are utilized in this manuscript, which is a proper number
compared with relevant RNA velocity studies. No statistical method was
used to predetermine sample size. No data were excluded from the
analyses. The experiments were not randomized. The Investigators were
not blinded to allocation during experiments and outcome assessment.
To reproduce the results shown in this paper, please use the demo code
with default parameters at the GitHub repository,
[189]https://github.com/xiaoyeye/TFvelo.
Reporting summary
Further information on research design is available in the [190]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[191]Supplementary Information^ (12MB, pdf)
[192]Peer Review File^ (3.6MB, pdf)
[193]Reporting Summary^ (234.3KB, pdf)
Source data
[194]Source Data^ (14.6MB, xlsx)
Acknowledgements