Abstract
Background: Single-cell studies have demonstrated the presence of
significant cell-to-cell heterogeneity in gene expression. Whether such
heterogeneity is only a bystander or has a functional role in the cell
differentiation process is still hotly debated.
Methods: In this study, we quantified and followed single-cell
transcriptional uncertainty – a measure of gene transcriptional
stochasticity in single cells – in 10 cell differentiation systems of
varying cell lineage progressions, from single to multi-branching
trajectories, using the stochastic two-state gene transcription model.
Results: By visualizing the transcriptional uncertainty as a landscape
over a two-dimensional representation of the single-cell gene
expression data, we observed universal features in the cell
differentiation trajectories that include: (i) a peak in single-cell
uncertainty during transition states, and in systems with bifurcating
differentiation trajectories, each branching point represents a state
of high transcriptional uncertainty; (ii) a positive correlation of
transcriptional uncertainty with transcriptional burst size and
frequency; (iii) an increase in RNA velocity preceding the increase in
the cell transcriptional uncertainty.
Conclusions: Our findings suggest a possible universal mechanism during
the cell differentiation process, in which stem cells engage stochastic
exploratory dynamics of gene expression at the start of the cell
differentiation by increasing gene transcriptional bursts, and
disengage such dynamics once cells have decided on a particular
terminal cell identity. Notably, the peak of single-cell
transcriptional uncertainty signifies the decision-making point in the
cell differentiation process.
Keywords: single cell, gene expression, cell differentiation,
transcriptional uncertainty, RNA velocity
Introduction
Cell differentiation is the process through which unspecialized stem
cells become more specialized. Because of its important roles in
development, cellular repair, and organismal homeostasis, the molecular
mechanisms of cell differentiation has been the subject of intense
scrutiny. Since roughly 50 years ago – along with the promulgation of
the central dogma of molecular biology by Francis Crick and the
characterization of the lactose operon by François Jacob and Jacques
Monod – the existence of a genetic program has become a prevailing
explanation for the cell differentiation process. Although the details
were originally not defined, at least not formally, such a genetic
program purports a constellation of master genes (i.e., transcription
factors) that orchestrate the transcription of downstream target genes
in a precise spatiotemporal fashion, resulting in long-lasting
alterations in the gene expression patterns ( [32]Herskowitz, 1989;
[33]Lewis, 1992; [34]Ohno et al., 1979). A notable experimental
evidence substantiating this view is the overexpression of myoD
inducing a myogenic phenotype in seemingly naive cells ( [35]Davis et
al., 1987). Over the past few decades, the repertoire of such master
genes across numerous stem cell systems, such as Nanog, Oct4, Sox2,
BATF and MyoD, has begun to coalesce ( [36]Papili Gao et al., 2017;
[37]Sartorelli & Puri, 2018; [38]Whyte et al., 2013).
Recent advances in single-cell technologies has revealed new aspects of
the cell differentiation that are incompatible with the idea of ordered
and programmed (i.e., deterministic) gene expression. More
specifically, single-cell data paint a stochastic differentiation
process that increases cell-to-cell variability of gene expression.
Such an observation has been made for a wide variety of cell
differentiation systems, including chicken erythroid progenitors (
[39]Richard et al., 2016), erythroid myeloid lymphoid (EML) cells (
[40]Mojtahedi et al., 2016), mouse embryonic stem cells (mESCs) (
[41]Semrau et al., 2017; [42]Stumpf et al., 2017), and human CD34+
cells ( [43]Moussy et al., 2017). Interestingly, a similar increase of
gene expression variation was also observed during the
de-differentiation of somatic cells into iPSCs ( [44]Buganim et al.,
2012). Stochastic gene expression also appears to have a functional
role beyond cell differentiation systems. For example, an increase in
cell-to-cell variability of gene expression has been reported during a
forced adaptation of budding yeast cells to unforeseen challenges (
[45]Braun, 2015).
In 1957, Conrad Waddington proposed the presently well-known epigenetic
landscape that likens the cell differentiation process to a ball
rolling on a downward sloping surface, starting from a state of high
cell potency and ending at one of possibly several states of low cell
potency. The landscape itself is shaped by the action of the genes and
gene network – depicted in the less-frequently-shown part B of
Waddington’s original figure as a network of ropes that are tied to the
surface, creating valleys and hills. Although the epigenetic landscape
was originally proposed only as a metaphor of how gene regulation
governs the cell differentiation process, this landscape has been
formalized within the framework of dynamical systems theory (
[46]Huang, 2009; [47]Saez et al., 2022). The valleys in the
Waddington’s epigenetic landscape are equated to stable states of a
dynamical system, called attractors, while the hills are often
interpreted as energetic barriers.
A number of recent studies provided a graphical representation of the
differentiation process based on single-cell transcriptomic data that
conforms with the Waddington’s epigenetic landscape ( [48]Fard et al.,
2016; [49]Guo & Zheng, 2017; [50]Shi et al., 2018; [51]Zhang & Zhou,
2018; [52]Zwiessele & Lawrence, 2017). More specifically, these studies
reconstructed the epigenetic landscape from single-cell gene expression
data using probabilistic and quasi-potential methods, for example by
applying Hopfield neural networks ( [53]Fard et al., 2016; [54]Guo &
Zheng, 2017), a cell-density based strategy ( [55]Zhang & Zhou, 2018),
network entropy measurements ( [56]Shi et al., 2018) or more recently
Large Deviation Theory ( [57]Lv et al., 2014). However, with the
exception of [58]Fard et al. (2016) and [59]Lv et al. (2014), the
aforementioned studies produced monotonic descent passages during cell
differentiation, mimicking closely the Waddington’s epigenetic
landscape metaphor (see for example ( [60]Bhattacharya et al., 2011;
[61]Shi et al., 2018)). Also, none of the above studies consider
directly the cellular mechanism that generates stochastic gene
transcriptional bursts.
In the present work, we aimed to shed light on the gene transcriptional
mechanism behind the rise-then-fall trajectory of cell-to-cell
variability in gene expression observed during the cellular
differentiation process ( [62]Richard et al., 2016). To this end, we
analyzed a collection of published single-cell transcriptomic datasets
from various cell differentiation systems, comprising both single-cell
RT-qPCR (scRT-qPCR) ( [63]Bargaje et al., 2017; [64]Guo et al., 2010;
[65]Moignard et al., 2013; [66]Moussy et al., 2017; [67]Richard et al.,
2016; [68]Stumpf et al., 2017) and single-cell RNA-sequencing
(scRNA-seq) ( [69]Nestorowa et al., 2016; [70]Treutlein et al., 2016).
We employed a likelihood-based analysis using a recent method CALISTA
(Clustering And Lineage Inference in Single-cell Transcriptomics
Analysis) ( [71]Papili Gao et al., 2020). The analysis relied on a
mechanistic model of the stochastic gene transcriptional bursts to
describe single-cell gene expression distribution. Specifically, we
introduced a new concept of transcriptional uncertainty at single cell
level, and by applying CALISTA, we reconstructed the transcriptional
uncertainty landscapes for the aforementioned cell differentiation
systems. Further, by leveraging the stochastic gene transcriptional
model behind CALISTA, we were able identify possible mechanisms behind
the overt trajectories of cell differentiation on the transcriptional
uncertainty landscapes ( [72]Coulon et al., 2010). For two additional
single-cell datasets, we also evaluated the single-cell RNA-velocity
using the recently published Velocyto method ( [73]La Manno et al.,
2018). The two-state model parameter analysis, combined with
RNA-velocities, provided insights into the mechanism regulating cell
fate decisions, specifically on the role of stochastic gene
transcriptions in the differentiation processes and on the possible
mechanism generating this stochasticity.
Methods
Main steps of CALISTA workflow
Herein, we briefly describe the main steps involved in the calculation
of single-cell transcriptional uncertainty using CALISTA ( [74]Papili
Gao et al., 2020).
Pre-processing. Given an
[MATH: N×G :MATH]
single-cell expression matrix
[MATH: M :MATH]
, where N denotes the number of cells and G the number of genes, the
pre-processing in CALISTA involves two steps: a normalization of the
expression data
[MATH:
mn,g
:MATH]
– i.e. the number of transcripts of gene g in the n-th cell, and a
selection of the most variable genes ( [75]Papili Gao et al., 2020).
Cell clustering. CALISTA clustering follows a two-step procedure. The
first step involves a greedy optimization strategy to find cell
clustering that maximizes the total cell likelihood, i.e. the sum of
the likelihood value for all cells. The single-cell likelihood value is
computed as the joint probability of the cell’s gene expression data,
which is set equal to the product of the probabilities of the mRNA
counts for the selected genes based on the mRNA distribution from the
two-state stochastic gene transcription model. To avoid issues with
numerical overflow, we use the logarithm of the cell likelihood. By
performing the greedy optimization multiple times, a consensus matrix
containing the number of times two cells in the dataset are put in the
same cluster, is generated. In the second and final step, CALISTA
generates the cell cluster assignments by using k-medoids clustering
based on the consensus matrix. The final outcome of CALISTA’s
clustering is the assignment of cells into K clusters and the optimal
model parameters for the two-state gene transcription model:
[MATH: θgk=θonθoffθt
mfenced>gk. :MATH]
for each gene g in cluster k ( [76]Papili Gao et al., 2020). In this
case,
[MATH: θon :MATH]
is the normalized rate of the promoter activation,
[MATH: θoff :MATH]
is the normalized rate of the promoter inactivation, and
[MATH: θt :MATH]
is the normalized rate of mRNA production when the promoter is active.
These parameters are normalized by the rate constant of mRNA
degradation
[MATH: θd :MATH]
, so that
[MATH: θd=1 :MATH]
.
Lineage progression inference. In CALISTA, cell lineage progression is
inferred based on cluster distances – a measure of dissimilarity
between two clusters. The cluster distance of two cell clusters is
defined as the average decrease in the cell likelihood value if the
cells from these two clusters are grouped as one cluster, as opposed to
the original clustering. The lineage progression graph is built by
adding transition edges between pairs of clusters in increasing
magnitude of cluster distance until all clusters are connected to at
least one other cluster, or based on user-specified criteria.
Single-cell transcriptional uncertainty. The last step in our analysis
is to compute the final single-cell likelihood. Briefly, for each cell,
we consider all edges in the lineage progression graph that are
adjacent to the cell’s respective cluster, i.e. edges that eminate from
or pointing to the cluster to which the cell belongs. The likelihood of
a cell along an edge is evaluated by interpolating the likelihood
values of the cell’s gene expression using the mRNA distributions from
the two adjacent clusters. Each cell is then assigned to the edge along
which its interpolated likelihood value is maximum, and the final cell
likelihood is set to this maximum value. The single-cell
transcriptional uncertainty is evaluated as the negative logarithm of
the cell likelihood value (NLL). Following the way single-cell
likelihood is computed (see Cell Clustering section above and
[77]Papili Gao et al., 2020), the NLL for each cell n, denoted by
[MATH: NLLn :MATH]
, is the sum of the NLL from every gene g for that cell, i.e.
[MATH:
NLLn=∑<
mrow>g=1NgNLLgn
:MATH]
where N [g ]denotes the number of genes.
Pseudotimes calculation. We can evaluate the pseudotimes for the cells
according to the following procedure. First, a pseudotime is given to
each cluster with a value between 0 (initial cell state) and 1 (final
cell fate). Subsequently, we determine the linear fractional position
of each cell along its respective edge at which its interpolated
likelihood value is maximum (see Single-cell transcriptional
uncertainty). The pseudotime of a cell is computed by a linear
interpolation of the pseudotimes of the two clusters adjacent to its
assigned edge according to the cell’s linear fractional position on
this edge.
Epigenetic landscape reconstruction. To visualize the 3D
transcriptional uncertainty landscape, we apply dimensional reduction
techniques such as principal component analysis (PCA) or t-SNE on the
z-scored expression data, to project the gene expression of each
individual cell on two dimensional axis, which gives the x-y axis of
the landscape plot. For the z axis, we plot the NLL values. The
transcriptional uncertainty landscape surface is reconstructed by
estimating local approximation of individual cell 3D coordinates on a
regular 30×30 grid by using a publicly available [78]Matlab (R2020a)
surface fitting package called [79]gridfit.
Pre-processing and analysis of single-cell expression datasets
Bargaje et al. scRT-qPCR dataset. The dataset includes the expression
profiles of 96 genes from 1896 single cells at eight different time
points (day 0, 1, 1.5, 2, 2.5, 3, 4, 5) during the differentiation of
human pluripotent stem cells (iPSCs) into either mesodermal (M) or
endodermal (En) fate ( [80]Bargaje et al., 2017). By employing CALISTA,
we obtained five cell clusters and detected a bifurcation event, which
gives rise to the two final cell fates. After lineage inference, we
pseudotemporally ordered cells along the inferred differentiation paths
(for more details, see ( [81]Papili Gao et al., 2020)).
Treutlein et al. scRNA-sequencing dataset. The dataset includes the
gene expression profiles of 405 cells during reprogramming of mouse
embryonic fibroblast (MEF) into a desired induced neural (iN) and an
alternative myogenic (M) cell fate ( [82]Treutlein et al., 2016). We
pre-processed the data using CALISTA to select the 40 most variable
genes (10% of the number of cells) for the transcriptional uncertainty
analysis. CALISTA identified four different subpopulations and
successfully recovered the bifurcation event (for more details, see (
[83]Papili Gao et al., 2020)).
Richard et al. scRT-qPCR dataset. The dataset contains the expression
profile of 91 genes measured from 389 cells at six distinct time points
(0, 8, 24, 33, 48, 72 h) during the differentiation of primary chicken
erythrocytic progenitor cells (T2EC) ( [84]Richard et al., 2016).
Following the CALISTA pre-processing step, we removed cells in which
less than 75% of the genes are expressed. Then, we selected the subset
of genes with at least one non-zero expression values. A total of 354
cells and 88 genes were considered in the transcriptional uncertainty
analysis. Based on eigengap heuristics ( [85]Papili Gao et al., 2020;
[86]von Luxburg, 2007), we grouped cells into six optimal clusters and
ordered cells along the inferred linear trajectory (see the Extended
data S8 Figure ( [87]Gao et al., 2023)).
Stumpf et al. scRT-qPCR dataset. The dataset comprises the single-cell
expression of 97 genes at seven time points (0, 24, 48, 72, 96, 120,
168 h) during neural differentiation of mouse embryonic stem cells (E14
cell line) ( [88]Stumpf et al., 2017). In the data pre-processing, we
excluded cells in which less than 70% of genes are expressed. Then, we
selected genes with at least one non-zero expression values. A total of
276 cells and 93 genes were considered for for the transcriptional
uncertainty analysis. Based on eigengap heuristics ( [89]Papili Gao et
al., 2020), we grouped cells into five optimal clusters and ordered
cells along the inferred linear trajectory ( Extended data S9 Figure (
[90]Gao et al., 2023)).
Moussy et al. scRT-qPCR dataset. The single-cell expression dataset
includes normalized Ct values for 91 genes in 435 cells captured at
five distinct time points (0, 24, 48, 72, 96 h) during human cord
blood-derived CD34+ differentiation ( [91]Moussy et al., 2017). We
employed CALISTA to group cells into seven clusters, reconstruct the
developmental trajectory and calculate pseudotimes ( Extended data, S10
Figure ( [92]Gao et al., 2023)).
Guo et al. scRT-qPCR dataset. The dataset comprises the single-cell
expression values of 48 genes from 387 individual cells isolated at
four distinct developmental cell stages, from 8-cell stage mouse
embryos to 64-blastocyst ( [93]Guo et al., 2010). By applying CALISTA,
we identified seven different subpopulations along the differentiation
process, and the inferred lineage hierarchy pinpointed two bifurcations
events at 32- and 64-cell stage ( Extended data, S11 Figure ( [94]Gao
et al., 2023)). The timing of the lineage bifurcations coincides with
two well-known branching points: one at 32-cell stage when totipotent
cells differentiate into trophectoderm (TE) and inner cell mass (ICM),
and another at 64-cell stage when ICM cells differentiate into
primitive endoderm (PE) and epiblast (E).
Nestorowa et al. scRNA-sequencing dataset. The dataset comprises
single-cell gene expression of 1656 cells from mouse hematopoietic stem
cell differentiation ( [95]Nestorowa et al., 2016). We pre-processed
the data by removing genes with non-zero values in less than 10% of the
cells. Then, we selected 433 most variable genes, which is 10% of the
number of genes after the previous pre-processing step, for the
transcriptional uncertainty analysis ( [96]Papili Gao et al., 2020). We
set the optimal number of clusters based on the original study (
[97]Nestorowa et al., 2016), which reported six different
subpopulations and two bifurcation events: the first one producing
common myeloid progenitor (CMP) from lymphoid-primed multipotent
progenitors (LMPP), and the second one generating granulocyte–monocyte
progenitors (GMP) from megakaryocyte-erythroid progenitors (MEP) (
Extended data, S12 Figure ( [98]Gao et al., 2023)).
Moignard et al. scRT-qPCR dataset. The dataset contains the single-cell
expression level of 18 transcription factors measured in a total of 597
mouse bone marrow cells during hematopoietic differentiation. By
applying CALISTA, we successfully identified the five subpopulations
and the two branching points detected in the original study (
[99]Moignard et al., 2013): long-term hematopoietic stem cells (HSC)
differentiating into megakaryocyte–erythroid progenitors (PreM) or
lymphoid-primed multipotent progenitors (LMPP); LMPP cells
differentiating into granulocyte–monocyte progenitors (GMP) and common
lymphoid progenitors (CLP) (for details see ( [100]Papili Gao et al.,
2020)).
Pairwise correlation analysis of transcriptional uncertainty and
transcriptional burst size and frequency
We define gene transcriptional burst size and burst frequency using the
two-state model parameters, as follows:
[MATH: Burst size
S=θtθ<
/mi>off :MATH]
(1)
[MATH: Burst frequency F=θon :MATH]
(2)
The burst size and burst frequency for each cluster k and gene g,
denoted by S [g, k ]and F [g, k ], respectively, are evaluated using
the cluster parameters
[MATH: θgk=θonθoffθt
mfenced>gk :MATH]
obtained from CALISTA single-cell clustering analysis. Meanwhile, the
average gene-wise NLL values for each single-cell cluster is computed
as follows:
[MATH: NLL¯g,k=∑n
=1NkNLLg,kn<
/mi>Nk :MATH]
(3)
where
[MATH:
NLLg,k<
mi>n :MATH]
is the negative log-likelihood of cell n based only on the expression
of gene g in cluster k, and N [k] is the total number of cells in
cluster k. Finally, the Pearson correlation coefficients between the
burst size S [g, k ]and
[MATH: NLL¯g,k :MATH]
and between the burst frequency F [g, k ]and
[MATH: NLL¯g,k :MATH]
are computed for each dataset to quantify the pairwise associations
between these variables. The statistical significance of the
correlation coefficients is determined using the t-test—specifically,
by evaluating the score
[MATH:
t=r(n−
2)/(1−r2) :MATH]
where r is the sample correlation coefficient.
RNA velocity analysis
Cells and genes were first filtered based on the pre-processing
strategy in the original publication by La Manno and colleagues (
[101]La Manno et al., 2018), which resulted in a total of 1720 cells
and 1448 genes from human glutamatergic neurogenesis, and a total of
18140 cells and 2141 genes from the mouse hippocampus dataset. We
further reduced the number of genes to only the top 500 highly variable
genes for the transcriptional uncertainty analysis. The cell cluster
assignments generated by Velocyto – the algorithm for computing RNA
velocity from the original publication ( [102]La Manno et al., 2018) –
were considered, instead of using CALISTA. Based on the clustering, we
employed CALISTA to generate the lineage progression and cell
pseudotimes ( Extended data, S13 Figure ( [103]Gao et al., 2023)). The
RNA velocity and transcriptional uncertainty values for the top 500
genes were calculated by employing Velocyto and CALISTA, respectively.
The cell-wise RNA velocity was set to the Euclidean norm of the vector
of RNA velocities for each cell, while the cell-wise NLLs was
[MATH:
NLLnk=∑g=1
500NLLgn
msubsup>500
:MATH]
(4)
Results
Single-cell transcriptional uncertainty landscape
In this work, we used CALISTA ( [104]Papili Gao et al., 2020), a
likelihood-based bioinformatics toolbox designed for an end-to-end
analysis of single-cell gene expression data, to evaluate the
transcriptional uncertainty of each individual cell based on its gene
expression data (see the Extended data, Supplementary Notes S1 (
[105]Gao et al., 2023)). CALISTA uses the two-state model of stochastic
gene transcription bursts to characterize the steady state distribution
of mRNA counts in individual cells ( [106]Peccoud & Ycart, 1995). In
the model, a gene promoter stochastically switches between the ON and
OFF state, and only in the ON state can gene transcription occur. The
distribution of mRNA depends on four model parameters: θ [on] (the rate
of promoter activation), θ [off] (the rate of promoter inactivation), θ
[t] (the rate of mRNA production when the promoter is in the ON state),
and θ [d] (the rate constant of mRNA degradation) ( [107]Herbach et
al., 2017; [108]Kim & Marioni, 2013) (see [109]Figure 1a). For example,
when θ [off] >> θ [on] and θ [off] >> θ [d] , keeping θ [t] / θ [off]
fixed, mRNA are produced through bursts of short but intense
transcription, which is a typical case observed for gene transcriptions
in single cells ( [110]Munsky et al., 2012) ( [111]Suter et al., 2011).
As the mRNA distribution is linked to mechanistically interpretable
parameters, CALISTA is able to give insights into the possible
mechanism driving the cell heterogeneity dynamics during cell
differentiation.
Figure 1. Single-cell transcriptional uncertainty landscape.
[112]Figure 1.
[113]Open in a new tab
(a) The illustration depicts the landscape of single-cell
transcriptional uncertainty during a differentiation process over the
(pseudo) time (from blue to yellow). Each dot corresponds to a cell in
the single-cell transcriptomic dataset. Cells start their journey from
a valley in the landscape, through a hill, before ending at one of the
final valleys/states. (b–f) Analysis of single-cell transcriptional
profiles during iPSC differentiation into cardiomyocytes ( [114]Bargaje
et al., 2017). (b) Boxplots of the negative log-likelihood (NLL) values
for each single-cell cluster. (c–d) Moving-window average NLL along (c)
endoderm and (d) mesoderm fate trajectory. (e) NLL of each gene and
cell for every single-cell cluster. (f) Protein-protein interaction
network of top variable genes inferred by STRING ( [115]Szklarczyk et
al., 2015). Blue nodes represent transcription factors, while red nodes
denote proteins involved in signal transduction. The width of the edges
denotes the confidence for the inferred relationship (thicker edge =
higher confidence).
CALISTA employs a maximum likelihood approach and assigns a likelihood
value to each cell based on its gene expression based on the mRNA
distribution governed by the two-state model of stochastic gene
transcription. The single-cell transcriptional uncertainty is evaluated
as the negative logarithm of the likelihood (NLL) value for a cell. The
single-cell likelihood value reflects the joint probability of its gene
expression repertoire. A cell with a low likelihood value may indicate
that the gene expression of the cell is different from its neighboring
cells, i.e. the cell is an outlier. But, more interestingly, a low
likelihood value may also correspond to a cell state of high
uncertainty in the gene expression. The group of cells in such a high
uncertainty state have gene expressions that are dissimilar to each
other, and thus, the gene expression distribution will have a high
entropy. By visualizing the single-cell transcriptional uncertainty
over the two-dimensional projection of the single-cell transcriptomics
data—for example, using the first two principal components from PCA—we
constructed a transcriptional uncertainty landscape in the form of a
surface plot of the NLL value. In this way, we studied the landscape of
transcriptional uncertainty during cell differentiation at single-cell
resolution. On such single-cell transcriptional uncertainty surface, an
aberrant cell can be easily distinguished from a cell of high
uncertainty state, since such an aberrant cell will appear isolated
from its neighboring cells with a high NLL.
Transcriptional uncertainty landscape of iPSC cell differentiation to
cardiomyocytes
In the following, we demonstrated an application of our procedure
described above to a single-cell transcriptional dataset from
cardiomyocytes differentiation from human induced pluripotent stem
cells (iPSCs) ( [116]Bargaje et al., 2017). The single-cell clustering
of CALISTA returned five clusters ( [117]Papili Gao et al., 2020) and
identified one bifurcation event in the lineage progression, which led
to two cell lineages ( [118]Bargaje et al., 2017), in good agreement
with the number of cell types reported in the original study. The
estimated uncertainty landscape shows cells exiting the initial
epiblast state that is characterized by a valley in the landscape,
passing through a hill of high transcriptional uncertainty
corresponding to primitive streak (PS)-like progenitor state, before
ending up at one of the low transcriptional uncertainty terminal states
corresponding to either mesodermal (desired) or endodermal (undesired)
fate (see the Extended data, S1 Figure ( [119]Gao et al., 2023)). As
depicted in [120]Figure 1b, the intermediate cell cluster (cluster 2)
comprising PS-like cells have higher cell uncertainty (lower
single-cell likelihood) than the other clusters. [121]Figure 1c and
[122]d give the moving-averaged uncertainty values for pseudotemporally
ordered cells using a moving window of 10% of the total cells for both
endodermal and mesodermal paths, respectively. The moving-averaged
transcriptional uncertainty for the two differentiation paths follows a
rise-then-fall trajectory where the peak of uncertainty coincides with
the lineage bifurcation event.
We explored whether the rise-then-fall in uncertainty is an artefact
from using the two-state model to evaluate the cell likelihood values.
To this end, we implemented a modified version of the algorithm for
ordering cells by calculating the cell likelihood values using the
empirical (observed) distribution, instead of the analytical
distribution from the two-state model. As shown in the Extended data S2
Figure ( [123]Gao et al., 2023), the transcriptional uncertainty
landscape from the modified implementation shows a strong resemblance
to the original one. We also investigated whether the number of
clusters may affect the landscape, in which using too few of the
clusters may artificially inflate the uncertainty due to the mixing of
cells from different states. We reran CALISTA by using a higher number
of clusters (set to nine based on the eigengap heuristic ( [124]von
Luxburg, 2007)). The hill in the uncertainty landscape is again seen
around the bifurcation event upon using a higher number of cell
clusters ( Extended data, S3 Figure ( [125]Gao et al., 2023)). Finally,
we used a different algorithm to cluster cells, specifically using a
Laplacian-based clustering algorithm called single-cell interpretation
via multikernel learning (SIMLR) ( [126]Wang et al., 2017), to test
whether the shape of the transcriptional uncertainty landscape changes
with the clustering algorithm. The single-cell clusters can be
interpreted as the transitional states that the differentiating cells
go through. Starting with the result of SIMLR cell clustering, we then
generated the lineage progression and estimated the cell likelihood
values using CALISTA. The transcriptional uncertainty landscape from
SIMLR cell clustering has the same shape as that in Extended data S1
Figure ( [127]Gao et al., 2023), demonstrating that the transcriptional
uncertainty landscape observed above is not dependent on using CALISTA
for cell clustering ( Extended data, S4 Figure ( [128]Gao et al.,
2023)).
To further elucidate the role of specific genes in shaping the
transcriptional uncertainty landscape, we looked at the transcriptional
uncertainty associated with individual genes. [129]Figure 1e depicts
the NLL distribution of each gene for the five single-cell clusters. As
expected, cells in cluster 2 have generally higher NLL than those in
the other clusters. [130]Figure 1e clearly illustrates that within
cluster 2, some genes show higher NLL values than the others ( Extended
data, S5 Figure ( [131]Gao et al., 2023)). To identify the important
genes related to transcriptional uncertainty, we identified genes with
NLL values exceeding a threshold
[MATH: δ :MATH]
for at least 30% of the cells in each cluster, where
[MATH: δ :MATH]
is set to 3 standard deviation above the overall mean NLL for all cells
and genes in the dataset (see Methods [132]equation (2)). None of the
genes in clusters 1, 4 and 5 have a NLL above the threshold. Meanwhile,
16 and eight genes in clusters 2 and 3, respectively, pass the above
criterion for high uncertainty with four common genes between the two
gene sets ( Extended data, S1 Table ( [133]Gao et al., 2023)). Genes
with high transcriptional uncertainty in cluster 2 may have functional
roles in cell fate determination. The gene set of cluster 2 includes
known genes upregulated only in the PS-like state (e.g. EOMES, GSC,
MESP1 and MIXL1), as well as markers of mesodermal and endodermal cells
(e.g. BMP4, HAND1, and SOX17) ( [134]Bargaje et al., 2017; [135]Papili
Gao et al., 2020) ( Extended data, S6 Figure ( [136]Gao et al., 2023)).
Meanwhile, the main contributors to cell uncertainty in cluster 3 (e.g.
BMP4 and MYL4 ( [137]Bargaje et al., 2017; [138]Papili Gao et al.,
2020)) are known transition genes between PS-like cells and the final
mesoderm fate ( Extended data, S7 Figure ( [139]Gao et al., 2023)).
[140]Figure 1f depicts the protein-protein interaction (PPI) network
related to the gene set of cluster 2 using STRING (minimum required
interaction score of 0.4) ( [141]Szklarczyk et al., 2015), indicating
that these genes form a strongly interconnected hub of known
transcription factors and molecules involved in the signal transduction
of embryonic development ( Extended data, Table S1 ( [142]Gao et al.,
2023)).
Transcriptional uncertainty landscapes of cell differentiation
We further applied the procedure above to seven additional single-cell
transcriptomic datasets that were generated using scRT-qPCR ( [143]Guo
et al., 2010; [144]Moignard et al., 2013; [145]Moussy et al., 2017;
[146]Richard et al., 2016; [147]Stumpf et al., 2017) and
scRNA-sequencing ( [148]Nestorowa et al., 2016; [149]Treutlein et al.,
2016), to assess the universality of the rise-then-fall feature of
single-cell transcriptional uncertainty landscape during cell
differentiation. The first of these datasets came from 405 cells during
mouse embryonic fibroblast (MEF) reprogramming into induced neural (iN)
and myogenic (M) cells ( [150]Treutlein et al., 2016). Like the iPSC
differentiation above, the lineage progression has a single bifurcation
point. As depicted in [151]Figure 2a, the single-cell transcriptional
uncertainty increases from the initial MEF state and reaches a peak
around the bifurcation before decreasing toward two end-point cell
fates. The rise-then-fall of transcriptional uncertainty in the MEF
reprogramming is in good agreement with what we observed in the iPSCs
differentiation above. Higher entropy of gene expression distribution
in a cell population has also been reported in the reprogramming of
iPSCs ( [152]Buganim et al., 2012).
Figure 2. CALISTA analysis of single-cell expression data.
[153]Figure 2.
[154]Open in a new tab
(a–c) Landscape plots (based on cell clusters and pseudotime) and
moving-averaged negative log-likelihood (NLL) values for each
differentiation path of (a) single-branching trajectory (
[155]Treutlein et al., 2016), (b) linear trajectories ( [156]Moussy et
al., 2017; [157]Richard et al., 2016; [158]Stumpf et al., 2017), (c)
multi-branching trajectories ( [159]Guo et al., 2010; [160]Moignard et
al., 2013; [161]Nestorowa et al., 2016). Green and red vertical arrows
in moving-averaged NLL plots indicate the first and second peak in cell
uncertainty, respectively. Abbreviations: (a) MEF: mouse embryonic
fibroblast, iN: induced neuronal, M: myocyte, (b) T2EC: chicken
erythocytic progenitor cell, ESC: embryonic stem cell, NPC:
neuroprogenitor cell, HSC: haematopoietic stem cell, CMP: common
lymphoid progenitor, (c) 8C: eighth cell stage, 16C: sixteenth cell
stage, ICM: inner cell mass, TE: trophectoderm, PE: primitive endoderm,
E: endoderm, MPP: multipotent progenitor, LMPP: lymphoid multipotent
progenitor, CMP: common myeloid progenitor, MEP/PreM:
megakaryocyte-erythrocyte progenitor, GMP: granulocyte-monocyte
progenitor, CLP: common lymphoid progenitor.
Next, we analyzed datasets from cell differentiation processes without
a lineage bifurcation and with multiple lineage bifurcations. Three
scRT-qPCR datasets came from differentiation systems without
bifurcation, including the Richard et al. study on chicken erythrocytic
differentiation of T2EC cells ( [162]Richard et al., 2016), the Stumpf
et al. study on differentiation of mouse embryonic stem cells (ESC) to
neural progenitor cells (NPC) ( [163]Stumpf et al., 2017), and the
Moussy et al. study during CD34+ cell differentiation ( [164]Moussy et
al., 2017). The single-cell clustering and lineage progression by
CALISTA produced the expected cell differentiation trajectory (see
Extended data, S8 to S10 Figures ( [165]Gao et al., 2023)). The
single-cell transcriptional uncertainty landscapes of these three
differentiation systems, as shown in [166]Figure 2b, exhibit a
rise-then-fall profile, creating a hill that the cells traverse through
in the differentiation process. A transitory increase in single-cell
gene expression uncertainty was reported either directly or indirectly
in the original publications. In [167]Richard et al. (2016) and
[168]Stumpf et al. (2017), the authors adopted the Shannon entropy to
quantify cell-to-cell variability (uncertainty), while [169]Moussy et
al. (2017) reported an unstable transition state with ‘hesitant cells’
flipping their morphology between polarized and round shapes before
committing to the common myeloid progenitors-like fate. Morphological
uncertainty therefore corresponded to a higher transcriptional
uncertainty. Note that the Moussy et al. study looked at only the
initial phase of the (hematopoietic) cell differentiation, and thus, it
is likely that the differentiation process had not completed for the
cells in the dataset.
The next set of single-cell gene expression data came from
differentiation systems with multi-branching lineage, including the Guo
et al. study during mouse embryo development from zygote to blastocyst
( [170]Guo et al., 2010), [171]Nestorowa et al. (2016) and
[172]Moignard et al. (2013) studies on hematopoietic stem cell
differentiation. [173]Figure 2c shows the single-cell transcriptional
landscape for each of the datasets. For the Guo et al. study, we
identified seven cell clusters and identified two bifurcations in the
lineage. Here, we observed two hills in the transcriptional uncertainty
landscape, each coinciding with a bifurcation event in the lineage
progression – one at 32-cell stage (cluster 2 to cluster 3 and 4) and
another at 64-cell stage (cluster 4 to cluster 6 and 7) (see the
Extended data, S11 Figure ( [174]Gao et al., 2023)). For the
[175]Nestorowa et al. (2016) ( Extended data, S12 Figure ( [176]Gao et
al., 2023)) and [177]Moignard et al. (2013) (see Methods and (
[178]Papili Gao et al., 2020)) datasets, we again observed peaks in the
transcriptional uncertainty landscape that colocalize with the
bifurcation points in the lineage progression.
The use of the two-state mechanistic gene transcriptional model within
CALISTA enabled us to probe into a mechanistic explanation for the
observed shape of the transcriptional uncertainty landscape. [179]Table
1 show the pairwise Pearson correlations between the cell-averaged NLL
of each cluster with two biologically interpretable model parameters,
namely transcriptional burst size (number of transcripts generated in
each burst) and burst frequency (occurrence of burst per unit time) (
[180]Nicolas et al., 2017) (see Methods). The Pearson correlations
indicate that the single-cell gene expression uncertainty increases
with higher burst size and burst frequency ( p-value ≤ 0.01). Higher
transcriptional burst size and frequency are associated with a lower θ
[off ]– a lower rate of promoter turning off – and a greater θ [on ]–
higher rate of promoter turning on. One possible explanation for such a
change in model parameters is a higher chromatin accessibility during
the transition period of cell differentiation. This finding is
consistent with the view that stem cells increase its gene expression
uncertainty or stochasticity by adopting a more open chromatin state to
enable the exploration of the gene expression space ( [181]Antolović et
al., 2017; [182]Fritzsch et al., 2018; [183]Nicolas et al., 2017;
[184]Zhang & Zhou, 2018).
Table 1. Pairwise correlation coefficients between transcriptional
uncertainty and transcriptional burst frequency/burst size.
Correlation with Transcriptional Uncertainty (p-value ≤ 0.01 in red
boldface):
Burst frequency Burst size
Bargaje et al. 0.74 0.70
Treutlein et al. 0.71 0.64
Richard et al. 0.68 0.55
Stumpf et al. 0.68 0.88
Moussy et al. 0.40 0.81
Guo et al. 0.75 0.82
Nestorowa et al. 0.73 0.71
Moignard et al. 0.04 0.83
La Manno et al. 0.78 0.55
Kreigstein et al. 0.77 0.32
[185]Open in a new tab
RNA velocity and transcriptional uncertainty
In a recent paper ( [186]La Manno et al., 2018), La Manno and
colleagues introduced the concept of RNA velocity, which involves
computing the rate of change of mRNA from the ratio of unspliced to
spliced mRNA. A positive RNA velocity indicates an induction of gene
expression, while a negative RNA velocity indicates a repression of
gene expression. La Manno et al. demonstrated that RNA velocities are
able to predict the trajectory of cells undergoing a dynamical
transition, such as in circadian rhythms or cell differentiation. In
the following, we explored the relationship between RNA velocities and
single-cell transcriptional uncertainty.
We evaluated the single-cell transcriptional uncertainty and RNA
velocity for two single-cell gene expression datasets that were
previously analyzed in [187]La Manno et al. (2018). The first dataset
came from human glutamatergic neurogenesis which has a linear
(non-bifurcating) lineage progression. [188]Figure 3 (top row) depicts
the cell clustering, single-cell transcriptional uncertainty, and RNA
velocities (see also Extended data, S13 Figure ( [189]Gao et al.,
2023)). The single-cell transcriptional uncertainty landscape again has
the rise-then-fall shape, as in the other cell differentiation systems
discussed above. Interestingly, the same rise-then-fall profile is also
seen in the RNA velocity. As illustrated in [190]Figure 3, the increase
and decrease of the RNA velocity preceed the transcriptional
uncertainty, and the peak of RNA velocity occurs prior to those of the
transcriptional uncertainty (see the Extended data S1 File for animated
illustration ( [191]Gao et al., 2023)). Furthermore, a gene-wise
cross-correlation analysis confirms a positive correlation between RNA
velocity and single-cell transcriptional uncertainty with a delay for
individual genes (see the Extended data, Figure S14 ( [192]Gao et al.,
2023)).
Figure 3. Comparison between RNA velocities estimated using Velocyto and
CALISTA negative log-likelihood (NLL) values.
[193]Figure 3.
[194]Open in a new tab
(Top row) Human glutamatergic neurogenesis and (Bottom row) mouse
hippocampal neurogenesis ( [195]La Manno et al., 2018). (First column)
Cell clustering assignments evaluated from Velocyto ( [196]La Manno et
al., 2018). Normalized values for Euclidean norm of RNA velocities (2
^nd column), CALISTA single-cell transcriptional uncertainty (NLL; 3
^rd column). The colors in the first column indicate the cell clusters,
and those in the second-third columns indicate the normalized cell-wise
RNA velocities and NLL values respectively.
We also compared RNA velocity and single-cell transcriptional
uncertainty for another dataset from mouse hippocampal neurogenesis
with a multi-branching lineage ( [197]La Manno et al., 2018).
[198]Figure 3 (bottom row) shows that like in the neurogenesis dataset
earlier, the RNA velocity increases and then decreases during cell
differentiation, and the change in the RNA precede that of the
transcriptional uncertainty (see the Extended data, S2 File for
animated illustration ( [199]Gao et al., 2023)). Also, the RNA velocity
peaks take place before the transcriptional uncertainty peaks. The
rise-then-fall dynamic of the RNA velocity seen in the two datasets
above is consistent with the view that cells engage in exploratory
stochastic dynamics as they leave the progenitor state, and disengage
this explorative mode as they reach toward the final cell state.
Discussion
Although Waddington’s epigenetic landscape was originally proposed only
as a metaphor, the landscape has helped stem cell researchers to
conceptualize the cell differentiation processes through canalization
of cell lineages. As mentioned earlier, much of the existing literature
on the analytical reconstruction of the epigenetic landscape relied on
either a dynamical system theory applied to a simple gene network, or a
thermodynamic interpretation based on the potential energy of a
reaction ( [200]Bhattacharya et al., 2011; [201]Rebhahn et al., 2014).
In this study, we did not make any prior assumptions on the gene
regulatory network driving the differentiation process nor on the
characteristics of the landscape, such as the existence of a stable
valley or that of an energetic barrier (hill). Rather, we assumed that
the gene transcription at the single-cell level occurs via stochastic
transcriptional bursts that described by a two-state stochastic gene
transcription model ( [202]Peccoud & Ycart, 1995). We defined
single-cell transcriptional uncertainty based on the likelihood of the
cell’s gene expression, computed using the steady-state mRNA
distribution from the stochastic transcriptional model above. While
high transcriptional uncertainty may reflect a cell with an aberrant
gene expression signature with respect to other cells of the same
state, such a cell will have little effect on the shape of the
transcriptional uncertainy landscape. More importantly, high
single-cell transcriptional uncertainty also reflects a cell state that
is characterized by high level of heterogeneity in gene transcription.
These cells together form the hill region of our transcriptional
uncertainty landscape. Thus, the transcriptional uncertainty landscape
in our study is a reflection of the dynamic trajectory of gene
transcriptional stochasticity during the cell differentiation process.
The two-state model used in CALISTA captures the essential features of
stochastic transcriptional bursts – an ON/OFF promoter state and an
mRNA transcription only during the ON state. The model is able to
reproduce the characteristic negative binomial distribution of mRNA
commonly observed in single-cell transcriptomic data. More detailed
modelling of gene transcriptional bursts that includes RNA polymerase
recruitment and paused release, maturation of nascent mRNA, and cell
divisions ( [203]Bartman et al., 2019; [204]Cao et al., 2020;
[205]Suter et al., 2011), demonstrates how various aspects of gene
transcription contribute to the overt cell-to-cell heterogeneity in
gene expression. Under conservative simplifying assumptions, the mRNA
distribution from the more detailed models can be reduced to that of
the two-state model. Thus, the parameters of the two-state model, for
example the rate constants of promoter activation (OFF-to-ON state) and
deactivation (ON-to-OFF), should be interpreted as effective constants
– i.e. not fundamental biophysical constants – that capture the
aggregate impact of various sources of gene transcriptional
stochasticity. Note that while we used the two-state model for
single-cell clustering and transcriptional uncertainty calculations, as
we demonstrated in the iPSC cell differentiation, the rise-then-fall of
the transcriptional uncertainty landscape is still valid when using
SIMLR clustering algorithm and when using the empirical distribution of
mRNA, rather than the two-state model distribution, for computing
single-cell transcriptional uncertainty.
The reconstruction of the transcriptional uncertainty landscapes from
10 single-cell transcriptomic datasets of various cell differentiation
processes in our study reveals a universal rise-then-fall trajectory in
which cells start from a high potency state with a uniform gene
expression pattern in the cell population, then progress through
transitional cell state(s) marked by increased transcriptional
uncertainty (i.e., higher cell-to-cell variability), and eventually
reach one of possibly several final cell states with again a uniform
gene expression pattern among the cells. Furthermore, the peaks of the
transcriptional uncertainty landscape colocalize with forks in the cell
lineage. The rise-then-fall in cell uncertainty agrees well with other
reports from different cell differentiation systems ( [206]Han et al.,
2020; [207]Mojtahedi et al., 2016; [208]Moussy et al., 2017;
[209]Richard et al., 2016; [210]Semrau et al., 2017; [211]Stumpf et
al., 2017), suggesting that stem cells go through a transition state of
high gene expression uncertainty before committing to a particular cell
fate. Notably, an increase of variability is a known early warning
signal associated with critical transitions in stochastic dynamical
systems that are driven by slow, monotonic change in the bifurcation
parameter ( [212]Kuehn, 2011; [213]Sarkar et al., 2019). While the
results of our analysis are consistent with critical transitions during
cell fate commitment in stem cells (see also ( [214]Mojtahedi et al.,
2016)), our analysis does not require nor imply this phenomenon. The
existence of a hill or barrier during the intermediate stage of cell
differentiation has also been proposed in previous studies (
[215]Braun, 2015; [216]Fard et al., 2016; [217]Moris et al., 2016). In
particular, Moris and colleagues compared this transition state to the
activation energy barrier in chemical reactions ( [218]Moris et al.,
2016). We noted however, that a hill in our transcriptional uncertainty
landscape is a reflection of a peak in the cell-to-cell gene expression
variability, and thus does not represent a resistance or barrier that a
cell has to overcome.
In the analysis of iPSCs differentiation into cardiomyocytes (
[219]Bargaje et al., 2017), the genes that contribute significantly to
the overall transcriptional uncertainty at or around the peak in the
landscape (clustes 2 and 3 in [220]Figure 1e) are known to regulate
cardiomyocyte differentiation ( [221]Bargaje et al., 2017) (see the
Extended data S1 Table for gene lists and S5 Figure and S6 Figure for
pathway enrichment analysis of these genes ( [222]Gao et al., 2023)),
supporting the idea that dynamic cell-to-cell variability has a
functional role in cell-fate decision making processes ( [223]Guillemin
et al., 2018; [224]Moris et al., 2018; [225]Rebhahn et al., 2014). Such
an idea would be in congruence with the recent demonstration that, in a
physiologically relevant cellular system, gene expression variability
is functionally linked to differentiation ( [226]Guillemin et al.,
2018; [227]Moris et al., 2018).
The rise-then-fall trajectory in the transcriptional uncertainty
landscape are more pronounced in some datatsets than in others. For
example, in Nestorowa ( [228]Nestorowa et al., 2016) and Moignard (
[229]Moignard et al., 2013) datasets (see [230]Figure 2c), peaks in the
transcriptional uncertainty landscape are less noticeable than in the
other differentiation systems. We noted that cells in the Nestorowa (
[231]Nestorowa et al., 2016) and Moignard ( [232]Moignard et al., 2013)
studies were pre-sorted by using flow cytometry based on the expression
of surface protein markers. We posited that at least some cells in the
transition state(s) might have been lost during the cell pre-sorting
since such cells might not express the chosen surface markers strongly.
Further, the correlation analysis between the cell transcriptional
uncertainty and biologically meaningful rates of the stochastic gene
transcription model showed strong positive correlations with
transcriptional burst size and frequency. Note that cellular processes
such as cell division can affect the heterogeneity of mRNA in a cell
population in a similar fashion as stochastic gene transcriptional
bursts ( [233]Cao & Grima, 2020; [234]Perez-Carrasco et al., 2020),
providing an alternate explanation for gene expression fluctuations.
But, several studies have reported an increase in gene transcriptional
bursts during transition states in cell differentiation and other
recent studies have suggested that both burst frequency and burst size
regulate gene expression levels ( [235]Antolović et al., 2017;
[236]Fritzsch et al., 2018; [237]Zhang & Zhou, 2018). Importantly, our
comparison of the single-cell transcriptional uncertainty and the
single-cell RNA velocity revealed that an increase (decrease) in RNA
velocity predicts an increase (decrease) in transcriptional uncertainty
after a short delay, and that a peak of RNA velocity preceeds that of
the transcriptional uncertainty.
The aforementioned observations, while correlative in nature, points to
possible biological mechanisms underlying the universal dynamic feature
of single-cell transcriptional uncertainty during cell differentiation.
At the start of the differentiation process, cells engage an
exploratory search dynamics in the gene expression space by increasing
stochastic transcriptional burst size and burst frequency. The putative
objective of such a stochastic search is to optimize the cell’s gene
expression pattern given its new environment. The engagement of this
stochastic exploratory mode is supported by the observed increased in
the overall RNA velocity and its expected-but-delayed effect in
elevating the cell-to-cell gene expression variability (i.e. higher
transcriptional uncertainty). Increased transcriptional burst size and
frequency are an indication of increased frequency of the promoter
turning ON (higher θ [on] and lower θ [off] ).
A possible mechanism behind this exploratory search dynamic is an
increase in chromatin mobility, driven by metabolic alterations in
early differentiation ( [238]Paldi, 2012). Multiple studies have
demonstrated that a mismatch between the intracellular state of stem
cells and their immediate environment can lead to metabolic
reorganization ( [239]Argüello-Miranda et al., 2018; [240]Folmes et
al., 2011; [241]Gu et al., 2016). More specifically, a change in the
balance between glycolysis and OXPHOS metabolism has been associated to
numerous differentiation processes (see ( [242]Richard et al., 2019)
and references therein). Furthermore, changes in the metabolic flux