Graphical abstract
graphic file with name fx1.jpg
[29]Open in a new tab
Highlights
* •
Fatecode is a computational method to predict cell fate regulators
from scRNA-seq data
* •
Fatecode uses autoencoder perturbation to identify genes that
influence cell populations
* •
Simulations and real scRNA-seq data show Fatecode detects cell fate
regulators
* •
Fatecode can accelerate discovery of cell fate regulators using
widely available data
Motivation
How stem and progenitor cells decide which cell types they will
generate via cell division is crucial for understanding tissue
development and engineering cell types for use in regenerative medicine
or cancer therapies. However, the identification of the regulators of
these cell fate decisions within the complex and dynamic system of
tissues is a major challenge. Experimental high-throughput perturbation
screens can help to dissect regulators, but these are not practical or
easy to implement in every context of interest. To address this
challenge, we developed a computational method, Fatecode, to predict
master regulators and key pathways controlling cell fate based on any
single-cell transcriptomics dataset.
__________________________________________________________________
Identifying the genes that control cell fate is essential for designing
cellular reprogramming strategies. To provide an accessible, in silico
complement to high throughput perturbation screens, Sadria et al.
develop Fatecode, a deep learning-based computational method that can
predict cell fate regulators from scRNA-seq data.
Introduction
In tissue development, specific gene regulators control how cells
change state and type to form a complete tissue.[30]^1 These gene
regulators are also important because they can be used to control cell
fate for multiple applications, including in regenerative medicine and
cancer.[31]^2 However, it remains a challenge to identify these
regulators within complex and dynamic tissue systems.[32]^1
Cell fate regulators can be identified using experimental methods such
as high-throughput genetic perturbation screens (e.g., CRISPR-based)
with single-cell gene expression (single-cell RNA sequencing
[scRNA-seq]) readouts.[33]^3^,[34]^4 However, these methods are
challenging to run on arbitrary biological systems. Computational
methods have been developed to predict gene expression programs that
explain the difference between perturbed and unperturbed
states[35]^5^,[36]^6^,[37]^7^,[38]^8 or to predict the linear effect of
perturbing a particular transcription factor.[39]^9 Also, computational
methods that determine the ordering of cell states along a trajectory,
based on their gene expression profiles using a pseudotime or actual
time approach,[40]^10^,[41]^11^,[42]^12^,[43]^13^,[44]^14 have been
used to examine the cell decision-making process by identifying genes
that are differentially expressed between trajectory branches. However,
these latter methods often have trouble identifying accurate
trajectories and branchpoints.[45]^15^,[46]^16 Furthermore, none of the
above methods are designed to identify cell fate regulators in normal
developmental processes.
We developed Fatecode, a computational method to predict important cell
fate regulator genes for cell types of interest. Fatecode predicts cell
fate regulators based only on scRNA-seq data covering a given range of
cell types to be analyzed. Fatecode learns a latent representation of
the scRNA-seq data using a deep learning-based
classification-supervised autoencoder[47]^17^,[48]^18 and then performs
in silico perturbation experiments on the latent representation to
predict genes that, when perturbed, would alter the original cell type
distribution to increase or decrease the population size of a cell type
of interest. Fatecode can be thought of as an in silico CRISPR
perturbation screen that identifies genes that influence cell fate,
based on a cell type readout. These genes can be traditional (e.g.,
transcription factors) or non-traditional regulators (any other genes).
We assessed Fatecode’s performance using simulated data produced by a
mechanistic model based on pre-defined gene-regulatory networks with
known cell fate regulators[49]^19 and tested it on scRNA-seq maps of
blood and developing brain from zebrafish and
mouse.[50]^20^,[51]^21^,[52]^22^,[53]^23
Results
Fatecode method overview
Fatecode uses a classification-supervised autoencoder to detect key
genes that can shift the cell type frequencies in an input scRNA-seq
dataset toward a desired distribution of cell types. Taking single-cell
gene expression profiles as input, the autoencoder learns a latent
space with reduced dimensions capturing the input information (reduce
gene dimension x cell matrix). A supervised cell type classifier is
included as part of the loss function to create a latent space composed
of features that support optimal cell type classification in addition
to input data reconstruction. Known cell type annotations in the input
data are used to train the classifier. This ensures that the latent
space is relevant for cell type classification used in later stages.
Each latent layer node of the autoencoder, which represents a reduced
dimension of the input, is systematically perturbed to simulate
altering key gene expression programs (sets of genes that are
correlated with each other that are represented by individual learned
latent layer dimension). Cell types are then reclassified to
characterize the effect of the perturbation, and the autoencoder’s
decoder uses the perturbed and unperturbed latent embeddings to
generate a gene-by-cell matrix of gene prioritization scores. This
matrix is used to identify genes important for the perturbation effect
([54]STAR Methods; [55]Figures 1 and [56]S1). Resulting cell type
distributions are generated for each possible perturbation and then
manually evaluated to identify those that increase or decrease
proportions of desired cell types. In this way, regulator genes are
identified to increase or decrease a given cell type proportion
relative to all other cell types, and these are predicted to be cell
fate regulators for the given cell type. An average of the cell fate
regulator prioritization scores across cells for the cell type is
computed to produce a final regulator list for each cell type.
Figure 1.
[57]Figure 1
[58]Open in a new tab
Fatecode workflow for in silico perturbation experiments and cell fate
regulator detection
The 3D model (top) represents a Waddington-like landscape depicting
cellular reprogramming processes. We seek to identify genes (question
marks) that regulate paths on this landscape (wavy lines) by
transitioning them to another path (red arrows). A
classification-supervised autoencoder learns a latent space
representing the original data, optimized for both input reconstruction
and cell type classification. The latent layer is systematically
perturbed, and by investigating all resulting perturbation-generated
cell type distributions, distributions with an increase or decrease in
a cell type of interest are identified. Perturbation output is
simulated by subtracting the perturbed from unperturbed latent layers
and feeding it to the decoder to identify a cell-by-gene matrix of
prioritization scores that can help us to prioritize genes predicted to
be important for achieving a desired cell population distribution. An
average of the cell fate regulator prioritization scores across cells
in each cell type is computed. By sorting these genes based on their
prioritization scores for a cell type of interest, the model predicts
genes that are important for regulating the levels of a given cell
type.
Our latent layer perturbation approach is inspired by latent vector
operations used in natural language processing and computer vision
applications to generate novel text and
images.[59]^24^,[60]^25^,[61]^26 In those applications, perturbation
operations performed on the latent layer generally yield superior
results compared to operations performed directly in the input space.
The classification component of Fatecode is used to exclude possible
latent space regions that do not conform to the overall structure of
the data. This helps in learning a model that is more representative of
the underlying data distribution.[62]^27
Optimizing model architecture and hyperparameters
Fatecode relies on the latent embedding of an autoencoder, but
different types of autoencoders may produce different
results, depending on the input data (see [63]supplemental
information).[64]^6^,[65]^28^,[66]^29^,[67]^30 To investigate this in
our problem context, we evaluated the performance of three common
autoencoder architectures: under-complete autoencoder (AE), variational
autoencoder (VAE), and conditional VAE (CVAE).[68]^31 The first step of
Fatecode evaluates these three autoencoder architectures and other
hyperparameters (see the [69]Hyperparameter search section) to find the
ones that best reconstruct the input data, measured by mean squared
error for reconstruction and cross-entropy for cell type
classification. To illustrate the importance of this step, we compared
how the choice of autoencoder affects learning the underlying
representation for two single-cell gene expression datasets in adult
zebrafish blood[70]^20 and murine pancreatic development.[71]^32 AE
produced the lowest reconstruction error for the zebrafish data
(averaged over cell types) ([72]Figures 2A and 2B). AE also produced a
latent layer that successfully reduces the dimension and cleanly
separates the five known cell types in the data ([73]Figure 2C), and
its cell type classifier yields a high accuracy ([74]Figure 2D).
However, for the mouse data, VAE achieved a higher accuracy compared to
the other autoencoders ([75]Figure S2).
Figure 2.
[76]Figure 2
[77]Open in a new tab
Comparison of autoencoder architectures for analyzing data for
hematopoiesis regulation in zebrafish blood
(A) Comparison of correlation between input and output of AE,
variational autoencoder (VAE), and conditional variational autoencoder
(CVAE).
(B) MSE between input and output of the three autoencoder
architectures, showing that AE produces the lowest error rate for this
dataset.
(C) Uniform Manifold Approximation and Projection (UMAP) visualization
of the latent layer of the under-complete autoencoder (AE).
(D) Confusion matrix for the classifier connected to the latent layer
of AE demonstrating excellent classification performance.
Fatecode accurately detects known regulators from simulated scRNA-seq data
To assess the accuracy by which Fatecode identifies cell fate
regulators using gene expression profiles, we applied the method to
simulated scRNA-seq data generated from known gene regulatory network
(GRN) structures using SERGIO.[78]^19 SERGIO allows users to specify
the number of cell types and key regulators in the simulated GRN
([79]Figure 3A). While Fatecode is not specific to GRNs (i.e., it can
identify a list of genes of any type, not just transcription factors),
a GRN-based simulation is expected to provide a good benchmark for our
method. A matrix of 400 cells and 2,700 genes with 20 known regulators
and 9 cell types was generated and run through Fatecode. Predicted
cell fate regulator genes and their prioritization scores were compared
to the known SERGIO regulator list. The number of known regulator genes
identified increases as more genes are prioritized ([80]Figure 3B).
Almost all of the known regulator genes (18 of 20) were identified when
150 genes were prioritized (of 2,700). To compare with a naive
baseline, we identified cell type markers (top 20 genes) using
differential gene expression (DGE) analysis on the same data using
Seurat’s non-parametric Wilcoxon rank-sum test.[81]^33 Fatecode
identifies a larger number of known regulators compared to DGE analysis
when examining up to 150 prioritized genes ([82]Figure 3B). As SERGIO
is a stochastic method, we analyzed five additional simulated datasets
of the same size, all of which yielded similar results (plotted as
shading in [83]Figure 3B). We repeated this analysis on a larger
dataset consisting of 2,700 cells, 1,200 genes, with 65 predefined
regulators, and 9 distinct cell types. We used Fatecode to identify the
top 180 key genes of these data, and DGE analysis to identify the top
20 differentially expressed genes from each cell type. Also, for
comparison, we included scFates, a method specifically designed for
trajectory-based DGE analysis.[84]^34 Fatecode consistently
outperformed both DGE methods in detecting known regulators. We further
evaluated performance by varying the top k gene threshold of DGE, and
Fatecode consistently outperformed DGE across all tested thresholds,
demonstrating its robustness while varying the number of genes
considered ([85]Figures 3C and [86]S3). Thus, Fatecode performs well at
identifying known regulators in simulated scRNA-seq data.
Figure 3.
[87]Figure 3
[88]Open in a new tab
Fatecode detects known regulators using simulated data generated by
SERGIO
(A) The schematic structure of the GRN to generate scRNA-seq. Red nodes
are known regulators, and green nodes are non-regulators whose
production rates are determined by their associated regulators. Our
goal is to identify known regulators from the generated scRNA-seq data
using Fatecode.
(B) Benchmark comparisons of the detection rate of predefined
regulators generated by SERGIO using Fatecode compared with a naive
differential gene expression (DGE) baseline. The red and green areas
represent the performance of Fatecode and DGE, respectively, on the
simulated data with 400 cells.
(C) Benchmark comparisons of the detection rate of known regulators
using Fatecode, scFates, and DGE on simulated data with 2,700 cells.
(D) Venn diagram showing the similarity between the number of known
regulators uncovered by Fatecode across various latent layer sizes.
We also examined the sensitivity of our model by the size of the latent
layer in the autoencoder by training Fatecode with different latent
layer sizes (n = 50, 75, and 100 dimensions) using the 2,700-cell
simulated data ([89]Figure 3D). Our results show general consistency
across the different latent layer sizes, indicating that Fatecode
exhibits robustness across a range of latent layer sizes.
Fatecode identifies known cell fate regulator genes in mouse hematopoiesis
Hematopoiesis is a cell differentiation process by which the body
produces mature blood cells from stem cells. We applied Fatecode to a
published mouse hematopoiesis single-cell differentiation dataset,
which involves the differentiation of myeloid progenitors into 9 cell
types ([90]Figure 4A).[91]^22 We then examined Fatecode’s accuracy in
predicting cell fate regulators that lead to the desired cell type
distribution by comparing the results with ground-truth experimental
perturbation data and known regulator
genes.[92]^9^,[93]^22^,[94]^35^,[95]^36 Fatecode learned a latent node
that, when perturbed, simultaneously increases the monocyte population
and decreases erythrocytes and granulocytes ([96]Figure 4B). Previous
studies have demonstrated that Irf8 is important in promoting the
differentiation of the GM (granulocyte-monocyte) lineage, particularly
monocytes, and functions as a key regulator in determining the fate
between granulocytes and monocytes. Fatecode accurately predicted Irf8
as an important cell fate regulator in the monocyte differentiation
process. It correctly assigned a high positive score for monocytes and
late_GMP (granulocyte-macrophage progenitor) and negative scores for
granulocytes and MEP (megakaryocyte-erythroid progenitor) lineages,
consistent with previous studies ([97]Figure 4C). Next, we investigated
the prediction results for Cebpa, knockout of which leads to a decline
in the population of differentiated myeloid cells, while concurrently
increasing the number of erythrocytes. Fatecode accurately assigned a
high positive score to Cebpa for monocytes and granulocytes and a
negative score to erythrocytes and MEPs ([98]Figures 4D and 4E). In
another example, Klf1 is a key regulator in driving differentiation
toward the ME (megakaryocyte-erythroid) lineage, specifically promoting
the development of erythrocytes, while simultaneously inhibiting the
GMP lineage. Fatecode correctly assigned a set of positive scores to
Klf1 for erythrocytes and MEPs, indicating its ability to capture a key
regulator in ME lineage differentiation ([99]Figure S4A). We also
tested Fatecode’s ability to detect genes that are known to be
important in maintaining stemness and inhibiting differentiation.
Fatecode correctly predicted Runx1 as a candidate that has negative
scores for perturbations that increase all mature cell types (all cell
types expect MEPs and GMPs) ([100]Figure S4B). Last, we examined the
prediction results for Fli1, which has diverse effects on
differentiation. Fatecode accurately gives positive scores for the
association between Fli1 and megakaryocytes, monocytes, and
granulocytes and also assigns a notable negative score to erythrocytes,
in agreement with the literature[101]^9^,[102]^37 ([103]Figure S4C).
These simulations show that Fatecode accurately identifies known cell
fate regulators that have been reported in previous perturbation-based
experimental studies.
Figure 4.
[104]Figure 4
[105]Open in a new tab
Fatecode accurately detects regulators and predicts the effect of
single-cell perturbations
(A) Hematopoiesis data from Paul et al.,[106]^22 visualized as a UMAP
and clustered into 9 cell types.
(B and D) The results of in silico perturbations that change the
initial cell frequency to the desired distribution. For (B), our
objective was to promote monocytes while reducing the number of
erythrocytes. For (D), we aimed for an increase in the erythroid
population and a decline in MEPs and megakaryocytes.
(C and E) Gene prioritization scores per cell type for Irf8 and Cebpa.
(F) Pathway enrichment analysis results. Gene Ontology biological
processes show significant processes related to cell development and
hematopoiesis.
Furthermore, to evaluate the role of the top 200 genes detected by
Fatecode for monocytes, we performed pathway enrichment analysis.
Pathways that are significantly enriched in these 200 genes include
those related to the immune system, hemopoiesis, cell development, and
cell differentiation, which agrees with their Fatecode-predicted role
in monocyte development ([107]Figure 4F).
We extended our analysis to larger hematopoiesis single-cell
differentiation data that involve differentiation into 12 cell types
([108]Figure S5A).[109]^23 We applied Fatecode to detect genes that can
increase the pool of undifferentiated cells in this system
([110]Figure S5B). One candidate detected by Fatecode in this process
is Entpd8, the deletion of which in mice elevates the neutrophil and
monocyte population.[111]^38 Fatecode predictions are consistent with
this experimental result. Fatecode also predicted Nlrp6 as a regulator
of neutrophil and monocyte differentiation. Cai et al. showed that the
number of hematopoietic stem cells and GM progenitors is reduced in
Kp-infected Nlrp6^−/− mice, while the survival of mature neutrophils in
bone marrow is increased.[112]^39 We repeated gene set enrichment
analysis using the top 200 genes detected by Fatecode. Biological
processes related to mouse hematopoiesis, stem cell development, and
metabolic signaling were enriched, showing that Fatecode can again
capture relevant pathways for this biological process
([113]Figure S5C).
Fatecode detects important regulators in cell differentiation and lineage
commitment in zebrafish
We applied Fatecode to zebrafish hematopoiesis data[114]^20 as an
additional demonstration and test. From all possible perturbations on
the latent layer performed by Fatecode, we selected ones that resulted
in the greatest predicted relative increase in hematopoietic stem and
progenitor cells (HSPCs) ([115]Figure 5A). As shown in [116]Figure 5B,
following the perturbation, some cells (mostly monocytes) are predicted
to switch to HSPCs ([117]Figure 5B). Fatecode gives a significant score
to signal transducer and activator of transcription 5A (stat5a) as one
of the most important genes for HSPCs. Stat5a is a key regulator of
normal hematopoiesis, with pleiotropic roles in hematopoietic stem
cells.[118]^40 Also, knockout studies have shown that the deletion of
stat5a led to an increase in HSPC cycling, gradually reduced survival,
and depleted the HSPC pool.[119]^41 Next, Fatecode gives irf8 a high
positive score for monocytes. Irf8 is a key regulator of monocyte
development, and it has been known to be important for myelopoiesis in
different model organisms.[120]^42^,[121]^43 It functions at an early
step of the transcriptional program that governs differentiation from
myeloid progenitors to monocytes/macrophages and plays a key role in
stem cell renewal and maintenance.[122]^43^,[123]^44 Fatecode also
identified a strong negative connection between foxo3 and myeloid cell
differentiation, consistent with foxo3 knockout studies, which show a
significant increase in granulocyte/monocyte progenitors in the spleen,
bone marrow, and blood and enhance short-term hematopoietic stem cell
proliferation.[124]^45^,[125]^46^,[126]^47 Fatecode found an important
role played by the otud gene family, a subgroup of deubiquitination
enzymes, by assigning a high positive score between HSPCs and the otud
gene family. Consistent with our prediction, knockout of otud genes in
Xenopus results in developmental impairments.[127]^48 Also, elevated
expression of otud genes leads to the acquisition of stem cell
properties.[128]^49 Fatecode also predicted the negative score between
thbs1 and HSPCs, where thbs1 has been shown previously to limit the
expression of essential self-renewal transcription factors, including
oct3 and oct4, sox2, klf4, and c-myc, within cells.[129]^50 Other key
gene candidates identified by Fatecode for this perturbation are also
known to be involved in hematopoiesis ([130]Table 1).
Figure 5.
[131]Figure 5
[132]Open in a new tab
In silico experiments to induce hematopoietic stem/progenitor cells
using hematopoiesis in zebrafish
(A) A series of latent layer perturbations and their effect on cell
distribution.
(B) Cells that switch from their initial cell type to HSPCs are
highlighted.
Table 1.
List of zebrafish hematopoiesis regulator genes predicted by Fatecode,
with literature evidence for involvement in this process
Gene Roles Reference
cdk1 plays an important role in the maintenance of pluripotency and
genomic stability in human pluripotent stem cells Neganova
et al.[133]^70
top2a controls the survival of human pluripotent stem cells Ben-David
et al.[134]^71
hmgb2a regulates hematopoietic stem cell maintenance Zhang
et al.[135]^72
ube2c highly expressed in human embryonic stem cells (hESCs) and a
biomarker of cancer stemness Fatima et al.[136]^73; Liu et al.[137]^74
fbxo11 depletion leads to the hematopoietic population with stem cell
characteristics Mo et al.[138]^75
hmgn2 facilitates the maintenance of active chromatin states required
for stem cell identity in a pluripotent stem cell model Garza-Manero
et al.[139]^76
aspm regulates symmetric stem cell division by tuning Cyclin E
ubiquitination Capecchi et al.[140]^77
myb regulates hematopoietic stem cell and myeloid progenitor cell
development Baker et al.[141]^78
kpna2 exhibits strong interactions with oct4 in embryonic stem cells Li
et al.[142]^79
[143]Open in a new tab
Fatecode identifies cell fate regulators in mouse hippocampus development
To demonstrate Fatecode on a larger biological dataset, we applied it
to developing mouse hippocampus scRNA-seq data,[144]^21 composed of
18,213 cells and 3,001 genes. The data are clustered in 14 annotated
cell types ([145]Figure 6A). We first sought to identify regulators in
the differentiation process that preferentially increase mature granule
cells ([146]Figure 6B). Fatecode predicts the ZFP gene family (Zfp94,
Zfp189, and Zfp706) as positively important in granule cell
differentiation. The Zfp family is a definitive marker for the
cerebellar granule neuron lineage and plays a critical role in granule
cell specification within the developing cerebellum.[147]^51 For
example, lack of Zfp521 results in a significant reduction in the
number of granule cells.[148]^52 Id2 and Id3 are important in
maintaining the size and cellular structure of the brains of adult
mice. It has also been shown that the absence of
[MATH:
Id2−/− :MATH]
leads to a decrease in the number of granule neurons.[149]^53^,[150]^54
In line with this earlier research, Fatecode assigns a high positive
score between both Id2 and Id3 for mature granule cells. These two
transcriptional regulators have also been found to determine the fate
of differentiating CD8^+ T cells.[151]^55
Figure 6.
[152]Figure 6
[153]Open in a new tab
Fatecode identifies key genes in mouse neurogenesis
(A) UMAP embedding of fourteen major cell types.
(B) Latent layer node perturbation leads to an increase in mature
granule cells while a decrease in immature granule cells.
(C) Pathway enrichment analysis shows the relevant biological process
using the top 200 genes selected based on their prioritization scores
for mature granule cells.
Next, we applied Fatecode to determine regulators that mediate the
differentiation process, which preferentially increases oligodendrocyte
progenitor cells (OPCs), and decreases granule cells (both mature and
immature) and oligodendrocytes. Fatecode predicted Igfbpl1 as having an
impact on OPC-to-oligodendrocyte differentiation, which is consistent
with published experimental studies.[154]^56^,[155]^57 Furthermore, we
considered Fth1, which provides neuroprotection and is enriched in
oligodendrocytes. Mice lacking Fth1 have more microglia cells compared
to the control and a significant reduction in neurons and
oligodendrocytes.[156]^58 Fatecode accurately assigned a high positive
score linking Fth1 to oligodendrocytes and mature granule cells and a
negative score for Fth1 and microglia cells, showing that knocking out
of Fth1 leads to an increase in microglia cells, consistent with the
experimental studies. Thymosin beta 4 (Tmsb4x) is a key candidate in
the context of neurogenesis during brain development.[157]^59 Its
expression is linked to neurogenic processes and exerts regulatory
control over the expansion of the stem cell pool within the early
neuroepithelium. Tmsb4x gene knockout elicits a pronounced effect on
the differentiation process in vitro. Specifically, it significantly
promotes the differentiation of stem cells, further emphasizing its
role in orchestrating cellular fate determination.[158]^60 Our method
correctly assigns a negative score for Tmsb4x and all cells except
neuroblasts and radial glia-like cells. To further validate the
performance of Fatecode in detecting key genes, we performed pathway
enrichment analysis on the top 200 Fatecode-predicted regulators. This
analysis showed that pathways related to brain development, synaptic
signaling, and protein synthesis were significantly enriched in these
genes ([159]Figure 6C).
To illustrate further downstream analysis that is possible based on
Fatecode results, we applied single-cell regulatory network inference
and clustering (SCENIC) on the mouse hippocampus development dataset to
construct a GRN consisting of the top 2,000 interactions based on their
SCENIC importance measure scores, which shows the significance of the
input gene (referred to as the “TF”) in determining the prediction
outcome for the target.[160]^61 We then mapped the top 400
Fatecode-predicted regulators to the SCENIC-inferred GRN. The resulting
networks can be used as a guide for identifying specific GRN mechanisms
to target in follow-up experiments (Ybx1 example, [161]Figure S6) to
test the regulatory relationships and potential roles of regulators in
cellular reprogramming. While SCENIC predicts useful additional
information to support experiment planning, it only considers
transcription factor regulators. Other types of genes in Fatecode’s
output can be identified as cell fate regulators and should also be
examined.
Discussion
Cell reprogramming is a promising technology for tissue repair and
regeneration, with the ultimate goal of accelerating recovery from
diseases or injuries, as well as the development of novel
therapies.[162]^62 An important component in successful cell
reprogramming is to correctly identify the regulators and trajectories
from single-cell transcriptomics data. However, the number of genes in
these datasets is large, and the number of underlying regulatory
interactions is much larger. Recent studies have demonstrated that the
expression of a single regulator is insufficient to produce an endpoint
phenotype.[163]^63 Instead, a group of control networks acts together
across a variety of biological processes and pathways to induce a
complete lineage conversion.[164]^64 To efficiently and accurately map
these control networks, we developed a deep learning method, Fatecode,
which we have successfully applied to analyze diverse datasets. First,
our method discovers an efficient architecture and latent layer for an
input single-cell dataset. Then, by performing operations on the latent
layer, it is able to predict perturbations for cell fate reprogramming.
Fatecode was validated using simulated scRNA-seq data with predefined
regulators and by predicting regulators in a variety of biological
scRNA-seq data and manually comparing the results to the literature.
The fundamental idea in Fatecode is similar to the minimum Hamiltonian
in physics and the potential energy landscape concept and
representation learning.[165]^27^,[166]^65 These authors have shown
that the most common autoencoders are naturally associated with an
energy function, independent of the training procedure. This reasoning
suggests that regulators can be seen as genes that allow the system to
achieve a target cell type distribution via the most efficient path
through the energy landscape. Fatecode uses the classifier as a guide
to determine what node in the latent layer must be perturbed to achieve
the desired reprogramming effect. Then the decoder maps the modified
latent layer to gene space for gene identification. It is also useful
to understand whether regulators are cell type specific. For example,
the mammalian target of rapamycin complex (mTORC1) is widely important
in cell fate decision-making and also important in the regulation of
T cell fate.[167]^66^,[168]^67^,[169]^68^,[170]^69 Running Fatecode for
different cell conversions can help identify cell-type-specific and
non-specific regulators.
In conclusion, we developed an effective computational framework for
predicting key players in cell fate control and the consequences of
perturbations on cell type frequencies. Fatecode’s modular design
enables users to select an autoencoder architecture that produces an
accurate model for their data. By leveraging the power of
classification-supervised autoencoders and the associated energy
manifold learning process, Fatecode generates useful hypotheses about
genes that could be manipulated to achieve desired cell transitions. We
hope this method accelerates the discovery of novel cell fate
regulators that can be used to engineer and grow cells for therapeutic
use in regenerative medicine applications.
Limitation of study
Fatecode can be thought of as an in silico CRISPR perturbation screen
that identifies genes that can influence cell fate. Unfortunately, we
were not able to find a published genome-wide CRISPR perturbation
screen of an appropriate cell line and with a cell fate readout. Most
genome-wide CRISPR-screens use standard cell lines that are not
naturally multi-potent and, thus, are not expected to generate multiple
cell fates. CRISPR has been used to evaluate cell fate regulators, but
only by examining one or a few candidate genes in a single paper. We
used these latter small-scale results to verify that Fatecode results
agree with these experiments. Because we could not find genome-wide
CRISPR screens with a cell fate readout, we used GRN simulations with
predefined regulators and small-scale CRISPR experiments to validate
our findings. In the future, we hope genome-scale CRISPR screens for
cell fate regulators will be published for us to compare to.
Despite offering a useful input data representation, how the
autoencoder latent layer represents the input data may be difficult to
understand. Future work will need to better understand how the input
data are represented and learned in the latent layer given diverse
input data. However, our results showed that Fatecode predictions are
relatively stable when changing the size of the latent layer,
indicating that latent information is likely captured consistently.
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Deposited data
__________________________________________________________________
Zebrafish hematopoiesis data Athanasiadis et.al.[171]^20 E-MTAB-5530
Raw and analyzed data This paper
[172]https://doi.org/10.5281/zenodo.11511340
Dentate Gyrus neurogenesis Hochgerner et al.[173]^21 [174]GSE95753
Hematopoiesis Paul et al.[175]^22 [176]GSE72859
Hematopoiesis Weinreb et al.[177]^23 [178]GSE140802
SERGIO Dibaeinia et al.[179]^19
[180]https://github.com/PayamDiba/SERGIO
__________________________________________________________________
Software and algorithms
__________________________________________________________________
Fatecode This paper [181]https://doi.org/10.5281/zenodo.11511340
Cytoscape Franz et al.[182]^81
[183]https://doi.org/10.1093/bioinformatics/btad031
scFates Faure et al.[184]^34
[185]https://doi.org/10.1093/bioinformatics/btac746
Scenic Aibar et al.[186]^61 [187]https://doi.org/10.1038/nmeth.4463
gseapy Fang et al.[188]^82 [189]https://github.com/zqfang/GSEApy
scikit-learn Alex et al.[190]^83 [191]https://scikit-learn.org/
Seurat CRAN [192]https://satijalab.org/seurat/
[193]Open in a new tab
Resource availability
Lead contact
Lead contact Further information and requests for resources should be
directed to and will be fulfilled by the lead contact, Mehrshad Sadria
(msadria@uwaterloo.ca)
Materials availability
This study did not generate new unique reagents.
Data and code availability
* •
The datasets used in the present study are openly accessible in
public repositories. The zebrafish hematopoiesis data can be found
under the accession number E-MTAB-5530 on ArrayExpress. We
downloaded a preprocessed version of the “Dentate Gyrus
neurogenesis” data (under accession number [194]GSE95753
) from [195]https://scvelo.readthedocs.io/en/stable/. The
hematopoiesis Paul et al. data can be downloaded from the GEO under
accession code [196]GSE72859 and the preprocessed version was
downloaded from [197]https://celloracle.org/. To generate simulated
data, we used the same parameters for the differential equations as
in [198]https://github.com/PayamDiba/SERGIO.[199]^19 The
hematopoiesis Weinreb et al. data can be downloaded from GEO under
accession number [200]GSE140802 and the preprocessed version was
downloaded from [201]https://cospar.readthedocs.io/en/latest/.
* •
Code supporting this study is available on:
[202]https://github.com/MehrshadSD/Fatecode and
[203]https://doi.org/10.5281/zenodo.11511340.
* •
Any additional information required to re-analyze the results
reported by this study are available from the [204]lead contact
upon request.
Method details
Deep representation learning
Autoencoders are a class of neural networks with a latent layer capable
of learning nonlinear representations of the input data in an
unsupervised manner. An autoencoder consists of an encoder that maps
the input to the latent space and a decoder which transfers the latent
space back to the original space. It can be used for denoising,
reducing dimensionality, or learning the representation (or manifold)
of the data. We implemented three autoencoder architectures:
under-complete AutoEncoder (AE), Variational AutoEncoder (VAE), and
Conditional VAriational Encoder (CVAE)[205]^31 ([206]Figure 1). AE has
a single latent layer. VAE constrains the latent layer by modeling the
latent space as a multivariate Gaussian distribution with a mean and a
standard deviation. CVAE conditions the latent space on class labels
and thus can generate data based on a given class label. The biological
task for our autoencoder is to learn a reduced dimension representation
of a cell by gene matrix capturing measurements of a single-cell
transcriptomics experiment mapping cellular trajectories. Only the gene
dimension is reduced, so the latent space describes a reduced
representation of each input cell transcriptome. To make the latent
layer more specific for our biological task, we added a cell type
classification task to the standard regression tasks. The
classification task, described in more detail below, predicts the type
of each latent cell and compares it to a known input cell type. The
training process works to optimize both classification and regression
performance simultaneously. This reduces the space of latent layer
candidates since not all possible latent layers are useful for the
classification task.
VAE
VAE is a type of autoencoder that estimates a latent set of probability
density functions that model the input data. Unlike AE, which learns an
unconstrained representation of the data, VAE assumes a Gaussian
distribution for the prior. An input gene by cell matrix X is run
through an encoder, which generates parameters for the set of
distributions Q(z |X). Then, from Q, a latent k-vector z is sampled and
the decoder transforms z into an output, with the condition that the
output is similar to the input, where k equals the number of components
(or distributions) in the VAE. The VAE total loss consists of the
reconstruction loss (first term) and the KL-divergence loss (second
term):
[MATH:
E[logP(X|z)
]−DKL[Q(z|<
/mo>X)‖
P(z)] :MATH]
[MATH:
DKL[Q(z|X<
/mi>)‖P
(z)]=−12∑k=1K[1+logσk2−μk2−σk2<
/mrow>] :MATH]
where
[MATH: μk :MATH]
and
[MATH: σk :MATH]
are the k-th components of output vectors
[MATH: μk :MATH]
(X) and
[MATH: σk :MATH]
(X), respectively.
CVAE
CVAE is distinguished from VAE by its embedding of conditional
information in the objective function. CVAE relies on two inputs: the
features and the class labels, c, instead of using only the features,
as is done with a VAE and AE. The CVAE architecture allows the encoder
and the decoder to be conditioned by c. Hence, the variational lower
bound objective is changed to the following form.
[MATH:
E[logP(X|z,c)]−
DKL[Q(z|<
/mo>X,c)‖
P(z|c
)],
:MATH]
Overall network architecture of Fatecode
The Fatecode autoencoder architecture was chosen for each of the
datasets analyzed in this study using a hyperparameter search (More
details in [207]Hyperparameter search section). Encoder and decoder
architectures are constrained to have the same number of outer and
inner layer nodes. For the analysis of hematopoiesis regulation in
zebrafish, Fatecode consists of a fully connected encoder and decoder.
The encoder and decoder are both two-layer networks of 92 (outer layer)
and 48 (inner layer) nodes with the LeakyReLU activation function and
the latent layer has 18 nodes. For the analysis of hematopoiesis in
mouse data by Weinreb et al.,[208]^23 the encoder/decoder has a
506-node outer layer and a 253-node inner layer, and the latent layer
has 125 nodes. For the mouse hematopoiesis data by Paul et al.[209]^22
the encoder/decoder has a 100-node outer layer and a 40-node and the
latent layer has 20 nodes. For the developing mouse hippocampus data,
we used a two-layer encoder/decoder of 50 (outer), 26 (inner), and a
latent layer of 15 nodes. Our model was built using software packages
and libraries, including TensorFlow V2.10.0, scikit-learn V1.1.3,
scanpy V1.9.1, numpy V1.23.4, and pandas V1.5.1.
Classification
The classifier determines cell types using the latent layer as input to
a single hidden layer and then an output layer (with one node per cell
type), all fully connected. ReLu and softmax activation functions are
used for the hidden and output layers, respectively. The number of
nodes in the hidden layer is varied during the hyperparameter
optimization. For adult zebrafish blood data,[210]^20 we use 15 and
5 nodes for the hidden and output layers, respectively. We use 25 and
12 nodes for classifying hematopoiesis in mouse data by
Weinreb et al.,[211]^23 20 and 9 nodes for data from Paul
et al.,[212]^22 and 22 and 14 for the developing mouse hippocampus
data.[213]^21 All cell labels are assigned by using the predefined cell
type labels of the original studies.
Identifying key regulators in cell differentiation
Consider adjustments (e.g., one or more gene knock-outs or
over-expressions) that will transition a baseline cell type
distribution (“A”) to a given desired target distribution (“B”). For
example, in the target cell distribution, our objective is to increase
the number of cell type N while decreasing the number of cell type P
([214]Figure S1). To detect genes that are important in a given
transition, Fatecode analyzes the effects of perturbations on cell fate
by systematically perturbing individual autoencoder latent nodes
learned from a single-cell transcriptomics data capturing cellular
trajectories. Each latent variable perturbation results in a
single-cell transcriptome through the decoding process and a
corresponding cell type distribution, proceeding as follows after
training Fatecode.
* (1)
The gene expression data, denoted as E, corresponding to a mixture
of cells with cell type distribution A, undergoes encoding to
produce a matrix of latent variables represented as
[MATH: X :MATH]
(
[MATH: X=encoder(E) :MATH]
). Each column of
[MATH: X :MATH]
is associated with a cell in E; each row corresponds to a latent
variable).
* (2)
In a series of simulations, finite perturbations of different sizes
[MATH: K :MATH]
(e.g., from a 50% reduction to a 10-fold increase) are applied to
each row j (number of latent variables) in
[MATH: X :MATH]
sequentially. For each perturbed latent layer row,
[MATH:
Xj∗ :MATH]
[MATH:
Xj∗=kXj :MATH]
* (3)
We then run the cell type classifier trained within Fatecode on the
perturbed latent layer to predict the cell type distribution for
each across all perturbation conditions.
[MATH:
New_cel
l_type_<
/mo>distrib
mi>ution=clas
sifier(Xj∗) :MATH]
* (4)
Then, we can identify a perturbed latent layer row,
[MATH:
Xj∗ :MATH]
, and its associated perturbation size, k, that is closest to the
desired target distribution B.
* (5)
To identify genes important for the transition from cell type
distributions A to B, we compute the difference between the
selected
[MATH: X∗ :MATH]
and the
[MATH: X :MATH]
latent layers. For instance, if increasing latent node #9 5-fold
can best approximate the desired distribution B, then the
difference between the selected
[MATH: X∗ :MATH]
and
[MATH: X :MATH]
latent layers is a latent node by cell matrix with all zero
entries, except for the 9^th row, which is 5 times
[MATH: X9 :MATH]
.
* (6)
With this selected perturbation matrix
[MATH: (X∗−<
/mo>X) :MATH]
, the decoder produces a gene-by-cell matrix. Then the average gene
expression profile of all cells in each cell type is computed,
resulting in a gene by cell_type matrix M. The (i,j)-th entry of M
is the prioritization score for the i-th gene in cell_type j.
* (7)
To identify the regulators predicted to be important for
transitioning initial cell type distribution A to target B, we rank
the genes based on their prioritization scores for a cell type of
interest.
[MATH:
Regulat
ors=sort(Mdes
mi>ired_celltype
msub>) :MATH]
We note that M does not directly specify how much each gene should be
perturbed to yield target B. Nonetheless, M contains information about
genes that are important in transitioning cell type distribution from
initial state A to the desired state B. This idea is similar to
potential energy in physics and representation
learning.[215]^27^,[216]^65
We also examine the model’s performance in detecting regulators when
operating on the output of the decoder compared to the latent layer. To
achieve this, we feed the perturbed vector to the decoder and subtract
the result from the unperturbed condition. We then investigate the
genes that show significant changes. Our results indicate that working
on the latent layer leads to better outcomes in detecting regulators
than operating on the output of the decoder. This observation is in
line with previous research in computer vision and natural language
processing, where using the latent space consistently yields superior
results compared to the original data space.[217]^25^,[218]^80 We
assume this is true in general when using an autoencoder with a
non-linear activation function with reasonably complex data, as we have
in biology (in contrast to the linear activation function case where
[MATH:
Decoder
(Xper
mi>turbed<
/msub>)−Decoder(X)
mo>=Decoder(Xpe<
mi>rturbed
mrow>−X) :MATH]
).
Hyperparameter search
Hyperparameter tuning was conducted using a grid search approach. We
explored various combinations of hyperparameters, including learning
rate, batch size, number of layers, number of neurons per layer, and
activation functions. The hyperparameter space for each parameter was
defined as follows.
* (1)
Autoencoder type: [AE, VAE, CVAE]
* (2)
Activation function: [LeakyReLU, Relu, linear]
* (3)
Learning rate: [0.001, 0.01, 0.1]
* (4)
Batch size: [400, 500, 600]
* (5)
Number of hidden layers (encoder): [1, 2]
* (6)
Number of neurons in latent_layer: [input_size/40, input_size/60,
input_size/80, input_size/100, input_size/125, input_size/150]
Data visualization
Python package “UMAP” was used to visualize the latent layer as a
reduced dimensionality space and for network visualizations we used
Cytoscape.[219]^81
Data preprocessing
The scRNA-Seq gene expression data is log normalized, scaled, and
centered. In the training process, 80% of the data is allocated for
training the classification-supervised autoencoder, while the remaining
20% is used for testing purposes.
Quantification and statistical analysis
Differential gene expression analysis was performed using the Wilcoxon
rank-sum test. To account for multiple testing, we applied the
Benjamini–Hochberg correction to the calculated P-values obtained from
the DEG analysis. Genes with a corrected p-value below 0.05 were
considered statistically significant. For scFates we used the default
parameters. For the identification of enriched gene ontology terms in
our study, we used the GSEApy package V1.0.4 with its default parameter
settings.
Acknowledgments