Graphical abstract

   graphic file with name fx1.jpg
   [29]Open in a new tab

Highlights

     * •
       Fatecode is a computational method to predict cell fate regulators
       from scRNA-seq data
     * •
       Fatecode uses autoencoder perturbation to identify genes that
       influence cell populations
     * •
       Simulations and real scRNA-seq data show Fatecode detects cell fate
       regulators
     * •
       Fatecode can accelerate discovery of cell fate regulators using
       widely available data

Motivation

   How stem and progenitor cells decide which cell types they will
   generate via cell division is crucial for understanding tissue
   development and engineering cell types for use in regenerative medicine
   or cancer therapies. However, the identification of the regulators of
   these cell fate decisions within the complex and dynamic system of
   tissues is a major challenge. Experimental high-throughput perturbation
   screens can help to dissect regulators, but these are not practical or
   easy to implement in every context of interest. To address this
   challenge, we developed a computational method, Fatecode, to predict
   master regulators and key pathways controlling cell fate based on any
   single-cell transcriptomics dataset.
     __________________________________________________________________

   Identifying the genes that control cell fate is essential for designing
   cellular reprogramming strategies. To provide an accessible, in silico
   complement to high throughput perturbation screens, Sadria et al.
   develop Fatecode, a deep learning-based computational method that can
   predict cell fate regulators from scRNA-seq data.

Introduction

   In tissue development, specific gene regulators control how cells
   change state and type to form a complete tissue.[30]^1 These gene
   regulators are also important because they can be used to control cell
   fate for multiple applications, including in regenerative medicine and
   cancer.[31]^2 However, it remains a challenge to identify these
   regulators within complex and dynamic tissue systems.[32]^1

   Cell fate regulators can be identified using experimental methods such
   as high-throughput genetic perturbation screens (e.g., CRISPR-based)
   with single-cell gene expression (single-cell RNA sequencing
   [scRNA-seq]) readouts.[33]^3^,[34]^4 However, these methods are
   challenging to run on arbitrary biological systems. Computational
   methods have been developed to predict gene expression programs that
   explain the difference between perturbed and unperturbed
   states[35]^5^,[36]^6^,[37]^7^,[38]^8 or to predict the linear effect of
   perturbing a particular transcription factor.[39]^9 Also, computational
   methods that determine the ordering of cell states along a trajectory,
   based on their gene expression profiles using a pseudotime or actual
   time approach,[40]^10^,[41]^11^,[42]^12^,[43]^13^,[44]^14 have been
   used to examine the cell decision-making process by identifying genes
   that are differentially expressed between trajectory branches. However,
   these latter methods often have trouble identifying accurate
   trajectories and branchpoints.[45]^15^,[46]^16 Furthermore, none of the
   above methods are designed to identify cell fate regulators in normal
   developmental processes.

   We developed Fatecode, a computational method to predict important cell
   fate regulator genes for cell types of interest. Fatecode predicts cell
   fate regulators based only on scRNA-seq data covering a given range of
   cell types to be analyzed. Fatecode learns a latent representation of
   the scRNA-seq data using a deep learning-based
   classification-supervised autoencoder[47]^17^,[48]^18 and then performs
   in silico perturbation experiments on the latent representation to
   predict genes that, when perturbed, would alter the original cell type
   distribution to increase or decrease the population size of a cell type
   of interest. Fatecode can be thought of as an in silico CRISPR
   perturbation screen that identifies genes that influence cell fate,
   based on a cell type readout. These genes can be traditional (e.g.,
   transcription factors) or non-traditional regulators (any other genes).
   We assessed Fatecode’s performance using simulated data produced by a
   mechanistic model based on pre-defined gene-regulatory networks with
   known cell fate regulators[49]^19 and tested it on scRNA-seq maps of
   blood and developing brain from zebrafish and
   mouse.[50]^20^,[51]^21^,[52]^22^,[53]^23

Results

Fatecode method overview

   Fatecode uses a classification-supervised autoencoder to detect key
   genes that can shift the cell type frequencies in an input scRNA-seq
   dataset toward a desired distribution of cell types. Taking single-cell
   gene expression profiles as input, the autoencoder learns a latent
   space with reduced dimensions capturing the input information (reduce
   gene dimension x cell matrix). A supervised cell type classifier is
   included as part of the loss function to create a latent space composed
   of features that support optimal cell type classification in addition
   to input data reconstruction. Known cell type annotations in the input
   data are used to train the classifier. This ensures that the latent
   space is relevant for cell type classification used in later stages.
   Each latent layer node of the autoencoder, which represents a reduced
   dimension of the input, is systematically perturbed to simulate
   altering key gene expression programs (sets of genes that are
   correlated with each other that are represented by individual learned
   latent layer dimension). Cell types are then reclassified to
   characterize the effect of the perturbation, and the autoencoder’s
   decoder uses the perturbed and unperturbed latent embeddings to
   generate a gene-by-cell matrix of gene prioritization scores. This
   matrix is used to identify genes important for the perturbation effect
   ([54]STAR Methods; [55]Figures 1 and [56]S1). Resulting cell type
   distributions are generated for each possible perturbation and then
   manually evaluated to identify those that increase or decrease
   proportions of desired cell types. In this way, regulator genes are
   identified to increase or decrease a given cell type proportion
   relative to all other cell types, and these are predicted to be cell
   fate regulators for the given cell type. An average of the cell fate
   regulator prioritization scores across cells for the cell type is
   computed to produce a final regulator list for each cell type.

Figure 1.

   [57]Figure 1
   [58]Open in a new tab

   Fatecode workflow for in silico perturbation experiments and cell fate
   regulator detection

   The 3D model (top) represents a Waddington-like landscape depicting
   cellular reprogramming processes. We seek to identify genes (question
   marks) that regulate paths on this landscape (wavy lines) by
   transitioning them to another path (red arrows). A
   classification-supervised autoencoder learns a latent space
   representing the original data, optimized for both input reconstruction
   and cell type classification. The latent layer is systematically
   perturbed, and by investigating all resulting perturbation-generated
   cell type distributions, distributions with an increase or decrease in
   a cell type of interest are identified. Perturbation output is
   simulated by subtracting the perturbed from unperturbed latent layers
   and feeding it to the decoder to identify a cell-by-gene matrix of
   prioritization scores that can help us to prioritize genes predicted to
   be important for achieving a desired cell population distribution. An
   average of the cell fate regulator prioritization scores across cells
   in each cell type is computed. By sorting these genes based on their
   prioritization scores for a cell type of interest, the model predicts
   genes that are important for regulating the levels of a given cell
   type.

   Our latent layer perturbation approach is inspired by latent vector
   operations used in natural language processing and computer vision
   applications to generate novel text and
   images.[59]^24^,[60]^25^,[61]^26 In those applications, perturbation
   operations performed on the latent layer generally yield superior
   results compared to operations performed directly in the input space.
   The classification component of Fatecode is used to exclude possible
   latent space regions that do not conform to the overall structure of
   the data. This helps in learning a model that is more representative of
   the underlying data distribution.[62]^27

Optimizing model architecture and hyperparameters

   Fatecode relies on the latent embedding of an autoencoder, but
   different types of autoencoders may produce different
   results, depending on the input data (see [63]supplemental
   information).[64]^6^,[65]^28^,[66]^29^,[67]^30 To investigate this in
   our problem context, we evaluated the performance of three common
   autoencoder architectures: under-complete autoencoder (AE), variational
   autoencoder (VAE), and conditional VAE (CVAE).[68]^31 The first step of
   Fatecode evaluates these three autoencoder architectures and other
   hyperparameters (see the [69]Hyperparameter search section) to find the
   ones that best reconstruct the input data, measured by mean squared
   error for reconstruction and cross-entropy for cell type
   classification. To illustrate the importance of this step, we compared
   how the choice of autoencoder affects learning the underlying
   representation for two single-cell gene expression datasets in adult
   zebrafish blood[70]^20 and murine pancreatic development.[71]^32 AE
   produced the lowest reconstruction error for the zebrafish data
   (averaged over cell types) ([72]Figures 2A and 2B). AE also produced a
   latent layer that successfully reduces the dimension and cleanly
   separates the five known cell types in the data ([73]Figure 2C), and
   its cell type classifier yields a high accuracy ([74]Figure 2D).
   However, for the mouse data, VAE achieved a higher accuracy compared to
   the other autoencoders ([75]Figure S2).

Figure 2.

   [76]Figure 2
   [77]Open in a new tab

   Comparison of autoencoder architectures for analyzing data for
   hematopoiesis regulation in zebrafish blood

   (A) Comparison of correlation between input and output of AE,
   variational autoencoder (VAE), and conditional variational autoencoder
   (CVAE).

   (B) MSE between input and output of the three autoencoder
   architectures, showing that AE produces the lowest error rate for this
   dataset.

   (C) Uniform Manifold Approximation and Projection (UMAP) visualization
   of the latent layer of the under-complete autoencoder (AE).

   (D) Confusion matrix for the classifier connected to the latent layer
   of AE demonstrating excellent classification performance.

Fatecode accurately detects known regulators from simulated scRNA-seq data

   To assess the accuracy by which Fatecode identifies cell fate
   regulators using gene expression profiles, we applied the method to
   simulated scRNA-seq data generated from known gene regulatory network
   (GRN) structures using SERGIO.[78]^19 SERGIO allows users to specify
   the number of cell types and key regulators in the simulated GRN
   ([79]Figure 3A). While Fatecode is not specific to GRNs (i.e., it can
   identify a list of genes of any type, not just transcription factors),
   a GRN-based simulation is expected to provide a good benchmark for our
   method. A matrix of 400 cells and 2,700 genes with 20 known regulators
   and 9 cell types was generated and run through Fatecode. Predicted
   cell fate regulator genes and their prioritization scores were compared
   to the known SERGIO regulator list. The number of known regulator genes
   identified increases as more genes are prioritized ([80]Figure 3B).
   Almost all of the known regulator genes (18 of 20) were identified when
   150 genes were prioritized (of 2,700). To compare with a naive
   baseline, we identified cell type markers (top 20 genes) using
   differential gene expression (DGE) analysis on the same data using
   Seurat’s non-parametric Wilcoxon rank-sum test.[81]^33 Fatecode
   identifies a larger number of known regulators compared to DGE analysis
   when examining up to 150 prioritized genes ([82]Figure 3B). As SERGIO
   is a stochastic method, we analyzed five additional simulated datasets
   of the same size, all of which yielded similar results (plotted as
   shading in [83]Figure 3B). We repeated this analysis on a larger
   dataset consisting of 2,700 cells, 1,200 genes, with 65 predefined
   regulators, and 9 distinct cell types. We used Fatecode to identify the
   top 180 key genes of these data, and DGE analysis to identify the top
   20 differentially expressed genes from each cell type. Also, for
   comparison, we included scFates, a method specifically designed for
   trajectory-based DGE analysis.[84]^34 Fatecode consistently
   outperformed both DGE methods in detecting known regulators. We further
   evaluated performance by varying the top k gene threshold of DGE, and
   Fatecode consistently outperformed DGE across all tested thresholds,
   demonstrating its robustness while varying the number of genes
   considered ([85]Figures 3C and [86]S3). Thus, Fatecode performs well at
   identifying known regulators in simulated scRNA-seq data.

Figure 3.

   [87]Figure 3
   [88]Open in a new tab

   Fatecode detects known regulators using simulated data generated by
   SERGIO

   (A) The schematic structure of the GRN to generate scRNA-seq. Red nodes
   are known regulators, and green nodes are non-regulators whose
   production rates are determined by their associated regulators. Our
   goal is to identify known regulators from the generated scRNA-seq data
   using Fatecode.

   (B) Benchmark comparisons of the detection rate of predefined
   regulators generated by SERGIO using Fatecode compared with a naive
   differential gene expression (DGE) baseline. The red and green areas
   represent the performance of Fatecode and DGE, respectively, on the
   simulated data with 400 cells.

   (C) Benchmark comparisons of the detection rate of known regulators
   using Fatecode, scFates, and DGE on simulated data with 2,700 cells.

   (D) Venn diagram showing the similarity between the number of known
   regulators uncovered by Fatecode across various latent layer sizes.

   We also examined the sensitivity of our model by the size of the latent
   layer in the autoencoder by training Fatecode with different latent
   layer sizes (n = 50, 75, and 100 dimensions) using the 2,700-cell
   simulated data ([89]Figure 3D). Our results show general consistency
   across the different latent layer sizes, indicating that Fatecode
   exhibits robustness across a range of latent layer sizes.

Fatecode identifies known cell fate regulator genes in mouse hematopoiesis

   Hematopoiesis is a cell differentiation process by which the body
   produces mature blood cells from stem cells. We applied Fatecode to a
   published mouse hematopoiesis single-cell differentiation dataset,
   which involves the differentiation of myeloid progenitors into 9 cell
   types ([90]Figure 4A).[91]^22 We then examined Fatecode’s accuracy in
   predicting cell fate regulators that lead to the desired cell type
   distribution by comparing the results with ground-truth experimental
   perturbation data and known regulator
   genes.[92]^9^,[93]^22^,[94]^35^,[95]^36 Fatecode learned a latent node
   that, when perturbed, simultaneously increases the monocyte population
   and decreases erythrocytes and granulocytes ([96]Figure 4B). Previous
   studies have demonstrated that Irf8 is important in promoting the
   differentiation of the GM (granulocyte-monocyte) lineage, particularly
   monocytes, and functions as a key regulator in determining the fate
   between granulocytes and monocytes. Fatecode accurately predicted Irf8
   as an important cell fate regulator in the monocyte differentiation
   process. It correctly assigned a high positive score for monocytes and
   late_GMP (granulocyte-macrophage progenitor) and negative scores for
   granulocytes and MEP (megakaryocyte-erythroid progenitor) lineages,
   consistent with previous studies ([97]Figure 4C). Next, we investigated
   the prediction results for Cebpa, knockout of which leads to a decline
   in the population of differentiated myeloid cells, while concurrently
   increasing the number of erythrocytes. Fatecode accurately assigned a
   high positive score to Cebpa for monocytes and granulocytes and a
   negative score to erythrocytes and MEPs ([98]Figures 4D and 4E). In
   another example, Klf1 is a key regulator in driving differentiation
   toward the ME (megakaryocyte-erythroid) lineage, specifically promoting
   the development of erythrocytes, while simultaneously inhibiting the
   GMP lineage. Fatecode correctly assigned a set of positive scores to
   Klf1 for erythrocytes and MEPs, indicating its ability to capture a key
   regulator in ME lineage differentiation ([99]Figure S4A). We also
   tested Fatecode’s ability to detect genes that are known to be
   important in maintaining stemness and inhibiting differentiation.
   Fatecode correctly predicted Runx1 as a candidate that has negative
   scores for perturbations that increase all mature cell types (all cell
   types expect MEPs and GMPs) ([100]Figure S4B). Last, we examined the
   prediction results for Fli1, which has diverse effects on
   differentiation. Fatecode accurately gives positive scores for the
   association between Fli1 and megakaryocytes, monocytes, and
   granulocytes and also assigns a notable negative score to erythrocytes,
   in agreement with the literature[101]^9^,[102]^37 ([103]Figure S4C).
   These simulations show that Fatecode accurately identifies known cell
   fate regulators that have been reported in previous perturbation-based
   experimental studies.

Figure 4.

   [104]Figure 4
   [105]Open in a new tab

   Fatecode accurately detects regulators and predicts the effect of
   single-cell perturbations

   (A) Hematopoiesis data from Paul et al.,[106]^22 visualized as a UMAP
   and clustered into 9 cell types.

   (B and D) The results of in silico perturbations that change the
   initial cell frequency to the desired distribution. For (B), our
   objective was to promote monocytes while reducing the number of
   erythrocytes. For (D), we aimed for an increase in the erythroid
   population and a decline in MEPs and megakaryocytes.

   (C and E) Gene prioritization scores per cell type for Irf8 and Cebpa.

   (F) Pathway enrichment analysis results. Gene Ontology biological
   processes show significant processes related to cell development and
   hematopoiesis.

   Furthermore, to evaluate the role of the top 200 genes detected by
   Fatecode for monocytes, we performed pathway enrichment analysis.
   Pathways that are significantly enriched in these 200 genes include
   those related to the immune system, hemopoiesis, cell development, and
   cell differentiation, which agrees with their Fatecode-predicted role
   in monocyte development ([107]Figure 4F).

   We extended our analysis to larger hematopoiesis single-cell
   differentiation data that involve differentiation into 12 cell types
   ([108]Figure S5A).[109]^23 We applied Fatecode to detect genes that can
   increase the pool of undifferentiated cells in this system
   ([110]Figure S5B). One candidate detected by Fatecode in this process
   is Entpd8, the deletion of which in mice elevates the neutrophil and
   monocyte population.[111]^38 Fatecode predictions are consistent with
   this experimental result. Fatecode also predicted Nlrp6 as a regulator
   of neutrophil and monocyte differentiation. Cai et al. showed that the
   number of hematopoietic stem cells and GM progenitors is reduced in
   Kp-infected Nlrp6^−/− mice, while the survival of mature neutrophils in
   bone marrow is increased.[112]^39 We repeated gene set enrichment
   analysis using the top 200 genes detected by Fatecode. Biological
   processes related to mouse hematopoiesis, stem cell development, and
   metabolic signaling were enriched, showing that Fatecode can again
   capture relevant pathways for this biological process
   ([113]Figure S5C).

Fatecode detects important regulators in cell differentiation and lineage
commitment in zebrafish

   We applied Fatecode to zebrafish hematopoiesis data[114]^20 as an
   additional demonstration and test. From all possible perturbations on
   the latent layer performed by Fatecode, we selected ones that resulted
   in the greatest predicted relative increase in hematopoietic stem and
   progenitor cells (HSPCs) ([115]Figure 5A). As shown in [116]Figure 5B,
   following the perturbation, some cells (mostly monocytes) are predicted
   to switch to HSPCs ([117]Figure 5B). Fatecode gives a significant score
   to signal transducer and activator of transcription 5A (stat5a) as one
   of the most important genes for HSPCs. Stat5a is a key regulator of
   normal hematopoiesis, with pleiotropic roles in hematopoietic stem
   cells.[118]^40 Also, knockout studies have shown that the deletion of
   stat5a led to an increase in HSPC cycling, gradually reduced survival,
   and depleted the HSPC pool.[119]^41 Next, Fatecode gives irf8 a high
   positive score for monocytes. Irf8 is a key regulator of monocyte
   development, and it has been known to be important for myelopoiesis in
   different model organisms.[120]^42^,[121]^43 It functions at an early
   step of the transcriptional program that governs differentiation from
   myeloid progenitors to monocytes/macrophages and plays a key role in
   stem cell renewal and maintenance.[122]^43^,[123]^44 Fatecode also
   identified a strong negative connection between foxo3 and myeloid cell
   differentiation, consistent with foxo3 knockout studies, which show a
   significant increase in granulocyte/monocyte progenitors in the spleen,
   bone marrow, and blood and enhance short-term hematopoietic stem cell
   proliferation.[124]^45^,[125]^46^,[126]^47 Fatecode found an important
   role played by the otud gene family, a subgroup of deubiquitination
   enzymes, by assigning a high positive score between HSPCs and the otud
   gene family. Consistent with our prediction, knockout of otud genes in
   Xenopus results in developmental impairments.[127]^48 Also, elevated
   expression of otud genes leads to the acquisition of stem cell
   properties.[128]^49 Fatecode also predicted the negative score between
   thbs1 and HSPCs, where thbs1 has been shown previously to limit the
   expression of essential self-renewal transcription factors, including
   oct3 and oct4, sox2, klf4, and c-myc, within cells.[129]^50 Other key
   gene candidates identified by Fatecode for this perturbation are also
   known to be involved in hematopoiesis ([130]Table 1).

Figure 5.

   [131]Figure 5
   [132]Open in a new tab

   In silico experiments to induce hematopoietic stem/progenitor cells
   using hematopoiesis in zebrafish

   (A) A series of latent layer perturbations and their effect on cell
   distribution.

   (B) Cells that switch from their initial cell type to HSPCs are
   highlighted.

Table 1.

   List of zebrafish hematopoiesis regulator genes predicted by Fatecode,
   with literature evidence for involvement in this process
   Gene Roles Reference
   cdk1 plays an important role in the maintenance of pluripotency and
   genomic stability in human pluripotent stem cells Neganova
   et al.[133]^70
   top2a controls the survival of human pluripotent stem cells Ben-David
   et al.[134]^71
   hmgb2a regulates hematopoietic stem cell maintenance Zhang
   et al.[135]^72
   ube2c highly expressed in human embryonic stem cells (hESCs) and a
   biomarker of cancer stemness Fatima et al.[136]^73; Liu et al.[137]^74
   fbxo11 depletion leads to the hematopoietic population with stem cell
   characteristics Mo et al.[138]^75
   hmgn2 facilitates the maintenance of active chromatin states required
   for stem cell identity in a pluripotent stem cell model Garza-Manero
   et al.[139]^76
   aspm regulates symmetric stem cell division by tuning Cyclin E
   ubiquitination Capecchi et al.[140]^77
   myb regulates hematopoietic stem cell and myeloid progenitor cell
   development Baker et al.[141]^78
   kpna2 exhibits strong interactions with oct4 in embryonic stem cells Li
   et al.[142]^79
   [143]Open in a new tab

Fatecode identifies cell fate regulators in mouse hippocampus development

   To demonstrate Fatecode on a larger biological dataset, we applied it
   to developing mouse hippocampus scRNA-seq data,[144]^21 composed of
   18,213 cells and 3,001 genes. The data are clustered in 14 annotated
   cell types ([145]Figure 6A). We first sought to identify regulators in
   the differentiation process that preferentially increase mature granule
   cells ([146]Figure 6B). Fatecode predicts the ZFP gene family (Zfp94,
   Zfp189, and Zfp706) as positively important in granule cell
   differentiation. The Zfp family is a definitive marker for the
   cerebellar granule neuron lineage and plays a critical role in granule
   cell specification within the developing cerebellum.[147]^51 For
   example, lack of Zfp521 results in a significant reduction in the
   number of granule cells.[148]^52 Id2 and Id3 are important in
   maintaining the size and cellular structure of the brains of adult
   mice. It has also been shown that the absence of
   [MATH:
   <mrow><mi>I</mi><mi>d</mi><msup><mn>2</mn><mrow><mo>−</mo><mo>/</mo><mo
   >−</mo></mrow></msup></mrow> :MATH]
   leads to a decrease in the number of granule neurons.[149]^53^,[150]^54
   In line with this earlier research, Fatecode assigns a high positive
   score between both Id2 and Id3 for mature granule cells. These two
   transcriptional regulators have also been found to determine the fate
   of differentiating CD8^+ T cells.[151]^55

Figure 6.

   [152]Figure 6
   [153]Open in a new tab

   Fatecode identifies key genes in mouse neurogenesis

   (A) UMAP embedding of fourteen major cell types.

   (B) Latent layer node perturbation leads to an increase in mature
   granule cells while a decrease in immature granule cells.

   (C) Pathway enrichment analysis shows the relevant biological process
   using the top 200 genes selected based on their prioritization scores
   for mature granule cells.

   Next, we applied Fatecode to determine regulators that mediate the
   differentiation process, which preferentially increases oligodendrocyte
   progenitor cells (OPCs), and decreases granule cells (both mature and
   immature) and oligodendrocytes. Fatecode predicted Igfbpl1 as having an
   impact on OPC-to-oligodendrocyte differentiation, which is consistent
   with published experimental studies.[154]^56^,[155]^57 Furthermore, we
   considered Fth1, which provides neuroprotection and is enriched in
   oligodendrocytes. Mice lacking Fth1 have more microglia cells compared
   to the control and a significant reduction in neurons and
   oligodendrocytes.[156]^58 Fatecode accurately assigned a high positive
   score linking Fth1 to oligodendrocytes and mature granule cells and a
   negative score for Fth1 and microglia cells, showing that knocking out
   of Fth1 leads to an increase in microglia cells, consistent with the
   experimental studies. Thymosin beta 4 (Tmsb4x) is a key candidate in
   the context of neurogenesis during brain development.[157]^59 Its
   expression is linked to neurogenic processes and exerts regulatory
   control over the expansion of the stem cell pool within the early
   neuroepithelium. Tmsb4x gene knockout elicits a pronounced effect on
   the differentiation process in vitro. Specifically, it significantly
   promotes the differentiation of stem cells, further emphasizing its
   role in orchestrating cellular fate determination.[158]^60 Our method
   correctly assigns a negative score for Tmsb4x and all cells except
   neuroblasts and radial glia-like cells. To further validate the
   performance of Fatecode in detecting key genes, we performed pathway
   enrichment analysis on the top 200 Fatecode-predicted regulators. This
   analysis showed that pathways related to brain development, synaptic
   signaling, and protein synthesis were significantly enriched in these
   genes ([159]Figure 6C).

   To illustrate further downstream analysis that is possible based on
   Fatecode results, we applied single-cell regulatory network inference
   and clustering (SCENIC) on the mouse hippocampus development dataset to
   construct a GRN consisting of the top 2,000 interactions based on their
   SCENIC importance measure scores, which shows the significance of the
   input gene (referred to as the “TF”) in determining the prediction
   outcome for the target.[160]^61 We then mapped the top 400
   Fatecode-predicted regulators to the SCENIC-inferred GRN. The resulting
   networks can be used as a guide for identifying specific GRN mechanisms
   to target in follow-up experiments (Ybx1 example, [161]Figure S6) to
   test the regulatory relationships and potential roles of regulators in
   cellular reprogramming. While SCENIC predicts useful additional
   information to support experiment planning, it only considers
   transcription factor regulators. Other types of genes in Fatecode’s
   output can be identified as cell fate regulators and should also be
   examined.

Discussion

   Cell reprogramming is a promising technology for tissue repair and
   regeneration, with the ultimate goal of accelerating recovery from
   diseases or injuries, as well as the development of novel
   therapies.[162]^62 An important component in successful cell
   reprogramming is to correctly identify the regulators and trajectories
   from single-cell transcriptomics data. However, the number of genes in
   these datasets is large, and the number of underlying regulatory
   interactions is much larger. Recent studies have demonstrated that the
   expression of a single regulator is insufficient to produce an endpoint
   phenotype.[163]^63 Instead, a group of control networks acts together
   across a variety of biological processes and pathways to induce a
   complete lineage conversion.[164]^64 To efficiently and accurately map
   these control networks, we developed a deep learning method, Fatecode,
   which we have successfully applied to analyze diverse datasets. First,
   our method discovers an efficient architecture and latent layer for an
   input single-cell dataset. Then, by performing operations on the latent
   layer, it is able to predict perturbations for cell fate reprogramming.
   Fatecode was validated using simulated scRNA-seq data with predefined
   regulators and by predicting regulators in a variety of biological
   scRNA-seq data and manually comparing the results to the literature.

   The fundamental idea in Fatecode is similar to the minimum Hamiltonian
   in physics and the potential energy landscape concept and
   representation learning.[165]^27^,[166]^65 These authors have shown
   that the most common autoencoders are naturally associated with an
   energy function, independent of the training procedure. This reasoning
   suggests that regulators can be seen as genes that allow the system to
   achieve a target cell type distribution via the most efficient path
   through the energy landscape. Fatecode uses the classifier as a guide
   to determine what node in the latent layer must be perturbed to achieve
   the desired reprogramming effect. Then the decoder maps the modified
   latent layer to gene space for gene identification. It is also useful
   to understand whether regulators are cell type specific. For example,
   the mammalian target of rapamycin complex (mTORC1) is widely important
   in cell fate decision-making and also important in the regulation of
   T cell fate.[167]^66^,[168]^67^,[169]^68^,[170]^69 Running Fatecode for
   different cell conversions can help identify cell-type-specific and
   non-specific regulators.

   In conclusion, we developed an effective computational framework for
   predicting key players in cell fate control and the consequences of
   perturbations on cell type frequencies. Fatecode’s modular design
   enables users to select an autoencoder architecture that produces an
   accurate model for their data. By leveraging the power of
   classification-supervised autoencoders and the associated energy
   manifold learning process, Fatecode generates useful hypotheses about
   genes that could be manipulated to achieve desired cell transitions. We
   hope this method accelerates the discovery of novel cell fate
   regulators that can be used to engineer and grow cells for therapeutic
   use in regenerative medicine applications.

Limitation of study

   Fatecode can be thought of as an in silico CRISPR perturbation screen
   that identifies genes that can influence cell fate. Unfortunately, we
   were not able to find a published genome-wide CRISPR perturbation
   screen of an appropriate cell line and with a cell fate readout. Most
   genome-wide CRISPR-screens use standard cell lines that are not
   naturally multi-potent and, thus, are not expected to generate multiple
   cell fates. CRISPR has been used to evaluate cell fate regulators, but
   only by examining one or a few candidate genes in a single paper. We
   used these latter small-scale results to verify that Fatecode results
   agree with these experiments. Because we could not find genome-wide
   CRISPR screens with a cell fate readout, we used GRN simulations with
   predefined regulators and small-scale CRISPR experiments to validate
   our findings. In the future, we hope genome-scale CRISPR screens for
   cell fate regulators will be published for us to compare to.

   Despite offering a useful input data representation, how the
   autoencoder latent layer represents the input data may be difficult to
   understand. Future work will need to better understand how the input
   data are represented and learned in the latent layer given diverse
   input data. However, our results showed that Fatecode predictions are
   relatively stable when changing the size of the latent layer,
   indicating that latent information is likely captured consistently.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   Zebrafish hematopoiesis data Athanasiadis et.al.[171]^20 E-MTAB-5530
   Raw and analyzed data This paper
   [172]https://doi.org/10.5281/zenodo.11511340
   Dentate Gyrus neurogenesis Hochgerner et al.[173]^21 [174]GSE95753
   Hematopoiesis Paul et al.[175]^22 [176]GSE72859
   Hematopoiesis Weinreb et al.[177]^23 [178]GSE140802
   SERGIO Dibaeinia et al.[179]^19
   [180]https://github.com/PayamDiba/SERGIO
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   Fatecode This paper [181]https://doi.org/10.5281/zenodo.11511340
   Cytoscape Franz et al.[182]^81
   [183]https://doi.org/10.1093/bioinformatics/btad031
   scFates Faure et al.[184]^34
   [185]https://doi.org/10.1093/bioinformatics/btac746
   Scenic Aibar et al.[186]^61 [187]https://doi.org/10.1038/nmeth.4463
   gseapy Fang et al.[188]^82 [189]https://github.com/zqfang/GSEApy
   scikit-learn Alex et al.[190]^83 [191]https://scikit-learn.org/
   Seurat CRAN [192]https://satijalab.org/seurat/
   [193]Open in a new tab

Resource availability

Lead contact

   Lead contact Further information and requests for resources should be
   directed to and will be fulfilled by the lead contact, Mehrshad Sadria
   (msadria@uwaterloo.ca)

Materials availability

   This study did not generate new unique reagents.

Data and code availability

     * •
       The datasets used in the present study are openly accessible in
       public repositories. The zebrafish hematopoiesis data can be found
       under the accession number E-MTAB-5530 on ArrayExpress. We
       downloaded a preprocessed version of the “Dentate Gyrus
       neurogenesis” data (under accession number [194]GSE95753
       ) from [195]https://scvelo.readthedocs.io/en/stable/. The
       hematopoiesis Paul et al. data can be downloaded from the GEO under
       accession code [196]GSE72859 and the preprocessed version was
       downloaded from [197]https://celloracle.org/. To generate simulated
       data, we used the same parameters for the differential equations as
       in [198]https://github.com/PayamDiba/SERGIO.[199]^19 The
       hematopoiesis Weinreb et al. data can be downloaded from GEO under
       accession number [200]GSE140802 and the preprocessed version was
       downloaded from [201]https://cospar.readthedocs.io/en/latest/.
     * •
       Code supporting this study is available on:
       [202]https://github.com/MehrshadSD/Fatecode and
       [203]https://doi.org/10.5281/zenodo.11511340.
     * •
       Any additional information required to re-analyze the results
       reported by this study are available from the [204]lead contact
       upon request.

Method details

Deep representation learning

   Autoencoders are a class of neural networks with a latent layer capable
   of learning nonlinear representations of the input data in an
   unsupervised manner. An autoencoder consists of an encoder that maps
   the input to the latent space and a decoder which transfers the latent
   space back to the original space. It can be used for denoising,
   reducing dimensionality, or learning the representation (or manifold)
   of the data. We implemented three autoencoder architectures:
   under-complete AutoEncoder (AE), Variational AutoEncoder (VAE), and
   Conditional VAriational Encoder (CVAE)[205]^31 ([206]Figure 1). AE has
   a single latent layer. VAE constrains the latent layer by modeling the
   latent space as a multivariate Gaussian distribution with a mean and a
   standard deviation. CVAE conditions the latent space on class labels
   and thus can generate data based on a given class label. The biological
   task for our autoencoder is to learn a reduced dimension representation
   of a cell by gene matrix capturing measurements of a single-cell
   transcriptomics experiment mapping cellular trajectories. Only the gene
   dimension is reduced, so the latent space describes a reduced
   representation of each input cell transcriptome. To make the latent
   layer more specific for our biological task, we added a cell type
   classification task to the standard regression tasks. The
   classification task, described in more detail below, predicts the type
   of each latent cell and compares it to a known input cell type. The
   training process works to optimize both classification and regression
   performance simultaneously. This reduces the space of latent layer
   candidates since not all possible latent layers are useful for the
   classification task.

VAE

   VAE is a type of autoencoder that estimates a latent set of probability
   density functions that model the input data. Unlike AE, which learns an
   unconstrained representation of the data, VAE assumes a Gaussian
   distribution for the prior. An input gene by cell matrix X is run
   through an encoder, which generates parameters for the set of
   distributions Q(z |X). Then, from Q, a latent k-vector z is sampled and
   the decoder transforms z into an output, with the condition that the
   output is similar to the input, where k equals the number of components
   (or distributions) in the VAE. The VAE total loss consists of the
   reconstruction loss (first term) and the KL-divergence loss (second
   term):
   [MATH:
   <mrow><mi>E</mi><mrow><mo>[</mo><mrow><mi>log</mi><mrow><mi>P</mi><mrow
   ><mo>(</mo><mrow><mi>X</mi><mo>|</mo><mi>z</mi></mrow><mo>)</mo></mrow>
   </mrow></mrow><mo>]</mo></mrow><mo>−</mo><msub><mi>D</mi><mrow><mi>K</m
   i><mi>L</mi></mrow></msub><mrow><mo>[</mo><mrow><mi
   mathvariant="normal">Q</mi><mrow><mo>(</mo><mrow><mrow><mi>z</mi><mo>|<
   /mo></mrow><mi
   mathvariant="normal">X</mi></mrow><mo>)</mo></mrow><mrow><mo>‖</mo><mi>
   P</mi></mrow><mrow><mo>(</mo><mi>z</mi><mo>)</mo></mrow></mrow><mo>]</m
   o></mrow></mrow> :MATH]
   [MATH:
   <mrow><msub><mi>D</mi><mrow><mi>K</mi><mi>L</mi></mrow></msub><mrow><mo
   >[</mo><mrow><mi>Q</mi><mrow><mo>(</mo><mrow><mi>z</mi><mo>|</mo><mi>X<
   /mi></mrow><mo>)</mo></mrow><mrow><mo>‖</mo><mi>P</mi></mrow><mrow><mo>
   (</mo><mi>z</mi><mo>)</mo></mrow></mrow><mo>]</mo></mrow><mo
   linebreak="badbreak">=</mo><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac
   ><mrow><munderover><mo>∑</mo><mrow><mi>k</mi><mo
   linebreak="badbreak">=</mo><mn>1</mn></mrow><mi>K</mi></munderover><mro
   w><mo stretchy="true">[</mo><mrow><mn>1</mn><mo
   linebreak="badbreak">+</mo><mi mathvariant="italic">log</mi><mspace
   width="0.25em"></mspace><msup><msub><mi>σ</mi><mi>k</mi></msub><mn>2</m
   n></msup><mo>−</mo><msup><msub><mi>μ</mi><mi>k</mi></msub><mn>2</mn></m
   sup><mo>−</mo><msup><msub><mi>σ</mi><mi>k</mi></msub><mn>2</mn></msup><
   /mrow><mo stretchy="true">]</mo></mrow></mrow></mrow> :MATH]

   where
   [MATH: <mrow><msub><mi>μ</mi><mi>k</mi></msub></mrow> :MATH]
   and
   [MATH: <mrow><msub><mi>σ</mi><mi>k</mi></msub></mrow> :MATH]
   are the k-th components of output vectors
   [MATH: <mrow><msub><mi>μ</mi><mi>k</mi></msub></mrow> :MATH]
   (X) and
   [MATH: <mrow><msub><mi>σ</mi><mi>k</mi></msub></mrow> :MATH]
   (X), respectively.

CVAE

   CVAE is distinguished from VAE by its embedding of conditional
   information in the objective function. CVAE relies on two inputs: the
   features and the class labels, c, instead of using only the features,
   as is done with a VAE and AE. The CVAE architecture allows the encoder
   and the decoder to be conditioned by c. Hence, the variational lower
   bound objective is changed to the following form.
   [MATH:
   <mrow><mi>E</mi><mrow><mo>[</mo><mrow><mi>log</mi><mrow><mi>P</mi><mrow
   ><mo>(</mo><mrow><mi>X</mi><mo>|</mo><mi>z</mi><mo>,</mo><mi>c</mi></mr
   ow><mo>)</mo></mrow></mrow></mrow><mo>]</mo></mrow><mo>−</mo><msub><mi>
   D</mi><mrow><mi>K</mi><mi>L</mi></mrow></msub><mrow><mo>[</mo><mrow><mi
   mathvariant="normal">Q</mi><mrow><mo>(</mo><mrow><mrow><mi>z</mi><mo>|<
   /mo></mrow><mi mathvariant="normal">X</mi><mo>,</mo><mi
   mathvariant="normal">c</mi></mrow><mo>)</mo></mrow><mrow><mo>‖</mo><mi>
   P</mi></mrow><mrow><mo>(</mo><mrow><mi>z</mi><mrow><mo>|</mo><mi>c</mi>
   </mrow></mrow><mo>)</mo></mrow></mrow><mo>]</mo></mrow><mtext>,</mtext>
   </mrow> :MATH]

Overall network architecture of Fatecode

   The Fatecode autoencoder architecture was chosen for each of the
   datasets analyzed in this study using a hyperparameter search (More
   details in [207]Hyperparameter search section). Encoder and decoder
   architectures are constrained to have the same number of outer and
   inner layer nodes. For the analysis of hematopoiesis regulation in
   zebrafish, Fatecode consists of a fully connected encoder and decoder.
   The encoder and decoder are both two-layer networks of 92 (outer layer)
   and 48 (inner layer) nodes with the LeakyReLU activation function and
   the latent layer has 18 nodes. For the analysis of hematopoiesis in
   mouse data by Weinreb et al.,[208]^23 the encoder/decoder has a
   506-node outer layer and a 253-node inner layer, and the latent layer
   has 125 nodes. For the mouse hematopoiesis data by Paul et al.[209]^22
   the encoder/decoder has a 100-node outer layer and a 40-node and the
   latent layer has 20 nodes. For the developing mouse hippocampus data,
   we used a two-layer encoder/decoder of 50 (outer), 26 (inner), and a
   latent layer of 15 nodes. Our model was built using software packages
   and libraries, including TensorFlow V2.10.0, scikit-learn V1.1.3,
   scanpy V1.9.1, numpy V1.23.4, and pandas V1.5.1.

Classification

   The classifier determines cell types using the latent layer as input to
   a single hidden layer and then an output layer (with one node per cell
   type), all fully connected. ReLu and softmax activation functions are
   used for the hidden and output layers, respectively. The number of
   nodes in the hidden layer is varied during the hyperparameter
   optimization. For adult zebrafish blood data,[210]^20 we use 15 and
   5 nodes for the hidden and output layers, respectively. We use 25 and
   12 nodes for classifying hematopoiesis in mouse data by
   Weinreb et al.,[211]^23 20 and 9 nodes for data from Paul
   et al.,[212]^22 and 22 and 14 for the developing mouse hippocampus
   data.[213]^21 All cell labels are assigned by using the predefined cell
   type labels of the original studies.

Identifying key regulators in cell differentiation

   Consider adjustments (e.g., one or more gene knock-outs or
   over-expressions) that will transition a baseline cell type
   distribution (“A”) to a given desired target distribution (“B”). For
   example, in the target cell distribution, our objective is to increase
   the number of cell type N while decreasing the number of cell type P
   ([214]Figure S1). To detect genes that are important in a given
   transition, Fatecode analyzes the effects of perturbations on cell fate
   by systematically perturbing individual autoencoder latent nodes
   learned from a single-cell transcriptomics data capturing cellular
   trajectories. Each latent variable perturbation results in a
   single-cell transcriptome through the decoding process and a
   corresponding cell type distribution, proceeding as follows after
   training Fatecode.
     * (1)
       The gene expression data, denoted as E, corresponding to a mixture
       of cells with cell type distribution A, undergoes encoding to
       produce a matrix of latent variables represented as
       [MATH: <mrow><mi>X</mi></mrow> :MATH]
       (
       [MATH: <mrow><mi>X</mi><mo linebreak="goodbreak"
       linebreakstyle="after">=</mo><mi>e</mi><mi>n</mi><mi>c</mi><mi>o</m
       i><mi>d</mi><mi>e</mi><mi>r</mi><mrow><mo>(</mo><mi>E</mi><mo>)</mo
       ></mrow></mrow> :MATH]
       ). Each column of
       [MATH: <mrow><mi>X</mi></mrow> :MATH]
       is associated with a cell in E; each row corresponds to a latent
       variable).
     * (2)
       In a series of simulations, finite perturbations of different sizes
       [MATH: <mrow><mi mathvariant="bold-italic">K</mi></mrow> :MATH]
       (e.g., from a 50% reduction to a 10-fold increase) are applied to
       each row j (number of latent variables) in
       [MATH: <mrow><mi>X</mi></mrow> :MATH]
       sequentially. For each perturbed latent layer row,
       [MATH:
       <mrow><msup><msub><mi>X</mi><mi>j</mi></msub><mo>∗</mo></msup></mro
       w> :MATH]

   [MATH:
   <mrow><msup><msub><mi>X</mi><mi>j</mi></msub><mo>∗</mo></msup><mo
   linebreak="badbreak">=</mo><mi>k</mi><msub><mi
   mathvariant="normal">X</mi><mi
   mathvariant="normal">j</mi></msub></mrow> :MATH]
     * (3)
       We then run the cell type classifier trained within Fatecode on the
       perturbed latent layer to predict the cell type distribution for
       each across all perturbation conditions.

   [MATH:
   <mrow><mi>N</mi><mi>e</mi><mi>w</mi><mo>_</mo><mi>c</mi><mi>e</mi><mi>l
   </mi><mi>l</mi><mo>_</mo><mi>t</mi><mi>y</mi><mi>p</mi><mi>e</mi><mo>_<
   /mo><mi>d</mi><mi>i</mi><mi>s</mi><mi>t</mi><mi>r</mi><mi>i</mi><mi>b</
   mi><mi>u</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo
   linebreak="badbreak">=</mo><mi>c</mi><mi>l</mi><mi>a</mi><mi>s</mi><mi>
   s</mi><mi>i</mi><mi>f</mi><mi>i</mi><mi>e</mi><mi>r</mi><mrow><mo
   stretchy="true">(</mo><msup><msub><mi>X</mi><mi
   mathvariant="normal">j</mi></msub><mo>∗</mo></msup><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
     * (4)
       Then, we can identify a perturbed latent layer row,
       [MATH:
       <mrow><msup><msub><mi>X</mi><mi>j</mi></msub><mo>∗</mo></msup></mro
       w> :MATH]
       , and its associated perturbation size, k, that is closest to the
       desired target distribution B.
     * (5)
       To identify genes important for the transition from cell type
       distributions A to B, we compute the difference between the
       selected
       [MATH: <mrow><msup><mi>X</mi><mo>∗</mo></msup></mrow> :MATH]
       and the
       [MATH: <mrow><mi>X</mi></mrow> :MATH]
       latent layers. For instance, if increasing latent node #9 5-fold
       can best approximate the desired distribution B, then the
       difference between the selected
       [MATH: <mrow><msup><mi>X</mi><mo>∗</mo></msup></mrow> :MATH]
       and
       [MATH: <mrow><mi>X</mi></mrow> :MATH]
       latent layers is a latent node by cell matrix with all zero
       entries, except for the 9^th row, which is 5 times
       [MATH: <mrow><msub><mi>X</mi><mn>9</mn></msub></mrow> :MATH]
       .
     * (6)
       With this selected perturbation matrix
       [MATH: <mrow><mo
       stretchy="true">(</mo><mrow><msup><mi>X</mi><mo>∗</mo></msup><mo>−<
       /mo><mi>X</mi></mrow><mo stretchy="true">)</mo></mrow> :MATH]
       , the decoder produces a gene-by-cell matrix. Then the average gene
       expression profile of all cells in each cell type is computed,
       resulting in a gene by cell_type matrix M. The (i,j)-th entry of M
       is the prioritization score for the i-th gene in cell_type j.
     * (7)
       To identify the regulators predicted to be important for
       transitioning initial cell type distribution A to target B, we rank
       the genes based on their prioritization scores for a cell type of
       interest.

   [MATH:
   <mrow><mi>R</mi><mi>e</mi><mi>g</mi><mi>u</mi><mi>l</mi><mi>a</mi><mi>t
   </mi><mi>o</mi><mi>r</mi><mi>s</mi><mo
   linebreak="badbreak">=</mo><mi>s</mi><mi>o</mi><mi>r</mi><mi>t</mi><mro
   w><mo
   stretchy="true">(</mo><msub><mi>M</mi><mrow><mi>d</mi><mi>e</mi><mi>s</
   mi><mi>i</mi><mi>r</mi><mi>e</mi><mi>d</mi><mo>_</mo><mi>c</mi><mi>e</m
   i><mi>l</mi><mi>l</mi><mi>t</mi><mi>y</mi><mi>p</mi><mi>e</mi></mrow></
   msub><mo stretchy="true">)</mo></mrow></mrow> :MATH]

   We note that M does not directly specify how much each gene should be
   perturbed to yield target B. Nonetheless, M contains information about
   genes that are important in transitioning cell type distribution from
   initial state A to the desired state B. This idea is similar to
   potential energy in physics and representation
   learning.[215]^27^,[216]^65

   We also examine the model’s performance in detecting regulators when
   operating on the output of the decoder compared to the latent layer. To
   achieve this, we feed the perturbed vector to the decoder and subtract
   the result from the unperturbed condition. We then investigate the
   genes that show significant changes. Our results indicate that working
   on the latent layer leads to better outcomes in detecting regulators
   than operating on the output of the decoder. This observation is in
   line with previous research in computer vision and natural language
   processing, where using the latent space consistently yields superior
   results compared to the original data space.[217]^25^,[218]^80 We
   assume this is true in general when using an autoencoder with a
   non-linear activation function with reasonably complex data, as we have
   in biology (in contrast to the linear activation function case where
   [MATH:
   <mrow><mi>D</mi><mi>e</mi><mi>c</mi><mi>o</mi><mi>d</mi><mi>e</mi><mi>r
   </mi><mrow><mo
   stretchy="true">(</mo><msub><mi>X</mi><mrow><mi>p</mi><mi>e</mi><mi>r</
   mi><mi>t</mi><mi>u</mi><mi>r</mi><mi>b</mi><mi>e</mi><mi>d</mi></mrow><
   /msub><mo
   stretchy="true">)</mo></mrow><mo>−</mo><mi>D</mi><mi>e</mi><mi>c</mi><m
   i>o</mi><mi>d</mi><mi>e</mi><mi>r</mi><mrow><mo>(</mo><mi>X</mi><mo>)</
   mo></mrow><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mi>D</mi><mi>e</mi><mi>c</mi><mi>o</mi><m
   i>d</mi><mi>e</mi><mi>r</mi><mrow><mo
   stretchy="true">(</mo><mrow><msub><mi>X</mi><mrow><mi>p</mi><mi>e</mi><
   mi>r</mi><mi>t</mi><mi>u</mi><mi>r</mi><mi>b</mi><mi>e</mi><mi>d</mi></
   mrow></msub><mo>−</mo><mi>X</mi></mrow><mo
   stretchy="true">)</mo></mrow></mrow> :MATH]
   ).

Hyperparameter search

   Hyperparameter tuning was conducted using a grid search approach. We
   explored various combinations of hyperparameters, including learning
   rate, batch size, number of layers, number of neurons per layer, and
   activation functions. The hyperparameter space for each parameter was
   defined as follows.
     * (1)
       Autoencoder type: [AE, VAE, CVAE]
     * (2)
       Activation function: [LeakyReLU, Relu, linear]
     * (3)
       Learning rate: [0.001, 0.01, 0.1]
     * (4)
       Batch size: [400, 500, 600]
     * (5)
       Number of hidden layers (encoder): [1, 2]
     * (6)
       Number of neurons in latent_layer: [input_size/40, input_size/60,
       input_size/80, input_size/100, input_size/125, input_size/150]

Data visualization

   Python package “UMAP” was used to visualize the latent layer as a
   reduced dimensionality space and for network visualizations we used
   Cytoscape.[219]^81

Data preprocessing

   The scRNA-Seq gene expression data is log normalized, scaled, and
   centered. In the training process, 80% of the data is allocated for
   training the classification-supervised autoencoder, while the remaining
   20% is used for testing purposes.

Quantification and statistical analysis

   Differential gene expression analysis was performed using the Wilcoxon
   rank-sum test. To account for multiple testing, we applied the
   Benjamini–Hochberg correction to the calculated P-values obtained from
   the DEG analysis. Genes with a corrected p-value below 0.05 were
   considered statistically significant. For scFates we used the default
   parameters. For the identification of enriched gene ontology terms in
   our study, we used the GSEApy package V1.0.4 with its default parameter
   settings.

Acknowledgments