Graphical abstract graphic file with name fx1.jpg [29]Open in a new tab Highlights * • Fatecode is a computational method to predict cell fate regulators from scRNA-seq data * • Fatecode uses autoencoder perturbation to identify genes that influence cell populations * • Simulations and real scRNA-seq data show Fatecode detects cell fate regulators * • Fatecode can accelerate discovery of cell fate regulators using widely available data Motivation How stem and progenitor cells decide which cell types they will generate via cell division is crucial for understanding tissue development and engineering cell types for use in regenerative medicine or cancer therapies. However, the identification of the regulators of these cell fate decisions within the complex and dynamic system of tissues is a major challenge. Experimental high-throughput perturbation screens can help to dissect regulators, but these are not practical or easy to implement in every context of interest. To address this challenge, we developed a computational method, Fatecode, to predict master regulators and key pathways controlling cell fate based on any single-cell transcriptomics dataset. __________________________________________________________________ Identifying the genes that control cell fate is essential for designing cellular reprogramming strategies. To provide an accessible, in silico complement to high throughput perturbation screens, Sadria et al. develop Fatecode, a deep learning-based computational method that can predict cell fate regulators from scRNA-seq data. Introduction In tissue development, specific gene regulators control how cells change state and type to form a complete tissue.[30]^1 These gene regulators are also important because they can be used to control cell fate for multiple applications, including in regenerative medicine and cancer.[31]^2 However, it remains a challenge to identify these regulators within complex and dynamic tissue systems.[32]^1 Cell fate regulators can be identified using experimental methods such as high-throughput genetic perturbation screens (e.g., CRISPR-based) with single-cell gene expression (single-cell RNA sequencing [scRNA-seq]) readouts.[33]^3^,[34]^4 However, these methods are challenging to run on arbitrary biological systems. Computational methods have been developed to predict gene expression programs that explain the difference between perturbed and unperturbed states[35]^5^,[36]^6^,[37]^7^,[38]^8 or to predict the linear effect of perturbing a particular transcription factor.[39]^9 Also, computational methods that determine the ordering of cell states along a trajectory, based on their gene expression profiles using a pseudotime or actual time approach,[40]^10^,[41]^11^,[42]^12^,[43]^13^,[44]^14 have been used to examine the cell decision-making process by identifying genes that are differentially expressed between trajectory branches. However, these latter methods often have trouble identifying accurate trajectories and branchpoints.[45]^15^,[46]^16 Furthermore, none of the above methods are designed to identify cell fate regulators in normal developmental processes. We developed Fatecode, a computational method to predict important cell fate regulator genes for cell types of interest. Fatecode predicts cell fate regulators based only on scRNA-seq data covering a given range of cell types to be analyzed. Fatecode learns a latent representation of the scRNA-seq data using a deep learning-based classification-supervised autoencoder[47]^17^,[48]^18 and then performs in silico perturbation experiments on the latent representation to predict genes that, when perturbed, would alter the original cell type distribution to increase or decrease the population size of a cell type of interest. Fatecode can be thought of as an in silico CRISPR perturbation screen that identifies genes that influence cell fate, based on a cell type readout. These genes can be traditional (e.g., transcription factors) or non-traditional regulators (any other genes). We assessed Fatecode’s performance using simulated data produced by a mechanistic model based on pre-defined gene-regulatory networks with known cell fate regulators[49]^19 and tested it on scRNA-seq maps of blood and developing brain from zebrafish and mouse.[50]^20^,[51]^21^,[52]^22^,[53]^23 Results Fatecode method overview Fatecode uses a classification-supervised autoencoder to detect key genes that can shift the cell type frequencies in an input scRNA-seq dataset toward a desired distribution of cell types. Taking single-cell gene expression profiles as input, the autoencoder learns a latent space with reduced dimensions capturing the input information (reduce gene dimension x cell matrix). A supervised cell type classifier is included as part of the loss function to create a latent space composed of features that support optimal cell type classification in addition to input data reconstruction. Known cell type annotations in the input data are used to train the classifier. This ensures that the latent space is relevant for cell type classification used in later stages. Each latent layer node of the autoencoder, which represents a reduced dimension of the input, is systematically perturbed to simulate altering key gene expression programs (sets of genes that are correlated with each other that are represented by individual learned latent layer dimension). Cell types are then reclassified to characterize the effect of the perturbation, and the autoencoder’s decoder uses the perturbed and unperturbed latent embeddings to generate a gene-by-cell matrix of gene prioritization scores. This matrix is used to identify genes important for the perturbation effect ([54]STAR Methods; [55]Figures 1 and [56]S1). Resulting cell type distributions are generated for each possible perturbation and then manually evaluated to identify those that increase or decrease proportions of desired cell types. In this way, regulator genes are identified to increase or decrease a given cell type proportion relative to all other cell types, and these are predicted to be cell fate regulators for the given cell type. An average of the cell fate regulator prioritization scores across cells for the cell type is computed to produce a final regulator list for each cell type. Figure 1. [57]Figure 1 [58]Open in a new tab Fatecode workflow for in silico perturbation experiments and cell fate regulator detection The 3D model (top) represents a Waddington-like landscape depicting cellular reprogramming processes. We seek to identify genes (question marks) that regulate paths on this landscape (wavy lines) by transitioning them to another path (red arrows). A classification-supervised autoencoder learns a latent space representing the original data, optimized for both input reconstruction and cell type classification. The latent layer is systematically perturbed, and by investigating all resulting perturbation-generated cell type distributions, distributions with an increase or decrease in a cell type of interest are identified. Perturbation output is simulated by subtracting the perturbed from unperturbed latent layers and feeding it to the decoder to identify a cell-by-gene matrix of prioritization scores that can help us to prioritize genes predicted to be important for achieving a desired cell population distribution. An average of the cell fate regulator prioritization scores across cells in each cell type is computed. By sorting these genes based on their prioritization scores for a cell type of interest, the model predicts genes that are important for regulating the levels of a given cell type. Our latent layer perturbation approach is inspired by latent vector operations used in natural language processing and computer vision applications to generate novel text and images.[59]^24^,[60]^25^,[61]^26 In those applications, perturbation operations performed on the latent layer generally yield superior results compared to operations performed directly in the input space. The classification component of Fatecode is used to exclude possible latent space regions that do not conform to the overall structure of the data. This helps in learning a model that is more representative of the underlying data distribution.[62]^27 Optimizing model architecture and hyperparameters Fatecode relies on the latent embedding of an autoencoder, but different types of autoencoders may produce different results, depending on the input data (see [63]supplemental information).[64]^6^,[65]^28^,[66]^29^,[67]^30 To investigate this in our problem context, we evaluated the performance of three common autoencoder architectures: under-complete autoencoder (AE), variational autoencoder (VAE), and conditional VAE (CVAE).[68]^31 The first step of Fatecode evaluates these three autoencoder architectures and other hyperparameters (see the [69]Hyperparameter search section) to find the ones that best reconstruct the input data, measured by mean squared error for reconstruction and cross-entropy for cell type classification. To illustrate the importance of this step, we compared how the choice of autoencoder affects learning the underlying representation for two single-cell gene expression datasets in adult zebrafish blood[70]^20 and murine pancreatic development.[71]^32 AE produced the lowest reconstruction error for the zebrafish data (averaged over cell types) ([72]Figures 2A and 2B). AE also produced a latent layer that successfully reduces the dimension and cleanly separates the five known cell types in the data ([73]Figure 2C), and its cell type classifier yields a high accuracy ([74]Figure 2D). However, for the mouse data, VAE achieved a higher accuracy compared to the other autoencoders ([75]Figure S2). Figure 2. [76]Figure 2 [77]Open in a new tab Comparison of autoencoder architectures for analyzing data for hematopoiesis regulation in zebrafish blood (A) Comparison of correlation between input and output of AE, variational autoencoder (VAE), and conditional variational autoencoder (CVAE). (B) MSE between input and output of the three autoencoder architectures, showing that AE produces the lowest error rate for this dataset. (C) Uniform Manifold Approximation and Projection (UMAP) visualization of the latent layer of the under-complete autoencoder (AE). (D) Confusion matrix for the classifier connected to the latent layer of AE demonstrating excellent classification performance. Fatecode accurately detects known regulators from simulated scRNA-seq data To assess the accuracy by which Fatecode identifies cell fate regulators using gene expression profiles, we applied the method to simulated scRNA-seq data generated from known gene regulatory network (GRN) structures using SERGIO.[78]^19 SERGIO allows users to specify the number of cell types and key regulators in the simulated GRN ([79]Figure 3A). While Fatecode is not specific to GRNs (i.e., it can identify a list of genes of any type, not just transcription factors), a GRN-based simulation is expected to provide a good benchmark for our method. A matrix of 400 cells and 2,700 genes with 20 known regulators and 9 cell types was generated and run through Fatecode. Predicted cell fate regulator genes and their prioritization scores were compared to the known SERGIO regulator list. The number of known regulator genes identified increases as more genes are prioritized ([80]Figure 3B). Almost all of the known regulator genes (18 of 20) were identified when 150 genes were prioritized (of 2,700). To compare with a naive baseline, we identified cell type markers (top 20 genes) using differential gene expression (DGE) analysis on the same data using Seurat’s non-parametric Wilcoxon rank-sum test.[81]^33 Fatecode identifies a larger number of known regulators compared to DGE analysis when examining up to 150 prioritized genes ([82]Figure 3B). As SERGIO is a stochastic method, we analyzed five additional simulated datasets of the same size, all of which yielded similar results (plotted as shading in [83]Figure 3B). We repeated this analysis on a larger dataset consisting of 2,700 cells, 1,200 genes, with 65 predefined regulators, and 9 distinct cell types. We used Fatecode to identify the top 180 key genes of these data, and DGE analysis to identify the top 20 differentially expressed genes from each cell type. Also, for comparison, we included scFates, a method specifically designed for trajectory-based DGE analysis.[84]^34 Fatecode consistently outperformed both DGE methods in detecting known regulators. We further evaluated performance by varying the top k gene threshold of DGE, and Fatecode consistently outperformed DGE across all tested thresholds, demonstrating its robustness while varying the number of genes considered ([85]Figures 3C and [86]S3). Thus, Fatecode performs well at identifying known regulators in simulated scRNA-seq data. Figure 3. [87]Figure 3 [88]Open in a new tab Fatecode detects known regulators using simulated data generated by SERGIO (A) The schematic structure of the GRN to generate scRNA-seq. Red nodes are known regulators, and green nodes are non-regulators whose production rates are determined by their associated regulators. Our goal is to identify known regulators from the generated scRNA-seq data using Fatecode. (B) Benchmark comparisons of the detection rate of predefined regulators generated by SERGIO using Fatecode compared with a naive differential gene expression (DGE) baseline. The red and green areas represent the performance of Fatecode and DGE, respectively, on the simulated data with 400 cells. (C) Benchmark comparisons of the detection rate of known regulators using Fatecode, scFates, and DGE on simulated data with 2,700 cells. (D) Venn diagram showing the similarity between the number of known regulators uncovered by Fatecode across various latent layer sizes. We also examined the sensitivity of our model by the size of the latent layer in the autoencoder by training Fatecode with different latent layer sizes (n = 50, 75, and 100 dimensions) using the 2,700-cell simulated data ([89]Figure 3D). Our results show general consistency across the different latent layer sizes, indicating that Fatecode exhibits robustness across a range of latent layer sizes. Fatecode identifies known cell fate regulator genes in mouse hematopoiesis Hematopoiesis is a cell differentiation process by which the body produces mature blood cells from stem cells. We applied Fatecode to a published mouse hematopoiesis single-cell differentiation dataset, which involves the differentiation of myeloid progenitors into 9 cell types ([90]Figure 4A).[91]^22 We then examined Fatecode’s accuracy in predicting cell fate regulators that lead to the desired cell type distribution by comparing the results with ground-truth experimental perturbation data and known regulator genes.[92]^9^,[93]^22^,[94]^35^,[95]^36 Fatecode learned a latent node that, when perturbed, simultaneously increases the monocyte population and decreases erythrocytes and granulocytes ([96]Figure 4B). Previous studies have demonstrated that Irf8 is important in promoting the differentiation of the GM (granulocyte-monocyte) lineage, particularly monocytes, and functions as a key regulator in determining the fate between granulocytes and monocytes. Fatecode accurately predicted Irf8 as an important cell fate regulator in the monocyte differentiation process. It correctly assigned a high positive score for monocytes and late_GMP (granulocyte-macrophage progenitor) and negative scores for granulocytes and MEP (megakaryocyte-erythroid progenitor) lineages, consistent with previous studies ([97]Figure 4C). Next, we investigated the prediction results for Cebpa, knockout of which leads to a decline in the population of differentiated myeloid cells, while concurrently increasing the number of erythrocytes. Fatecode accurately assigned a high positive score to Cebpa for monocytes and granulocytes and a negative score to erythrocytes and MEPs ([98]Figures 4D and 4E). In another example, Klf1 is a key regulator in driving differentiation toward the ME (megakaryocyte-erythroid) lineage, specifically promoting the development of erythrocytes, while simultaneously inhibiting the GMP lineage. Fatecode correctly assigned a set of positive scores to Klf1 for erythrocytes and MEPs, indicating its ability to capture a key regulator in ME lineage differentiation ([99]Figure S4A). We also tested Fatecode’s ability to detect genes that are known to be important in maintaining stemness and inhibiting differentiation. Fatecode correctly predicted Runx1 as a candidate that has negative scores for perturbations that increase all mature cell types (all cell types expect MEPs and GMPs) ([100]Figure S4B). Last, we examined the prediction results for Fli1, which has diverse effects on differentiation. Fatecode accurately gives positive scores for the association between Fli1 and megakaryocytes, monocytes, and granulocytes and also assigns a notable negative score to erythrocytes, in agreement with the literature[101]^9^,[102]^37 ([103]Figure S4C). These simulations show that Fatecode accurately identifies known cell fate regulators that have been reported in previous perturbation-based experimental studies. Figure 4. [104]Figure 4 [105]Open in a new tab Fatecode accurately detects regulators and predicts the effect of single-cell perturbations (A) Hematopoiesis data from Paul et al.,[106]^22 visualized as a UMAP and clustered into 9 cell types. (B and D) The results of in silico perturbations that change the initial cell frequency to the desired distribution. For (B), our objective was to promote monocytes while reducing the number of erythrocytes. For (D), we aimed for an increase in the erythroid population and a decline in MEPs and megakaryocytes. (C and E) Gene prioritization scores per cell type for Irf8 and Cebpa. (F) Pathway enrichment analysis results. Gene Ontology biological processes show significant processes related to cell development and hematopoiesis. Furthermore, to evaluate the role of the top 200 genes detected by Fatecode for monocytes, we performed pathway enrichment analysis. Pathways that are significantly enriched in these 200 genes include those related to the immune system, hemopoiesis, cell development, and cell differentiation, which agrees with their Fatecode-predicted role in monocyte development ([107]Figure 4F). We extended our analysis to larger hematopoiesis single-cell differentiation data that involve differentiation into 12 cell types ([108]Figure S5A).[109]^23 We applied Fatecode to detect genes that can increase the pool of undifferentiated cells in this system ([110]Figure S5B). One candidate detected by Fatecode in this process is Entpd8, the deletion of which in mice elevates the neutrophil and monocyte population.[111]^38 Fatecode predictions are consistent with this experimental result. Fatecode also predicted Nlrp6 as a regulator of neutrophil and monocyte differentiation. Cai et al. showed that the number of hematopoietic stem cells and GM progenitors is reduced in Kp-infected Nlrp6^−/− mice, while the survival of mature neutrophils in bone marrow is increased.[112]^39 We repeated gene set enrichment analysis using the top 200 genes detected by Fatecode. Biological processes related to mouse hematopoiesis, stem cell development, and metabolic signaling were enriched, showing that Fatecode can again capture relevant pathways for this biological process ([113]Figure S5C). Fatecode detects important regulators in cell differentiation and lineage commitment in zebrafish We applied Fatecode to zebrafish hematopoiesis data[114]^20 as an additional demonstration and test. From all possible perturbations on the latent layer performed by Fatecode, we selected ones that resulted in the greatest predicted relative increase in hematopoietic stem and progenitor cells (HSPCs) ([115]Figure 5A). As shown in [116]Figure 5B, following the perturbation, some cells (mostly monocytes) are predicted to switch to HSPCs ([117]Figure 5B). Fatecode gives a significant score to signal transducer and activator of transcription 5A (stat5a) as one of the most important genes for HSPCs. Stat5a is a key regulator of normal hematopoiesis, with pleiotropic roles in hematopoietic stem cells.[118]^40 Also, knockout studies have shown that the deletion of stat5a led to an increase in HSPC cycling, gradually reduced survival, and depleted the HSPC pool.[119]^41 Next, Fatecode gives irf8 a high positive score for monocytes. Irf8 is a key regulator of monocyte development, and it has been known to be important for myelopoiesis in different model organisms.[120]^42^,[121]^43 It functions at an early step of the transcriptional program that governs differentiation from myeloid progenitors to monocytes/macrophages and plays a key role in stem cell renewal and maintenance.[122]^43^,[123]^44 Fatecode also identified a strong negative connection between foxo3 and myeloid cell differentiation, consistent with foxo3 knockout studies, which show a significant increase in granulocyte/monocyte progenitors in the spleen, bone marrow, and blood and enhance short-term hematopoietic stem cell proliferation.[124]^45^,[125]^46^,[126]^47 Fatecode found an important role played by the otud gene family, a subgroup of deubiquitination enzymes, by assigning a high positive score between HSPCs and the otud gene family. Consistent with our prediction, knockout of otud genes in Xenopus results in developmental impairments.[127]^48 Also, elevated expression of otud genes leads to the acquisition of stem cell properties.[128]^49 Fatecode also predicted the negative score between thbs1 and HSPCs, where thbs1 has been shown previously to limit the expression of essential self-renewal transcription factors, including oct3 and oct4, sox2, klf4, and c-myc, within cells.[129]^50 Other key gene candidates identified by Fatecode for this perturbation are also known to be involved in hematopoiesis ([130]Table 1). Figure 5. [131]Figure 5 [132]Open in a new tab In silico experiments to induce hematopoietic stem/progenitor cells using hematopoiesis in zebrafish (A) A series of latent layer perturbations and their effect on cell distribution. (B) Cells that switch from their initial cell type to HSPCs are highlighted. Table 1. List of zebrafish hematopoiesis regulator genes predicted by Fatecode, with literature evidence for involvement in this process Gene Roles Reference cdk1 plays an important role in the maintenance of pluripotency and genomic stability in human pluripotent stem cells Neganova et al.[133]^70 top2a controls the survival of human pluripotent stem cells Ben-David et al.[134]^71 hmgb2a regulates hematopoietic stem cell maintenance Zhang et al.[135]^72 ube2c highly expressed in human embryonic stem cells (hESCs) and a biomarker of cancer stemness Fatima et al.[136]^73; Liu et al.[137]^74 fbxo11 depletion leads to the hematopoietic population with stem cell characteristics Mo et al.[138]^75 hmgn2 facilitates the maintenance of active chromatin states required for stem cell identity in a pluripotent stem cell model Garza-Manero et al.[139]^76 aspm regulates symmetric stem cell division by tuning Cyclin E ubiquitination Capecchi et al.[140]^77 myb regulates hematopoietic stem cell and myeloid progenitor cell development Baker et al.[141]^78 kpna2 exhibits strong interactions with oct4 in embryonic stem cells Li et al.[142]^79 [143]Open in a new tab Fatecode identifies cell fate regulators in mouse hippocampus development To demonstrate Fatecode on a larger biological dataset, we applied it to developing mouse hippocampus scRNA-seq data,[144]^21 composed of 18,213 cells and 3,001 genes. The data are clustered in 14 annotated cell types ([145]Figure 6A). We first sought to identify regulators in the differentiation process that preferentially increase mature granule cells ([146]Figure 6B). Fatecode predicts the ZFP gene family (Zfp94, Zfp189, and Zfp706) as positively important in granule cell differentiation. The Zfp family is a definitive marker for the cerebellar granule neuron lineage and plays a critical role in granule cell specification within the developing cerebellum.[147]^51 For example, lack of Zfp521 results in a significant reduction in the number of granule cells.[148]^52 Id2 and Id3 are important in maintaining the size and cellular structure of the brains of adult mice. It has also been shown that the absence of [MATH: Id2/ :MATH] leads to a decrease in the number of granule neurons.[149]^53^,[150]^54 In line with this earlier research, Fatecode assigns a high positive score between both Id2 and Id3 for mature granule cells. These two transcriptional regulators have also been found to determine the fate of differentiating CD8^+ T cells.[151]^55 Figure 6. [152]Figure 6 [153]Open in a new tab Fatecode identifies key genes in mouse neurogenesis (A) UMAP embedding of fourteen major cell types. (B) Latent layer node perturbation leads to an increase in mature granule cells while a decrease in immature granule cells. (C) Pathway enrichment analysis shows the relevant biological process using the top 200 genes selected based on their prioritization scores for mature granule cells. Next, we applied Fatecode to determine regulators that mediate the differentiation process, which preferentially increases oligodendrocyte progenitor cells (OPCs), and decreases granule cells (both mature and immature) and oligodendrocytes. Fatecode predicted Igfbpl1 as having an impact on OPC-to-oligodendrocyte differentiation, which is consistent with published experimental studies.[154]^56^,[155]^57 Furthermore, we considered Fth1, which provides neuroprotection and is enriched in oligodendrocytes. Mice lacking Fth1 have more microglia cells compared to the control and a significant reduction in neurons and oligodendrocytes.[156]^58 Fatecode accurately assigned a high positive score linking Fth1 to oligodendrocytes and mature granule cells and a negative score for Fth1 and microglia cells, showing that knocking out of Fth1 leads to an increase in microglia cells, consistent with the experimental studies. Thymosin beta 4 (Tmsb4x) is a key candidate in the context of neurogenesis during brain development.[157]^59 Its expression is linked to neurogenic processes and exerts regulatory control over the expansion of the stem cell pool within the early neuroepithelium. Tmsb4x gene knockout elicits a pronounced effect on the differentiation process in vitro. Specifically, it significantly promotes the differentiation of stem cells, further emphasizing its role in orchestrating cellular fate determination.[158]^60 Our method correctly assigns a negative score for Tmsb4x and all cells except neuroblasts and radial glia-like cells. To further validate the performance of Fatecode in detecting key genes, we performed pathway enrichment analysis on the top 200 Fatecode-predicted regulators. This analysis showed that pathways related to brain development, synaptic signaling, and protein synthesis were significantly enriched in these genes ([159]Figure 6C). To illustrate further downstream analysis that is possible based on Fatecode results, we applied single-cell regulatory network inference and clustering (SCENIC) on the mouse hippocampus development dataset to construct a GRN consisting of the top 2,000 interactions based on their SCENIC importance measure scores, which shows the significance of the input gene (referred to as the “TF”) in determining the prediction outcome for the target.[160]^61 We then mapped the top 400 Fatecode-predicted regulators to the SCENIC-inferred GRN. The resulting networks can be used as a guide for identifying specific GRN mechanisms to target in follow-up experiments (Ybx1 example, [161]Figure S6) to test the regulatory relationships and potential roles of regulators in cellular reprogramming. While SCENIC predicts useful additional information to support experiment planning, it only considers transcription factor regulators. Other types of genes in Fatecode’s output can be identified as cell fate regulators and should also be examined. Discussion Cell reprogramming is a promising technology for tissue repair and regeneration, with the ultimate goal of accelerating recovery from diseases or injuries, as well as the development of novel therapies.[162]^62 An important component in successful cell reprogramming is to correctly identify the regulators and trajectories from single-cell transcriptomics data. However, the number of genes in these datasets is large, and the number of underlying regulatory interactions is much larger. Recent studies have demonstrated that the expression of a single regulator is insufficient to produce an endpoint phenotype.[163]^63 Instead, a group of control networks acts together across a variety of biological processes and pathways to induce a complete lineage conversion.[164]^64 To efficiently and accurately map these control networks, we developed a deep learning method, Fatecode, which we have successfully applied to analyze diverse datasets. First, our method discovers an efficient architecture and latent layer for an input single-cell dataset. Then, by performing operations on the latent layer, it is able to predict perturbations for cell fate reprogramming. Fatecode was validated using simulated scRNA-seq data with predefined regulators and by predicting regulators in a variety of biological scRNA-seq data and manually comparing the results to the literature. The fundamental idea in Fatecode is similar to the minimum Hamiltonian in physics and the potential energy landscape concept and representation learning.[165]^27^,[166]^65 These authors have shown that the most common autoencoders are naturally associated with an energy function, independent of the training procedure. This reasoning suggests that regulators can be seen as genes that allow the system to achieve a target cell type distribution via the most efficient path through the energy landscape. Fatecode uses the classifier as a guide to determine what node in the latent layer must be perturbed to achieve the desired reprogramming effect. Then the decoder maps the modified latent layer to gene space for gene identification. It is also useful to understand whether regulators are cell type specific. For example, the mammalian target of rapamycin complex (mTORC1) is widely important in cell fate decision-making and also important in the regulation of T cell fate.[167]^66^,[168]^67^,[169]^68^,[170]^69 Running Fatecode for different cell conversions can help identify cell-type-specific and non-specific regulators. In conclusion, we developed an effective computational framework for predicting key players in cell fate control and the consequences of perturbations on cell type frequencies. Fatecode’s modular design enables users to select an autoencoder architecture that produces an accurate model for their data. By leveraging the power of classification-supervised autoencoders and the associated energy manifold learning process, Fatecode generates useful hypotheses about genes that could be manipulated to achieve desired cell transitions. We hope this method accelerates the discovery of novel cell fate regulators that can be used to engineer and grow cells for therapeutic use in regenerative medicine applications. Limitation of study Fatecode can be thought of as an in silico CRISPR perturbation screen that identifies genes that can influence cell fate. Unfortunately, we were not able to find a published genome-wide CRISPR perturbation screen of an appropriate cell line and with a cell fate readout. Most genome-wide CRISPR-screens use standard cell lines that are not naturally multi-potent and, thus, are not expected to generate multiple cell fates. CRISPR has been used to evaluate cell fate regulators, but only by examining one or a few candidate genes in a single paper. We used these latter small-scale results to verify that Fatecode results agree with these experiments. Because we could not find genome-wide CRISPR screens with a cell fate readout, we used GRN simulations with predefined regulators and small-scale CRISPR experiments to validate our findings. In the future, we hope genome-scale CRISPR screens for cell fate regulators will be published for us to compare to. Despite offering a useful input data representation, how the autoencoder latent layer represents the input data may be difficult to understand. Future work will need to better understand how the input data are represented and learned in the latent layer given diverse input data. However, our results showed that Fatecode predictions are relatively stable when changing the size of the latent layer, indicating that latent information is likely captured consistently. STAR★Methods Key resources table REAGENT or RESOURCE SOURCE IDENTIFIER Deposited data __________________________________________________________________ Zebrafish hematopoiesis data Athanasiadis et.al.[171]^20 E-MTAB-5530 Raw and analyzed data This paper [172]https://doi.org/10.5281/zenodo.11511340 Dentate Gyrus neurogenesis Hochgerner et al.[173]^21 [174]GSE95753 Hematopoiesis Paul et al.[175]^22 [176]GSE72859 Hematopoiesis Weinreb et al.[177]^23 [178]GSE140802 SERGIO Dibaeinia et al.[179]^19 [180]https://github.com/PayamDiba/SERGIO __________________________________________________________________ Software and algorithms __________________________________________________________________ Fatecode This paper [181]https://doi.org/10.5281/zenodo.11511340 Cytoscape Franz et al.[182]^81 [183]https://doi.org/10.1093/bioinformatics/btad031 scFates Faure et al.[184]^34 [185]https://doi.org/10.1093/bioinformatics/btac746 Scenic Aibar et al.[186]^61 [187]https://doi.org/10.1038/nmeth.4463 gseapy Fang et al.[188]^82 [189]https://github.com/zqfang/GSEApy scikit-learn Alex et al.[190]^83 [191]https://scikit-learn.org/ Seurat CRAN [192]https://satijalab.org/seurat/ [193]Open in a new tab Resource availability Lead contact Lead contact Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Mehrshad Sadria (msadria@uwaterloo.ca) Materials availability This study did not generate new unique reagents. Data and code availability * • The datasets used in the present study are openly accessible in public repositories. The zebrafish hematopoiesis data can be found under the accession number E-MTAB-5530 on ArrayExpress. We downloaded a preprocessed version of the “Dentate Gyrus neurogenesis” data (under accession number [194]GSE95753 ) from [195]https://scvelo.readthedocs.io/en/stable/. The hematopoiesis Paul et al. data can be downloaded from the GEO under accession code [196]GSE72859 and the preprocessed version was downloaded from [197]https://celloracle.org/. To generate simulated data, we used the same parameters for the differential equations as in [198]https://github.com/PayamDiba/SERGIO.[199]^19 The hematopoiesis Weinreb et al. data can be downloaded from GEO under accession number [200]GSE140802 and the preprocessed version was downloaded from [201]https://cospar.readthedocs.io/en/latest/. * • Code supporting this study is available on: [202]https://github.com/MehrshadSD/Fatecode and [203]https://doi.org/10.5281/zenodo.11511340. * • Any additional information required to re-analyze the results reported by this study are available from the [204]lead contact upon request. Method details Deep representation learning Autoencoders are a class of neural networks with a latent layer capable of learning nonlinear representations of the input data in an unsupervised manner. An autoencoder consists of an encoder that maps the input to the latent space and a decoder which transfers the latent space back to the original space. It can be used for denoising, reducing dimensionality, or learning the representation (or manifold) of the data. We implemented three autoencoder architectures: under-complete AutoEncoder (AE), Variational AutoEncoder (VAE), and Conditional VAriational Encoder (CVAE)[205]^31 ([206]Figure 1). AE has a single latent layer. VAE constrains the latent layer by modeling the latent space as a multivariate Gaussian distribution with a mean and a standard deviation. CVAE conditions the latent space on class labels and thus can generate data based on a given class label. The biological task for our autoencoder is to learn a reduced dimension representation of a cell by gene matrix capturing measurements of a single-cell transcriptomics experiment mapping cellular trajectories. Only the gene dimension is reduced, so the latent space describes a reduced representation of each input cell transcriptome. To make the latent layer more specific for our biological task, we added a cell type classification task to the standard regression tasks. The classification task, described in more detail below, predicts the type of each latent cell and compares it to a known input cell type. The training process works to optimize both classification and regression performance simultaneously. This reduces the space of latent layer candidates since not all possible latent layers are useful for the classification task. VAE VAE is a type of autoencoder that estimates a latent set of probability density functions that model the input data. Unlike AE, which learns an unconstrained representation of the data, VAE assumes a Gaussian distribution for the prior. An input gene by cell matrix X is run through an encoder, which generates parameters for the set of distributions Q(z |X). Then, from Q, a latent k-vector z is sampled and the decoder transforms z into an output, with the condition that the output is similar to the input, where k equals the number of components (or distributions) in the VAE. The VAE total loss consists of the reconstruction loss (first term) and the KL-divergence loss (second term): [MATH: E[logP(X|z) ]DKL[Q(z|< /mo>X) P(z)] :MATH] [MATH: DKL[Q(z|X< /mi>)P (z)]=12k=1K[1+logσk2μk2σk2< /mrow>] :MATH] where [MATH: μk :MATH] and [MATH: σk :MATH] are the k-th components of output vectors [MATH: μk :MATH] (X) and [MATH: σk :MATH] (X), respectively. CVAE CVAE is distinguished from VAE by its embedding of conditional information in the objective function. CVAE relies on two inputs: the features and the class labels, c, instead of using only the features, as is done with a VAE and AE. The CVAE architecture allows the encoder and the decoder to be conditioned by c. Hence, the variational lower bound objective is changed to the following form. [MATH: E[logP(X|z,c)] DKL[Q(z|< /mo>X,c) P(z|c )], :MATH] Overall network architecture of Fatecode The Fatecode autoencoder architecture was chosen for each of the datasets analyzed in this study using a hyperparameter search (More details in [207]Hyperparameter search section). Encoder and decoder architectures are constrained to have the same number of outer and inner layer nodes. For the analysis of hematopoiesis regulation in zebrafish, Fatecode consists of a fully connected encoder and decoder. The encoder and decoder are both two-layer networks of 92 (outer layer) and 48 (inner layer) nodes with the LeakyReLU activation function and the latent layer has 18 nodes. For the analysis of hematopoiesis in mouse data by Weinreb et al.,[208]^23 the encoder/decoder has a 506-node outer layer and a 253-node inner layer, and the latent layer has 125 nodes. For the mouse hematopoiesis data by Paul et al.[209]^22 the encoder/decoder has a 100-node outer layer and a 40-node and the latent layer has 20 nodes. For the developing mouse hippocampus data, we used a two-layer encoder/decoder of 50 (outer), 26 (inner), and a latent layer of 15 nodes. Our model was built using software packages and libraries, including TensorFlow V2.10.0, scikit-learn V1.1.3, scanpy V1.9.1, numpy V1.23.4, and pandas V1.5.1. Classification The classifier determines cell types using the latent layer as input to a single hidden layer and then an output layer (with one node per cell type), all fully connected. ReLu and softmax activation functions are used for the hidden and output layers, respectively. The number of nodes in the hidden layer is varied during the hyperparameter optimization. For adult zebrafish blood data,[210]^20 we use 15 and 5 nodes for the hidden and output layers, respectively. We use 25 and 12 nodes for classifying hematopoiesis in mouse data by Weinreb et al.,[211]^23 20 and 9 nodes for data from Paul et al.,[212]^22 and 22 and 14 for the developing mouse hippocampus data.[213]^21 All cell labels are assigned by using the predefined cell type labels of the original studies. Identifying key regulators in cell differentiation Consider adjustments (e.g., one or more gene knock-outs or over-expressions) that will transition a baseline cell type distribution (“A”) to a given desired target distribution (“B”). For example, in the target cell distribution, our objective is to increase the number of cell type N while decreasing the number of cell type P ([214]Figure S1). To detect genes that are important in a given transition, Fatecode analyzes the effects of perturbations on cell fate by systematically perturbing individual autoencoder latent nodes learned from a single-cell transcriptomics data capturing cellular trajectories. Each latent variable perturbation results in a single-cell transcriptome through the decoding process and a corresponding cell type distribution, proceeding as follows after training Fatecode. * (1) The gene expression data, denoted as E, corresponding to a mixture of cells with cell type distribution A, undergoes encoding to produce a matrix of latent variables represented as [MATH: X :MATH] ( [MATH: X=encoder(E) :MATH] ). Each column of [MATH: X :MATH] is associated with a cell in E; each row corresponds to a latent variable). * (2) In a series of simulations, finite perturbations of different sizes [MATH: K :MATH] (e.g., from a 50% reduction to a 10-fold increase) are applied to each row j (number of latent variables) in [MATH: X :MATH] sequentially. For each perturbed latent layer row, [MATH: Xj :MATH] [MATH: Xj=kXj :MATH] * (3) We then run the cell type classifier trained within Fatecode on the perturbed latent layer to predict the cell type distribution for each across all perturbation conditions. [MATH: New_cel l_type_< /mo>distribution=clas sifier(Xj) :MATH] * (4) Then, we can identify a perturbed latent layer row, [MATH: Xj :MATH] , and its associated perturbation size, k, that is closest to the desired target distribution B. * (5) To identify genes important for the transition from cell type distributions A to B, we compute the difference between the selected [MATH: X :MATH] and the [MATH: X :MATH] latent layers. For instance, if increasing latent node #9 5-fold can best approximate the desired distribution B, then the difference between the selected [MATH: X :MATH] and [MATH: X :MATH] latent layers is a latent node by cell matrix with all zero entries, except for the 9^th row, which is 5 times [MATH: X9 :MATH] . * (6) With this selected perturbation matrix [MATH: (X−< /mo>X) :MATH] , the decoder produces a gene-by-cell matrix. Then the average gene expression profile of all cells in each cell type is computed, resulting in a gene by cell_type matrix M. The (i,j)-th entry of M is the prioritization score for the i-th gene in cell_type j. * (7) To identify the regulators predicted to be important for transitioning initial cell type distribution A to target B, we rank the genes based on their prioritization scores for a cell type of interest. [MATH: Regulat ors=sort(Mdesired_celltype) :MATH] We note that M does not directly specify how much each gene should be perturbed to yield target B. Nonetheless, M contains information about genes that are important in transitioning cell type distribution from initial state A to the desired state B. This idea is similar to potential energy in physics and representation learning.[215]^27^,[216]^65 We also examine the model’s performance in detecting regulators when operating on the output of the decoder compared to the latent layer. To achieve this, we feed the perturbed vector to the decoder and subtract the result from the unperturbed condition. We then investigate the genes that show significant changes. Our results indicate that working on the latent layer leads to better outcomes in detecting regulators than operating on the output of the decoder. This observation is in line with previous research in computer vision and natural language processing, where using the latent space consistently yields superior results compared to the original data space.[217]^25^,[218]^80 We assume this is true in general when using an autoencoder with a non-linear activation function with reasonably complex data, as we have in biology (in contrast to the linear activation function case where [MATH: Decoder (Xperturbed< /msub>)Decoder(X)=Decoder(Xpe< mi>rturbedX) :MATH] ). Hyperparameter search Hyperparameter tuning was conducted using a grid search approach. We explored various combinations of hyperparameters, including learning rate, batch size, number of layers, number of neurons per layer, and activation functions. The hyperparameter space for each parameter was defined as follows. * (1) Autoencoder type: [AE, VAE, CVAE] * (2) Activation function: [LeakyReLU, Relu, linear] * (3) Learning rate: [0.001, 0.01, 0.1] * (4) Batch size: [400, 500, 600] * (5) Number of hidden layers (encoder): [1, 2] * (6) Number of neurons in latent_layer: [input_size/40, input_size/60, input_size/80, input_size/100, input_size/125, input_size/150] Data visualization Python package “UMAP” was used to visualize the latent layer as a reduced dimensionality space and for network visualizations we used Cytoscape.[219]^81 Data preprocessing The scRNA-Seq gene expression data is log normalized, scaled, and centered. In the training process, 80% of the data is allocated for training the classification-supervised autoencoder, while the remaining 20% is used for testing purposes. Quantification and statistical analysis Differential gene expression analysis was performed using the Wilcoxon rank-sum test. To account for multiple testing, we applied the Benjamini–Hochberg correction to the calculated P-values obtained from the DEG analysis. Genes with a corrected p-value below 0.05 were considered statistically significant. For scFates we used the default parameters. For the identification of enriched gene ontology terms in our study, we used the GSEApy package V1.0.4 with its default parameter settings. Acknowledgments