Abstract

   Single-cell sequencing provides transcriptomic profiling at single-cell
   resolution, uncovering cellular heterogeneity with unprecedented
   precision. Yet, current single cell data analysis suffers from the
   inherent data noises, batch effects, and sparsity, highlighting the
   requirement of a unified model to represent cellular states. To
   circumvent this problem, many recent efforts focus on training
   single-cell foundation models based on large datasets. However, current
   human foundation models are still limited by the sizes of training data
   and model parameters. Here, we have collected a diverse dataset of 100
   million human cells, on which we train a single-cell foundation model
   (CellFM) containing 800 million parameters. To balance efficiency and
   performance, the model is trained through a modified RetNet framework
   on the MindSpore. Extensive experiments have shown that CellFM
   outperforms existing models in cell annotation, perturbation
   prediction, gene function prediction, and gene-gene relationship
   capturing.

   Subject terms: Bioinformatics, Genetic models, Computational models,
   Gene regulatory networks, Gene expression
     __________________________________________________________________

   Single-cell sequencing reveals cellular heterogeneity but is challenged
   by technical noise and batch effects. Here, authors present CellFM, an
   800-million-parameter foundation model trained on 100 million human
   cells through the MindSpore framework, which outperforms existing
   models in downstream tasks.

Introduction

   Single-cell RNA sequencing (scRNA-seq) technologies have revolutionized
   molecular biology by enabling the measurement of transcriptome profiles
   with unparalleled scale and precision^[54]1,[55]2. As single-cell
   technologies advance, the rapid accumulation of extensive datasets has
   posed significant analytical challenges^[56]3,[57]4, primarily due to
   the data’s inherent noise, sparsity, and batch effects. Despite the
   development of numerous single-cell-specific tools^[58]5–[59]7 to
   address these challenges, their performance often falls short when
   applied to new datasets and struggles to scale with the growing data
   size. More importantly, these tools fail to fully leverage the rich
   information embedded in large atlas datasets, underscoring the need for
   novel computational strategies.

   To address this challenge, several single-cell foundation models have
   been developed to analyze single-cell data^[60]8. Drawing inspiration
   from the remarkable success of large language models (LLMs) in natural
   language processing (NLP)^[61]9, and aiming to reduce training costs,
   researchers have begun exploring the fine-tuning of these LLMs using
   relatively small amounts of single-cell data. For example,
   Cell2Sentence^[62]10 converts gene expression profiles of individual
   cells into sequences of gene names ordered by expression levels and
   uses these sequences to fine-tune the GPT-2 model. Similarly,
   GenePT^[63]11 utilizes GPT-3.5 to generate gene embeddings based on
   gene descriptions and metadata. While these approaches have improved by
   fine-tuning the GPT models, they still fall short of fully harnessing
   the rich gene expression data in large single-cell datasets,
   highlighting the need for more comprehensive strategies.

   To create single-cell foundation models from scratch, three types of
   single-cell foundation models, including ordering, value
   categorization, and Value projection were proposed. We first introduce
   gene-ranking-based models, such as iSEEEK^[64]12, which was the first
   model trained on over 10 million cells by predicting gene rank.
   Similarly, tGPT^[65]13 learns gene embeddings by autoregressively
   modeling gene ranks relative to their neighbors, processing sequences
   of genes ordered by expression levels, and predicting the next gene’s
   rank based on prior context. Trained on 22.3 million single-cell
   transcriptomes from humans and mice, tGPT demonstrated superior
   performance across multiple datasets. Geneformer^[66]14 predicts gene
   positions within the cellular context to derive rank-based gene
   embeddings. With training on a dataset of 30 million single-cell
   transcriptomes spanning diverse human tissues, Geneformer has achieved
   outstanding predictive performance.

   Recently, value categorization strategies were applied to leverage gene
   counts. By representing each gene with an embedding and binning its RNA
   counts, these models convert the continuous gene expression data into
   categorical values, enabling the use of methods designed for
   categorical data. For example, scBert^[67]15 bins gene expression
   values into discrete “buckets," transforming the continuous task of
   predicting gene expression into a classification problem. Trained on
   millions of human cells, scBert has shown improved performance across
   various datasets. Similarly, scGPT^[68]16 also segments gene expression
   values but enhances the process with an attention mask mechanism for
   autoregressive prediction. Using a self-supervised approach, scGPT
   optimizes both cell and gene representations. Trained on over 33
   million human cells, scGPT excels in various single-cell tasks. The
   Universal Cell Embedding (UCE)^[69]17 captures molecular diversity
   across species by integrating genetic data using protein language
   models. It uses self-supervised learning to predict gene expression by
   masking a subset of genes, refining predictions with binary
   cross-entropy loss. With over 650 million parameters, UCE is trained on
   more than 36 million cells, offering insights into gene expression
   across diverse cellular contexts.

   To further predict precise gene values, the value projection-based
   single-cell models were proposed. In this strategy, the gene expression
   vector x[i] is expressed as the sum of two components: a projection of
   the gene expression vector and a positional or gene embedding. Unlike
   ordering and value categorization methods, the key advantage of value
   projection is that it preserves the full resolution of the data. For
   example, scFoundation^[70]18 directly predicts raw gene expression
   values using a mask autoencoder (MAE). Trained on a large dataset of
   around 50 million human cells with ~0.1 billion parameters,
   scFoundation demonstrates decent performance in single-cell analysis.
   Similarly, GeneCompass^[71]19 incorporates four types of biological
   prior knowledge to enhance the understanding of gene regulatory
   mechanisms during gene expression prediction. Trained on about 50
   million human and 50 million mouse cells, GeneCompass has around 100
   million parameters. To integrate metadata information, scELMo^[72]20
   leverages LLMs like GPT-3.5 to generate metadata descriptions and
   embeddings, combining them with raw data in both zero-shot and
   fine-tuning frameworks to address a variety of tasks.

   Despite the endeavor, the potential of foundation models trained
   exclusively on 100 million human cells has yet to be fully explored.
   The limited availability of sufficient single-species training data,
   such as for human cells, hindered the development of large-scale,
   single-species models. Existing single-species models are typically
   trained on around 50 million cells, resulting in fewer than 100 million
   parameters. One reason for this limitation is the difficulty in
   collecting single-cell datasets, which are often stored in various
   formats (e.g., FASTQ, h5ad, Seurat objects, 10x Genomics) and dispersed
   across different repositories to accommodate diverse data processing
   and analysis needs.

   Here, we collect a lot of single-cell datasets from public databases
   and then make these data cleansing and standardization of unified
   formats, resulting in compiling a dataset of approximately 100 million
   human cells sequenced through various technologies. These datasets are
   twice as large as those used in the current largest single-species
   model, providing a rich foundation for training a larger
   model^[73]21,[74]22. We proposed a robust single-cell foundation model
   CellFM (Fig. [75]1) with an impressive 800 million parameters, marking
   an eightfold increase in model parameters over the current largest
   single-species model. To enhance the training of CellFM’s extensive
   parameters and to handle its substantial dataset, we have integrated
   ERetNet, a Transformer architecture variant with linear complexity.
   ERetNet’s design ensures a balance between efficiency and performance,
   serving as the backbone of our model. CellFM is categorized as a
   value-projection-based single-cell foundation model, as it aims to
   recover the vector embeddings of masked genes derived from their linear
   projections based on gene expression values. CellFM is developed using
   the MindSpore AI framework from Huawei and is trained on four Huawei
   Altas800 servers, each equipped with eight Ascend910 NPUs. Our
   comprehensive experiments have shown that CellFM outperforms existing
   models across diverse applications, such as cell annotation,
   perturbation prediction, and gene function prediction.

Fig. 1. Overview of the CellFM Framework.

   [76]Fig. 1
   [77]Open in a new tab

   a CellFM consists of the Embedding module, the ERetNet module, and the
   LoRA module. The expression of the cell is first fed into the Embedding
   module to obtain the initial token values of each gene. The embedded
   gene tokens of the cell are then input into the ERetNet module to learn
   the gene-to-gene relationships and gene embeddings. Finally, the
   Low-Rank Adaptation (LoRA) is implemented to minimize the number of
   training parameters of CellFM. b Each ERetNet Layer block integrates
   Multi-Head Attention (MHA), the Simple Gated Linear Unit (SGLU), and
   Layer Normalization (LN). c The collecting workflow and constituents of
   the training dataset employed within CellFM. d The pre-trained CellFM
   model is adaptable for a multitude of single-cell downstream analyzes
   including cell type annotation, perturbation prediction, gene network
   inference, and gene function prediction.

Results

Overview of cellFM

   Single-cell sequencing technology is crucial for revealing the detailed
   landscape of cellular diversity and function at the single-cell
   resolution. With the development of single-cell sequencing
   technologies, a vast array of datasets has been amassed, laying a
   robust groundwork for the training of single-cell foundation models.
   However, these datasets are available across various public
   repositories, including the National Center for Biotechnology
   Information (NCBI) Gene Expression Omnibus (GEO)^[78]23, the European
   Nucleotide Archive (ENA)^[79]24, the Genome Sequence Archive
   (GSA)^[80]25,[81]26, and the ImmPort^[82]27. We have meticulously
   curated single-cell data from these esteemed public databases
   (Fig. [83]1a). These datasets are stored in multiple formats, such as
   FASTQ data, expression matrices, or Seurat/Scanpy objects. We first
   process raw FASTQ data into the gene expression matrix through primary
   analysis software provided by manufacturers. Subsequently, all acquired
   expression matrices were processed using a standardized data analysis
   workflow facilitated by the SynEcoSys® single-cell database (Singleron
   Biotechnologies)^[84]28. This process involved three key steps
   including quality control for filtering cells and genes, gene name
   standardization according to HUGO Gene Nomenclature Committee (HGNC)
   guidelines, and converting the data to a unified sparse matrix format
   for subsequent analysis. Our efforts have successfully aggregated
   19,914 samples, totaling 102,304,686 human cells from different organs
   and single-cell sequencing technologies. We have provided a dataset
   summary (Supplementary Fig. [85]S1) and shared a detailed list of the
   sources for the training datasets used in CellFM
   (Supplementary_Metadata_information.xlsx). Concretely, 46.3 million
   cells were derived from normal donors, the other cells were from
   diseased donors, such as 7.1 million cells from Viral infections donors
   and 3.5 million cells from lung cancer donors. Most datasets were
   sequenced by 10x genomics
   [MATH: <msup><mrow><mn>3</mn></mrow><mrow><mo>′</mo></mrow></msup>
   :MATH]
   containing 66.7 million cells. Among the curated dataset, approximately
   70 million cells had annotated cell types. The training dataset
   includes a diverse range of cell types, such as T cells (19.2 million),
   mononuclear phagocytes (7.01 million), neurons (6.29 million), and
   fibroblasts (3 million).

   Building on our comprehensive collection of human cell data, we
   introduce CellFM, an efficient foundation model endowed with 800
   million parameters, designed to streamline the analysis of single-cell
   data (Fig. [86]1b). The model’s core is comprised of an embedding
   module, a series of stacked ERetNet Layers, and the low-rank adaptive
   module (LoRA) mechanism. CellFM begins by converting scalar gene
   expression data into rich, high-dimensional embedding features through
   its embedding module. These gene embeddings are then fed into L ERetNet
   Layers, which are adept at capturing the nuanced relationships among
   genes based on their expression profiles. Each ERetNet Layer is
   composed of several key components: the Gated Multi-head Attention
   (MHA) unit, the Simple Gated Linear Unit (SGLU), and Layer
   Normalization (LN). Collectively, these elements empower the ERetNet
   Layer to achieve training parallelism, cost-effective inference, and
   superior performance (Fig. [87]1c). Furthermore, CellFM integrates the
   LoRA module to reduce the number of trainable parameters during the
   fine-tuning phase when adapting the model to new datasets. Once
   pre-trained, CellFM can be applied to multiple single-cell downstream
   applications, such as gene function prediction, cell type annotation,
   perturbation effect prediction, and gene network analysis
   (Fig. [88]1d).

CellFM improves the accuracy of gene function prediction

   Gene function prediction is a cornerstone for deciphering the roles and
   properties of genes under diverse conditions^[89]14. With the human
   genome comprising ~20,000 protein-encoding genes^[90]29, and a
   significant portion lacking functional annotations, the accurate
   prediction of gene functions is imperative for a deeper understanding
   of their roles within biological systems. Here, we evaluated the
   performance of CellFM for identifying gene function through three
   distinct gene categories including Dosage sensitivity (referred to as
   T1), Bivalent methylation status versus non-methylated (T2), and
   Bivalent methylation versus Lys4-only methylation (T3). These
   categorizations represent binary classification challenges, where model
   predictions are assessed against actual gene function labels. Since the
   limited number of genes in these three tasks, typically fewer than
   1000, fine-tuning existing single-cell foundation models presents a
   challenge.

   To make a fair comparison, we adopted a zero-shot learning strategy for
   each model on the gene function prediction task. As shown in
   Fig. [91]2a, our model demonstrated remarkable results, achieving the
   best performance on three tasks. CellFM surpassed existing methods,
   with a 5.68% and 5.86% increase over the top two competing models UCE
   and scGPT in terms of average accuracy values, respectively. A similar
   trend was observed for Macro-F1 scores (Fig. [92]2b). The superior
   performance of our model was further substantiated by visualization
   results generated by Uniform Manifold Approximation and Projection
   (UMAP) using the gene embeddings from each pre-trained model. Our
   model’s ability to distinctly categorize dosage-sensitive from
   non-sensitive genes was evident in Fig. [93]2c and Supplementary
   Fig. [94]S2. However, scGPT and Geneformer exhibited an overlap in the
   embedding space that could undermine gene function prediction accuracy.
   In summary, our findings underscore the model’s proficiency in
   accurately predicting gene functions utilizing a zero-shot approach,
   showcasing its efficacy without the need for extensive model
   fine-tuning.

Fig. 2. Comparison of Gene Function Prediction performance in a zero-shot
setting.

   [95]Fig. 2
   [96]Open in a new tab

   a accuracy (ACC) scores and (b) Macro-F1 values for CellFM along with
   other competing single-cell foundation models on the binary
   classification data. c The visualization results of CellFM, scGPT, and
   Geneformer are plotted by the Uniform Manifold Approximation and
   Projection (UMAP) using the gene embedding generated through each
   model. d The Area Under the Precision-Recall Curve (AUPR) values for
   multiple gene functions predicted by CellFM and other competing
   single-cell foundation models on Gene Ontology (GO) data, where each
   gene is annotated with numerous functions. MF Molecular Function, CC
   Cellular Component, and BP Biological Process. Source data are provided
   as a Source Data file.

   To evaluate the ability of CellFM on multi-class classification, We
   further performed the gene function prediction using the data from the
   Gene Ontology (GO). This dataset includes three major categories:
   biological process (BP), cellular component (CC), and molecular
   function (MF). Detailed information about the GO dataset can be found
   in the referenced study^[97]30. Given the complexity of predicting all
   gene functions (BP: 1578, CC: 253, MF: 299), we focused our evaluation
   on the top 10 most frequent functions within each category. This
   approach ensures a realistic yet manageable benchmark for model
   comparison. To guarantee fairness across methods, we intersected the
   gene sets from each foundational single-cell model and maintained
   consistent training, validation, and test sets. As shown in
   Fig. [98]2d, CellFM demonstrated superior average performance,
   outperforming the top two models GeneCompass and UCE by 1.6% and 1.94%
   in average AUPR, respectively. Other models such as scFoundation and
   Geneformer also delivered competitive results with scGPT achieving an
   AUPR of 71.3%. We didn’t compare with scELMo since scELMo employed
   large language models (LLMs) like GPT-3.5 as generators to generate
   embeddings from metadata descriptions in the training phase, which has
   included gene function information.

CellFM enables predicting perturbation responses

   Recent advancements in sequencing and gene editing have enabled
   large-scale experimental perturbation simulations to study changes in
   gene expression and cellular behavior. These simulations are essential
   for understanding cellular responses to various stimuli and are
   increasingly applied to investigate drug effects, disease mechanisms,
   and therapeutic interventions. However, the vast combinatorial space of
   potential gene perturbations quickly exceeds the practical limits of
   experimental feasibility. To overcome this, single-cell foundation
   models adopt perturbation modeling, leveraging knowledge from known
   experimental perturbations to predict responses to unknown
   perturbations. By utilizing self-attention mechanisms over the gene
   dimension, these models can capture intricate gene interactions and
   accurately predict gene expression responses in unseen perturbations.
   The predictive power of perturbation modeling becomes especially
   valuable in AI-driven drug discovery, where it is used to forecast how
   existing drugs or genes will affect cellular processes, identify new
   drug targets, and explore drug repurposing opportunities. Additionally,
   these models offer deep insights into cellular heterogeneity, crucial
   for developing personalized medicine strategies.

   To assess CellFM’s proficiency in predicting perturbation responses, we
   utilized two Perturb-seq datasets: (1) the Adamson dataset^[99]31,
   encompassing 87 single-gene perturbations with roughly 100 cells per
   perturbation and at least 7000 control cells; and (2) the Norman
   dataset^[100]32, which includes 131 dual-gene and 105 single-gene
   perturbations. As depicted in Fig. [101]3a, we employed the Pearson
   correlation metric on the top 20 differentially expressed genes (De) to
   evaluate each model, where Δ denotes the degree of gene expression
   alteration post-perturbation relative to the control state. We
   evaluated all single-cell foundation models on the perturbation task by
   combining them with the classic perturbation model GEARS, as suggested
   by the study scFoundation. GEARS is a computational tool specifically
   designed for predicting single and multi-gene perturbations based on
   scRNA-seq datasets. GEARS operates by integrating a gene-gene
   interaction network as prior knowledge, which allows it to leverage
   existing biological information to improve prediction accuracy. GEARS
   has demonstrated state-of-the-art performance in gene perturbation
   predictions, making it a leading choice in the field. Concretely, we
   replaced GEARS’ gene embeddings with those derived from CellFM
   (Fig. [102]3a). As shown in Fig. [103]3b-c, our model consistently
   outperformed all competing single-cell foundation models, achieving
   improvements of 1% and 1.45% in average PCC and MSE compared to the
   second-ranked model scFoundation, respectively. Additionally, CellFM
   consistently surpassed GEARS with 4.75% and 7% improvement regarding
   average PCC and MSE values, respectively. As shown in Supplementary
   Fig. [104]S4, CellFM consistently outperformed all other single-cell
   foundation models, as well as the non-foundational model GEARS, when
   measured by the R^2 metric. CellFM achieved an R^2value that was 1.3%
   higher than the second-best scGPT in terms of average R^2. The visual
   results for two specific perturbation cases from the Adamson dataset in
   Fig. [105]3d have further shown that CellFM could accurately predict
   the direction of perturbation.

Fig. 3. Analysis of Perturbation Response and Reverse Perturbation
Predictions.

   [106]Fig. 3
   [107]Open in a new tab

   a A diagram showcasing the perturbation prediction model leveraging
   cell-specific gene embeddings derived from CellFM. b The mean square
   error (MSE) between predicted and actual post-gene expressions for the
   top differentially expressed (DE) genes in a zero-shot setting. c
   Comparison of CellFM with other single-cell foundation models and the
   perturbation prediction method GERAS. Pearson correlation coefficients
   between predicted and actual gene expression changes are reported for
   the top differentially expressed (DE) genes in a zero-shot setting. d
   Analysis of gene expression changes following perturbations of AHR+ctrl
   (n = 464 cells) and BPGM+ZBTB1 (n = 280 cells). The plots compare
   predicted versus actual expression changes for the top 20
   differentially expressed genes. Box plot elements represent: center
   line, median (50th percentile); box limits, upper (75th) and lower
   (25th) quartiles; whiskers, 1.5 × interquartile range (IQR) from the
   box; points beyond whiskers are considered outliers. The horizontal
   dashed line indicates the null effect baseline (0 change). Minimum and
   maximum values are represented by the whisker endpoints, with all
   percentiles calculated from the empirical distribution of expression
   changes. e A graphical representation of potential perturbation
   combinations across a 20-gene space, differentiated by experiment type
   (train, valid, test, unseen). Predicted perturbations are indicated by
   square boxes, with the actual source perturbation marked by a cross.
   The boxes are colored as follows: dark purple for test data, light
   purple for validation, medium purple for training, and gray for unseen.
   f The accuracy of each model in predicting the correct source of
   perturbation among the top 10 predictions for test cases in a
   fine-tuning setting. Source data are provided as a Source Data file.

   The Norman dataset targeted 105 genes with 236 perturbations and
   represented just 5% of the expansive 5565 possible gene combinations,
   highlighting the vast unexplored perturbation space. Consequently, we
   harnessed CellFM to extend the scope of perturbations virtually and
   graphically represent the anticipated average response for each gene
   combination. Concretely, we trained CellFM on the existing knockouts
   (KOs) from the Norman dataset and extrapolated to other perturbations.
   CellFM was trained on the original Norman dataset, which covers 236
   perturbations targeting 105 genes. We then used the fine-tuned model to
   predict and expand the responses to untested perturbation combinations
   in silico. These predictions were visualized as the response for each
   perturbation combination. We excluded all perturbed genes and plotted
   the UMAP graph. As shown in Supplementary Fig. [108]S5, the clusters
   exhibited overlap in certain regions while remaining distinct in
   others, consistent with the expectation that several perturbations
   either have no effect or produce similar effects. The genes depicted in
   each cluster represent the “dominant gene" within the perturbation
   combinations. The results have demonstrated a strong association
   between the clusters and their respective dominant genes. For example,
   the cluster associated with the SET gene indicates that the data points
   in this cluster correspond to combined perturbations involving SET and
   another gene (e.g., SET+CLDN6, SET+MIDN).

   In addition to the gene perturbation, we further validate the
   performance of CellFM on drug-perturbed data. Similarly, we also
   combined CellFM with a non-single foundation model CellOT^[109]33,
   which was a classic model for drug perturbation prediction. Concretely,
   we integrated CellFM and CellOT by replacing the input cell
   representation of CellOT with the cell representation from CellFM
   (Supplementary Fig. [110]S9a), as CellOT is specifically tailored for
   drug perturbation and does not require gene embeddings. In this
   setting, we compared CellFM with CellOT, scGEN, and Identity. As
   indicated in Figure S9b, CellFM outperformed CellOT in perturbation
   prediction on four drugs, achieving improvements of 66.6% and 2.2% in
   average l[2] and PCC, respectively.

Reverse perturbation prediction in silico using CellFM

   Beyond forecasting the outcomes of gene perturbations, the accurate
   prediction of CRISPR target genes that prompt cellular recovery from
   disease states is equally significant. Here, we conducted “in silico
   reverse perturbation prediction” following the study scGPT^[111]16,
   utilizing the Norman dataset. Concretely, we followed scGPT, selecting
   20 perturbation genes from the Norman dataset to construct perturbation
   cases for fine-tuning and testing. This combinatorial space consists of
   210 one-gene or two-gene perturbation combinations. The subset was
   selected to maximize the representation of ground truth perturbation
   data across both training and testing cases, using a random train-test
   split. Since scGPT did not specify a particular seed for splitting, we
   used the default seed. The resulting dataset contained 47 (22%) known
   perturbations, including 33 training cases, 3 validation cases, and 11
   test cases, with the remaining perturbation cases left as unseen. Both
   scGPT and CellFM were evaluated on this newly split dataset, as we were
   unable to replicate the exact splits used in the original scGPT study.

   CellFM demonstrated remarkable success, accurately predicting the
   perturbations that would yield the observed cellular outcomes. For
   example, it accurately identified the combination of CNN1 and ETS2, as
   well as the pairing of ETS2 and IGDCC3, as the top predictions for a
   specific test case (Fig. [112]3e). A similar accuracy trend was
   observed for the perturbations involving the combinations of IGDCC3 and
   MAPK1, and CEBPE and CNN1 genes (Supplementary Fig. [113]S6). CellFM
   and scGPT achieved similar performance in relevant perturbations
   (purple bars in Supplementary Fig. [114]S8). However, CellFM achieved
   an average of correctly identified perturbations in 81.8% of the top 10
   predictions (Fig. [115]3f), which was 18.1% higher than scGPT. As the
   number of top predictions considered increased, both CellFM and scGPT
   demonstrated enhanced performance. However, GEARs did not exhibit a
   comparable level of improvement in accuracy. The performance of GEARS
   was consistent with the findings reported in the scGPT, where the hit
   rates for top 1 to top 8 predictions also remained constant. This
   behavior is likely due to the limitations of GEARS as a smaller model,
   which struggles to identify perturbation combinations effectively. To
   further assess the robustness of CellFM, we performed additional
   evaluations by varying the random seeds during the dataset splitting
   process to generate the new training and test cases. As shown in
   Supplementary Fig. [116]S7b, CellFM outperformed scGPT and GEARS in
   terms of average hit rate accuracy. Specifically, CellFM achieved an
   average of 36.3% and 54.5% correctly identified perturbations in the
   top 3 and top 5 predictions, which were 18.1% and 18.2% higher than
   scGPT, respectively. To further examine whether the performance of
   CellFM was affected by random seeds during model fine-tuning, we
   re-evaluated CellFM, scGPT, and GEARS under different random seeds. The
   results demonstrated that CellFM consistently maintained comparable top
   1 prediction performance, outperforming scGPT by 9.1% (Supplementary
   Fig. [117]S7a). For the reverse perturbation prediction, we maintained
   the original comparison for CellFM and only evaluated it against scGPT
   and GEARS since the newly added single-cell foundation models neither
   provided code nor published corresponding results in their original
   studies (Supplementary Fig. [118]S3) and reimplementing these models
   would have required substantial programming effort and time.

   The current study is limited to gene perturbation without considering
   drug molecules. Thus, our method doesn’t perform reverse perturbation
   prediction in silico on the drug perturbation datasets such as sciPlex,
   since drug perturbation datasets require additional information
   specific to the drug molecules themselves. In the future, we will
   expand CellFM’s capabilities to support drug perturbation datasets,
   which will involve adapting the model architecture to incorporate
   drug-specific molecular information.

Cell type annotation with CellFM

   Cell type annotation is a cornerstone of single-cell data analysis,
   essential for uncovering the cellular heterogeneity within biological
   samples. To evaluate CellFM’s competency in cell type annotation, we
   conducted an exhaustive benchmark against several recent single-cell
   foundation models. We have also included two baseline methods—SVM and
   scmap—which were also suggested simultaneously in the benchmark
   studies^[119]34,[120]35. The cell annotation benchmarks used in our
   study, including intra-dataset and inter-dataset evaluations, were
   based on the latest benchmarking framework, scEval^[121]8, designed to
   evaluate single-cell foundation models on cell annotation tasks.
   Following prior studies such as scGPT and scEval, we utilized
   randomized train-test splits for intra-dataset evaluation to assess
   model performance under consistent experimental conditions. While
   intra-datasets were a part of our evaluation, we also performed
   inter-dataset testing, which better reflects real-world scenarios
   involving large batch effects. For inter-datasets, we partitioned the
   data by batch or patient ID, iteratively using each batch as the test
   set while training on the remaining batches. For hPancreas, we directly
   adopted the train-test settings established in scGPT, ensuring training
   and testing data came from different batches. This cross-batch
   validation ensures a thorough and fair assessment of model robustness
   across batches. Importantly, the cell annotation datasets used in our
   study were not included during the pre-training of CellFM. For the
   zero-shot cell type annotation task, all foundation models, including
   ours, were fully frozen during training. Classifiers (e.g., MLP or
   CNN+MLP) were then trained on embeddings extracted from these frozen
   models using labeled training data and the cross-entropy loss function.
   These trained classifiers were subsequently used to predict cell types
   on test datasets. We used the default classifiers implemented in each
   single-cell foundation model. For models like UCE, which lack a
   predefined classifier, we implemented a multi-layer perceptron (MLP)
   approach, in line with standard foundation model practices. For the
   CellFM, we followed it with the MLP classifier for the classification
   task.

   Based on the findings in the “Scaling of data and model size” section
   (Supplementary Note [122]4), we used CellFM with 80 million model
   parameters (CellFM-80M) for cell annotation and batch effect correction
   tasks. Our initial evaluation involved eight intra-datasets. Following
   the methodology established by scGPT^[123]16, we segmented each
   intra-dataset, allocating 70% for training and the remainder for
   testing. As illustrated in Fig. [124]4a, CellFM(80M) excelled baselines
   in terms of ACC across intra-datasets (we simply refer to CellFM(80M)
   as CellFM). The average ACC for CellFM was 92.91%, surpassing the
   second-ranked single-cell foundation model scFoundation by 2.02%.
   CellFM were also 1.97%, 26.86%, and 10.2% higher than those of SVM,
   scmap, and scBERT, respectively. A similar trend could be found when
   measured by the Macro-F1 values (Fig. [125]4b). To substantiate the
   superior outcomes of CellFM, we present a case study of the immune
   dataset. The predictions were visualized in Fig. [126]4c, d and
   Supplementary Fig. [127]S10, 11, CellFM achieved high precision for
   most cell types. To evaluate the efficacy of our model in the context
   of batch effects, we assessed its performance on 7 paired cross-batch
   datasets. In each scenario, a distinct batch of data was designated as
   the test set, with the remaining data constituting the training set. As
   depicted in Fig. [128]4e, f and Supplementary Fig. [129]S12, our model
   continued to outperform its competitors and was 2.3% higher than the
   second-ranked single-cell foundation model scFoundation. CellFM
   outperformed SVM, scmap, and scBERT on inter-dataset cell annotation
   tasks, with average accuracy improvements of 2%, 21.1%, and 3.4%,
   respectively. We further present a visualization case study of the
   hPancreas dataset in Fig. [130]4g, h and Supplementary
   Fig. [131]S10-11. CellFM showed a high precision of most cell types. We
   also evaluated the embedding quality of all single-cell foundation
   models on inter-datasets using the scIB metric scores. As shown in
   Supplementary Fig. [132]S15, CellFM outperformed competing models in
   AvgBio scores, aligning with its enhanced cell classification
   performance. Additionally, we incorporated the scIB metric scores to
   evaluate the impact of MLP layers in CellFM and found minimal
   performance variation when adjusting layers from one to three.
   Supplementary Fig. [133]S16 shows a positive correlation between
   classification accuracy and AvgBio scores, but both metrics exhibited
   only minor changes across layer configurations.

Fig. 4. Zero-shot cell type annotation performance of each model.

   [134]Fig. 4
   [135]Open in a new tab

   Heatmaps illustrating (a) classification accuracy and (b) Macro-F1
   scores of each model across intra-datasets. Red indicates lower
   performance, and blue represents the highest performance. c The river
   plot of CellFM illustrates the predicted cell types and their
   relationships to the actual cell types on the Immune dataset. d The
   river plot of scGPT illustrates the predicted cell types and their
   relationships to the actual cell types on the Immune dataset. Heatmaps
   show (e) classification accuracy and (f) Macro-F1 scores of each model
   across inter-datasets, with each value representing the average
   accuracy calculated from five independent runs using different random
   seeds. Red indicates lower performance, and blue represents the highest
   performance. g The river plot of CellFM illustrates the predicted cell
   types and their relationships to the actual cell types on the hPancreas
   dataset. h The river of scGPT plot illustrates the predicted cell types
   and their relationships to the actual cell types on the hPancreas
   dataset. Source data are provided as a Source Data file.

   To further assess CellFM’s capability to distinguish subtypes such as
   exhausted and activated CD8+ T cells, we evaluated it on the basal cell
   carcinoma (BCC) dataset ([136]GSE123813) and the liver hepatocellular
   carcinoma (LIHC) dataset ([137]GSE140228), both obtained from a
   database article^[138]36 due to their availability in h5ad format and
   detailed cell type annotations. As shown in Supplementary
   Figs. [139]S13 and S14, CellFM consistently outperformed other models
   across all cell types, achieving an average accuracy 2.3% higher than
   the second-best model, UCE. Notably, CellFM demonstrated exceptional
   performance in distinguishing exhausted and activated CD8+ T cells,
   surpassing UCE by an average of 6.5%. On the BCC_GSE123813 dataset,
   CellFM achieved accuracy scores of 77% and 74% for exhausted and
   activated CD8+ T cells, respectively, outperforming UCE by 6% and 7%. A
   similar trend was observed on the LIHC_GSE140228 dataset, further
   confirming CellFM’s robustness in identifying these cell states.

   We made CellFM, scGPT, GeneCompass, and Geneformer to fine-tune. As
   shown in Supplementary Fig. [140]S17, the performance of fine-tuned
   scGPT aligns closely with the results reported in its original article
   and the benchmark study scEval^[141]8. For instance, fine-tuned scGPT
   achieved 92.2% cell type annotation accuracy on the human pancreas
   dataset, comparable to its original study and that reported in scEval.
   Similarly, fine-tuned Geneformer reached 85.3% accuracy on the same
   dataset, consistent with the performance reported in scEval, although
   its original paper did not provide results for cell type
   classification. The fine-tuned GeneCompass achieved comparable results
   with the fine-tuned CellFM. CellFM(800M) obtained low performance on
   the intra- and inter-datasets when evaluated through zero-shot.
   However, across all inter-datasets, CellFM (800M) demonstrated an
   average fine-tuning accuracy that was 12.8% and 15.92% higher than
   scGPT and Geneformer, respectively. This performance indicated the
   potential power of the larger model. To evaluate the performance of
   CellFM affected by LoRA during the fine-tuning phase, we conducted the
   ablation experiments on the cell type annotation task using the
   inter-datasets. As shown in Supplementary Fig. [142]S18, the
   performance of CellFM showed minimal changes when LoRA was applied
   compared to when it was not, across the inter-datasets. However, using
   LoRA reduced the time required for fine-tuning. Based on these
   findings, we recommend applying LoRA during the fine-tuning of CellFM
   to enhance efficiency without sacrificing performance.

   Since Long Non-Coding RNAs (lncRNAs) were included in CellFM’s training
   data, we evaluated its ability to identify cell-type-specific lncRNAs
   using attention scores. By analyzing CLS-gene attention, we selected
   the top 100 genes per cell type, identifying critical lncRNAs for
   classification. Trained on PBMC data, CellFM highlighted HOTAIRM1 as a
   top gene for CD14+ Monocytes. This myeloid-specific lncRNA regulates
   monocyte differentiation via miR-3960 and HOXA1, with silencing
   reducing CD14 and monocyte marker expression. Including HOTAIRM1 in
   scRNA-seq annotation improves CD14+ Mono identification accuracy,
   demonstrating lncRNAs’ value in cell type classification^[143]37.

   To evaluate whether alternative normalization methods might further
   improve CellFM’s performance, we tested scTransform^[144]38, which
   corrects the variance-mean bias. Due to computational constraints, we
   performed these experiments using a smaller CellFM model with 80
   million parameters instead of the original 800 million. As shown in
   Supplementary Fig. [145]S21(a), the performance with scTransform was
   slightly lower than with log1p normalization used in CellFM. These
   results suggested that while scTransform addresses variance-mean bias
   explicitly, it does not yield substantial improvements in CellFM’s
   performance for the cell-type annotation tasks evaluated.

   To evaluate the efficiency of the modified ERetNet model used in
   CellFM, we further evaluated two key modifications to RetNet: replacing
   the traditional feedforward network with a gated bilinear network, and
   substituting the pre-layer LayerNorm with the DeepNorm layer
   normalization technique. We also benchmarked it against the classic
   Transformer model. All ablation experiments were conducted using a
   newly trained CellFM model with 80 million parameters, with a focus on
   the cell type annotation task, due to limitations in time and
   computational resources. As shown in Supplementary Fig. [146]S21(b),
   removing the Simple Gated Linear Unit and DeepNorm resulted in
   decreases of 0.8% and 0.9%, respectively. Additionally, the removal of
   the L_cls loss led to a slight drop (0.4%) in performance. When
   compared to the classic Transformer, CellFM demonstrated a 1.2%
   improvement. In conclusion, these modifications collectively
   contributed to the robust performance of CellFM, as evidenced by the
   benchmarking results. Additionally, the implementation of Gated
   Multi-head Attention (MHA) in CellFM improved the computational
   complexity from
   [MATH:
   <mi>O</mi><mrow><mo>(</mo><mrow><msubsup><mrow><mi>l</mi></mrow><mrow><
   mi>max</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mi>d</mi></mrow><mo
   >)</mo></mrow> :MATH]
   to
   [MATH:
   <mi>O</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>l</mi></mrow><mrow><mi>
   max</mi></mrow></msub><msup><mrow><mi>d</mi></mrow><mrow><mn>2</mn></mr
   ow></msup><mo>/</mo><mi>h</mi></mrow><mo>)</mo></mrow> :MATH]
   , where d was set to 1536 and the number of attention heads (h) was set
   to 48. Consequently, the actual computational complexity of CellFM is
   [MATH: <mi>O</mi><mfenced close=")"
   open="("><mrow><mn>2048</mn><mo>×</mo><mfrac><mrow><mn>153</mn><msup><m
   row><mn>6</mn></mrow><mrow><mn>2</mn></mrow></msup></mrow><mrow><mn>48<
   /mn></mrow></mfrac></mrow></mfenced><mo>,</mo> :MATH]
   which is smaller than
   [MATH: <mi>O</mi><mfenced close=")"
   open="("><mrow><mn>204</mn><msup><mrow><mn>8</mn></mrow><mrow><mn>2</mn
   ></mrow></msup><mo>×</mo><mn>1536</mn></mrow></mfenced><mo>.</mo>
   :MATH]
   The formula derivation can be found in Supplementary Note [147]1.

   To further evaluate the performance of CellFM in integrating datasets
   with batch effects, we conducted a comparison among three single-cell
   foundation models: scELMo, scGPT, and UCE. We included scGPT, scELMo,
   and UCE in the comparison, as their original studies reported batch
   correction capabilities. The deep learning framework used for CellFM,
   MindSpore, does not support the Gradient Reversal Layer (GRL)
   technique. Similarly, other single-cell foundation models with batch
   effect correction functionalities also lack GRL implementation. To
   ensure a fair comparison, we re-evaluated scGPT by removing the GRL
   loss while retaining its other loss functions. We evaluated CellFM
   across multiple datasets, including PBMC 10k, the human brain cell
   atlas, and two versions of Tabula Sapiens. As demonstrated in
   Supplementary Fig. [148]S19, CellFM achieved the highest average AvgBio
   scores on these datasets, outperforming the second-best method, UCE, by
   2.1%.

Deciphering gene relationships with CellFM

   The intricate interplay among target genes within a Gene Regulatory
   Network (GRN) is pivotal for orchestrating key biological processes.
   Here, we examined the ability of CellFM to encode these gene
   relationships through its gene embeddings and attention maps. To
   evaluate the gene relationships efficiently captured by the pre-trained
   CellFM, we fine-tuned CellFM using 32484 immune cells from the Immune
   data and about 200000 non-immune cells from the human brain data. As
   shown in Fig. [149]5a and Supplementary Fig. [150]S22, we present three
   gene relationship graphs: Fig. [151]5(a) shows the pre-trained CellFM,
   Figure S22(b) displays CellFM trained on immune cells, and Figure
   S22(c) illustrates CellFM trained on non-immune cells. The results show
   that the relationships among genes IL-2, IL-3, and IL-4, observed in
   the pre-trained CellFM, were preserved when CellFM was fine-tuned on
   immune cells. However, these relationships were absent when CellFM was
   trained solely on non-immune cells. Previous studies have shown that
   IL-2, IL-3, and IL-4 are involved in the JAK/STAT pathway. IL-2,
   secreted primarily by Th1 cells, promotes immune activation, while
   IL-4, secreted by Th2 cells, facilitates anti-inflammatory signaling.
   Together, they regulate immune responses by mediating cell
   proliferation, differentiation, and immune homeostasis. Additionally,
   IL-3 stimulates the STAT5 pathway, which regulates cell proliferation,
   differentiation, and anti-apoptotic signaling^[152]39,[153]40. In
   summary, CellFM effectively preserved biologically relevant immune gene
   relationships. On the other hand, we observed that the gene
   relationship structures between IL1RAP and IL1R1 were only present in
   the CellFM trained on immune cells. Previous studies have shown that
   IL1RAP and IL1R1 form a functional receptor complex that mediates IL-1β
   signaling. This interaction plays a critical role in neutrophilic
   inflammation, exacerbating airway inflammation and contributing to
   worsened pulmonary obstruction.

Fig. 5. Gene-Gene Relationships Unveiled by CellFM.

   [154]Fig. 5
   [155]Open in a new tab

   a A gene cluster comprising IL2, IL3, and IL4 was identified through
   the cosine similarity of gene embeddings generated by the pre-trained
   CellFM (zero-shot). b This analysis compares the number of enriched
   pathways derived from gene programs extracted by CellFM (zero-shot) and
   a coexpression network within an immune-related human dataset, across
   various Leiden clustering resolutions. c This workflow outlines the
   process of identifying the most influenced genes through attention
   maps, where attention scores from perturbed cell states are
   sequentially ranked to select the most impacted genes. d The heatmap
   displays the connectivity changes in the network of the top 20 genes
   most influenced by the gene SPI1 in a fine-tuning setting. The color
   gradient represents the correlation strength between SPI1 and its
   perturbed genes, ranging from blue (weakest) to purple (strongest
   positive). e The network graph represents the top 20 genes, with
   ChIP-seq predicted targets validated in the ChIP-Atlas database,
   highlighted in light blue. f The heatmap displays KEGG (Kyoto
   Encyclopedia of Genes and Genomes) pathways enriched for the top 20
   SPI1-impacted genes, identified through one-sided hypergeometric tests
   with Benjamini-Hochberg correction for multiple comparisons.

   To further verify the identified gene programs, we followed scGPT to
   perform Leiden clustering gene programs on the gene similarity graph
   construed by K-Nearest Neighbors (KNN) and extracted gene programs from
   gene clusters that consisted of five or more genes. Subsequently, we
   conducted a comprehensive pathway enrichment analysis based on the gene
   programs using the Kyoto Encyclopedia of Genes and Genomes (KEEG). As
   illustrated in Fig. [156]5b, we juxtaposed the results yielded by
   CellFM with those obtained from co-expression network analysis. CellFM
   consistently revealed a significantly higher number of enriched
   pathways across all clustering resolutions, except for the resolution
   at 10. To further validate the efficiency of the pathways identified by
   CellFM, we conducted a comparative analysis of the pathways between
   CellFM and the co-expression network at a resolution of 40. As shown in
   Supplementary Table [157]S1–S3, both methodologies identified 25 common
   pathways. CellFM uniquely identified an additional 59 pathways, 7 of
   which were pertinent to immune system processes. Conversely, the
   co-expression network uniquely identified 32 pathways, of which only 2
   were associated with immune functions. These comprehensive findings
   underscore CellFM’s superior capacity to capture subtle and intricate
   gene-gene interactions, thereby enabling the elucidation of specific
   biological mechanisms within a broader cellular context.

CellFM efficiently identified genes most affected by perturbations

   In this section, we analyzed the perturbed genes in the perturbation
   experiments and their most significantly affected genes through the
   attention map (Fig. [158]5c). Concretely, we modeled the effects of the
   perturbing gene by providing the model with the control cell expression
   profiles (non-perturbed) and explicitly indicating the gene as the
   perturbed gene. The model then predicted how the gene perturbation
   would affect the expression of other genes. Using the attention
   mechanism, we identified the 20 genes most influenced by perturbing the
   gene.

   We have provided nine cases in Supplementary Fig. [159]S23. Across nine
   case genes, among the top 20 genes most influenced by CellFM for each
   perturbation gene, an average of 18 were found in the ChIP-Atlas
   database^[160]41. The results showed that CellFM can correctly identify
   influenced genes through attention scores. Additionally, we further
   conducted the pathway analysis on case perturbation genes JUN and SPI1
   to show that CellFM captured distinct pathway-activation patterns
   through the genes most influenced by perturbation genes. As illustrated
   in Fig. [161]5d, in the Adamson dataset, CellFM identified the top 20
   genes most influenced by gene SPI1. Most genes were confirmed to be
   associated with SPI1, as validated by the ChIP-Atlas database
   (Fig. [162]5e). We noted that SPI1 in the column has connectivity,
   while the SPI1 in the row doesn’t. The reason may be caused by the
   attention mechanism used in CellFM (Supplementary Note [163]3). Most
   gene pairs in the heatmap of Fig. [164]5d have values near zero due to
   our normalization strategy. We ranked all genes in the perturbed
   dataset from Adamson based on their attention scores and recalculated
   the connection scores by dividing these ranked scores by the total
   number of genes. Additionally, the partial relationships between CCNB2,
   CCNA2, TOP2A, MKI67, KPNA2, and CENPF in Fig. [165]5d weren’t captured
   likely because CellFM focused on capturing relationships with SPI1.
   When not designating perturbation targets in CellFM (i.e., during
   regular gene recovery tasks), we could observe relations between these
   genes as Supplementary Fig. [166]S24. In addition, based on gene
   embeddings derived from CellFM, the cosine similarity scores of these
   genes were significantly higher compared to those of other genes,
   indicating that CellFM can learn strong intrinsic relationships among
   them.

   To further directly demonstrate the relationships between SPI1 and the
   SPI1 perturbed genes, we obtained the SPI1 ChIP-seq BigWig files
   (SRX2770855) from the public ChIP-Atlas database
   ([167]https://chip-atlas.org/peak_browser) and visualized them using
   the Integrative Genomics Viewer (IGV). To select the top 5 genes with
   the strongest SPI1 binding, we identified those with the highest MACS2
   scores, indicating the most significant SPI1 binding at their genomic
   loci. These genes were prioritized based on peak intensity and the
   extent of overlap with key genomic regions. The IGV snapshots allowed
   us to visually confirm the presence and strength of SPI1 binding at
   these loci, providing direct evidence of SPI1’s regulatory impact on
   these genes. As shown in Supplementary Fig. [168]S25, the results
   clearly show the binding peaks of SPI1 at the genomic loci of the
   target genes, indicating potential regulatory interactions.

   To construct a comprehensive gene-transcription factor (TF) interaction
   network for the genes influenced by SPI1, we utilized the TRRUST
   database^[169]42 ([170]http://www.grnpedia.org/trrust) to identify the
   top 300 genes most influenced by SPI1. As shown in Supplementary
   Fig. [171]S26, the network diagram illustrates the visualization of
   these interactions, where nodes represent genes or transcription
   factors and edges denote the interactions between them. Red nodes
   represent transcription factors, while blue nodes represent genes.
   Notably, genes such as MYC emerged as hub genes, interacting with
   multiple key TFs, including CTNNB1, JUNB, and CCNB1^[172]43–[173]45.
   This network analysis provides valuable insights into the regulatory
   mechanisms underlying SPI1-mediated gene expression, highlighting
   potential transcription factor interactions that may drive downstream
   cellular processes.

   Furthermore, CellFM was able to capture distinct pathway-activation
   patterns (Fig. [174]5f) through the genes most influenced by
   SPI1^[175]46–[176]48. For pathway enrichment analysis, we included all
   genes from the KEGG pathway database as the background gene set to
   provide a broader biological context. This approach was intentional to
   highlight the pathways enriched by the top 20 genes most influenced by
   SPI1 perturbation. The enriched pathways presented in Fig. [177]5f
   emphasize those in which SPI1 and its perturbed genes may exert a
   coordinated influence on important biological processes. For instance,
   the Human T-cell leukemia virus 1 infection (HTLV-1) contained SPI1,
   CCNB2, and CCNA2, holding particular significance. First, the SPI1 gene
   plays a critical role in leukemia stem cell (LSC) self-renewal^[178]49.
   In HTLV-1 infection, SPI1 activation may promote the expansion of
   leukemia stem cells, enhancing their self-renewal capacity and
   accelerating leukemia progression. Second, the CCNA2 gene primarily
   regulates cell cycle transitions. HTLV-1-encoded oncoproteins Tax and
   HBZ disrupt CCNA2 expression, causing cell cycle dysregulation and
   increasing cell immortalization potential^[179]50. In HAM/TSP patients,
   observed downregulation of CCNA2 may be due to IRF-1 suppression, which
   could prevent transformation into ATLL. For the other pathway Acute
   myeloid leukemia (AML) enriched SPI1 and CCNA2 holding particular
   significance. First, SPI1 is crucial for normal blood cell
   differentiation, and its dysregulation in AML leads to uncontrolled
   cell growth and blocked differentiation, promoting leukemia
   development^[180]46. In AML, CCNA2 is abnormally overexpressed in
   certain subtypes, particularly in therapy-related AML (t-AML)^[181]51.
   This overexpression, along with other cell cycle regulation genes like
   CCNE2 and CDC2, is linked to poor prognosis and is often associated
   with chromosomal deletions of 5 and 7, which correlate with lower
   survival rates.

   We also showed the case perturbed gene JUN (Supplementary
   Fig. [182]S27) and obtained the ChIP-seq BigWig files (SRX10976151) for
   the JUN transcription factor. The ChIP-seq data for JUN were also
   obtained from publicly available datasets and visualized using IGV.
   Representative snapshots of JUN binding sites across various genomic
   regions are provided in the Supplementary Fig. [183]S28. These
   visualizations highlight significant ChIP-seq peaks, offering direct
   evidence of JUN’s binding to specific genomic loci. Additionally, the
   pathway Focal adhesion in Supplementary Fig. [184]S27c identified key
   genes such as JUN, CCND3, and CTNNB1, all of which play essential roles
   in cell adhesion and proliferation. Concretely, JUN encodes c-Jun, a
   component of the AP-1 complex that regulates gene expression and
   proliferation. It is closely linked to focal adhesion pathways,
   particularly through FAK (Focal Adhesion Kinase), which mediates cell
   attachment to the extracellular matrix (ECM) and influences migration
   and invasion^[185]52. CCND3, a regulator of the G1/S phase transition,
   is connected to FAK signaling, linking cell adhesion to cell cycle
   progression^[186]53. Additionally, CTNNB1 (β-catenin) through the
   Wnt/β-catenin pathway promotes cell adhesion by interacting with
   cadherins and modulating FAK activity. This coordinated regulation of
   JUN, CCND3, and CTNNB1 suggests their collaborative role in driving
   cancer development and progression through cell adhesion and
   proliferation pathways^[187]54.

Discussion

   To aid efficient analysis of the single-cell data and harness the
   wealth of knowledge contained within single-cell atlas datasets, we
   have introduced a state-of-the-art foundation model known as CellFM.
   This model was pre-trained on our meticulously curated datasets,
   encompassing about 100 million human cells. These datasets empower
   CellFM to generate an expansive set of 800 million model parameters,
   marking an eightfold increase over the parameters present in the
   current single-cell models trained on a single species. To augment the
   training efficiency of CellFM, we have adopted the ERetNet architecture
   as its core. This network represents an advancement over the
   traditional RetNet framework, offering enhanced parallel training
   capabilities and cost-effective inference. These features collectively
   contribute to CellFM’s exceptional performance. Moreover, CellFM
   incorporates a Low-Rank Adaptive module designed to minimize parameter
   count during fine-tuning, thereby optimizing the model for specific
   tasks without compromising its generalizability. Through a series of
   comprehensive experiments, CellFM has demonstrated its effectiveness
   across a range of single-cell tasks including cell type annotation,
   prediction of responses to perturbations, gene network analysis, and
   gene function prediction.

   To satisfy the model’s training on large-scale datasets, we have chosen
   a variant of the RetNet architecture as CellFM’s foundation, diverging
   from the Transformers used in other single-cell foundation models. The
   RetNet architecture facilitates parallel, recurrent, and chunkwise
   processing, which we have refined by integrating the SGLU module,
   amplifying training efficiency. Additionally, we have embedded the
   Low-Rank Adaptive (LoRA) strategy within CellFM, optimizing its
   training on new datasets with similar characteristics. The combination
   of this efficient training architecture and the comprehensive datasets
   forms the basis for developing the current largest CellFM model,
   equipped with 800 million parameters. CellFM is developed using the
   MindSpore AI framework from Huawei and is trained on four Huawei
   Altas800 servers, each equipped with eight Ascend910 NPUs. Our rigorous
   experiments demonstrated the model’s adaptability and potency in
   multiple single-cell downstream tasks. In the spirit of research
   collaboration, we are dedicated to sharing our progress by making the
   CellFM codes and the pre-trained model publicly available. This
   initiative aims to provide researchers with a unified framework that
   streamlines the adoption of pre-trained models for their distinct
   research goals. While large human datasets have also been used to train
   multi-species models like UCE and GeneCompass, the number of human
   cells in these models did not exceed 50 million. In contrast, our model
   was trained on ~100 million human cells. Moreover, our model’s
   parameter size is eight times larger than GeneCompass and 1.23 times
   larger than UCE. As demonstrated in Figs. [188]2–[189]4, our model
   consistently outperformed GeneCompass and UCE in tasks such as cell
   type annotation, gene function prediction, and cell perturbation.

   Despite the advances in CellFM, several limitations remain to be
   explored. Firstly, the attention map in CellFM was limited in capturing
   gene relationships related to static or global biological knowledge. In
   the future, we will explore new explainability techniques to overcome
   this challenge. Furthermore, the current model is limited by the
   absence of multi-species data, which restricts its potential for
   broader biological contexts and cross-species comparisons. Finally, the
   model’s construction did not leverage existing biological prior
   knowledge, which could affect its depth and accuracy in interpreting
   biological phenomena.

Methods

Data collection

   All training data utilized in this study were sourced from reputable
   public databases. Specifically, from April 2021 to August 2023, we
   identified datasets leveraging keywords like “single-cell RNA
   sequencing,” “single-cell transcriptome,” and “single-cell RNA.” These
   keywords were used to search through databases such as NCBI
   GEO^[190]23, ENA^[191]24, GSA^[192]25,[193]26, ImmPort^[194]27, and
   others. In our selection process, we carefully curated the datasets,
   retaining only those human single-cell datasets that were relevant to
   our study. These datasets were encountered in multiple formats,
   including FASTQ data, expression matrices, and Seurat/Scanpy objects.
   Our initial step involved transforming the raw FASTQ data into
   expression matrices using primary analysis software supplied by the
   manufacturers. Following this, all obtained and transformed expression
   matrices underwent pre-processing through a standardized workflow
   provided by the SynEcoSys® single-cell database from Singleron
   Biotechnologies^[195]28. This workflow included several critical steps:
   (1) Quality control involved filtering cells based on a minimum gene
   count threshold of 200 genes per cell; (2) Gene name standardization
   was conducted by the HUGO Gene Nomenclature Committee (HGNC)
   guidelines, ensuring that gene aliases in each dataset were converted
   to their respective HGNC-approved gene symbols. This step guaranteed
   the uniqueness and consistency of gene names across all datasets. (3)
   Finally, the expression matrices for each sample were converted into a
   unified sparse matrix format, preparing them for subsequent model
   training.

CellFM architecture

   The CellFM model comprises three core components, including the
   embedding module, the ERetNet module, and the LoRA module
   (Fig. [196]1). The embedding module in CellFM maps one-dimensional
   scalar values of gene expression to high-dimensional embedding features
   for model training, enabling the representation of gene expressions in
   a high-dimensional space. CellFM then applies the ERetNet module to
   learn the relationships among genes based on the gene expression
   information. In parallel, CellFM uses the LoRA module to help train
   CellFM more efficiently by reducing the number of parameters when
   adjusting model weights with new data.

The embedding module

   To efficiently train CellFM, we have set an upper limit on the number
   of genes it inputs, defined by the threshold
   [MATH:
   <msub><mrow><mi>l</mi></mrow><mrow><mi>max</mi></mrow></msub><mo>=</mo>
   <mn>2048</mn> :MATH]
   . For each cell, if the number of expressed genes exceeds
   [MATH: <msub><mrow><mi>l</mi></mrow><mrow><mi>max</mi></mrow></msub>
   :MATH]
   , we randomly select
   [MATH: <msub><mrow><mi>l</mi></mrow><mrow><mi>max</mi></mrow></msub>
   :MATH]
   genes with high expression values. Conversely, if a cell has fewer
   expressed genes than
   [MATH: <msub><mrow><mi>l</mi></mrow><mrow><mi>max</mi></mrow></msub>
   :MATH]
   , we pad the gene IDs and set the padded values as zero because the
   model architecture has a fixed length
   [MATH: <msub><mrow><mi>l</mi></mrow><mrow><mi>max</mi></mrow></msub>
   :MATH]
   for parallel computing. These padded values won’t participate in the
   calculations during CellFM’s training and thus don’t influence our
   models. We then apply a Multilayer Perceptron (MLP) to map the scalar
   expression values of genes to embedding vectors necessary for the
   ERetNet module as follows:
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><m
   i>L</mi><mi>e</mi><mi>a</mi><mi>k</mi><mi>y</mi><mi>R</mi><mi>e</mi><mi
   >L</mi><mi>U</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mi>X</mi></mrow><mrow><mn>0</mn></mrow></ms
   ub><msub><mrow><mi>W</mi></mrow><mrow><mn>0</mn></mrow></msub></mrow></
   mfenced><msub><mrow><mi>X</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>
   =</mo><msub><mrow><mi>X</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>⋅<
   /mo><msub><mrow><mi>W</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>+</m
   o><mi>α</mi><mo>⋅</mo><msub><mrow><mi>X</mi></mrow><mrow><mn>1</mn></mr
   ow></msub><msub><mrow><mi>X</mi></mrow><mrow><mn>3</mn></mrow></msub><m
   o>=</mo><mi>S</mi><mi>o</mi><mi>f</mi><mi>t</mi><mi>m</mi><mi>a</mi><mi
   >x</mi><mrow><mo>(</mo><mrow><msub><mrow><mi>X</mi></mrow><mrow><mn>2</
   mn></mrow></msub></mrow><mo>)</mo></mrow><msub><mrow><mi>W</mi></mrow><
   mrow><mn>2</mn></mrow></msub> :MATH]
   1

   where
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>∈</mo><m
   sup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>1</mn><mo>×</mo><mi>
   b</mi></mrow></msup> :MATH]
   ,
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>∈</mo><m
   sup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>b</mi><mo>×</mo><mi>
   b</mi></mrow></msup> :MATH]
   , and
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>∈</mo><m
   sup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>b</mi><mo>×</mo><mi>
   d</mi></mrow></msup> :MATH]
   are learnable parameter matrices. The coefficient α is a learnable
   residual coefficient. The hyperparameters b and d are set to b = 256
   and d = 1536, respectively. The term
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>∈</mo><m
   sup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><msub><mrow><mi>l</mi></
   mrow><mrow><mi>max</mi></mrow></msub><mo>×</mo><mn>1</mn></mrow></msup>
   :MATH]
   represents the initial input cell pre-processed by the aforementioned
   workflow.

   As performed in previous LLMs, we randomly mask the expressions of 20%
   of the genes (denoted as M) and then recover them based on the
   non-masked genes. During gene masking, the 20% of genes masked during
   the pre-training task were exclusively selected from non-padded genes,
   ensuring that CellFM focused on reconstructing meaningful gene
   expression patterns using the remaining relevant gene information. This
   design allows the model to learn effective compression and meaningful
   representations without interference from padding values. Specifically,
   for the M masked gene expressions, we replace the gene expressions of
   the cell with a learnable weight vector
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mi>M</mi></mrow></msub><mo>∈</mo><m
   sup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>1</mn><mo>×</mo><mi>
   d</mi></mrow></msup> :MATH]
   initialized to zero. Consequently, the feature X[tmp] can be obtained
   as follows:
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mi>t</mi><mi>m</mi><mi>p</mi></mrow
   ></msub><mo>=</mo><mover
   accent="true"><mrow><mi>M</mi></mrow><mo>→</mo></mover><mo>⊙</mo><msub>
   <mrow><mi>X</mi></mrow><mrow><mn>3</mn></mrow></msub><mo>+</mo><mrow><m
   o>(</mo><mrow><mn>1</mn><mo>−</mo><mover
   accent="true"><mrow><mi>M</mi></mrow><mo>→</mo></mover></mrow><mo>)</mo
   ></mrow><mo>⊙</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>M</mi></mrow><
   /msub> :MATH]
   2

   where  ⊙ is the element-wise product, and
   [MATH: <mover
   accent="true"><mrow><mi>M</mi></mrow><mo>→</mo></mover><mo>∈</mo><msup>
   <mrow><mrow><mo>{</mo><mrow><mn>0</mn><mo>,</mo><mn>1</mn></mrow><mo>}<
   /mo></mrow></mrow><mrow><msub><mrow><mi>l</mi></mrow><mrow><mi>max</mi>
   </mrow></msub></mrow></msup> :MATH]
   is the mask vector indicating the position of masked genes with value
   0.

   To learn the specific characteristic of each gene, we initialize a
   learnable embedding matrix
   [MATH:
   <msup><mrow><mi>E</mi></mrow><mrow><mi>G</mi></mrow></msup><mo>∈</mo><m
   sup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>24079</mn><mo>×</mo>
   <mi>d</mi></mrow></msup> :MATH]
   . The term 24079 represents the number of genes ID and the d = 1536
   means the dimension of vector embeddings initialized based on each
   unique gene ID. We then integrate the gene expression and gene ID
   embeddings as follows:
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mi>e</mi><mi>m</mi><mi>b</mi></mrow
   ></msub><mo>=</mo><mrow><mo>[</mo><mrow><msubsup><mrow><mi>E</mi></mrow
   ><mrow><mi>g</mi><mn>1</mn></mrow><mrow><mi>G</mi></mrow></msubsup><mo>
   ,</mo><mo>.</mo><mo>.</mo><mo>.</mo><mo>,</mo><msubsup><mrow><mi>E</mi>
   </mrow><mrow><msub><mrow><mi>g</mi></mrow><mrow><msub><mrow><mi>l</mi><
   /mrow><mrow><mi>max</mi></mrow></msub></mrow></msub></mrow><mrow><mi>G<
   /mi></mrow></msubsup></mrow><mo>]</mo></mrow><mo>+</mo><msub><mrow><mi>
   X</mi></mrow><mrow><mi>t</mi><mi>m</mi><mi>p</mi></mrow></msub> :MATH]
   3

   Furthermore, we incorporate an additional learnable weight
   X[cls] ∈ R^1×d, which is appended to the gene expression embeddings.
   This weight facilitates the learning of cell-level features by
   aggregating gene information in the following manner:
   [MATH:
   <msub><mrow><mi>X</mi></mrow><mrow><mi>e</mi><mi>x</mi><mi>p</mi><mi>r<
   /mi></mrow></msub><mo>=</mo><msub><mrow><mi>X</mi></mrow><mrow><mi>c</m
   i><mi>l</mi><mi>s</mi></mrow></msub><mo>∥</mo><msub><mrow><mi>X</mi></m
   row><mrow><mi>e</mi><mi>m</mi><mi>b</mi></mrow></msub> :MATH]
   4

   where the ∥ symbol denotes the concatenation of two vectors.

The ERetNet module

   CellFM learns gene embeddings and relationships from gene expression
   through the ERetNet module, a variant of the RetNet^[197]55. RetNet is
   an efficient, high-performance variant of the Transformer architecture.
   To better adapt large-scale single-cell datasets, we have modified the
   RetNet module in two ways: First, we’ve replaced the traditional
   feedforward network in RetNet with a gated bilinear network, which has
   led to improved model performance and a smoother training process.
   Second, we’ve refined the model’s training stability and performance by
   substituting the pre-layer LayerNorm in RetNet with the DeepNorm layer
   normalization technique^[198]56. Collectively, these modifications have
   resulted in the ERetNet module, which includes a Gated Multi-head
   Attention (MHA), a Gated Linear Unit (SGLU), and layer normalization
   (LN), all contributing to a more stable and effective gene expression
   analysis model.

Gated multi-head attention (MHA)

   The Gated Multi-head Attention (MHA) block is used to learn
   dependencies between genes, which is a variant of the Retention
   mechanism in the RetNet. To address the computational inefficiency of
   the exponential attention operations (
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">O</mi><mrow><mo>(</mo><mrow><msubsup><mrow><mi>l</
   mi></mrow><mrow><mi>max</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mi
   >d</mi></mrow><mo>)</mo></mrow> :MATH]
   ) implemented in RetNet, we adopt the method proposed by Shen et
   al.^[199]57. By first computing the keys (K) and values (V), followed
   by the queries (Q), this approach achieves a linear complexity of
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">O</mi><mfenced close=")"
   open="("><mrow><mfrac><mrow><msub><mrow><mi>l</mi></mrow><mrow><mi>max<
   /mi></mrow></msub><msup><mrow><mi>d</mi></mrow><mrow><mn>2</mn></mrow><
   /msup></mrow><mrow><mi>h</mi></mrow></mfrac></mrow></mfenced> :MATH]
   , significantly reducing overhead. Furthermore, we scaled Q, K, and V
   following the study^[200]57 to ensure the half-precision training as
   follows:
   [MATH: <mtable><mtr><mtd
   columnalign="center"><mi>Q</mi><mo>=</mo><mfrac><mrow><mspace
   width="0.25em"></mspace><mstyle><mtext>ReLU</mtext></mstyle><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><mi>X</mi><msub><mrow><mi
   >W</mi></mrow><mrow><mi>Q</mi></mrow></msub></mrow><mo>)</mo></mrow></m
   row><mrow><msqrt><mrow><mi>d</mi></mrow></msqrt></mrow></mfrac><mo>,</m
   o><mspace
   width="1.0em"></mspace><mi>K</mi><mo>=</mo><mfrac><mrow><mspace
   width="0.25em"></mspace><mstyle><mtext>ReLU</mtext></mstyle><mspace
   width="0.25em"></mspace><mrow><mo>(</mo><mrow><mi>X</mi><msub><mrow><mi
   >W</mi></mrow><mrow><mi>K</mi></mrow></msub></mrow><mo>)</mo></mrow></m
   row><mrow><msqrt><mrow><mi>d</mi></mrow></msqrt></mrow></mfrac><mo>,</m
   o><mspace
   width="1.0em"></mspace><mi>V</mi><mo>=</mo><mfrac><mrow><mi>X</mi><msub
   ><mrow><mi>W</mi></mrow><mrow><mi>V</mi></mrow></msub></mrow><mrow><msq
   rt><mrow><msub><mrow><mi>l</mi></mrow><mrow><mi
   mathvariant="normal">cell</mi></mrow></msub></mrow></msqrt></mrow></mfr
   ac></mtd></mtr><mtr><mtd
   columnalign="center"><mi>A</mi><mi>t</mi><mi>t</mi><mi>e</mi><mi>n</mi>
   <mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi><mrow><mo>(</mo><mrow><mi>X</mi
   ></mrow><mo>)</mo></mrow><mo>=</mo><mi>Q</mi><mrow><mo>(</mo><mrow><msu
   p><mrow><mi>K</mi></mrow><mrow><mi>T</mi></mrow></msup><mrow><mo>(</mo>
   <mrow><mover
   accent="true"><mrow><mi>M</mi></mrow><mo>→</mo></mover><mo>⊙</mo><mi>V<
   /mi></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow></mtd></mtr></mtabl
   e> :MATH]
   5

   where
   [MATH:
   <msub><mrow><mi>l</mi></mrow><mrow><mi>c</mi><mi>e</mi><mi>l</mi><mi>l<
   /mi></mrow></msub><mo>=</mo><msub><mrow><mi>l</mi></mrow><mrow><mi>max<
   /mi></mrow></msub><mo>+</mo><mn>1</mn> :MATH]
   denotes the number of expressed genes within each cell,
   [MATH: <mover accent="true"><mrow><mi>M</mi></mrow><mo>→</mo></mover>
   :MATH]
   denotes the mask vector.

   To enhance the CellFM’s representational power, we use h = d/d[head]
   attention heads in each ERetNet layer, where d[head] = 32 is the head
   dimension. Each head consists of three parameter matrices W[Q], W[K],
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mi>V</mi></mrow></msub><mo>∈</mo><m
   sup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><msub><mrow><mi>d</mi></
   mrow><mrow><mi>h</mi><mi>e</mi><mi>a</mi><mi>d</mi></mrow></msub><mo>×<
   /mo><msub><mrow><mi>d</mi></mrow><mrow><mi>h</mi><mi>e</mi><mi>a</mi><m
   i>d</mi></mrow></msub></mrow></msup> :MATH]
   . In addition, we add a swish gate^[201]58 to increase the
   non-linearity of ERetNet layers. Formally, given input X[expr], we
   define the MHA layer as follows:
   [MATH:
   <mi>h</mi><mi>e</mi><mi>a</mi><msub><mrow><mi>d</mi></mrow><mrow><mi>i<
   /mi></mrow></msub><mo>=</mo><mi>A</mi><mi>t</mi><mi>t</mi><mi>e</mi><mi
   >n</mi><mi>t</mi><mi>i</mi><mi>o</mi><msub><mrow><mi>n</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><mrow><mo>(</mo><mrow><msub><mrow><mi>X</mi></
   mrow><mrow><mi>e</mi><mi>x</mi><mi>p</mi><mi>r</mi></mrow></msub></mrow
   ><mo>)</mo></mrow><mi>Y</mi><mo>=</mo><mi>G</mi><mi>r</mi><mi>o</mi><mi
   >u</mi><mi>p</mi><mi>N</mi><mi>o</mi><mi>r</mi><msub><mrow><mi>m</mi></
   mrow><mrow><mi>h</mi></mrow></msub><mfenced close=")"
   open="("><mrow><mi>C</mi><mi>o</mi><mi>n</mi><mi>c</mi><mi>a</mi><mi>t<
   /mi><mrow><mo>(</mo><mrow><mi>h</mi><mi>e</mi><mi>a</mi><msub><mrow><mi
   >d</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><mo>.</mo><mo>.</
   mo><mo>.</mo><mo>,</mo><mi>h</mi><mi>e</mi><mi>a</mi><msub><mrow><mi>d<
   /mi></mrow><mrow><mi>h</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow
   ></mfenced><mi>M</mi><mi>H</mi><mi>A</mi><mrow><mo>(</mo><mrow><msub><m
   row><mi>X</mi></mrow><mrow><mi>e</mi><mi>x</mi><mi>p</mi><mi>r</mi></mr
   ow></msub></mrow><mo>)</mo></mrow><mo>=</mo><mrow><mo>(</mo><mrow><mi>S
   </mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>h</mi><mrow><mo>(</mo><mrow><msu
   b><mrow><mi>X</mi></mrow><mrow><mi>e</mi><mi>x</mi><mi>p</mi><mi>r</mi>
   </mrow></msub><msub><mrow><mi>W</mi></mrow><mrow><mi>G</mi></mrow></msu
   b></mrow><mo>)</mo></mrow><mo>⊙</mo><mi>Y</mi></mrow><mo>)</mo></mrow><
   msub><mrow><mi>W</mi></mrow><mrow><mi>O</mi></mrow></msub> :MATH]
   6

   where W[G],
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mi>O</mi></mrow></msub><mo>∈</mo><m
   sup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>d</mi><mo>×</mo><mi>
   d</mi></mrow></msup> :MATH]
   are learnable parameters, and GroupNorm normalizes^[202]59 the output
   of each head, following SubLN proposed in the study^[203]60.

Simple Gated Linear Unit (SGLU)

   To improve model performance and a smoother training process, we’ve
   replaced the traditional feedforward network in RetNet with a gated
   bilinear network. The gated unit GLU introduces a multiplicative gating
   mechanism that explicitly indicates the model’s memory of each feature
   dimension, thereby smoothing the training process and facilitating
   better integration between channels. Considering that the gating
   mechanism inherently introduces nonlinear relationships, to further
   accelerate the model’s computation, this work, referring to
   literature^[204]61, adopts the SGLU, which is based on the GLU
   formula^[205]62 but omits the Swish activation function:
   [MATH:
   <mi>S</mi><mi>G</mi><mi>L</mi><mi>U</mi><mrow><mo>(</mo><mrow><mi>X</mi
   ></mrow><mo>)</mo></mrow><mo>=</mo><mrow><mo>(</mo><mrow><mi>X</mi><msu
   b><mrow><mi>W</mi></mrow><mrow><mi>u</mi></mrow></msub><mo>⊙</mo><mi>X<
   /mi><msub><mrow><mi>W</mi></mrow><mrow><mi>v</mi></mrow></msub></mrow><
   mo>)</mo></mrow><msub><mrow><mi>W</mi></mrow><mrow><mi>o</mi></mrow></m
   sub> :MATH]
   7

   where  ⊙ is the element-wise product.

Layer normalization (LN)

   The Transformer architecture in large single-cell models typically uses
   post-norm normalization after residual connections to enhance model
   depth and convergence. However, this can cause a gradient explosion as
   the model size grows. To counteract this, a pre-norm strategy is
   applied in RetNet for a stabilized training process, albeit with a
   potential performance trade-off. To address this gap, CellFM employs
   the new post-norm normalization method DeepNorm^[206]56. DeepNorm
   reduces the contribution ratio of each network block to the output,
   thereby reducing the amount of gradient that needs to be updated and
   ensuring the stability of training.
   [MATH:
   <msup><mrow><mi>Y</mi></mrow><mrow><mrow><mo>(</mo><mrow><mi>l</mi></mr
   ow><mo>)</mo></mrow></mrow></msup><mo>=</mo><mi>L</mi><mi>N</mi><mrow><
   mo>(</mo><mrow><mi>M</mi><mi>H</mi><mi>A</mi><mrow><mo>(</mo><mrow><msu
   p><mrow><mi>X</mi></mrow><mrow><mrow><mo>(</mo><mrow><mi>l</mi></mrow><
   mo>)</mo></mrow></mrow></msup></mrow><mo>)</mo></mrow><mo>+</mo><mi>λ</
   mi><mo>⋅</mo><msup><mrow><mi>X</mi></mrow><mrow><mrow><mo>(</mo><mrow><
   mi>l</mi></mrow><mo>)</mo></mrow></mrow></msup></mrow><mo>)</mo></mrow>
   <msup><mrow><mi>X</mi></mrow><mrow><mrow><mo>(</mo><mrow><mi>l</mi><mo>
   +</mo><mn>1</mn></mrow><mo>)</mo></mrow></mrow></msup><mo>=</mo><mi>L</
   mi><mi>N</mi><mrow><mo>(</mo><mrow><mi>S</mi><mi>G</mi><mi>L</mi><mi>U<
   /mi><mrow><mo>(</mo><mrow><msup><mrow><mi>Y</mi></mrow><mrow><mrow><mo>
   (</mo><mrow><mi>l</mi></mrow><mo>)</mo></mrow></mrow></msup></mrow><mo>
   )</mo></mrow><mo>+</mo><mi>λ</mi><mo>⋅</mo><msup><mrow><mi>Y</mi></mrow
   ><mrow><mrow><mo>(</mo><mrow><mi>l</mi></mrow><mo>)</mo></mrow></mrow><
   /msup></mrow><mo>)</mo></mrow> :MATH]
   8

   where LN( ⋅ ) is LayerNorm and λ is a hyperparameter.

Low-rank adaptation (LoRA) module

   Large models typically comprise hundreds of millions of parameters,
   resulting in considerable time consumption for full model training. To
   alleviate the burden of training on various datasets, we employ the
   Low-Rank Adaptation (LoRA) algorithm^[207]63. LoRA operates under the
   assumption that updates to pre-trained weights during fine-tuning can
   be decomposed by low rank. Hence, for a pre-trained weight matrix
   W[0] ∈ R^n×k, we utilize low-rank decomposition to constrain the weight
   increment  ∇ W:
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>+</mo><m
   i
   mathvariant="normal">Δ</mi><mi>W</mi><mo>=</mo><msub><mrow><mi>W</mi></
   mrow><mrow><mn>0</mn></mrow></msub><mo>+</mo><mi>B</mi><mi>A</mi>
   :MATH]
   9

   where B ∈ R^d×r, A ∈ R^r×k, and the rank
   [MATH:
   <mi>r</mi><mo><</mo><mo><</mo><mi>min</mi><mrow><mo>(</mo><mrow><mi>d</
   mi><mo>,</mo><mi>k</mi></mrow><mo>)</mo></mrow> :MATH]
   .

   During the forward computation, both W[0] and matrices A and B are used
   in calculations with the input X; however, during the backward
   propagation, the W[0] parameter is frozen and does not undergo gradient
   updates, while only A and B are updated. It can be observed that in
   regular training, the training parameter count for W[0] is n × k,
   whereas, with LoRA-based training, the training parameter count for
   W[0]is the sum of the parameters of A and B, which is (n + k) × r.
   Since the dimension r is significantly smaller than n and k, the number
   of training parameters for the weights is greatly reduced, leading to a
   substantial decrease in computational overhead.

   In the ERetNet architecture, the MHA has five weight matrices: W[Q],
   W[K], W[V], W[G], and W[O], and the SGLU gating unit has three weight
   matrices: W[u], W[v], and W[o]. We consider the dimensions of these 8
   matrices to be d × d. In this experiment, we limit the application of
   LoRA to only the ERetNet encoder part and freeze all model parameters
   for updates except for the weights of the LayerNorm layer.

Loss functions

Mean squared error (MSE)

   In CellFM, we focus on minimizing the Mean Squared Error (MSE) as the
   primary metric because it effectively measures the discrepancy between
   the predicted and actual gene vector embeddings for masked genes. MSE
   is particularly suitable in this context as it penalizes larger errors
   more heavily, making it crucial for accurately recovering gene
   representations. Additionally, MSE has been widely adopted in similar
   tasks involving gene expression prediction and representation learning.
   For example, scFoundation^[208]18 and GeneCompass^[209]19 employ MSE to
   optimize gene expression prediction in high-dimensional spaces,
   demonstrating its effectiveness in promoting precise modeling.
   Specifically, we employ a fully connected MLP followed by the ERetNet
   module to estimate the expression value for M genes. The optimization
   of this objective involves utilizing the MSE loss at the masked
   positions, denoted as M mask. The MSE works as follows:
   [MATH: <mover
   accent="true"><mrow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub></mrow><mrow><mo>^</mo></mrow></mover><mo>=</mo><mi>M</mi><mi>L
   </mi><mi>P</mi><mrow><mo>(</mo><mrow><msubsup><mrow><mi>x</mi></mrow><m
   row><mi>i</mi></mrow><mrow><mrow><mo>(</mo><mrow><mi>L</mi></mrow><mo>)
   </mo></mrow></mrow></msubsup></mrow><mo>)</mo></mrow><msub><mrow><mi>L<
   /mi></mrow><mrow><mi
   mathvariant="normal">MSE</mi></mrow></msub><mo>=</mo><mfrac><mrow><mn>1
   </mn></mrow><mrow><mo>∣</mo><msub><mrow><mi>M</mi></mrow><mrow><mi
   mathvariant="normal">mask</mi></mrow></msub><mo>∣</mo></mrow></mfrac><m
   under><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mo>∀</mo><mi>i</mi><mo>,</mo><msub><
   mrow><mi>M</mi></mrow><mrow><mi
   mathvariant="normal">(mask,i)</mi></mrow></msub><mo>=</mo><mn>1</mn></m
   row></munder><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo>~</mo></mrow></mover></m
   row><mrow><mi>i</mi></mrow></msub><mo>−</mo><msub><mrow><mi>y</mi></mro
   w><mrow><mi>i</mi></mrow></msub></mrow></mfenced></mrow><mrow><mn>2</mn
   ></mrow></msup> :MATH]
   10

   ∣M[mask]∣ denotes the count of ones in the mask gene vector M[mask] for
   each cell.
   [MATH:
   <msubsup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow><mrow><mrow><mo>
   (</mo><mrow><mi>L</mi></mrow><mo>)</mo></mrow></mrow></msubsup> :MATH]
   signifies the features derived from the ERetNet for gene i at the L
   layer.

   To further enhance CellFM’s learning capabilities and channel
   aggregation, the feature corresponding to token cls is also leveraged,
   represented as
   [MATH: <msubsup><mrow><mi>X</mi></mrow><mrow><mspace
   width="0.25em"></mspace><mstyle><mtext>cls</mtext></mstyle></mrow><mrow
   ><mrow><mo>(</mo><mrow><mi>L</mi></mrow><mo>)</mo></mrow></mrow></msubs
   up><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mn>1</mn><mo>×</mo><mi>
   d</mi></mrow></msup> :MATH]
   . This feature passes through an additional network designed to predict
   expression values. Specifically, a learnable parameter matrix
   [MATH: <msub><mrow><mi>W</mi></mrow><mrow><mi
   mathvariant="normal">cls</mi></mrow></msub><mo>∈</mo><msup><mrow><mi
   mathvariant="double-struck">R</mi></mrow><mrow><mi>d</mi><mo>×</mo><mi>
   d</mi></mrow></msup> :MATH]
   and an activation function are employed. After mapping
   [MATH: <msubsup><mrow><mi>X</mi></mrow><mrow><mspace
   width="0.25em"></mspace><mstyle><mtext>cls</mtext></mstyle></mrow><mrow
   ><mrow><mo>(</mo><mrow><mi>L</mi></mrow><mo>)</mo></mrow></mrow></msubs
   up> :MATH]
   , the result is multiplied with the embeddings of the genes to be
   predicted in the vocabulary E^G to compute inner products, yielding
   another set of predicted values. These are then compared with the
   actual values to calculate the mean squared error loss.
   [MATH:
   <mi>x</mi><mo>=</mo><mi>σ</mi><mrow><mo>(</mo><mrow><msubsup><mrow><mi>
   x</mi></mrow><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow><mrow><mrow><mo
   >(</mo><mrow><mi>L</mi></mrow><mo>)</mo></mrow></mrow></msubsup><msub><
   mrow><mi>W</mi></mrow><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></msub
   ></mrow><mo>)</mo></mrow><msub><mrow><mi>ȳ</mi></mrow><mrow><mi>i</mi><
   /mrow></msub><mo>=</mo><mi>x</mi><mi>@</mi><msubsup><mrow><mi>E</mi></m
   row><mrow><mi>i</mi></mrow><mrow><mi>G</mi></mrow></msubsup><msub><mrow
   ><mi>L</mi></mrow><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></msub><mo
   >=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mo>∣</mo><msub><mrow><mi>M<
   /mi></mrow><mrow><mi>m</mi><mi>a</mi><mi>s</mi><mi>k</mi></mrow></msub>
   <mo>∣</mo></mrow></mfrac><munder><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mo>∀</mo><mi>i</mi><mo>,</mo><msub><
   mrow><mi>M</mi></mrow><mrow><mrow><mo>(</mo><mrow><mi>m</mi><mi>a</mi><
   mi>s</mi><mi>k</mi><mo>,</mo><mi>i</mi></mrow><mo>)</mo></mrow></mrow><
   /msub><mo>=</mo><mn>1</mn></mrow></munder><msup><mrow><mrow><mo>(</mo><
   mrow><msub><mrow><mi>ȳ</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>−</
   mo><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><m
   o>)</mo></mrow></mrow><mrow><mn>2</mn></mrow></msup> :MATH]
   11

   The @ symbol represents matrix multiplication. σ is an activation
   function Sigmoid.
   [MATH:
   <msubsup><mrow><mi>E</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi>G</mi>
   </mrow></msubsup> :MATH]
   denotes the embedding of gene i in the vocabulary E^G. Finally, the
   total loss functions of CellFM can be obtained as follows:
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><mi>t</mi><mi>o</mi><mi>t</mi><mi>a<
   /mi><mi>l</mi></mrow></msub><mo>=</mo><msub><mrow><mi>L</mi></mrow><mro
   w><mi>M</mi><mi>S</mi><mi>E</mi></mrow></msub><mo>+</mo><msub><mrow><mi
   >L</mi></mrow><mrow><mi>c</mi><mi>l</mi><mi>s</mi></mrow></msub> :MATH]
   12

Baseline single-cell foundation models

   We have incorporated several recent single-cell foundation models into
   our benchmarking, including scFoundation, GeneCompass, UCE, scELMo,
   scBERT, scGPT, and Geneformer. These models represent different
   approaches to single-cell analysis, and we have categorized them based
   on their methodological focus. Specifically, models including CellFM,
   scFoundation, scELMo, and GeneCompass fall under the value projection
   category, while UCE, scGPT, and scBERT belong to the value
   categorization category, and Geneformer is placed in the ordering
   category.

Implementation details

   CellFM consists of 40 stacked ERetNet blocks, with each block having 48
   attention heads. During pre-training, we used ~100 million cells for
   training CellFM. The model was optimized using the Adam optimizer with
   a starting learning rate of 1e-7 and trained for a total of 2 epochs.
   The total batch size was 128, distributed equally across 4 Huawei
   Altas800 servers, each equipped with 8 Ascend910 NPUs. The MindSpore AI
   development framework powered the automatic data parallel training.

   The decision to train the CellFM model for two epochs was informed by
   standard practices in large-scale model training^[210]9, where rapid
   convergence is typically observed within the initial epochs. To
   validate this convergence of CellFM, we conducted the experiment using
   the 80-million-parameter version of CellFM on all training datasets.
   The results confirmed the same pattern: the loss dropped sharply-from 8
   to below 1-during the first epoch, with only minimal changes in the
   second epoch (in Supplementary Fig. [211]S29). This behavior reflects
   the typical convergence dynamics of large-scale models and supports our
   choice to limit training to two epochs to balance efficiency and
   performance.

Pre-processing

   The gene symbols across all raw count gene expression matrices were
   standardized using the reference mapping provided by the HUGO Gene
   Nomenclature Committee. This process included both human protein-coding
   genes and common mitochondrial genes, resulting in a comprehensive gene
   set G consisting of 24,078 genes. Ultimately, normalization and a log1p
   transformation were employed across all gene expression matrices to
   alleviate skewness in the data. For the cell type annotation task, we
   excluded cell types with fewer than 10 cells in the training data, as
   well as those present in the query data but absent in the reference
   data.

Reporting summary

   Further information on research design is available in the [212]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [213]Supplementary Information^ (8.2MB, pdf)
   [214]Reporting Summary^ (86.7KB, pdf)
   [215]Transparent Peer Review file^ (14.9MB, pdf)

Acknowledgements