Abstract

   Single-cell sequencing is a crucial tool for dissecting the cellular
   intricacies of complex diseases. Its prohibitive cost, however, hampers
   its application in expansive biomedical studies. Traditional cellular
   deconvolution approaches can infer cell type proportions from more
   affordable bulk sequencing data, yet they fall short in providing the
   detailed resolution required for single-cell-level analyses. To
   overcome this challenge, we introduce “scSemiProfiler”, an innovative
   computational framework that marries deep generative models with active
   learning strategies. This method adeptly infers single-cell profiles
   across large cohorts by fusing bulk sequencing data with targeted
   single-cell sequencing from a few rigorously chosen representatives.
   Extensive validation across heterogeneous datasets verifies the
   precision of our semi-profiling approach, aligning closely with true
   single-cell profiling data and empowering refined cellular analyses.
   Originally developed for extensive disease cohorts, “scSemiProfiler” is
   adaptable for broad applications. It provides a scalable,
   cost-effective solution for single-cell profiling, facilitating
   in-depth cellular investigation in various biological domains.

   Subject terms: Computational models, Machine learning, Software, Data
   acquisition, Biotechnology
     __________________________________________________________________

   Single-cell sequencing is vital for studying complex diseases but is
   costly. Here, authors introduce scSemiProfiler, a deep generative
   learning framework that infers single-cell profiles by combining bulk
   sequencing with single-cell data from selected samples, offering a
   cost-effective solution.

Introduction

   The advent of single-cell sequencing has dramatically reshaped the
   landscape of biological research, providing an unparalleled view into
   the cellular complexities of organisms^[28]1,[29]2. This technique has
   unearthed the subtle distinctions among individual cells, enabling a
   richer understanding of cellular dynamics^[30]3,[31]4. High-resolution
   data from such analyses are essential for delineating and
   characterizing the myriad of cellular subpopulations within patient
   samples, leading to transformative developments in biomarker discovery
   and personalized therapy strategies^[32]5–[33]7. Cohort studies, which
   offer longitudinal insights by observing specific groups over time, are
   particularly poised to benefit from these advances^[34]8–[35]10.
   However, the substantial financial cost associated with single-cell
   sequencing—such as the estimated $6000 required to sequence 20,000
   cells as of 2023 ([36]costpercell)—can often be a limiting factor for
   extensive research endeavors.

   A range of computational strategies, particularly deconvolution
   methods, are available for dissecting bulk data^[37]11 into distinct
   cell populations. These approaches enable a harmonious balance between
   affordability and data resolution. Prominent among these are
   deconvolution techniques such as CIBERSORTx^[38]12,[39]13,
   Bisque^[40]14, DWLS^[41]15, MuSiC^[42]16, and NNLS^[43]16, which have
   become particularly popular. Another method, EPIC^[44]17, operates
   similarly but necessitates a pre-defined signature gene matrix. In
   addition to these classical deconvolution approaches, Scaden^[45]18 and
   TAPE^[46]19 present an emerging category of methods that employ more
   sophisticated deep neural networks to estimate cell type proportions
   using single-cell reference data and deconvolute bulk gene expression.
   These methods estimate the proportions of different cell types within
   bulk sequencing samples by utilizing signature profiles from
   single-cell reference datasets. While valuable for enhancing cohort
   study analyses, such conventional bulk decomposition methods exhibit
   limitations in their resolution and accuracy^[47]20,[48]21. Capable of
   dissecting bulk samples into different cell types and ascertaining
   their gene expression patterns, they offer beneficial insights at the
   cell type level. Yet, these methods often fail to deliver high
   deconvolution accuracy. The aforementioned conventional cell
   deconvolution methods often do not achieve high accuracy and perform
   substantially worse particularly when dealing with real bulk data. More
   importantly, conventional methods are constrained to cell type level
   resolution and are incapable of achieving the true single-cell
   resolution required to capture the significant variability within cell
   types. While cell type proportions offer a view of changes in cell
   fractions within bulk samples, and cell type signature matrices provide
   gene expression profiles at the cell type level, neither can delve into
   the heterogeneity within individual cell types. This single-cell level
   granularity is critical for deciphering the complexities of diseases
   and their response to therapies^[49]22–[50]26. The limitations of
   traditional bulk decomposition methods restrict our ability to fully
   explore cellular dynamics and to appreciate the unique contributions of
   each cell to the overall sample profile. True single-cell resolution
   allows for in-depth analyses essential for understanding cellular
   heterogeneity, such as UMAP^[51]27, pathway activation pattern
   analysis, biomarker discovery, gene functional enrichment, as well as
   cell-cell interaction^[52]28 and pseudotime trajectory
   analysis^[53]29–[54]32. These analyses are further enhanced when
   combined with advanced machine learning techniques^[55]33–[56]37, which
   are particularly adept at decoding the subtleties of cellular
   heterogeneity and dynamics and typically necessitate single-cell
   datasets.

   In response to the challenges previously highlighted and with the goal
   of offering a cost-effective approach for extensive single-cell
   sequencing, we introduce the single-cell Semi-profiler
   (scSemiProfiler). This deep generative computational tool is crafted to
   significantly improve the precision and depth of single-cell analysis.
   It stands out as a more economical and scalable option for single-cell
   sequencing, thereby facilitating advanced single-cell analysis with
   greater accessibility. This tool effectively integrates active learning
   techniques^[57]38 with deep generative neural network
   algorithms^[58]39, aiming to provide single-cell resolution data at a
   more affordable price. scSemiProfiler aims to simultaneously achieve
   two fundamental goals in the semi-profiling process. On one hand,
   scSemiProfiler’s active learning module integrates information from the
   deep learning model and bulk data, intelligently selecting the most
   informative samples for actual single-cell sequencing. On the other
   hand, scSemiProfiler’s deep generative model component^[59]40–[60]43
   effectively merges single-cell data from representative samples with
   the bulk sequencing data of the cohort, computationally inferring the
   single-cell data for the remaining non-representative samples. This
   deep neural network approach leads to more detailed “deconvolution” of
   the target bulk data into precise single-cell level measurements.
   Consequently, with only the budget for bulk sequencing and single-cell
   sequencing of representatives, scSemiProfiler outputs single-cell data
   for all samples in the study. To the best of our knowledge,
   scSemiProfiler is the first of its kind designed for such intricate
   single-cell level computational decomposition from bulk sequencing
   data.

   Through comprehensive evaluations across a variety of datasets,
   scSemiProfiler has consistently produced semi-profiled single-cell data
   that closely align with actual single-cell datasets and accurately
   reflects the results of downstream tasks. As a result, scSemiProfiler
   facilitates improved access to single-cell data for large-scale
   studies, including disease cohort investigations and beyond. By making
   large-scale single-cell studies more affordable, scSemiProfiler
   promises to catalyze the application of single-cell technologies in a
   wide array of expansive biomedical research. This advancement is set to
   expand the scope and enhance the depth of biological research globally.

Results

Method overview

   The scSemiProfiler approach represents a deep neural network framework
   for decomposing broad-scope bulk sequencing data into detailed
   single-cell cohorts. It achieves this by conducting single-cell
   profiling on only a select few representative samples and then
   computationally inferring single-cell data for the remainder. This
   approach substantially lowers the costs associated with large-scale
   single-cell studies. As depicted in Fig. [61]1, our method is designed
   to deliver cost-effective, semi-profiled single-cell sequencing data,
   enabling deep exploration of cellular dynamics in large cohorts. In
   this context, “semi-profiling” refers to the generation of single-cell
   data for an entire cohort, achieved either through direct single-cell
   sequencing of selected representative samples or via in silico
   inference using a deep generative model. This in silico inference
   process combines actual single-cell sequencing data from a
   representative sample with bulk sequencing data, encompassing both the
   target and the representative sample. Thus, a “semi-profiled cohort”
   includes real single-cell data for representative samples and inferred
   data for the non-representative ones. This deep generative approach
   facilitates a thorough examination of individual cellular profiles
   within a larger dataset, seamlessly linking the extensive scope of bulk
   sequencing with the granularity of single-cell analysis.

Fig. 1. Overview of the scSemiProfiler method.

   [62]Fig. 1
   [63]Open in a new tab

   a Initial Configuration: Bulk sequencing is first conducted on the
   entire cohort, followed by clustering analysis of this data. This
   analysis identifies representative samples, typically those closest to
   the cluster centroids. b Representative Profiling: The identified
   representatives are then subjected to single-cell sequencing. The data
   obtained from this sequencing is further processed to determine gene
   set scores and feature importance weights, enriching the subsequent
   analysis steps. c Deep Generative Inference: Utilizing a VAE-GAN-based
   model, the process integrates comprehensive bulk data from the cohort
   with the single-cell data derived from the representatives, and infers
   the single-cell data for the non-representative (target) samples.
   During the model’s 3-stage training, the generator aims to optimize
   losses
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><msub><mrow><mi>G</mi></mrow><mrow><
   mi>P</mi><mi>r</mi><mi>e</mi><mi>t</mi><mi>r</mi><mi>a</mi><mi>i</mi><m
   i>n</mi><mn>1</mn></mrow></msub></mrow></msub> :MATH]
   ,
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><msub><mrow><mi>G</mi></mrow><mrow><
   mi>P</mi><mi>r</mi><mi>e</mi><mi>t</mi><mi>r</mi><mi>a</mi><mi>i</mi><m
   i>n</mi><mn>2</mn></mrow></msub></mrow></msub> :MATH]
   , and L[inference], respectively, whereas the discriminator focuses on
   minimizing the discriminator loss L[D]. In L[D], G and D are the
   generator and discriminator respectively. The term D((x[i], s[i]))
   represents the discriminator’s predicted probability that a given input
   cell is real, under the condition that it is indeed a real cell.
   Conversely, D(G((x[i], s[i]))) denotes the discriminator’s predicted
   probability that the input cell is real, when in fact it is a cell
   reconstructed by the generator. d Representative Selection Decision:
   Decisions on further representative selection are made, taking into
   account budget constraints and the effectiveness of the current
   representatives. An active learning algorithm, which draws on insights
   from the bulk data and the generative models, is employed to pinpoint
   additional optimal representatives. These newly selected
   representatives then undergo further single-cell sequencing (b) and
   serve as new reference points for the ongoing in silico inference
   process (c). This active learning step is optional if the user prefers
   the all-in-one “global mode”. e Comprehensive Downstream Analyses: The
   final panel shows extensive downstream analyses enabled by the
   semi-profiled single-cell data. This is pivotal in demonstrating the
   model’s capacity to provide deep and wide-ranging insights, showcasing
   the full potential and applicability of the semi-profiled single-cell
   data.

   Initially, the semi-profiling pipeline commences with bulk sequencing
   of each cohort member (Fig. [64]1a), laying the foundational data layer
   for all subsequent analyses. Following this foundational step, the
   methodology employs a clustering analysis to form B (sample batch size)
   sample clusters utilizing the extensive data derived from the initial
   bulk sequencing and selecting a “representative” sample for each
   cluster (Fig. [65]1b). Single-cell profiling will be conducted on the
   selected representative samples in preparation for the following steps.
   The core of the scSemiProfiler involves a deep generative learning
   model (Fig. [66]1c). This model is engineered to intricately meld
   actual single-cell data profiles with the gathered bulk sequencing
   data, thereby capturing complex biological patterns and nuances.
   Specifically, it uses a VAE-GAN^[67]44 architecture initially
   pretrained on single-cell sequencing data of selected representatives
   for self-reconstruction. Subsequently, the VAE-GAN is further
   pretrained with a representative reconstruction bulk loss, aligning
   pseudobulk estimations from the reconstructed single-cell data with
   real pseudobulk. Finally, the model undergoes fine-tuning with another
   target bulk loss tied to the real bulk sequencing data of the target
   samples, facilitating precise in silico inference of the targets’
   single-cell profiles. Once the in silico single-cell inference is
   finished for all non-representative samples in the cohort, an active
   learning module can be used for selecting the next batch of potentially
   most informative representatives for single-cell sequencing to further
   improve the semi-profiling performance (Fig. [68]1d). When studying a
   smaller dataset, or when more cells per sample are required, a smaller
   batch size, such as 2, may be preferred. However, a batch size of 4 is
   set as default to maximize the usage of a 10x genomics single-cell
   toolkit, which can typically capture up to 20,000 cells (4 samples if
   assuming 5000 cells each). This dynamic, iterative process is
   continuously augmented with newly acquired single-cell data, ensuring
   that the most informative samples are selected for real single-cell
   profiling, leading to more accurate in silico single-cell inference for
   non-representative samples. This iterative process concludes when the
   budgetary constraint is met or when a sufficient number of
   representatives have been chosen to ensure satisfactory semi-profiling
   performance.

   When any of the stop criteria are met, the semi-profiled single-cell
   data can be used for a broad spectrum of downstream single-cell
   analyses (Fig. [69]1e), such as cell feature visualization, biomarker
   and function enrichment analysis, tracking cell type compositions in
   various tissues/conditions, cell-cell interaction analysis, and
   pseudotime analysis. Ultimately, scSemiProfiler offers a holistic
   sequencing perspective, delivering nuanced single-cell insights from
   bulk sequencing data. The method exhibits acceptable performance in
   terms of runtime and memory usage (refer to Supplementary Fig. [70]S1,
   making it suitable for scaling to large-scale datasets.

   While active learning allows for selecting a minimal number of
   representatives to conserve budget, the necessity for multiple
   sequencing rounds can sometimes challenge many laboratories and
   introduce batch effects. To counter this, scSemiProfiler features a
   “global mode” that enables all-in-one selection of representatives
   based on initial bulk data analysis. Furthermore, this “global mode”
   can be used in conjunction with active learning to enhance the
   selection process.

The semi-profiled COVID-19 single-cell cohort exhibits significant similarity
to its real counterpart

   To test the performance of scSemiProfiler in generating semi-profiled
   single-cell data that resonates with the granularity and details of
   actual single-cell sequencing, we utilized a COVID-19 cohort
   single-cell sequencing dataset^[71]45. After quality controls (please
   refer to the section titled “Representative single-cell profiling and
   processing” in our Methods for more details.), this dataset includes
   124 samples, including healthy controls and infected patients of
   different severity levels: asymptomatic, mild, moderate, severe, or
   critical. Here, we produced pseudobulk data by taking the average of
   the normalized count single-cell data. We then tested scSemiProfiler’s
   ability to regenerate the single-cell cohort from the pseudobulk data
   and real-profiled single-cell representatives using semi-profiling, as
   the actual bulk data for those samples in the COVID-19 cohort is
   absent. In generating our semi-profiled dataset, 28 representatives
   were selected in batches of 4 using our active learning algorithm for
   real single-cell profiling. Deep generative models then inferred the
   single-cell profiles for the remaining samples based on the
   representatives’ real-profiled single-cell data and the bulk data of
   all samples. The estimated total cost of these bulk and single-cell
   sequencing is $62,640. This is only 33.7% of the estimated price,
   $186,000, for actually conducting single-cell sequencing for the entire
   cohort. Additionally, our approach offers the advantage of generating
   extra bulk data for the cohort, a benefit not provided by single-cell
   sequencing of the entire cohort. The price of bulk sequencing is
   estimated based on the cost at McGill Genome Centre in the year 2023.
   This estimation assumes the use of one NovaSeq 6000 S2 system (capable
   of sequencing up to 4 billion reads per run) at an approximate cost of
   $7000, plus an additional $110 for library preparation per bulk sample.
   The cost for single-cell sequencing is based on the tool
   ([72]costpercell). This tool provides a cost estimate for capturing 0.8
   billion total reads from 20,000 cells across four samples in one 10x
   lane, equating to $0.3 per cell. Consequently, the estimated cost for
   each sample (5000 cells) is $1500. In the following sections, we
   provide justification for scSemiProfiler’s capacity to match the
   analytical results of a fully profiled single-cell cohort. We compare
   the outcomes from the semi-profiled cohort, which includes real
   single-cell data for representative samples and inferred single-cell
   data for target samples, to those from the real-profiled cohort, the
   ground truth with single-cell data for all samples.

   Using UMAP^[73]27 visualizations, we show a significant alignment
   between the semi-profiled and the real-profiled in Fig. [74]2a and b.
   We also annotated the semi-profiled COVID-19 cohort using an unbiased
   approach. We trained a Multi-layer Perceptron (MLP) classifier using
   the annotated representatives’ cells and used it to predict the cell
   types of the rest of the non-representative samples’ cells generated by
   the deep learning model. As shown in Fig. [75]2b, cell clustering
   remains intact in the semi-profiled cohort, from the distinctive
   clusters of B cells, plasmablasts, and platelets to the nuanced
   similarities of the CD14 and CD16 cells. The semi-profiled dataset
   illustrates its high fidelity to the real-profiled version. This
   fidelity extends to capturing even the subtle batch effects, as
   observed in the twin CD4 clusters, further accentuating
   scSemiProfiler’s robust in silico inference capabilities. The UMAP
   plots in Fig. [76]2c weave together directly sequenced samples with the
   semi-profiled ones, showcasing the tool’s finesse. The overlapping data
   points found in both sequencing techniques resonate with the
   transformative nature of scSemiProfiler—it harmonizes accuracy and
   cost-efficiency seamlessly. This comparison validates our tool’s
   ability to approximate a fully profiled dataset. This prompts an
   inquiry: Does this alignment owe its credit to the active learning
   mechanism that identifies the most informative representatives, or does
   it also hail from the deep generative model’s prowess in inferring
   target samples’ cells with finesse? To answer this, Fig. [77]2d uses
   distinct colors to delineate real-profiled cells from representatives
   and generated cells for target samples. It shows that our
   representative selection strategy is able to select representatives
   such that their cells have a relatively good coverage of the overall
   cell distribution. This emphasizes our method’s capacity to extend
   beyond the representatives, enriching the dataset. Meanwhile, the deep
   learning model managed to generate cells to complement the rest and
   make the overall semi-profiled cohort almost identical to the original
   real-profiled cohort. The effectiveness of the deep learning model in
   the semi-profiling process is further demonstrated in Fig. [78]2e and f
   and Supplementary Fig. [79]S2. These figures illustrate the model’s
   capability in reconstructing data from single representative samples
   and its proficiency in inferring data for individual target samples.
   Supplementary Figs. [80]S3–[81]S5 provides further details regarding
   the cell type distribution in each sample/sample cluster and
   scSemiProfiler is capable of generating similar sample clusters in
   terms of UMAP visualization and cell type distribution.

Fig. 2. Overall comparisons of the semi-profiled and real-profiled COVID-19
dataset.

   [82]Fig. 2
   [83]Open in a new tab

   a UMAP visualization of the real-profiled data. Colors correspond to
   cell types and are consistent with (g). b UMAP visualization of the
   semi-profiled data. c UMAP visualization of semi-profiled data and
   real-profiled data together. The color differentiation signifies
   whether cells originate from the semi-profiled or the real-profiled
   dataset. Areas of overlap between the two indicate where the
   semi-profiled data closely resembles the real-profiled data. d UMAP
   visualization of the semi-profiled cohort, displaying different colors
   to distinguish cells produced by a deep generative model (labeled as
   “Generated”) from the representative cells obtained through
   real-profiling (labeled as “Representatives”). e, f An illustrative
   case of in silico single-cell data inference for a target sample (AP1)
   using a representative sample (AP6) from the COVID-19 cohort is
   presented. e UMAP visualization compares the reconstructed single-cell
   data of the representative (in gray) against the real-profiled
   single-cell data of the representative (in blue). f UMAP illustrates
   the cell distribution of the representative sample (blue), the inferred
   target sample cells (yellow), and the real-profiled single-cell data of
   the target sample (red). The inferred cells exhibit a much higher
   resemblance to the real-profiled target sample data than to the
   representative’s data. g Visualization illustrates the relative
   activation patterns of the interferon pathway. The comparison of these
   values between the semi-profiled and real-profiled matrices yields a
   Pearson correlation coefficient of 0.808 and a two-sided p-value of
   5.74 × 10^−22. h Graph depicting the normalized error in semi-profiled
   data with an increasing number of representatives. The terms
   “scSemiProfiler” and “Selection-only” represent our semi-profiling
   method and a method that only selects representatives using an active
   learning algorithm, respectively. It is important to note that actual
   costs may vary based on the sequencing technology and the specific
   number of cells sequenced. i Stacked bar plot illustrating the
   proportions of cell types across various disease conditions. The upper
   portion represents the real-profiled data, while the lower portion
   depicts the semi-profiled data. Pearson correlation coefficients
   comparing cell type proportions between the real-profiled and
   semi-profiled datasets are provided for different conditions: Healthy
   (0.987), Asymptomatic (0.970), Mild (0.996), Moderate (0.992), Severe
   (0.978), and Critical (0.989), indicating a high degree of similarity
   between the two datasets across these conditions. j–l Cell type
   deconvolution benchmarking. Results for all N = 124 samples in this
   COVID-19 dataset are shown. The boxes represent the interquartile
   ranges (IQRs), and the solid lines indicate the means. The whiskers
   extend to points within 1.5 IQRs of the lower and upper quartiles.
   P-values are from one-sided Wilcoxon tests without adjustment. j Figure
   displaying Lin’s Concordance Correlation Coefficient (CCC) between
   actual (ground truth) cell type proportions and those estimated by
   various deconvolution methods. Except for the first two columns, all
   other columns' results are based on 4 representatives' single-cell data
   as reference. k Comparison of Pearson correlation across various
   deconvolution methods. l Comparison of Root Mean Square Error (RMSE)
   across various deconvolution methods. Source data are provided as a
   Source Data file.

   To test the fidelity of single-cell gene expression semi-profiling, we
   tested the interferon (IFN) pathway gene set—crucial for the innate
   immune response against COVID-19^[84]46–[85]49. Through the prism of
   our semi-profiled dataset, Fig. [86]2g reveals IFN activation patterns
   that harmonize with the real-profiled dataset. The uniformity in IFN
   activation patterns across various key cell types and severity levels,
   as highlighted by similar heatmaps, confirms the effectiveness of the
   semi-profiling technique. This uniformity indicates that the critical
   disease-related pathways were effectively captured and maintained in
   the semi-profiled data.

   Further, we explored the quantitative metrics of efficacy and
   cost-effectiveness in semi-profiling, Fig. [87]2h. Our analysis
   centered on understanding the relationship between the number of
   representative samples used for single-cell sequencing and the
   associated semi-profiling error. While an increase in representative
   samples intuitively raises the costs of single-cell sequencing,
   scSemiProfiler effectively leverages the single-cell data of these
   representatives for accurate inference of target samples. As more
   representatives are selected, our semi-profiling method effectively
   reduces the normalized error (see the Methods section for detailed
   error computation). We also compared our method to a selection-only
   method, which uses the same representatives chosen by scSemiProfiler
   using the active learning algorithm. For each target sample, it
   performs the single-cell data inference in a naive manner: merely
   copying the corresponding representative’s data. The dashed line shows
   that our semi-profiling method has a huge lead over the selection-only
   method. The star symbol denotes the number of representatives selected
   for the specific semi-profiled cohort that was utilized for subsequent
   analyses, along with the associated error. The vertical dashed line
   underscores that by using the same representatives, our method achieves
   substantially lower errors compared to the selection-only method.
   Moreover, the horizontal dashed line demonstrates that the
   selection-only method requires a considerably larger pool of
   representatives to attain the same level of semi-profiling accuracy as
   our method. The comparative analysis between scSemiProfiler and the
   selection-only method highlights the deep learning model’s efficacy in
   reducing costs and minimizing errors.

   In further evaluating the effectiveness of semi-profiling, defined as a
   single-cell granularity bulk decomposition, we turned to the
   conventional realm of cell type proportion metrics, as anchored by
   Fig. [88]2i. Existing research^[89]45 elucidates that PBMC cell type
   proportions undergo dynamic shifts with evolving disease conditions.
   True to this, our real-profiled dataset indicates a pronounced
   expansion of B cells and CD14 cells under aggravated conditions—a
   pattern mirrored in the semi-profiled dataset. The Pearson correlation
   coefficient^[90]50 of cell type composition associated with different
   disease conditions between the semi-profiled and the real single-cell
   datasets consistently surpasses 0.9.

   In our comprehensive analysis of cell deconvolution methods, we
   meticulously compared scSemiProfiler with several leading-edge
   techniques, including CIBERSORTx^[91]12,[92]13, Bisque^[93]14,
   Scaden^[94]18, TAPE^[95]19, EPIC^[96]17, NNLS and MuSiC^[97]16, as
   depicted in Fig. [98]2j–l. Each method was tested under identical
   conditions, using the same bulk data and single-cell reference data. A
   key challenge in this analysis was the memory constraints encountered
   by most methods, elaborated in Supplementary Fig. [99]S1b, which
   hindered their ability to process the full set of 28 representative
   single-cell data. To ensure a fair comparison across all methods, we
   limited the single-cell reference to the same initial batch of four
   representative samples. Additionally, we demonstrated the performance
   of our method using a set of 28 representatives that can not be fully
   exploited by other benchmarked tools. Our results unequivocally
   demonstrate the superior deconvolution performance of scSemiProfiler
   over all benchmarked methods. Its effectiveness is not only apparent
   when utilizing a smaller reference set of 4 samples but also becomes
   increasingly pronounced with a larger set of 28 samples. This distinct
   superiority of scSemiProfiler is a testament to its efficiency and
   versatility. Capable of excelling with both compact and extensive
   single-cell reference datasets, scSemiProfiler stands out as the most
   adept tool in cell type deconvolution, surpassing its contemporaries in
   handling diverse data scales with unparalleled precision and
   reliability. In Supplementary Fig. [100]S6, we also show the high
   deconvolution accuracy using a side-by-side comparison between each
   sample’s predicted cell type proportion using 28 representatives with
   the ground truth.

   Different from existing deconvolution methods, since our method is
   capable of “deconvoluting” the bulk samples into single-cell
   resolution, we can perform de novo cell type annotation for the
   generated single-cell data. To further substantiate this unique
   functionality, we applied de novo cell type identification to the
   semi-profiled COVID-19 dataset. We then compared the de novo cell type
   annotation results, derived from the single-cell data profiles
   generated by our model, with the ground truth. Briefly, de novo
   labeling of cell types can be achieved from our semi-profiled
   single-cell data using a biomarker-based strategy similar to those
   employed in other relevant studies^[101]51,[102]52 Specifically, we
   identified the top biomarkers associated with each cell cluster
   identified from the semi-profiled single-cell data. These biomarkers
   were then compared with known cell type markers from databases such as
   CellMarker^[103]53,[104]54 or Panglao DB^[105]55 to annotate the cell
   types de novo. We compared the de novo cell type identification results
   with supervised cell type annotations and demonstrated the
   effectiveness of scSemiprofiler in de novo cell type identification
   (Supplementary Fig. [106]S7).

   Results presented above are based on selecting representatives using
   our active learning algorithm. To examine the effectiveness of
   scSemiProfiler’s global mode and compare it with results obtained under
   active learning for selecting representative samples, we employed the
   global mode to select the same number of representatives (28). We then
   used the deep generative model to infer the single-cell data for the
   remaining samples, thereby creating a semi-profiled dataset. This
   dataset underwent the same analytical procedures as those applied to
   the dataset generated under active learning. The results, which are
   detailed in Supplementary Fig. [107]S8, indicate that although the
   global mode’s performance is slightly inferior to that of the active
   learning mode, it remains quite comparable. However, it is important to
   note that, unlike active learning, the global mode cannot terminate
   before the predefined number of representatives has been profiled—even
   if fewer representatives could achieve similar performance. Therefore,
   this one-round global mode does not offer the same budget-minimizing
   flexibility as the multiple-round active learning mode.

The semi-profiled COVID-19 single-cell cohort proves reliable for single-cell
downstream analyses

   We have previously demonstrated the capability of scSemiProfiler in
   accurately generating semi-profiled single-cell data that closely
   aligns with its real-profiled counterpart. Moving beyond basic cell
   type proportion predictions, which is the primary focus of other
   methods, scSemiProfiler excels in predicting gene expression for each
   cell within a population, thereby more authentically mimicking true
   single-cell data. This advancement is crucial for more complex
   downstream single-cell analysis tasks.

   To illustrate the effectiveness of semi-profiled data in standard
   downstream single-cell analyses, we conducted a series of evaluations.
   A key task in these analyses is the identification of biomarkers within
   distinct cell clusters, highlighting genes with distinct expression
   patterns. Utilizing the semi-profiled data generated by scSemiProfiler,
   we performed various single-cell level downstream analyses. The results
   from these analyses, as depicted in Fig. [108]3, demonstrate a
   significant consistency with outcomes derived from real-profiled data.
   For instance, we identified top cell type signature genes using the
   real-profiled cohort. When comparing their expression patterns in both
   real-profiled and semi-profiled datasets, the similarities were
   striking. The dot plots in Fig. [109]3a display these patterns,
   showcasing an almost indiscernible difference between the datasets. The
   semi-profiling at the single-cell level provides high-resolution
   expression data for marker genes within each cell population (cell
   type). Our approach reveals the distribution of marker gene expression
   across all cells within a specific type. While existing bulk
   deconvolution methods can offer average gene expression data for cell
   type markers, they fall short in depicting detailed gene expression
   distribution and variations among individual cells within the same
   population. This high level of cell type biomarker concordance
   underscores the scSemiProfiler’s robustness not only in replicating
   single-cell data but also in ensuring the fidelity of downstream
   analytical processes.

Fig. 3. Comparative analyses of single-cell level downstream analysis tasks
using real-profiled and semi-profiled COVID-19 datasets.

   [110]Fig. 3
   [111]Open in a new tab

   a Dot plots elucidating the expression proportion and intensity of
   discerned cell type signature genes. The top half showcases the
   real-profiled dataset, while the bottom delineates the semi-profiled
   version. P-values are from two-sided Pearson correlation tests. b RRHO
   plot emphasizing the congruence between the CD4 positive and negative
   markers in both datasets. One-sided hypergeometric test p-values
   without adjustment are presented. c Visualization of the GO term
   enrichment outcomes rooted in CD4 signature genes from both dataset
   versions. The plot accentuates the union of the top 10 enriched terms,
   with the Pearson correlation coefficients of between the bar lengths,
   which is based on the corresponding p-value and therefore represents
   the significant level. Benjamini-Hochberg (BH) adjusted one-sided
   hypergeometric test p-values are plotted. The p-value on the top is
   from two-sided Pearson correlation test. d A juxtaposition of cell-cell
   interaction analyses stemming from real-profiled and semi-profiled
   cells from moderate COVID-19 patients, underscoring the similarity in
   interaction types and counts. A two-sided Pearson correlation p-value
   is presented to show the similarity between the two interaction number
   matrices. e Comparative depiction of pseudotime trajectories for CD4
   cells across both datasets, highlighting their striking similarity in
   reconstructing dynamic cellular processes. The p-value is from a
   two-sided Pearson correlation test. Source data are provided as a
   Source Data file.

   We further explored the similarities between biomarkers identified
   using semi-profiled data and those from real-profiled data. Biomarker
   discovery is possible with lower-resolution data, such as the average
   cell type gene expression data provided by current bulk deconvolution
   tools. However, these methods are less reliable compared to single-cell
   data. The limitation of average cell type gene expression data is its
   lack of replicates for each cell type, leading to reduced statistical
   power. The absence of replicates makes it challenging to estimate
   variance within a cell type, which is essential for standard
   differential expression tests. In contrast, our semi-profiled
   single-cell data supports robust biomarker discovery through rigorous
   statistical testing, as it includes multiple samples (i.e. cells) for
   each cell type. We demonstrate the similarity of biomarkers identified
   using real-profiled and semi-profiled datasets through a rank-rank
   hypergeometric overlap (RRHO) plot. An RRHO plot^[112]56 visualizes the
   overlap between two ranked gene lists, highlighting the degree of
   similarity and the significance of the overlap between them (see the
   Methods section for more details). Leveraging the RRHO plots, we
   compared the top 50 positive and top 50 negative gene lists associated
   with different cell types (Fig. [113]3b for CD4 cells and Supplementary
   Fig. [114]S9 for all other cell types. The plots show positive marker
   and negative marker lists from both datasets are highly similar. A
   marked dissimilarity was evident between the positive and negative
   marker lists, which is intuitively anticipated. By definition, positive
   markers are genes that are higher expressed in the corresponding cell
   types, and negative markers are the opposite—lower expressed.
   Therefore, they should have no overlap. The compelling concordance
   demonstrated in Fig. [115]3a, b bolsters our claim that the
   semi-profiled data from scSemiProfiler can viably supplant
   real-profiled data for the pivotal task of biomarker discovery.

   Next, we used the biomarkers derived from the semi-profiled dataset and
   those from the real-profiled dataset for gene functional enrichment
   analysis, assessing whether the two versions of the analysis yield
   consistent results. Fig. [116]3c compares the Gene Ontology
   (GO)^[117]57,[118]58 enrichment^[119]59,[120]60 outcomes derived from
   real-profiled and semi-profiled datasets. The top 100 signature genes
   from both datasets are used for the enrichment analysis. We observed an
   overlap of 95 genes between the two lists, yielding a highly
   significant hypergeometric test p-value of 9.10 × 10^−196 (the
   population size of the hypergeometric test is the number of highly
   variable genes used for this dataset, 6030). A comparison of the top 10
   overlapping terms from both versions reveals nearly identical
   significance (Pearson correlation coefficient of 0.998 with a p-value
   of 4.13 × 10^−12 for comparing the significant levels). The results for
   other cell types are in Supplementary Fig. [121]S10. Reactome
   pathway^[122]61 enrichment analysis results are in Supplementary
   Fig. [123]S11. This further corroborates the reliability of
   semi-profiled data in downstream analyses.

   Progressing to yet another pivotal single-cell level downstream
   analysis task, we evaluated the congruence in cell-cell interaction
   analyses derived from real-profiled and semi-profiled datasets. Given
   the paramount role of cell-cell interactions in orchestrating a myriad
   of multicellular processes, their analysis often unveils pivotal
   biological insights^[124]62,[125]63. Fig. [126]3d juxtaposes cell-cell
   interaction analyses rooted in real-profiled and semi-profiled cells
   from moderate COVID-19 patients (see results for other severity levels
   in Supplementary Fig. [127]S12). The evident concordance in types and
   counts of interactions in both renditions reinforces the reliability of
   our semi-profiled data (R = 0.996, P < 2.23 × 10^−308). We also show a
   comparison of partition-based graph abstraction (PAGA) plots generated
   using real-profiled cohort and semi-profiled cohort in Supplementary
   Fig. [128]S13, which demonstrates that the semi-profiled data can
   accurately capture the cellular trajectories and relationships between
   cell types. Given that such analyses intrinsically require single-cell
   data, scSemiProfiler emerges as the sole contender capable of producing
   data apt for this task from bulk sources.

   Delving further into the capacity of semi-profiled data for other
   downstream single-cell level analysis tasks, we turned our attention to
   pseudotime analysis^[129]29,[130]30. Pseudotime is a pivotal tool in
   reconstructing dynamic cellular processes, ranging from differentiation
   pathways to developmental timelines or disease trajectories. As
   depicted in Fig. [131]3e, the pseudotime trajectories derived from
   real-profiled and semi-profiled CD4 cells are strikingly similar
   (Consistent results for the pseudotime analysis of other cell types can
   be found in Supplementary Fig. [132]S14). This similarity is supported
   by a high Pearson correlation of 0.809 (see the Methods section for
   details regarding pseudotime analysis and Pearson correlation test).
   Such compelling evidence underscores that the semi-profiled data
   retains its reliability even for intricate biological explorations like
   cell trajectory and differentiation analyses. The same downstream
   analysis has also been performed for the “global selection mode” of
   scSemiProfiler and is presented in Supplementary Fig. [133]S15, which
   turns out to be only slightly worse than selecting representatives
   using active learning.

In silico generated single-cell data offer additional insights beyond
representative samples for understanding the studied cohort

   Given the high similarity between the single-cell analysis results from
   the real-profiled and semi-profiled COVID-19 cohorts, a critical
   question arises: Does this similarity stem more from the accurate in
   silico inference of the deep generative learning model, or is it
   primarily due to effective representative selection through our active
   learning algorithm? To address this question, we divided the
   semi-profiled cohort into two components: the representative cells and
   the inferred cells, and conducted further comparative analyses. Our
   findings are presented from three perspectives: Analysis of the entire
   semi-profiled cohort is more informative than analyzing only the
   representative cells; the high similarity of the semi-profiled cohort
   with the real-profiled cohort, especially considering that most cells
   in the semi-profiled cohort are generated by the deep generative model,
   underscores the effectiveness of the in silico inference; comparisons
   between the analyses based on the real-profiled cohort and those using
   only inferred cells demonstrate the reliability and value of the
   inferred cells.

   At the cohort level: The inclusion of generated cells significantly
   improves the overall data similarity to the real cohort and enhances
   analysis results compared to using only representative cells. The UMAP
   results, as indicated in Supplementary Fig. [134]S16a–c, show that
   relying solely on cells from representatives fails to encompass many
   areas of the original cohort’s UMAP. This leads to the omission of
   certain cell subtypes and a lack of intra-cluster variability. This
   observation is supported by a comparison between the semi-profiling and
   the “Representative-only” methods shown in Supplementary Fig. [135]S16.
   These plots demonstrate that semi-profiling more accurately captures
   the real-profiled cohort, especially in terms of cell type proportions
   and overall gene expression, compared to using only representatives.
   While a sufficient number of representatives can achieve comparable
   results to the real cohort for straightforward tasks such as cell type
   marker identification (Supplementary Figs. [136]S17 and [137]S18), the
   semi-profiled cohort still presents advantages when conducting detailed
   single-cell analyses, such as pseudotime (Pearson correlation: 0.809
   for semi-profiled vs. 0.545 for “representative-only”, Supplementary
   Fig. [138]S18d). It is also crucial to note that the effectiveness of
   representative data is enhanced by our selection strategies, which
   include bulk clustering and active learning.

   At the disease condition level: Using only representative cells often
   fails to cover all disease conditions, which either precludes the study
   of specific conditions or leads to dramatically worse results compared
   to using the semi-profiled cohort. In the case of the COVID-19 cohort,
   we utilized 28 representatives out of 124 total samples for most
   analyses. These 28 representatives do not cover condition
   “Asymptomatic” condition, making it challenging to investigate this
   specific disease condition using single-cell data. For instance, the
   estimation of cell type proportions for the “Asymptomatic” condition
   cannot be achieved using only representatives, as there is no cells
   from this condition were profiled at the single-cell level. shown in
   Supplementary Fig. [139]S16d. Moreover, even for conditions that are
   covered by the representatives, the analysis results are often less
   reliable due to a lack of statistical power or failure to capture the
   internal variety within each condition. For example, Supplementary
   Fig. [140]S16d, g show that the semi-profiled cohort is significantly
   more similar to the real-profiled cohort than the “representative-only”
   version in terms of cell type proportion and pathway activation
   patterns. Notably, when investigating biomarkers for disease
   conditions, the semi-profiled dataset demonstrates a stronger
   similarity to the real-profiled dataset than the “representative-only”
   version, as illustrated in Supplementary Fig. [141]S19. For instance,
   in studying the “Critical” condition, the top 100 markers identified by
   the semi-profiled dataset have 68 overlaps (P = 5.43 × 10^−109) with
   the real-profiled ground truth, whereas the “representative-only”
   dataset only has 30 overlaps(P = 4.31 × 10^−31). This disparity becomes
   even more pronounced when the number of representatives is reduced,
   often due to limited budget constraints in cohort studies. To
   illustrate this scenario with fewer representatives, we examined the
   COVID study using just 12 representatives and compared the findings
   with those from the semi-profiled results (utilizing the same 12
   representatives). The results, presented in Supplementary Fig. [142]S20
   for the 12 representatives version, demonstrate that even with only 12
   representatives, the semi-profiled dataset continues to maintain high
   similarity in terms of disease marker analysis, while the similarity of
   the “representative-only” results declines further.

   At the individual sample level: The reliance on single-cell data from
   representatives alone for studying non-representative samples is not
   feasible without in silico inference, as these samples are not directly
   profiled. This limitation highlights two major concerns: Equity,
   Diversity, and Inclusion (EDI) and the potential for missed scientific
   discoveries. Excluding non-representative samples from analysis
   introduces biases in our understanding of the disease, as these samples
   are not represented, potentially leading to unfair representation and
   oversight of specific groups that may not be well represented by the
   selected representatives. Moreover, excluding non-representatives risks
   missing unique biological traits that representatives may not exhibit.
   However, scSemiProfiler enables the inclusion of these
   non-representative samples in single-cell analyses without additional
   costs. The generated single-cell data for individual samples closely
   resemble their corresponding target samples, as evidenced by UMAP
   visualizations in Supplementary Fig. [143]S2.

   Furthermore, representing non-representative samples directly using
   single-cell data from their corresponding representatives (as described
   in the “selection-only” method in our manuscript) can yield data too
   divergent from the ground truth to be meaningful. For example,
   Supplementary Fig. [144]S21 shows a non-representative sample with a
   “Critical” severity level represented by its corresponding
   representative. This figure illustrates that the inferred single-cell
   data has cell type proportions dramatically more similar to the target
   sample than to the representative sample. In Supplementary
   Fig. [145]S21b, we present 40 target sample biomarkers, differentially
   expressed genes (DEGs) of this sample against a healthy control. The
   left 20 DEGs show gene expression more similar to the target sample
   than the representatives, and the right 20 show where the
   representatives have more similar patterns than our inferred sample.
   The visualizations clearly indicate that the inferred sample is
   significantly more similar to the target than the representative. Even
   among the least similar 20 genes on the right-hand side, more genes
   show similar average expressions in the three samples. The Pearson
   correlations between the inferred sample and the target sample (0.904)
   are significantly higher than those between the representative and the
   target sample (0.568). Importantly, enrichment analysis in
   Supplementary Fig. [146]S21c, d reveals that these DEGs are crucially
   relevant to the immune response, such as “regulation of immune
   response”, “Immune System”, “Innate Immune System”, “Cytokine Signaling
   in Immune System”, and several terms relevant to MHC^[147]64,
   interferon pathway^[148]49,[149]65,[150]66, indicating that using only
   the representative for analysis could likely lead to incorrect
   conclusions about the studied disease, since many of those key
   disease-associated immune terms would be missed. Additionally,
   extensive analyses have been conducted to further validate the
   effectiveness of the in silico generated cells. Supplementary
   Fig. [151]S22a demonstrates that the deep generative learning model
   successfully generates the majority of cells, specifically those
   inferred for non-representative samples, across all datasets examined.
   The generally positive outcomes highlight the accuracy of these
   generated cells; any significant deviation from real-profiled data
   could compromise the overall results of single-cell analyses. Moreover,
   the cell type proportions obtained using only the generated
   non-representative cells closely align with those from the
   real-profiled datasets, as shown in Supplementary Fig. [152]S22b,
   further affirming the reliability of the inferred cells. Our model’s
   detailed generative process is designed to maintain the integrity and
   variety of cell types, addressing concerns that replicating
   representatives might not accurately represent their associated target
   samples. Another comparative analysis between these semi-profiled
   non-representative cells and the original real-profiled cells within
   the COVID-19 dataset serves as a robust demonstration of our model’s
   effectiveness. This comparison focuses on cell type distributions and
   pathway activations, revealing high concordance in cell type
   percentages and the preservation of key functional attributes, such as
   interferon activation pathways crucial for the immune response to
   COVID-19. The similarity in these critical biological aspects is
   illustrated in Supplementary Fig. [153]S23. Comprehensive downstream
   analyses—including biomarker identification, pathway enrichment,
   cell-cell interaction studies, and pseudotime trajectory
   analyses-performed using the inferred cells further validate the
   integrity and biological relevance of our generated data. The results
   of these analyses, detailed in Supplementary Fig. [154]S24, demonstrate
   the model’s ability to produce data that is not only quantitatively
   robust but also qualitatively insightful, thus enhancing the dataset
   with a more representative spectrum of cellular states for advanced
   biological exploration.

Semi-profiling maintains accuracy on a heterogeneous colorectal cancer
dataset

   We further tested the effectiveness of scSemiProfiler by validating it
   against a notably heterogeneous colorectal cancer dataset^[155]67,
   which encompassed 112 single-cell sequencing samples that passed the
   quality control. This collection comprised 19 normal tissues, 86 tumor
   tissues (including colorectal cancer subtype iCMS2 and iCMS3), and 7
   lymph node tissues. Considering the inherent diversity of this dataset,
   achieving accurate semi-profiling could ostensibly test the limits of
   scSemiProfiler. Again, this study does not include paired bulk
   sequencing data. Therefore, we have utilized the pseudobulk data
   derived from single-cell analysis as a surrogate for actual bulk
   sequencing. Nevertheless, following a consistent data processing and
   semi-profiling protocol, and selecting 36 representatives in batches of
   4, the semi-profiled dataset mirrored its real-profiled counterpart.
   This congruence manifested not only in visual similarity but also in
   cell type distributions and subsequent analyses outcomes. Using the
   same estimation method as applied to the COVID-19 cohort, the total
   cost for both bulk and single-cell sequencing to obtain this highly
   similar semi-profiled single-cell cohort is approximately $73,320. This
   price also includes the cost of bulk data for the cohort and represents
   only 43.6% of the $168,000 estimated for conducting single-cell
   sequencing on the entire cohort.

   Figure [156]4a, b graphically highlight the similarity between the
   semi-profiled and real-profiled data using UMAP visualizations. These
   visualizations, color-coded according to cell types, show a substantial
   alignment between the datasets. The semi-profiled data mirrors the
   real-profiled data in terms of the location and shape of each cell type
   cluster. Notably, both datasets effectively segregate cell types such
   as plasma B, enteric glial, Mast, and epithelial. Additionally, a
   nuanced connection between fibroblast and endothelial cells is evident
   in both versions. Immune-centric cells like McDC, T_NK, and B cells are
   also accurately positioned in close proximity in both datasets,
   underscoring the precision of the semi-profiled data. This similarity
   is further emphasized in Fig. [157]4c, which showcases the significant
   overlap between the two datasets. Moreover, Fig. [158]4d employs
   distinct color schemes to differentiate between cells generated by the
   deep generative learning model and those from real-profiled
   representative data. The cells from the representatives cover a
   substantial portion, indicating their well-chosen representative
   selection. However, numerous cells that fall outside the
   representatives’ distribution are accurately generated by the deep
   learning model, highlighting the critical role of both active learning
   and the deep generative model in achieving effective semi-profiling.
   The accuracy of the deep learning model’s generation can be further
   shown in Fig. [159]4e, f where we present the model’s single-cell
   inference for an individual sample. More examples can be found in
   Supplementary Fig. [160]S25. Supplementary Figs. [161]S26–[162]S28
   provide further details regarding the cell type distribution in each of
   the 36 sample clusters and individual samples and how scSemiProfiler
   performs in generating each of these sample clusters.

Fig. 4. Detailed comparisons between the semi-profiled and real-profiled data
in the heterogeneous colorectal cancer dataset.

   [163]Fig. 4
   [164]Open in a new tab

   a UMAP visualization of the real-profiled data, with colors denoting
   distinct cell types. Colors are consistent with (g). b UMAP
   visualization of the semi-profiled data, with colors denoting distinct
   cell types. c Joint UMAP visualization highlighting the close
   resemblance between the semi-profiled and real-profiled data. d UMAP
   plot of the semi-profiled dataset, with color-coding distinguishing
   cells from the actual sequenced representatives and the ones generated
   through semi-profiling. e, f UMAP visualization of the in silico
   inference of the target sample (MUX9009) using representative (EXT095).
   e UMAP showing the similarity of the reconstructed and the original
   representative. f UMAP showing the inferred target sample is more
   similar to the target ground truth than to the representative’s cells.
   g “Activation of immune response” gene set relative activation pattern
   calculated for different tumor tissue types as compared to the “Normal”
   type in the real-profiled and semi-profiled datasets. Entries with
   fewer than 500 cells are left blank. The p-value is from a two-sided
   Pearson correlation test. h, Performance trajectory of the
   scSemiProfiler on the colorectal cancer dataset, showcasing its
   superiority over the selection-only approach, with costs computed
   similarly to Fig. [165]2d. i Stacked bar plots comparing cell type
   compositions between the semi-profiled and real-profiled datasets
   across different tissues. The Pearson correlation coefficients between
   the real-profiled and semi-profiled tissues are LymphNode: 0.995,
   Normal: 0.993, iCMS2: 0.994, iCMS3: 0.988. j–l Cell type deconvolution
   benchmarking. Results for all N = 112 samples in the colorectal cancer
   dataset are shown. The interquartile ranges (IQRs) are depicted by the
   boxes, and the solid lines denote the mean values. Whiskers extend to
   points within 1.5 times the IQR from the first and third quartiles.
   P-values are derived from one-sided Wilcoxon tests without adjustment.
   j CCC between the actual cell type proportions and those estimated by
   various deconvolution methodologies. k The Same comparison using
   Pearson correlation coefficients. l Root Mean Square Error (RMSE)
   comparisons among different deconvolution techniques, highlighting the
   computational efficiency and accuracy of the scSemiProfiler, especially
   with an extended set of representatives. Source data are provided as a
   Source Data file.

   To further justify the semi-profiled gene expression values are
   accurate and can be used for biological analysis, we computed the gene
   set activation pattern of the GO term “activation of immune
   response” (GO:0002253) for the two tumor tissue types the same way as
   we did for the COVID-19 cohort (Fig. [166]4g). We chose this term
   because the immune response plays a significant role in the body’s
   defense against cancer, and its activation or suppression can influence
   cancer progression and patient outcomes^[167]68,[168]69. The gene set
   activation scores are calculated and then adjusted by subtracting the
   score of the “Normal” tissue. The activation patterns in the
   real-profiled and semi-profiled datasets are highly similar, leading to
   a high Pearson correlation coefficient of 0.909 between them.

   We also quantitatively examined the overall performance of
   scSemiProfiler as different numbers of representatives are selected. As
   shown in Fig. [169]4h, while our approach trumps the selection-only
   method, the gap narrows in comparison to results on the COVID-19
   dataset-owing largely to the colorectal cancer dataset’s inherent
   heterogeneity. Despite this, the deep generative model’s efficacy
   remains conspicuous, ensuring cost-effective error reduction. Also, if
   we aim to achieve an error as low as the previous COVID-19 cohort,
   which leads to almost identical analysis results as the real data, only
   half of the samples need to be selected as representatives. This still
   reduces the cost significantly.

   Diving deeper into the cell type proportions within the colorectal
   cancer cohort, one discerns variations across different tissue
   types-“Lymph Node”, “Normal”, “iCMS2”, and “iCMS3”. Fig. [170]4i
   illustrates these differences. For example, “Lymph Node” contains an
   expanded population of B cells compared with other tissue types,
   “Normal” is enriched with PlasmaB cells, and the two tumor subtypes
   have a pronounced epithelial presence. Remarkably, the semi-profiled
   dataset captures these nuances with precision, underlining its
   capability to replicate intricate analyses with fidelity.

   Lastly, the benchmarking of deconvolution results presented in
   Fig. [171]4j–l positions scSemiProfiler at the forefront, significantly
   outperforming most existing methods such as Bisque, TAPE, and Scaden.
   While its performance with four representatives is on par with
   CIBERSORTx, scSemiProfiler excels in computational memory efficiency.
   Further extending the representatives to 36 dramatically boosts the
   deconvolution accuracy. In contrast, existing methods such as CIBERSORT
   falter when tasked with handling a large reference set like 36
   representatives, mainly due to their computational inefficiencies. This
   distinction underscores the scSemiProfiler’s distinct advantage
   unshared by its peers. To provide a more detailed perspective on our
   deconvolution outcomes, Supplementary Fig. [172]S29 showcases a
   comparative analysis. It displays our predicted cell type proportions
   for each individual sample alongside the ground truth, enabling a
   side-by-side evaluation.

scSemiProfiler ensures consistent downstream analyses between semi-profiled
and real single-cell data in heterogeneous colorectal cancer cohorts

   In the context of a heterogeneous dataset like this colorectal cancer
   one, the semi-profiled dataset stands robust, offering downstream
   analysis results that mirror the real-profiled data. This close
   resemblance is consolidated in Fig. [173]5.

Fig. 5. Downstream analysis results comparisons for the colorectal cancer
dataset.

   [174]Fig. 5
   [175]Open in a new tab

   a Dot plots visualizing the cell type signature genes. The p-values are
   from two-sided Pearson correlation tests. b RRHO plot visualizing the
   comparison between semi-profiled and real-profiled markers of
   epithelial cells. One-sided hypergeometric p-values are plotted. c
   Epithelial cell type signature genes GO enrichment analysis results
   comparison. Benjamini-Hochberg corrected one-sided hypergeometric
   p-values are plotted. The p-value on the top is from the two-sided
   Pearson correlation test. d Cell-cell interaction results comparison
   between the real-profiled tumor tissue cells and semi-profiled tumor
   tissue cells. A two-sided Pearson correlation p-value is plotted. e
   Pseudotime results comparison using the epithelial cells. A two-sided
   Pearson correlation p-value is plotted. Source data are provided as a
   Source Data file.

   A notable observation is the accuracy in analyzing biomarker expression
   pattern and their intra-cluster variation using semi-profiled data.
   Fig. [176]5a showcases dot plots for top cell type signature genes
   derived from the real dataset. These plots reflect an identical pattern
   in both the real-profiled and semi-profiled datasets. Further
   affirmation comes from the strong Pearson correlation coefficient
   between the colors (0.994) and sizes (0.996) of the dots. Notably,
   these correlation coefficients even surpass those observed in the more
   homogeneous COVID-19 dataset. The semi-profiled dataset also reproduces
   biomarker discovery results, establishing its credibility as a suitable
   stand-in for the real-profiled data in such analyses. This consistency
   is exemplified by genes like KRT18 and KRT8, exclusive to epithelial
   cells, corroborated by existing literature^[177]70,[178]71. Another
   illustration is the unique expression of CPA3 and TPSB2 in Mast across
   both datasets. Beyond these top cell type signature genes, a granular
   examination of epithelial cells—encompassing the top 50 positive and
   negative markers—reinforces the congruence. All markers were identified
   using data at single-cell resolution through thorough statistical
   testing, a process unachievable with decomposed cell type-level data
   from standard bulk decomposition methods. As depicted in Fig. [179]5b,
   the preponderance of highly significant entries, with many p-values
   lower than 10^−50, strongly indicates a high degree of similarity in
   marker lists between the two datasets. RRHO plots for additional cell
   types can be found in Supplementary Fig. [180]S30.

   Diving deeper into the gene functional enrichment analysis,
   Fig. [181]5c offers further validation. Analyzing the top 100 signature
   genes across both datasets reveals a staggering 96 common markers,
   yielding a hypergeometric test p-value of 2.13 × 10^−183 (population
   size = 4053). GO terms from both datasets, along with their respective
   p-values, showcase pronounced similarity. Although the heterogeneity of
   the dataset leads to a relatively lower Pearson correlation coefficient
   of 0.593, the overall patterns in the two plots remain statistically
   similar (P = 0.042), leading to the same scientific conclusions.
   Moreover, when considering the union of the top 10 Gene Ontology (GO)
   terms from both the semi-profiled and actual datasets (comprising a
   total of 12 terms), there is a significant similarity in terms of
   enrichment p-values. Notably, the two versions of the top 10 GO terms
   have 9 overlap terms. More comprehensive GO term and pathway enrichment
   analysis results for other cell types’ signature genes are also
   consistent for the real-profiled and semi-profiled versions
   (Supplementary Figs. [182]S31 and [183]S32).

   Despite the increased heterogeneity of the dataset, the analysis of
   cell-cell interactions with the colorectal cancer semi-profiled cohort
   remains promising. As illustrated in Fig. [184]5d, the cell-cell
   interaction analysis, when executed on the real-profiled tumor tissue
   cells and the semi-profiled counterpart, reveals substantial
   consistency. Navigating the intricate interaction patterns
   characteristic of tumor tissues, the semi-profiled data astonishingly
   replicates the intricate layout with a robust Pearson correlation
   coefficient of 0.933. Both versions highlight enteric glial and
   fibroblast as the primary senders, while neutrophils emerge as the
   predominant receivers. Significantly, the most intense interactions
   identified across both sets involve the enteric glial with itself, the
   enteric glial with neutrophils, and the fibroblast with neutrophils.
   The cell-cell interaction results for other tissues also exhibit a high
   degree of similarity between the real-profiled and semi-profiled
   versions, as shown in Supplementary Fig. [185]S33. Additionally, the
   strikingly similar PAGA plots presented in Supplementary Fig. [186]S34
   further demonstrate the utility of semi-profiled data in studying
   cellular trajectories and relationships between different cell types.

   Shifting the focus to pseudotime analysis, we evaluated epithelial
   cells across all tissues as an example demonstration (Fig. [187]5e).
   Given the presence of cells from tumor tissues, this might introduce
   elevated heterogeneity within the cell type. Yet, the consistency
   between pseudotime analysis results from both versions is significant.
   Both versions discern lower pseudotime values concentrated at the base
   of the cluster, culminating in larger values towards the upper regions,
   with the pinnacle being the top-right quadrant. The statistical
   significance of the similarity is further validated by a Pearson
   correlation of 0.461 with a significant p-value of 1.87 × 10^−22 (see
   the Methods section for details about the Pearson correlation test).
   The pseudotime analyses for other cell types also demonstrate a high
   degree of similarity, as evidenced in Supplementary Fig. [188]S35. Such
   a finding underscores the capability of the semi-profiled data to
   adeptly capture intra-cluster nuances in detailed analyses.

Semi-profiling with real bulk measurements yields a dataset nearly identical
to the original single-cell data

   To further illustrate the adaptability of scSemiProfiler in real-world
   applications, we directed our analyses towards the iMGL
   dataset^[189]72, which uniquely profiles both single-cell and real bulk
   RNA-seq measurements for the human inducible pluripotent stem cell
   (iPSC)-derived microglia-like (iMGL) cells, differing from the
   pseudobulk datasets previously used. There are 25 samples having both
   single-cell data and bulk RNA sequencing data. Samples are of different
   conditions (grown in cell culture for 0-4 days and under various
   treatments). The availability of such datasets, which include both
   single-cell and bulk sequencing data on a large scale, remains very
   limited, partially due to the unnecessity of doing both sequencing for
   the same large-scale cohort and its prohibitive cost. Pseudobulk,
   created by averaging out the single-cell data, is often quite different
   from the bulk sequencing from the same sample^[190]73. This is because
   single-cell data is typically noisier, largely due to shallow
   sequencing and the presence of “dropouts” (genes not detected) in
   single-cell data^[191]74–[192]76.

   To navigate this complexity, we devised a method to infer the target
   sample pseudobulk directly from its real bulk data, real bulk, and
   pseudobulk for the representative (refer to the Methods section
   “Fine-tune the deep generative learning model to infer the single-cell
   measurements for the target samples” for comprehensive details). This
   technique allows us to accurately estimate the pseudobulk data of the
   target samples. Supplementary Fig. [193]S36 shows that the average
   Pearson correlation between the inferred pseudobulk and ground truth
   pseudo bulk data is 0.997, significantly higher than the Pearson
   correlation between the real bulk and ground truth pseudobulk, which is
   0.880. This approach enables us to more effectively utilize our deep
   learning model in the pseudobulk data space, which often aligns more
   closely with single-cell data as it is computed by directly averaging
   the single-cell data without any extra bias. Through this method,
   despite the intricate challenges of the iMGL dataset, the results in
   Fig. [194]6 illustrate that our semi-profiled data parallels the
   real-profiled data quite closely. By selecting only eight
   representative samples out of a total of 25 samples, we could notably
   reduce the overall cost without compromising on accuracy. For this
   smaller dataset, we employed active learning to select 8 representative
   samples in batches of 2. Using the same estimation method as in the
   other two studies, the total cost for acquiring both bulk and
   semi-profiled single-cell data through our method is approximately
   $21,750. This amount is just 58% of the estimated $37,500 required for
   conducting single-cell sequencing across the entire cohort.

Fig. 6. Comparative analyses between semi-profiled and real-profiled iMGL
datasets.

   [195]Fig. 6
   [196]Open in a new tab

   a UMAP visualization of the real-profiled iMGL cohort. Different colors
   represent different cell types. Colors are consistent with (g). b UMAP
   visualization of the semi-profiled iMGL cohort. c Combined UMAP
   visualization showcasing the consistent cell distribution across both
   data versions. d UMAP visualization highlighting the representatives'
   cells alongside the semi-profiled cells within the semi-profiled
   dataset. e, f UMAP visualization of the process of generating target
   sample’s (iMGL_D4_rep3) single-cell data using a representative
   (iMGL_D2_rep3) based on deep generative learning model. e UMAP
   visualization of the representative cells reconstruction. f UMAP
   visualization of inference. Compared to representative cells, inferred
   cells' distribution shows higher similarity with the target
   real-profiled cells (example region highlighted in red box). g Relative
   pathway activation pattern of the GO term “activation of immune
   response” calculated for cells of different treatments as compared to
   cell type “C1: Homeostatic non-proliferative” in the real-profiled and
   semi-profiled cohorts. Entries with fewer than 100 cells are left as
   blank. The p-value is from a two-sided Pearson correlation test. h
   Performance evaluation of the scSemiProfiler on the iMGL dataset,
   emphasizing its efficiency in error reduction. Source data are provided
   as a Source Data file. i A comparative illustration of cell type
   proportions under varying experimental conditions, accentuating the
   similarity in patterns between datasets. The Pearson correlation
   coefficients between the real-profiled and semi-profiled versions of
   cell type proportions under different conditions are: iMGL_D0: 0.999,
   iMGL_D1: 0.871, iMGL_D2: 0.993, iMGL_D3: 0.989, iMGL_D4: 0.992,
   iMGL_DMSO: 0.998, iMGL_GW_30: 0.960, iMGL_GW_300: 0.999, iMGL_T_30:
   0.987, iMGL_T_300: 0.999. j–l Deconvolution performance benchmarking
   using CCC, Pearson correlation and RMSE. The analysis covers all N = 25
   samples in this iMGL dataset. The boxes illustrate the interquartile
   ranges (IQRs), with solid lines indicating the means. Whiskers extend
   to data points within 1.5 times the IQR from the lower and upper
   quartiles. P-values are obtained from one-sided Wilcoxon tests without
   adjustment.

   The UMAP visualization, as presented in Fig. [197]6a, b, solidifies our
   findings. Here, the two versions - semi-profiled and real-profiled -
   show a high level of consistency. The cell distributions of various
   iMGL subtypes (as delineated by the dataset provider) in the UMAP
   follow nearly identical patterns. The detailed observation showcases
   that clusters like “C2: Activated, immediate-early”, “C4: Activated,
   non-immediate-early”, and “C5: Activated, immediate-early” are
   interconnected and primarily located on the right-hand side. Likewise,
   “C3: Homeostatic, proliferative” finds its position at the upper left,
   with “C1: Homeostatic, non-proliferative” and “C6: Freshly thawed”
   lying at the bottom left. This consistency transcends to Fig. [198]6c,
   emphasizing a consistent cell distribution across the two versions.
   Furthermore, Fig. [199]6d underlines the precision of scSemiProfiler,
   where a majority of cells in the semi-profiled version were accurately
   generated. The efficient coverage of representative cells in this
   figure also highlights that our active learning strategy remains robust
   even when navigating the challenges of real bulk measurements.

   UMAP visualizations in Fig. [200]6e, f and Supplementary Fig. [201]S37
   show scSemiProfiler’s capability of accurately reconstructing a
   representative sample and infer single-cell data of a target sample
   from its real bulk data. Despite the similarity, there are discernible
   differences in cell distributions between the target and representative
   samples, with the semi-profiled version showing more resemblance. We
   can still see the tendency of the generated cells (yellow), distributed
   closer to the target ground truth (red) than to the representatives’
   cells (blue). This indicates that the semi-profiled samples are either
   denser or sparser in the same areas as their counterparts. Importantly,
   these findings underscore scSemiProfiler’s sensitivity in detecting
   subtle differences between cell subtypes—a task that poses a challenge
   for other deconvolution methods that rely on cell type signature genes.
   Supplementary Figs. [202]S38–[203]S40 shows that the cell distribution
   varies in each sample cluster and sample. However, the semi-profiled
   version of the data is able to capture these nuance variances.

   Figure [204]6g provides an in-depth look at the accuracy of
   semi-profiled gene expression values. Since microglia are the resident
   immune cells in the brain, we checked the activation pattern of GO term
   “activation of immune response” in each cell type under each treatment.
   The semi-profiled dataset presents a highly similar activation pattern
   as the real-profiled dataset.

   Figure [205]6h offers further evidence of the scSemiProfiler’s
   effectiveness. Here, the performance curve of the semi-profiled
   approach significantly undercuts the selection-only one, demonstrating
   its capability to limit semi-profiling errors while optimizing on
   costs.

   An intriguing observation from the iMGL dataset is the cell type
   proportion’s dynamic shifts under various experimental conditions
   (Fig. [206]6i). The transitions from iMGL_D0 to iMGL_D4, for instance,
   reveal a progressive increase in the proportions of C1 and C4 cells.
   Contrastingly, “C2: Activated, immediate-early” cells peak at iMGL_D1
   and then decrease steadily. Although the intricate effects of drugs
   iMGL_GW and iMGL_T need further investigation, preliminary data
   suggests that elevated doses result in a surge of “C1: Homeostatic,
   non-proliferative” cells. Impressively, these intricate variations are
   mirrored in the semi-profiled dataset.

   A juxtaposition of our deconvolution method against others on the iMGL
   dataset offers illuminating insights. The performance of CIBERSORTx and
   MuSiC, which was previously high in other datasets, significantly
   declines, possibly due to the inherent challenges of real bulk data and
   the nuanced similarities among cell types. It has been previously
   reported that MuSiC fails to generate reasonable results on real bulk
   datasets, leading to negative CCC^[207]19. Other methods, such as
   Bisque, TAPE, and Scaden, are more robust to those challenges, showing
   decent deconvolution performance. Despite these challenges,
   scSemiProfiler showcases resilience and consistently outperforms all
   its peers except Bisque, a fact further corroborated by the Wilcoxon
   test^[208]77 (see p-values in Fig. [209]6j–l). The marginal difference
   between our scSemiProfiler and the selection-only method is a testament
   to both approaches nearing optimal performance in this specific
   context. Supplementary Fig. [210]S41 further presents our accurate
   deconvolution for individual samples. The superior performance is
   attributed to our pseudobulk inference strategy described in the
   Methods Section, which was designed to address the systematic
   differences between pseudobulk (the average expression) of aggregated
   cells from the target sample. As shown in Supplementary Fig. [211]S36,
   the real bulk expression differs significantly from the ground-truth
   pseudobulk of the target single-cell data, while our inferred
   pseudobulk closely resembles the true pseudobulk.

   To further demonstrate the effectiveness of scSemiProfiler’s ability to
   deal with real bulk data, we generated pseudobulk of the iMGL dataset
   and performed the semi-profiling using these pseudobulk data. We used
   the generated single-cell data to perform the same analysis and
   compared it with the real bulk version of semi-profiling. The
   pseudobulk version of semi-profiling analysis results were detailed in
   Supplementary Fig. [212]S42. Overall the results are very similar to
   the real bulk version, demonstrating that scSemiProfiler is able to
   handle the challenge brought by the real bulk data and maintain its
   high performance on pseudobulk data.

   To further assess scSemiProfiler’s reliability with real bulk data and
   its adaptability to other species and tissues, we conducted an
   extensive search in the GEO database for datasets that include both
   bulk and single-cell data. We discovered a hamster lung
   dataset^[213]78, which represents a different species and tissue
   compared to the previously analyzed datasets. Our selection criteria
   focused on datasets with paired single-cell and bulk RNA-seq profiling
   of at least 15 samples, each averaging over 1000 cells. The average
   number of cells in each sample in the hamster dataset is 1,151.4, while
   in the colorectal cancer dataset is 1914.0, in the COVID-19 dataset is
   5138.3, and in the iMGL dataset is 2974.9. Most samples in the hamster
   dataset could not pass our previous filtering criteria—samples with
   fewer than 1000 cells should be filtered. A low number of cells
   significantly decreases the number of training samples for the deep
   generative learning model, making it hard for the model to learn the
   underlying data distribution. Despite its relatively lower data
   quality, characterized by fewer cells per sample and a smaller overall
   sample size, we still selected this dataset for analysis due to its
   unique composition and limited availability. We implemented the same
   analytical approach as with previous datasets, with results detailed in
   Supplementary Fig. [214]S43. The close resemblance between the
   semi-profiled and real-profiled datasets, combined with
   scSemiProfiler’s significant advantage over other deconvolution methods
   (as evident in Supplementary Fig. [215]S43f, h, i), confirms its
   effectiveness across diverse species and tissue types. By selecting
   only 2 out of 16 samples as representative (Supplementary
   Fig. [216]S43d indicates these samples contained the majority of
   cells), scSemiProfiler successfully semi-profiled a single-cell dataset
   that closely mirrors the real-profiled ground truth. Additional
   outputs, including UMAP visualizations (Supplementary Fig. [217]S43a–c)
   and precise cell type proportion estimates (Supplementary
   Fig. [218]S43f–i), highlight its accuracy.

Semi-profiling using real bulk measurement also leads to reliable downstream
results

   In this more realistic setting, where scSemiProfiler is tasked with
   semi-profiling using real bulk data for downstream analyses, the
   semi-profiled data consistently mirrored results from the real-profiled
   version. This is particularly remarkable given the unique challenges
   presented by the real bulk data (see the “Fine-tune the deep generative
   learning model to infer the single-cell measurements for the target
   samples” section in Methods for a detailed description of the
   challenges brought by real bulk dataset and our strategies for
   resolving them). We present these downstream analysis results in
   Fig. [219]7.

Fig. 7. Similarities in downstream single-cell analyses between real-profiled
and semi-profiled data for the iMGL cohort.

   [220]Fig. 7
   [221]Open in a new tab

   a Dot plots visualizing the nearly identical expression patterns of
   cell type signature genes across both datasets. P-values are from
   two-sided Pearson correlation tests. b RRHO plots highlighting the
   striking similarities between the top 50 positive and negative C3
   markers from both real and semi-profiled datasets. P-values are from
   one-sided hypergeometric tests. c Overlapping GO enrichment analysis
   results for the top C3 signature genes, emphasizing consistent
   analytical outcomes between the datasets. P-values plotted are from
   one-sided hypergeometric tests and are Benjamini-Hochberg corrected.
   The p-value on the top is from a two-sided Pearson correlation test. d
   PAGA plots illustrating the consistent major cell type links observed
   in both datasets. The p-value is from a two-sided Pearson correlation
   test. e Pseudotime plots affirming the topographical alignment between
   the real-profiled and semi-profiled cohort. The two versions of
   datasets' pseudotime distributions show high similarity (Pearson
   correlation 0.762, two-sided p-value 3.76 × 10^−77). Source data are
   provided as a Source Data file.

   The markers identified using single-cell resolution data in the
   real-profiled dataset were almost identical in expression patterns to
   those in the semi-profiled dataset. Fig. [222]7a visually reinforces
   this, showing “C1: Homeostatic, non-proliferative” and “C10:
   Homeostatic, non-proliferative” with virtually indistinguishable
   expression patterns across both data types. Additionally, unique
   expressions in “C11: Myeloid, progenitors”, such as GAPA2 and HPGDS,
   were consistently observed. The overarching similarities were further
   quantified with impressive Pearson correlation coefficients for both
   dot sizes and colors, clocking in at 0.980 and 0.989, respectively.

   Further validating the congruency of our method, Fig.[223]7b presented
   the RRHO plot of the top 50 positive and negative C3 markers from both
   single-cell datasets (see RRHO plots for other cell types in
   Supplementary Fig. [224]S44). The degree of similarity is substantial,
   with most of the overlapping test entries showcasing p-values less than
   10^−50. Such findings strongly suggest that the scSemiProfiler is adept
   at producing reliable data for biomarker discovery.

   Proceeding to more in-depth downstream analysis using the top 100
   signature genes for GO enrichment, we observed an overlap of 90 genes
   between the semi-profiled and the real single-cell datasets
   (Fig. [225]7c). The hypergeometric test revealed a significant p-value
   of 1.81 × 10^−183 (population size of the hypergeometric test equals
   the total number of highly variable genes used, 6013). The enriched
   terms identified from both the semi-profiled and real datasets matched
   closely. The Pearson correlation coefficient between the two versions’
   significance is 0.995, underscoring the consistency in their analytical
   outcomes. Extended GO and Reactome enrichment analysis in Supplementary
   Figs. [226]S45 and [227]S46 further confirm the accuracy of finding
   signature genes using semi-profiled data.

   In the case of cell-cell interactions, both real and semi-profiled
   single-cell data did not capture any significant interactions, probably
   due to the similarity between cell clusters in the iMGL data. Instead,
   the partition-based graph abstraction (PAGA) analysis^[228]79 showcased
   in Fig.[229]7d highlighted that major cell type links were consistent
   across datasets. Further cementing this was the strong Pearson
   correlation coefficient of 0.865 between the adjacency matrices of the
   two networks. Pseudotime analysis in Fig. [230]7e also affirmed the
   alignment between the two datasets. The topographical pseudotime
   alignment between them is almost congruent, justified by a Pearson
   correlation of 0.762 with a p-value 3.76 × 10^−77.

   Additionally, as shown in Supplementary Fig. [231]S47, the pseudobulk
   semi-profiling’s downstream analysis results are similar to the real
   bulk version, further demonstrating scSemiProfiler’s ability to handle
   real bulk data. Furthermore, even in the challenging hamster dataset,
   all the downstream analysis as illustrated in Supplementary
   Fig. [232]S48, demonstrates that the semi-profiled dataset supports
   robust downstream analytical tasks such as biomarker discovery,
   cell-cell interactions, and pseudotime analysis.

   In conclusion, our findings robustly demonstrate that scSemiProfiler
   seamlessly adapts to real-world scenarios employing real bulk data. The
   downstream analytical outcomes derived from the semi-profiled data are
   significantly consistent with those based on real data.

Active learning demonstrates its prowess in selecting the most informative
samples for enhanced single-cell profiling

   The crux of scSemiProfiler’s strategy revolves around judiciously
   selecting representative samples. The rationale is straightforward: the
   more informative the chosen representatives, the better the
   semi-profiling performance. This not only enhances the fidelity of the
   generated profiles but also provides a cost-effective approach by
   minimizing the number of necessary representatives.

   Initially, our methodology bore similarities to uncertainty
   sampling^[233]38,[234]80,[235]81, a heuristic active learning
   technique. Here, the intuition is to query samples with the most
   uncertainty, thereby maximizing the incremental information acquired.
   In our algorithm, we employed bulk data to pick out batches of samples
   that exhibited the most variance from their designated representatives.
   Since this method does not use information from the base learner (the
   deep generative model), it is still a passive learning algorithm.

   To improve the representative selection, we then turn the algorithm
   into an active learning algorithm by incorporating the information from
   the deep generative models. The algorithm also utilizes the clustering
   information in the cohort and is thus a type II active learning
   algorithm^[236]82. Combining these two ideas, the algorithm aims to
   reduce the total heterogeneity of each sample cluster, ensuring that
   each target sample has a similar representative, thus optimizing
   semi-profiling performance.

   As depicted in Fig. [237]8, we juxtaposed our advanced active learning
   algorithm against the rudimentary passive learning approach. Each
   panel, from Fig. [238]8a–c, encapsulates the comparative analyses
   derived from distinct datasets. While the x-axis maps the
   representatives earmarked for single-cell sequencing, the y-axis
   portrays the single-cell in silico inference difficulty, which is
   quantified using the average single-cell-level difference (see Equation
   ([239]17) in the Methods section) between target samples and their
   representatives. This premise holds that as the dissimilarity between
   the target and the representatives increases, the complexity of the in
   silico inference task also rises. This metric is crucial as it
   highlights the challenges encountered by the deep generative learning
   model in semi-profiling, thereby illuminating the effectiveness of the
   strategies used for selecting representatives.

Fig. 8. Active learning demonstrates its prowess in selectively profiling the
most informative samples at the single-cell level.

   [240]Fig. 8
   [241]Open in a new tab

   The x-axis represents the number of samples selected for single-cell
   profiling (representatives). The y-axis shows the single-cell in silico
   inference difficulty of the dataset, which is quantified by the average
   single-cell difference from each sample to its representative,
   showcasing the efficiency of representative selection strategies. The
   marked stars signify the iterations chosen for our methodology, with
   the generated data underpinning the analyses detailed in previous
   sections. a Results from the COVID-19 dataset. Active learning shows
   significantly better performance, especially in the beginning when a
   few representatives are selected. b Observations derived from the
   colorectal cancer dataset. Active learning continues to show
   significantly better performance even when more representatives are
   selected. c Insights from the iMGL dataset with real bulk measurements.
   Active learning still manages to outperform passive learning
   significantly. Source data are provided as a Source Data file.

   The empirical evidence is resounding. Across all datasets, the active
   learning algorithm showcased its mettle by consistently pinpointing
   representatives that considerably reduced the total distance to other
   samples. This underscores the algorithm’s capability to foster superior
   representative selection for semi-profiling. In Fig. [242]8a, when
   applying our method to the COVID-19 dataset, active learning shows
   better performance than passive learning, especially in the beginning
   iterations of the semi-profiling. Active learning reduces the inference
   difficulty when the same number of representatives are selected. Also,
   consider the marked point where 28 representatives are selected, which
   is consistent with the representatives we selected for our analyses in
   previous sections. To reach the same level of inference difficulty,
   passive learning needs to select 4 more representatives, i.e. the cost
   of 4 samples’ single-cell sequencing experiment is saved. For the
   colorectal cancer dataset (Fig. [243]8b), active learning continues to
   perform better than passive learning, and at the point we selected, the
   cost for more than two batches (8) of representatives can be saved
   using active learning. Fig. [244]8c shows the results for the iMGL
   dataset, in which active learning is also significantly better than
   passive learning. In all experiments, active learning consistently
   outperforms passive learning in selecting representatives. With the
   same budget, active learning achieves lower inference difficulty.
   Furthermore, to reach a comparable level of inference difficulty,
   active learning requires a lower cost.

Discussion

   In the present work, we introduced scSemiProfiler, an innovative
   computational framework designed to produce affordable single-cell
   measurements for a given large cohort. The allure of scSemiProfiler
   stems from its unparalleled capacity to mirror the outcomes of
   conventional single-cell sequencing, yet with substantial
   cost-effectiveness. Marrying expansive sequencing with pinpoint
   profiling and optimizing data utilization, it charts a robust,
   intricate, and economical path for large-scale single-cell exploration.
   This tool is not merely a means of generating data but ensures the
   derived single-cell information remains dependable for an array of
   downstream analyses. The pipeline ingests bulk data, leverages an
   active learning module to judiciously select representatives for
   authentic single-cell sequencing, and employs a deep generative
   learning model, VAE-GAN, to infer the single-cell data of the remainder
   of the cohort. Through this approach, every sample in the cohort has
   access to single-cell data, either semi-profiled or real-profiled. We
   subjected scSemiProfiler to rigorous evaluation using three distinct
   cohorts, simulating a scenario devoid of single-cell data and then
   juxtaposing the output of our tool with the authentic single-cell
   datasets. The striking resemblance between semi-profiled and
   real-profiled data in facets like UMAP visualization and cell type
   proportions ascertains the robustness of scSemiProfiler. Furthermore,
   both datasets align almost perfectly in subsequent analyses,
   encompassing biomarkers, enrichment analyses, cell-cell interactions,
   and pseudotime trajectories, offering a cost-effective means of
   integrating single-cell data in cohort studies without compromising on
   analytical precision.

   The significance of scSemiProfiler’s contributions lies in its defining
   properties: the precision of the method and its transformative capacity
   within the field of single-cell genomics.

   Precision-driven computational modeling: Our study’s primary
   contribution is the development and implementation of an iterative deep
   generative neural network framework with active learning, designed to
   enable single-cell level bulk decomposition. scSemiProfiler
   distinguishes itself from traditional cell deconvolution methods, which
   typically achieve only cell type level decomposition of bulk data. Our
   tool significantly advances this by actively deconvoluting bulk data
   into genuine single-cell data. This capability enables sophisticated
   analyses that delve into cellular heterogeneity, such as UMAP, pathway
   activation patterns, biomarker discovery, gene functional enrichment
   analysis, cell-cell interaction analysis, and pseudotime
   analysis-applications not possible with conventional deconvolution
   outputs. Notably, given the semi-profiled cohort, one can perform de
   novo cell type annotation for the single cells, which is another
   flexibility not owned by any other existing tools. Further setting
   scSemiProfiler apart, evaluations across four cohorts demonstrate its
   superior performance in traditional deconvolution tasks, particularly
   with real bulk datasets. This superior performance is attributed to its
   robust pseudobulk inference capabilities, which allow for more accurate
   and meaningful interpretations in practical settings, making
   scSemiProfiler a pioneering tool in the field. Furthermore, when placed
   under rigorous scrutiny, the semi-profiled data generated by our model
   consistently showcases significant similarity to real-profiled
   single-cell data, underscoring its reliability and potential to reshape
   the paradigm of single-cell analytics, particularly for large-scale
   cohort studies in which real single-cell sequencing cost would be
   prohibitive.

   Intelligent active learning mechanism: Our second salient contribution
   hinges on the power of active learning. Instead of passively selecting
   representatives, our model embarks on an intelligent, iterative
   journey. By constantly evaluating the data and learning from previous
   rounds of selection, this module discerns which representatives would
   offer the most informative insights for semi-profiling. This isn’t a
   mere addition; it’s a reinvention of how representative selection
   operates. The synchronization of insights from the deep learning model
   with this active learning module ensures that the selection process is
   not just data-driven, but also insight-driven. The implications of this
   are twofold: firstly, it enhances the quality and relevance of the
   selected representatives, ensuring that they truly resonate with the
   cohort’s cellular composition. Secondly, it offers an economic
   advantage. By intelligently choosing representatives, the financial
   overheads associated with single-cell experiments can be minimized,
   ensuring maximum return on investment both in terms of data quality and
   budgetary constraints. Together, these innovations not only elevate the
   efficacy of scSemiProfiler but also pioneer a new direction in how
   single-cell data can be derived, analyzed, and utilized, holding
   profound implications for future biomedical research endeavors.

   Future work on scSemiProfiler is focused on expanding its capabilities
   from single-cell RNA sequencing (scRNA-seq) to other biological
   modalities, such as proteomics. The adaptability of the tool’s current
   methodology will require adjustments in algorithms to suit the unique
   data characteristics of each new modality, along with comprehensive
   performance evaluations benchmarked against established standards. A
   pivotal element of this development is the utilization of cross-modal
   information exchange, which aims to improve the accuracy of
   semi-profiling. This approach could significantly reduce the costs
   associated with single-cell multi-omics studies beyond the scope of the
   current study, by minimizing the need for extensive representative
   samples from each modality and leveraging the inherent similarities
   across various biological data types.

   The scSemiProfiler represents a groundbreaking shift in single-cell
   analysis, particularly for large-scale and cohort studies, by offering
   a cost-effective yet comprehensive approach. It achieves this by
   selectively performing single-cell experiments on a few samples using
   active learning, coupled with deep generative models for in silico
   inference of single-cell profiles for the remaining samples, thereby
   creating similar semi-profiled cohorts at a substantially lower cost.
   This approach not only alleviates financial constraints in expansive
   biomedical research but also extends its utility beyond cohort studies.
   The tool’s deep generative model enables “single-cell level
   deconvolution” across various studies, provided there is a single-cell
   reference. Initially centered on RNA sequencing datasets,
   scSemiProfiler’s adaptability to other single-cell and bulk data
   modalities opens new avenues in personalized medicine and can
   significantly enrich global data repositories. Significantly,
   scSemiProfiler is designed to complement, rather than compete with,
   large-scale single-cell sequencing technologies, such as WT Mega by
   Parse Biosciences. WT Mega is notable for its ability to process up to
   1 million cells across 96 samples. When scSemiProfiler is integrated
   with these advanced sequencing platforms, it further reduces costs,
   enabling single-cell level profiling of thousands of samples in
   large-scale studies at a more manageable expense. This synergy
   significantly broadens the scope and depth of possible research,
   enhancing the efficiency and affordability of large-scale single-cell
   studies.

Methods

   In this section, we outline the methodological details for each module
   depicted in Fig. [245]1, presented in the following sequence: initial
   setup procedures, representative single-cell profiling and processing
   techniques, specifics of the deep learning model and the in silico
   inference process, the active learning algorithm for iterative
   representative selection, criteria for termination, and additional
   information concerning downstream analysis and evaluations.

The initial setup of scSemiProfiler (INITIAL SETUP)

   The initial setup of the scSemiProfiler method plays a pivotal role
   (depicted in Fig. [246]1a), serving as the foundation for subsequent
   semi-profiling iterations. Initially, each individual sample within the
   designated cohort undergoes bulk sequencing. Upon obtaining the raw
   count data from this sequencing, we perform library size normalization
   followed by a log1p transformation. From the processed data, we
   identify and select highly variable genes and relevant markers for
   further analysis. In our experiments, we included more genes chosen by
   the default SCANPY’s ‘pp.highly_variable_genes’ tool to avoid missing
   important markers. We also reviewed the original study generating the
   datasets to find important markers and manually added them to the
   highly variable genes. The highly variable genes are chosen using the
   ‘pp.highly_variable_genes’ tool in SCANPY. For the colorectal cancer
   dataset, due to its relatively sparser gene expression, we chose the
   top 4000 highly variable genes plus a few manually added markers. For
   the rest, we chose the top 6000 highly variable genes plus a few
   markers. The user can decide the number of highly variable genes based
   on their own dataset and GPU memory or stick with our default value
   6000. Subsequently, we employ Principal Component Analysis
   (PCA)^[247]83–[248]85, implemented in the Python package
   Scikit-learn^[249]86, to reduce the dimensionality of the processed
   bulk data. We employ a consistent approach to determine the number of
   principal components (PCs). Specifically, we use 100 PCs when there are
   more than 100 data samples to minimize information loss. For datasets
   with fewer than 100 samples, we use a number of PCs equal to the sample
   size minus one. We then determine the batch size of representatives (B)
   representing the number of selected representatives for actual
   single-cell profiling in each iteration. Once the representative batch
   size is defined, we cluster the dimensionality-reduced bulk data to
   select the initial batch of representatives. The dimensionality-reduced
   data is clustered into B distinct clusters using the K-Means
   algorithm^[250]87, the sample that is closest to the cluster centroid
   is designated as the representative for that cluster. This
   well-prepared initial setup provides the clustered bulk sequencing data
   and identified representatives for high-resolution single-cell
   sequencing.

Representative single-cell profiling and processing

   After the selection of representatives, the scSemiProfiler method
   proceeds to the single-cell profiling phase, focusing specifically on
   the chosen representatives (Fig. [251]1b). The raw count data obtained
   from single-cell profiling undergoes a sequence of preprocessing steps.
   These steps encompass traditional quality controls as well as advanced
   processing techniques aimed at enhancing the learning of the deep
   generative model.

   To initiate, expression values that are extremely low (below 0.1% of
   the library size) are adjusted to zero. This adjustment is based on
   insights from prior studies^[252]88,[253]89, which have demonstrated
   that these minimal values are often representative of background noise
   and, therefore, should be excluded. The threshold for this processing
   is empirically determined by selecting the value that yields the most
   high-quality UMAP visualizations consistent with the original cell type
   annotations.

   Following the noise removal, our preprocessing workflow encompasses
   several standard steps, including the removal of low-quality cells
   (expressing fewer than 200 genes, per SCANPY’s default settings) and
   genes (expressed in fewer than three cells). We also remove dying cells
   identified by a high proportion of mitochondrial reads and doublets
   expressing an unusually high number of genes, both determined using
   quality control plots generated by SCANPY’s ‘pp.calculate_qc_metrics’
   tool. Subsequently, we normalize the data to a library size of 10,000
   per cell and apply a log1p transformation. Low-quality samples with
   fewer than 1000 cells are removed, with an exception made for the
   hamster dataset to preserve an adequate sample size. The data matrix is
   then refined to include only highly variable genes identified from bulk
   data analysis. The COVID-19 dataset arrives already preprocessed as
   detailed by^[254]45, including doublet removal with Scrublet^[255]90,
   cell filtering, library normalization, and log1p transformation. In
   contrast, the colorectal cancer dataset, provided in raw counts,
   undergoes a comprehensive preprocessing regimen. We filter out cells
   expressing fewer than 200 genes and genes detected in fewer than three
   cells. Additionally, doublets expressing more than 6000 genes and dead
   cells with mitochondrial reads above 20% are removed. Samples with
   fewer than 1000 genes are also discarded. The dataset undergoes
   background noise removal, library size normalization, and log1p
   transformation, with a further selection of data columns based on
   highly variable genes. For the iMGL dataset, we process cells that
   passed the dataset provider’s stringent quality controls, including
   thresholds of over 1500 genes detected per cell, fewer than 100,000
   reads per cell, and mitochondrial RNA fractions below 20%. We also
   exclude any genes detected in less than three individual cells. Our
   preprocessing steps for this dataset include standard procedures like
   filtering low-quality samples, background noise removal, library size
   normalization, log1p transformation, and gene feature selection. The
   hamster dataset, characterized by its sparse single-cell data
   (averaging 1,151.4 cells per sample), bypasses cell, gene, and sample
   filtering to preserve data integrity, focusing solely on selecting
   genes common to single-cell and bulk data, removing background noise,
   normalizing library sizes, and selecting highly variable genes. This
   approach ensures we retain all samples for analysis despite the
   dataset’s overall low quality, leveraging its unique pairing of
   single-cell and bulk data.

   After the initial preprocessing, the single-cell data is further
   refined by integrating two types of prior knowledge to enhance cell
   representations. Firstly, gene set scores are calculated using all
   curated gene sets from MSigDB^[256]59 and Ernst et al.^[257]91, sourced
   from UNIFAN^[258]92. Gene set scores are determined by averaging the
   expression values for each gene set, followed by a log1p
   transformation, and then combined with the preprocessed single-cell
   gene expression matrix via concatenation. In the deep generative model,
   there are two separate encoders for encoding gene set scores and gene
   expression values, and two separate decoders for reconstructing their
   values. Both reconstructions will be involved in the reconstruction
   loss computation. The percentage of genes in the final features are
   COVID-19: 88.8%, colorectal cancer: 84.5%, iMGL: 89.8%, hamster: 90.0%.
   Secondly, we adopt a feature weighting method, giving more weight to
   features with higher variance in each cell’s input when calculating
   reconstruction loss. More comprehensive details of this strategy are
   available in the section focusing on single-cell inference.

Pretrain the deep generative learning model for reconstructing the
single-cell data of the selected representatives (REPRESENTATIVE
RECONSTRUCTION)

   The next phase, depicted in Fig. [259]1c, involves training a deep
   generative learning model for the in silico single-cell data inference
   of a non-representative target sample using processed representative
   single-cell data and bulk data of both samples. This will be executed
   for all non-representative samples to ensure everyone in the study
   cohort will have single-cell data (real-profiled or semi-profiled) in
   the end. Firstly, our deep generative model aims to reconstruct the
   single-cell profiles of representatives. This reconstruction lays the
   foundation of the single-cell inference of the target samples, which
   can be viewed as a modified single-cell data reconstruction task.
   Specifically, we first pretrain a reconstruction model that’s tailored
   to the representative of each cluster. We then fine-tune this model for
   each associated target sample, employing the target bulk loss to direct
   the transformation from the representative’s profile to that of the
   target. This process is critical for ensuring the model’s accuracy in
   reflecting the unique characteristics of each target sample (i.e.,
   matching the bulk sequencing). When performing the single-cell
   inference after the reconstruction, the initial single-cell data
   reconstruction of the representative is modified by introducing the
   difference between the target sample and its representative based on
   their distinct bulk data profiling, aligning the synthetic single-cell
   data for the target sample closer to its actual single-cell profiling
   counterpart. The fundamental reasoning is that the single-cell data
   matrix of a specific target sample should bear a resemblance to the
   representatives chosen from the same study. Significant differences
   between them can be adjusted and guided by the disparity as delineated
   by the bulk sequencing data.

   We designed a VAE-GAN-based model for the representative reconstruction
   and the target single-cell generation guided by their expression
   difference at the bulk level. It ingests gene expression data into an
   MLP encoder, which subsequently outputs parameters of a multivariate
   Gaussian. A random variable, z, is sampled from this Gaussian and
   processed through an MLP decoder, resulting in parameters of a
   Zero-Inflated Negative Binomial (ZINB) distribution - optimal for
   modeling single-cell RNA-seq data^[260]93–[261]95. Training the VAE
   parameters involves maximizing the data likelihood and minimizing the
   KL divergence^[262]96 between the latent variable distribution and a
   standard Gaussian distribution, following the Evidence Lower BOund
   (ELBO) loss framework^[263]40. The geneset scores were concatenated
   with the gene expression data and factored into the loss calculations,
   essentially functioning as additional genes. A separate MLP in the
   encoder network is used for encoding gene set score values to the
   encoder’s hidden layer, and a separate MLP in the decoder is used for
   generating gene set values from the hidden layer. However, their
   utilization is specifically tailored to improve the learning of the
   cell representation and they do not play a role in the computation of
   bulk tissue losses to guide in silico inference. To demonstrate the
   effectiveness of the gene set strategy, we compared the cell embedding
   quality generated with/without using gene set scores using an
   independent dataset^[264]97 that is also employed in the SCVI study. We
   trained models using cells with only gene expression data and cells
   augmented with gene set scores separately. Then we used these two
   models to perform the reconstruction task for the dataset and found the
   one augmented with gene set score achieves lower MSE (19.33 vs. 25.48
   Supplementary Fig. [265]S49).

   Innovating upon this foundational VAE structure, we integrated the
   following four techniques to enhance effective cell representation
   learning. Gene set score inclusion: As mentioned in the data
   preprocessing section, beyond mere gene expression reconstruction, we
   compute gene set scores for cells and concatenate them to the gene
   expression data. During the two stages of pretrain, we also compute the
   reconstruction loss for gene set scores. Such inclusion furnishes the
   model with an enriched biological context, fostering a more
   comprehensive learning of the input cells. Hence, input for each cell
   becomes the concatenation of gene expression and gene set scores:
   (x[i], s[i]). Feature importance weight: We also compute a weight
   vector w to weight the contribution of each feature to the VAE’s
   reconstruction loss based on the feature’s variance. Graph
   convolutional networks (GCN)^[266]98,[267]99: A GCN layer assimilates
   adjacent cells’ information, mitigating dropout concerns and the
   inherent noise of single-cell sequencing. Generative adversarial
   network (GAN)^[268]41 dynamics: We employ a discriminator network,
   resonating with GAN structures, which discerns between genuine and
   generated cell data. This guides the generator towards producing more
   authentic single-cell data. The resultant loss function for the
   generator discussed above is:
   [MATH: <msub><mrow><mi
   mathvariant="normal">L</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">G</mi></mrow><mrow><mi
   mathvariant="normal">Pretrain1</mi></mrow></msub></mrow></msub><mo>=</m
   o><msub><mrow><mi mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">VAE</mi></mrow></msub><mo>−</mo><msub><mrow><mi
   mathvariant="normal">λ</mi></mrow><mrow><mi>d</mi></mrow></msub><msub><
   mrow><mi mathvariant="normal">L</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">D</mi></mrow><mrow><mi
   mathvariant="normal">Fake</mi></mrow></msub></mrow></msub> :MATH]
   1
   [MATH: <msub><mrow><mi mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">VAE</mi></mrow></msub><mo>=</mo><munderover
   accent="false" accentunder="false"><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi
   ></mrow></munderover><mfenced close="]" open="["><mrow><mo>−</mo><mi
   mathvariant="bold">w</mi><mo>⋅</mo><mi>log</mi><mfenced
   close=")"><mrow><mrow><mo>(</mo><mrow><mi
   mathvariant="normal">ZINB</mi><mrow><mo>(</mo><mrow><mrow><mo>(</mo><mr
   ow><msub><mrow><mi mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>∣</m
   o><mi
   mathvariant="bold-italic">ρ</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><m
   o>)</mo></mrow><mo>,</mo><mi
   mathvariant="bold-italic">π</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">z</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><m
   o>)</mo></mrow><mo>,</mo><mi
   mathvariant="bold-italic">θ</mi></mrow><mo>)</mo></mrow></mrow><mo>)</m
   o></mrow></mrow></mfenced><mo>+</mo><mi
   mathvariant="normal">KL</mi><mrow><mo>(</mo><mrow><mi>q</mi><mrow><mo>(
   </mo><mrow><mi
   mathvariant="bold">z</mi><mo>∣</mo><mrow><mo>(</mo><mrow><msub><mrow><m
   i mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mo>)</mo></mrow><mo>∥</mo><mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">0</mi><mo>,</mo><mi>I</mi></mrow><mo>)</mo></mrow></
   mrow><mo>)</mo></mrow></mrow></mfenced> :MATH]
   2
   [MATH: <msub><mrow><mi
   mathvariant="normal">L</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">D</mi></mrow><mrow><mi
   mathvariant="normal">Fake</mi></mrow></msub></mrow></msub><mo>=</mo><mu
   nderover accent="false" accentunder="false"><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi
   ></mrow></munderover><mo>−</mo><mi>log</mi><mrow><mo>(</mo><mrow><mn>1<
   /mn><mo>−</mo><mi>D</mi><mrow><mo>(</mo><mrow><mi>G</mi><mrow><mo>(</mo
   ><mrow><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mo>)</mo></mrow></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow> :MATH]
   3

   [MATH: <msub><mrow><mi
   mathvariant="normal">L</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">G</mi></mrow><mrow><mi
   mathvariant="normal">Pretrain1</mi></mrow></msub></mrow></msub> :MATH]
   is the loss used to train the VAE-based generator network during the
   first pertrain stage. It has two components, a VAE loss L[VAE] and
   another loss
   [MATH: <msub><mrow><mi
   mathvariant="normal">L</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">D</mi></mrow><mrow><mi
   mathvariant="normal">Fake</mi></mrow></msub></mrow></msub> :MATH]
   including the feedback from the discriminator. In L[VAE], N is the
   number of cells in the representative’s single-cell dataset. x[i] is
   the vector representing the gene expression value of the i-th cell in
   the dataset and s[i] is the corresponding gene set score.
   [MATH: <mi class="MJX-tex-caligraphic"
   mathvariant="script">N</mi><mrow><mo>(</mo><mrow><mi
   mathvariant="bold">0</mi><mo>,</mo><mi>I</mi></mrow><mo>)</mo></mrow>
   :MATH]
   is the standard Gaussian distribution. z[i] is a low-dimensional latent
   variable sampled from the Gaussian distribution q(z∣(x[i], s[i])) whose
   parameters μ[i] and Σ[i] are generated by the encoder network:
   [MATH: <msub><mrow><mi mathvariant="normal">MLP</mi></mrow><mrow><mi
   mathvariant="normal">Encoder</mi></mrow></msub><mrow><mo>(</mo><mrow><m
   i
   mathvariant="normal">GCN</mi><mrow><mo>(</mo><mrow><mrow><mo>(</mo><mro
   w><msub><mrow><mi mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mo>)</mo></mrow></mrow><mo>)</mo></mrow> :MATH]
   . We follow the ZINB distribution design of SCVI^[269]93 and generate
   ZINB parameter using the decoder network:
   ρ(z[i]) = MLP[Decoder,ρ](z[i]), π(z[i]) = MLP[Decoder,π](z[i]) and θ is
   a free parameter vector that are learned in the training process. In
   [MATH:
   <msub><mrow><mi>L</mi></mrow><mrow><msub><mrow><mi>D</mi></mrow><mrow><
   mi mathvariant="normal">Fake</mi></mrow></msub></mrow></msub> :MATH]
   , G represents the VAE-based generator so G((x[i], s[i])) represents
   the i-th cell reconstructed by the generator. The reconstructed cell
   data is the mean of the generator’s ZINB distribution. And D is our
   discriminator network, so D(G((x[i], s[i]))) is the discriminator’s
   predicted probability of the i-th reconstructed cell being a
   real-profiled cell. λ[d] denotes an empirically determined scaling
   factor. Meanwhile, the discriminator network will be trained using
   cross-entropy loss to reach higher classification performance.
   [MATH: <msub><mrow><mi mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">D</mi></mrow></msub><mo>=</mo><munderover
   accent="false" accentunder="false"><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi
   ></mrow></munderover><mfenced close="]"
   open="["><mrow><mo>−</mo><mi>log</mi><mrow><mo>(</mo><mrow><mi>D</mi><m
   row><mo>(</mo><mrow><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mo>)</mo></mrow></mrow><mo>)</mo></mrow><mo>−</mo><mi>log</mi><mrow><mo
   >(</mo><mrow><mn>1</mn><mo>−</mo><mi>D</mi><mrow><mo>(</mo><mrow><mi>G<
   /mi><mrow><mo>(</mo><mrow><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mo>)</mo></mrow></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow></mrow>
   </mfenced> :MATH]
   4

   Here, D((x[i], s[i])) represents the discriminator’s predicted
   probability of the i-th input cell being real when it is indeed real,
   and D(G((x[i], s[i]))), as mentioned in a previous paragraph,
   represents the discriminator’s predicted probability of the i-th cell
   being real when it is, in fact, a cell reconstructed by the generator
   network. When trained together, the generator will first be trained
   until convergence, followed by alternating training with the
   discriminator every three epochs. Together, they form a GAN with a
   min-max objective:
   [MATH:
   <msub><mrow><mi>min</mi></mrow><mrow><mi>G</mi></mrow></msub><msub><mro
   w><mi>max</mi></mrow><mrow><mi>D</mi></mrow></msub><munderover
   accent="false" accentunder="false"><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi
   ></mrow></munderover><mrow><mo>[</mo><mrow><mi>log</mi><mrow><mo>(</mo>
   <mrow><mi>D</mi><mrow><mo>(</mo><mrow><mrow><mo>(</mo><mrow><msub><mrow
   ><mi mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mo>)</mo></mrow></mrow><mo>)</mo></mrow><mo>+</mo><mi>log</mi><mrow><mo
   >(</mo><mrow><mn>1</mn><mo>−</mo><mi>D</mi><mrow><mo>(</mo><mrow><mi>G<
   /mi><mrow><mo>(</mo><mrow><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">s</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow><
   mo>)</mo></mrow></mrow><mo>)</mo></mrow></mrow><mo>)</mo></mrow></mrow>
   <mo>]</mo></mrow> :MATH]
   5

   In this context, the discriminator’s objective is to maximize its
   accuracy, aiming for precise classification, while the generator’s goal
   is to minimize discrepancies, generating realistic cells to challenge
   the discriminator.

   Overall, our approach involves two pretrain stages and one fine-tune
   stage for performing single-cell inference. In the initial pretrain
   stage, we employ
   [MATH: <msub><mrow><mi
   mathvariant="normal">L</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">G</mi></mrow><mrow><mi
   mathvariant="normal">Pretrain1</mi></mrow></msub></mrow></msub> :MATH]
   and L[D] to train the generator G and discriminator D, respectively,
   forming a GAN. The primary objective here is to train a generator
   capable of reconstructing genuine cell data from the representatives.
   Subsequently, the second pretraining stage closely resembles the first,
   without freezing any subnetwork, but it functions in full-batch mode.
   This modification enables the inclusion of an extra bulk loss term in
   the generator’s loss function. Such an addition improves the model’s
   capacity to leverage insights from bulk data and better aligns the
   single-cell data with its bulk counterpart. This enhancement is crucial
   for priming the model for the in silico inference of single-cell data
   from bulk data, while also strengthening the data reconstruction
   process.
   [MATH: <msub><mrow><mi
   mathvariant="normal">L</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">G</mi></mrow><mrow><mi
   mathvariant="normal">Pretrain2</mi></mrow></msub></mrow></msub><mo>=</m
   o><msub><mrow><mi mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">VAE</mi></mrow></msub><mo>−</mo><msub><mrow><mi
   mathvariant="normal">λ</mi></mrow><mrow><mi>d</mi></mrow></msub><msub><
   mrow><mi mathvariant="normal">L</mi></mrow><mrow><msub><mrow><mi
   mathvariant="normal">D</mi></mrow><mrow><mi
   mathvariant="normal">Fake</mi></mrow></msub></mrow></msub><mo>+</mo><ms
   ub><mrow><mi mathvariant="normal">λ</mi></mrow><mrow><mi
   mathvariant="normal">BulkR</mi></mrow></msub><msub><mrow><mi
   mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">BulkR</mi></mrow></msub> :MATH]
   6
   [MATH: <msub><mrow><mi mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">BulkR</mi></mrow></msub><mo>=</mo><msub><mrow><mfe
   nced close="∥" open="∥"><mrow><mfrac><mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></msubsup><msubsup><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>′</mo>
   </mrow></msubsup></mrow><mrow><mi>N</mi></mrow></mfrac><mo>−</mo><mfrac
   ><mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></msubsup><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mrow><mi>N</mi></mrow></
   mfrac></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msub> :MATH]
   7

   The term λ[BulkR] represents another empirical hyperparameter for
   regularization weight.
   [MATH: <msubsup><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>′</mo>
   </mrow></msubsup> :MATH]
   represents the reconstructed i-th cell. The bulk loss is defined as the
   disparity between the pseudobulk of the real representative dataset and
   that of the reconstructed representative dataset.

Fine-tune the deep generative learning model to infer the single-cell
measurements for the target samples blue (IN SILICO INFERENCE)

   The in silico inference is achieved by performing reconstruction of the
   representative single-cell sample and then “transforming” it to match
   the pseudobulk (the average expression) of aggregated cells from the
   target sample. After successfully pretraining our deep generative model
   so that it can perform accurate single-cell data reconstruction, the
   subsequent stage of our method leverages this foundation and
   incorporates bulk data information to perform the “transformation”.
   Intuitively, the incorporation is achieved by matching the bulk
   expression calculated from the generated cells and the bulk expression
   of the target sample. However, it will be inaccurate and even
   misleading to “guide” the transformation from the reference single-cell
   to the target single-cell if using the actual bulk measurement to
   calculate the “loss” for the in-silico inference, since the pseudobulk
   of the target single-cell sample is very different from its true bulk
   counterpart (Supplementary Fig. [270]S36). This is because due to the
   difference in sequencing platform and technology, as well as the
   relatively higher level of dropouts and noise, the pseudobulk of the
   target single-cell sample is often quite different from the bulk
   sequencing from the same sample^[271]73, due to the “dropouts” (gene
   not detected)^[272]75,[273]76 in single-cell data. As a result, it will
   be inaccurate and even misleading to “guide” the transformation from
   the reference single-cell to the target single-cell if using the real
   bulk measurement to calculate the “loss” for the in silico inference.
   To overcome this potential issue, here we employed the above strategy
   to estimate pseudobulk data for the representative and the target
   sample and keep the model running on pseudobulk space to steer away
   from modeling the difference between bulk and pseudobulk expression of
   the samples.

   In this in silico inference phase, to perform the bulk data alignment
   described above, we introduce an extra term to the loss function that
   accounts for the difference between the bulk data of the representative
   and that of the target sample. This fine-tuning process refines the
   model’s capabilities, enabling it to perform in silico generation of
   cells from the target sample while maintaining consistency with the
   observed bulk expression patterns. Consequently, our deep generative
   model transitions from being solely focused on reconstruction to
   generating single-cell data for the target sample. In this crucial
   stage, we retire the discriminator from our model architecture, as our
   focus shifts from generating cells that resemble the representatives’
   cells to adapting the model for the target samples. Consequently, the
   training process exclusively involves the generator. The loss function
   for the generator adopted in this stage bears resemblance to that of
   the second pretrain phase, with the distinction of the bulk loss
   transitioning from L[BulkR] to L[BulkT]:
   [MATH: <msub><mrow><mi mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">Inference</mi></mrow></msub><mo>=</mo><msub><mrow>
   <mi mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">VAE</mi></mrow></msub><mo>+</mo><msub><mrow><mi
   mathvariant="normal">λ</mi></mrow><mrow><mi
   mathvariant="normal">BulkT</mi></mrow></msub><msub><mrow><mi
   mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">BulkT</mi></mrow></msub> :MATH]
   8
   [MATH: <msub><mrow><mi mathvariant="normal">L</mi></mrow><mrow><mi
   mathvariant="normal">BulkT</mi></mrow></msub><mo>=</mo><msub><mrow><mfe
   nced close="∥" open="∥"><mrow><mfrac><mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>N</mi></mrow></msubsup><msub><mrow><msup><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mo>′</mo></mrow></msup></mrow><m
   row><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mrow><mi>N</mi></mrow></
   mfrac><mo>−</mo><mfrac><mrow><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>
   i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>N</mi></mrow></msubsup><msub
   ><mrow><mi mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mrow><mi>N</mi></mrow></
   mfrac><mo>⋅</mo><mi
   mathvariant="bold-italic">α</mi></mrow></mfenced></mrow><mrow><mn>2</mn
   ></mrow></msub> :MATH]
   9
   [MATH: <mi
   mathvariant="bold-italic">α</mi><mo>=</mo><mrow><mo>(</mo><mrow><msub><
   mrow><mi mathvariant="bold">B</mi></mrow><mrow><mi
   mathvariant="bold">T</mi></mrow></msub><mo>+</mo><mi
   mathvariant="bold-italic">ϵ</mi></mrow><mo>)</mo></mrow><mo>⊘</mo><mrow
   ><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">B</mi></mrow><mrow><mi
   mathvariant="bold">R</mi></mrow></msub><mo>+</mo><mi
   mathvariant="bold-italic">ϵ</mi></mrow><mo>)</mo></mrow> :MATH]
   10

   The second term of equation ([274]9),
   [MATH:
   <mfrac><mrow><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo>
   <mn>1</mn></mrow><mrow><mi>N</mi></mrow></msubsup><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mrow><mi>N</mi></mrow></
   mfrac><mo>⋅</mo><mi mathvariant="bold-italic">α</mi> :MATH]
   , functions as an estimation of the pseudobulk value for the target
   sample, derived from the actual bulk data. This term is incorporated to
   reduce the systematic discrepancy between the pseudobulk and the actual
   bulk data corresponding to the same sample. To derive this estimation,
   we first compute the conversion ratio
   [MATH: <mi mathvariant="bold-italic">α</mi> :MATH]
   as the element-wise Hadamard division between the expression level of
   the representative’s genes and the expression level of the target
   sample’s genes using the real bulk RNA-seq data. Here, B[T] and B[R]
   represent the real bulk data of the target and representative samples,
   respectively, while ϵ, whose elements are all set to a small fixed
   scalar value (0.1 by default), is introduced as a pseudocount vector to
   avoid division by zero. We expect that the ratio, which reflects the
   relative gene expression differences between the representative and
   target samples, will remain consistent whether it is derived from
   pseudobulk or true bulk expressions. This consistency underlines the
   robustness of our method in processing expression data across various
   bulk analyses. Following this, we compute the target sample’s
   pseudobulk by multiplying the representative’s pseudobulk from
   single-cell data (already known) with this conversion ratio
   [MATH: <mi mathvariant="bold-italic">α</mi> :MATH]
   , derived from the true bulk RNA-seq data. This process significantly
   boosts the robustness of our method, as the expression ratio between
   samples tends to remain stable, irrespective of the bulk sequencing
   technology and normalization methods applied. In contrast, the absolute
   differences between real bulk RNA-seq and pseudobulk data can vary
   based on their respective sequencing and normalization techniques,
   making it challenging to model them accurately. Our method’s focus on
   relative rather than absolute differences allows for more consistent
   and reliable analyses across varied datasets. Supplementary
   Fig. [275]S36 shows that in contrast to the real bulk data, the
   inferred pseudobulk from our method closely resembles the true
   pseudobulk observed from the profiled target single-cell sample, which
   suggests that our pseudobulk inference method can mitigate the
   technical difference between pseudobulk and real bulk and convert the
   real bulk into the corresponding pseudobulk that can better guide the
   in silico inference of the target single-cell sample. In cases such as
   the COVID-19 and colorectal cancer dataset, where the pseudobulk is
   already employed and no need to acquire it via estimation, the entire
   second term simplifies to the target’s pseudobulk. Furthermore, as the
   bulk loss L[BulkT] inherently propagates more gradient to higher gene
   expression values (given the target bulk loss is in a squared error
   form), we implement a strategy to gradually reduce the gradient
   received by larger values. See details in the “Deep Generative Learning
   Model Training Details” section. Concurrently, we increase the weight
   of the bulk loss in the loss function over the course of training. This
   refined approach ensures that, after adequately training for larger
   values, the model can fine-tune smaller values, resulting in a
   comprehensive and precise single-cell data generation for the target
   samples. To counteract the influence of potentially noisy gene
   expression with minuscule values, we introduced a neuron activation
   threshold. This threshold converts output gene expression values that
   are exceedingly small, specifically those below 0.1% of the input
   library size (a threshold consistent with that used for background
   noise removal in preprocessing), to zero. In all of these stages,
   please note that none of the weights or subnetworks are frozen, except
   in the inference the discriminator network is no longer involved as
   mentioned in the paragraph describing inference. This is because we
   want the model to perform reconstruction and consider the
   reconstruction loss all the time so that the model does not deviate too
   aggressively from the representative’s data distribution. Supplementary
   Fig. [276]S50 shows the cells generated using models trained with
   different training stages, demonstrating the effectiveness and
   necessity of these training stages.

Deep generative learning model architecture

   In this study, our model integrates a Variational Autoencoder (VAE) as
   the generator and a Multi-Layer Perceptron (MLP) as the discriminator.
   We developed both components using PyTorch^[277]100. A concise overview
   of the model architecture is provided below, and for an in-depth
   understanding, we encourage consulting the complete details available
   in our GitHub repository (refer to the code availability section).

   Before sending the cell data into the model, we performed some
   precomputation for the cells, which is part of a Graph Convolutional
   Network (GCN) layer. We aggregated information from the K (15 by
   default) nearest neighboring cells for each input cell. The neighbor
   finding for each cell was based on the ‘neighbors.kneighbors_graph’
   tool in the Scikit-learn package. We preprocessed this message-passing
   step on the log1p transformed cell expression data, subsequently
   inputting the aggregated cell data into the deep learning model. The
   encoder of the VAE then employs linear layers and activation functions,
   leading to a hidden layer comprising 256 neurons. The message passing,
   the first linear layer, and the first activation function in the
   encoder form a GCN layer together. The design of the rest of the VAE
   structure, inclusive of both the encoder and decoder, was adapted from
   SCVI^[278]93. ReLU^[279]101 activation is used for both the encoder and
   decoder. Weight decay^[280]102,[281]103 is used for avoiding
   overfitting. We chose 32 for the latent space dimension and 256 for the
   decoder’s hidden layer size. Our model’s VAE generator also utilizes a
   standard Gaussian prior for the latent space distribution, serving as a
   regularization mechanism to avert over-adaptation of the model
   parameters to the training data. For the discriminator, we constructed
   a 4-layer MLP, with Leaky ReLU activations^[282]104 between linear
   layers. The hidden layers are configured with 256, 128, and 10 neurons,
   respectively. The output layer employs a Sigmoid activation to produce
   the prediction probability of the input cell being real. Weight
   initialization in the model utilizes the Kaiming method^[283]105.

Deep generative learning model training details

   Following the detailed outline of the VAE-GAN model structure, we next
   describe the detailed training settings, which were designed to
   optimize the model’s performance. The training process was structured
   in a three-stage sequence to ensure comprehensive learning and
   fine-tuning of the model parameters.

   Initially, in the first pretraining stage (Pretrain 1), we focus on
   training the deep generative learning model to learn the distribution
   of the data and making it capable of reconstructing the representative
   single-cell data. This pretraining stage can further be split into two
   substages: in Pretrain 1.1, the VAE generator is trained independently
   until convergence (setting 100 epochs as a default value). Then, in
   Pretrain 1.2 the generator and discriminator are trained jointly until
   the generator cannot further reduce its loss. When trained together,
   the generator and the discriminator are trained alternatively for 3
   epochs each, and this will go for 100 iterations by default. During
   this phase, we employed the default SCVI learning rate of 1 × 10^−3.

   The second pretraining stage (Pretrain 2) mirrors the first in its
   basic approach but incorporates notable modifications. It is executed
   in full-batch mode, with an additional representative bulk loss
   integrated into the training process. Here, in Pretrain 2.1, similar to
   Pretrain 1.1, the generator is again trained until convergence (set to
   50 epochs by default), and then trained jointly with the discriminator
   in Pretrain 2.2 until convergence. The default setting is training the
   two networks for 50 iterations, where each network is trained for 3
   epochs in each iteration. To facilitate more refined adjustments during
   this stage, the learning rate is reduced to 1 × 10^−4.

   In the final stage of training, the single-cell inference fine-tuning
   stage, the discriminator is set aside, and training is executed through
   a series of mini-stages. Each mini-stage involves setting the gradients
   for highly expressed genes above specific thresholds (each mini-stage
   corresponds to a different threshold) to zero, a strategy aimed at
   ensuring equitable optimization for genes with smaller expression
   values. These thresholds are hyperparameters estimated relative to the
   peak expression value (
   [MATH: <mi>max</mi><mo>_</mo><mi mathvariant="normal">expr</mi> :MATH]
   ) from the representative’s normalized count matrix, set at No
   threshold,
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac><mi>max</m
   i><mo>_</mo><mi mathvariant="normal">expr</mi> :MATH]
   ,
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mn>4</mn></mrow></mfrac><mi>max</m
   i><mo>_</mo><mi mathvariant="normal">expr</mi> :MATH]
   ,
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mn>8</mn></mrow></mfrac><mi>max</m
   i><mo>_</mo><mi mathvariant="normal">expr</mi> :MATH]
   , and
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mn>8</mn><msqrt><mrow><mn>2</mn></
   mrow></msqrt></mrow></mfrac><mi>max</mi><mo>_</mo><mi
   mathvariant="normal">expr</mi> :MATH]
   (The last threshold should not be too close to 0 to avoid emphasizing
   too much on background noise). Training is continued until convergence
   in each mini-stage, typically not exceeding 150 epochs, with a learning
   rate of 2 × 10^−4. As the thresholds are lowered, some generated gene
   expressions are not involved in the loss computation, so the weight for
   the target bulk loss in the loss function is progressively quadrupled
   (i.e. increased by four times), ensuring a consistent total magnitude
   for this loss component. Upon completion of all mini-stages, the
   trained model is then utilized to generate single-cell data for the
   target sample. The rationale behind this mini-stage strategy is to
   guide the model to equally tune the different levels of gene expression
   values it generates. When no threshold is applied since our bulk loss
   has the form of a square error, larger gradient values are passed to
   larger gene expression values. In other words, the training will focus
   too much on tuning the larger gene expression values and lack attention
   to the smaller ones. Therefore, we apply this mini-stage strategy to
   progressively fine-tune the model, encouraging it to adjust the
   generated gene expressions from the larger ones to the lower ones.
   Supplementary Fig. [284]S50 shows that the cells generated using models
   trained after each mini-stage become progressively closer to the ground
   truth (target sample’s real-profiled data), justifying the
   effectiveness of this mini-stage training strategy.

Incrementally select representatives using active learning to improve
single-cell inference (ACTIVE LEARNING)

   Our strategy to experimentally select representative samples
   prioritized by active learning for sequencing is detailed as follows.
   Our approach to selecting representative samples for single-cell
   sequencing employs an iterative active learning process. Initially, an
   algorithm identifies a set of samples that provide a diverse and
   informative snapshot of the biological system under study, prioritizing
   them for the first round of sequencing (as described in the “The
   initial setup of scSemiProfiler” section in Methods). After sequencing,
   the gathered data is used to update and refine our semi-profiling
   model. If the accuracy of the resulting semi-profiles does not meet our
   standards (the inferred data is different from the sequenced data from
   the previous round), the active learning algorithm is re-engaged. It
   then selects samples as new representatives for subsequent rounds of
   sequencing based on their expected contribution to improving the
   model’s accuracy. This iterative method ensures that the process does
   not exhaust the pool of informative samples in the initial round.
   Instead, it strategically selects a portion of samples at each step,
   preserving ample candidates for future sequencing. Our active learning
   design is meant to make efficient, incremental use of the sample pool,
   improving the model progressively with each round of sequencing.

   Upon performing the initial round of representative selection and
   executing the deep generative learning models for in silico inference,
   every sample in the cohort possesses single-cell data, either
   real-profiled or semi-profiled. While researchers can opt to conclude
   here and proceed with downstream analysis using the single-cell data,
   when budget permits, our active learning module can further refine the
   process. By pinpointing additional informative representatives for
   single-cell sequencing, some target samples will be assigned a more
   similar representative, leading to enhanced single-cell inference
   performance.

   Our initial idea of representative selection strategy is based on the
   belief that selecting the most “uncertain” samples as the
   representatives results in the most inference improvement and yields
   the richest insights for our model. Consequently, we select the sample
   with the largest bulk data difference compared to its representative to
   serve as the new representative:
   [MATH: <msub><mrow><mi>R</mi></mrow><mrow><mi
   mathvariant="normal">new</mi></mrow></msub><mo>=</mo><msub><mrow><mi
   mathvariant="normal">arg</mi><mspace width="0.25em"></mspace><mi
   mathvariant="normal">max</mi></mrow><mrow><mi>i</mi><mo>∈</mo><mi
   mathvariant="normal">Cohort</mi></mrow></msub><msub><mrow><mi>D</mi></m
   row><mrow><mi>b</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>R</mi><mrow
   ><mo>(</mo><mrow><mi>i</mi></mrow><mo>)</mo></mrow><mspace
   width="0.25em"></mspace><mo>,</mo><mi>i</mi></mrow><mo>)</mo></mrow>
   :MATH]
   11

   Where R(i) is the representative of i, and D[b] is the Euclidean
   distance between the PCA-reduced bulk data of two samples. This process
   is executed B (representative batch size) times. Upon choosing new
   representatives, we update the membership for all samples. This ensures
   that, if a sample aligns more closely with a new representative than
   their previous one, their affiliation shifts:
   [MATH:
   <mi>R</mi><mrow><mo>(</mo><mrow><mi>i</mi></mrow><mo>)</mo></mrow><mo>=
   </mo><msub><mrow><mi mathvariant="normal">arg</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">min</mi></mrow><mrow><mi>r</mi><mo>∈</mo><mi
   mathvariant="normal">Rs</mi></mrow></msub><msub><mrow><mi>D</mi></mrow>
   <mrow><mi>b</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>r</mi><mspace
   width="0.25em"></mspace><mo>,</mo><mi>i</mi></mrow><mo>)</mo></mrow>
   :MATH]
   12

   where R(i) is the representative of the sample i and Rs is the set of
   representatives. Conceptually, this approach is similar to the
   uncertainty sampling^[285]80 algorithm, a subtype of active learning.
   However, this method is passive learning since it does not utilize the
   knowledge acquired by the base learners, our deep generative learning
   models.

   To improve the representative selection, we developed our active
   learning algorithm by merging the information gained by the deep
   generative model base learners into the selection of new
   representatives. To commence, we pinpoint the B sample clusters
   exhibiting the highest in-cluster heterogeneity, which is the aggregate
   distance from each sample to the cluster’s representative. The
   heterogeneity for a cluster c is expressed as:
   [MATH:
   <msub><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>c</mi></mrow
   ></msub><mfenced
   open="["><mrow><msub><mrow><mi>D</mi></mrow><mrow><mi>b</mi></mrow></ms
   ub><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>c</mi><
   /mrow></msub><mspace
   width="0.25em"></mspace><mo>,</mo><mi>i</mi></mrow><mo>)</mo></mrow><mo
   >+</mo><msub><mrow><mi
   mathvariant="normal">λ</mi></mrow><mrow><mi>s</mi><mi>c</mi></mrow></ms
   ub><msub><mrow><mi>D</mi></mrow><mrow><mi>s</mi><mi>c</mi></mrow></msub
   ><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>c</mi></m
   row></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mi>i</mi></mrow><mo>)</mo></mrow><mo>+</mo><ms
   ub><mrow><mi
   mathvariant="normal">λ</mi></mrow><mrow><mi>p</mi><mi>b</mi></mrow></ms
   ub><msub><mrow><mi>D</mi></mrow><mrow><mi>p</mi><mi>b</mi></mrow></msub
   ><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>c</mi></m
   row></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mi>i</mi></mrow><mo>)</mo></mrow></mrow></mfen
   ced> :MATH]
   , where i can be any sample in the cluster and R[c] is the
   representative. Essentially, a sample’s heterogeneity is the combined
   distance-across three distance metrics-between the sample and its
   representative. The total cluster heterogeneity is derived from the
   summation of all sample heterogeneity. D[b] denotes the Euclidean
   distance between the PCA-reduced bulk data of two samples. The term
   D[sc](R[c], i) represents the difficulty of transforming the
   representative single-cell data to that of a target. This is quantified
   by averaging the Euclidean distances between the representative cells
   and their K nearest neighbors in the target sample, all within a
   PCA-reduced space. Formally:
   [MATH:
   <msub><mrow><mi>D</mi></mrow><mrow><mi>s</mi><mi>c</mi></mrow></msub><m
   row><mo>(</mo><mrow><mi>a</mi><mo>,</mo><mi>b</mi></mrow><mo>)</mo></mr
   ow><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><msub><mrow><mi>N</mi>
   </mrow><mrow><mi>a</mi></mrow></msub></mrow></mfrac><msubsup><mrow><mo>
   ∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mr
   ow><mi>N</mi></mrow><mrow><mi>a</mi></mrow></msub></mrow></msubsup><msu
   b><mrow><mo mathsize="big">∑</mo></mrow><mrow><mi>j</mi><mo>∈</mo><mi
   mathvariant="normal">KNN</mi><mrow><mo>(</mo><mrow><mi>i</mi></mrow><mo
   >)</mo></mrow></mrow></msub><mi
   mathvariant="normal">ED</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">v</mi></mrow><mrow><mi
   mathvariant="bold">a,i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">v</mi></mrow><mrow><mi
   mathvariant="bold">b,j</mi></mrow></msub></mrow><mo>)</mo></mrow>
   :MATH]
   13

   with a, b being two samples in the semi-profiled cohort, v[a], v[b]
   being PCA reduced single-cell data matrices for the two samples,
   ED(x, y) computing the Euclidean distance between x and y, and KNN(i)
   identifying the cell’s K (1 by default) nearest neighbor in another
   sample’s data matrix. Subscript represents rows, e.g. v[a,i] is the ith
   row (ith cell) of the PCA reduced data of sample a. D[pb] is the
   Euclidean distance of the log1p transformed pseudobulk data of two
   samples, which is positively related to the amount of bulk loss and
   thus quantifies the deep learning model’s difficulty in performing the
   in silico inference. Finally, λ[sc] and λ[pb] are empirical scaling
   factors (both are 1 by default).

   The most heterogeneous cluster C[H] is chosen as:
   [MATH:
   <msub><mrow><mi>C</mi></mrow><mrow><mi>H</mi></mrow></msub><mo>=</mo><m
   sub><mrow><mi>arg</mi><mi>max</mi></mrow><mrow><mi>c</mi><mo>∈</mo><mi
   mathvariant="normal">clusters</mi></mrow></msub><munder><mrow><mo
   mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>c</mi></mrow></munder><mrow>
   <mo>[</mo><mrow><msub><mrow><mi>D</mi></mrow><mrow><mi>b</mi></mrow></m
   sub><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>c</mi>
   </mrow></msub><mo>,</mo><mi>i</mi></mrow><mo>)</mo></mrow><msub><mrow><
   mo>+</mo><mi
   mathvariant="normal">λ</mi></mrow><mrow><mi>s</mi><mi>c</mi></mrow></ms
   ub><msub><mrow><mi>D</mi></mrow><mrow><mi>s</mi><mi>c</mi></mrow></msub
   ><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>c</mi></m
   row></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mi>i</mi></mrow><mo>)</mo></mrow><msub><mrow><
   mo>+</mo><mi
   mathvariant="normal">λ</mi></mrow><mrow><mi>p</mi><mi>b</mi></mrow></ms
   ub><msub><mrow><mi>D</mi></mrow><mrow><mi>p</mi><mi>b</mi></mrow></msub
   ><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><mi>c</mi></m
   row></msub><mo>,</mo><mspace
   width="0.25em"></mspace><mi>i</mi></mrow><mo>)</mo></mrow></mrow><mo>]<
   /mo></mrow> :MATH]
   14

   After selecting the B most heterogeneous clusters to split, the
   subsequent step involves selecting a new representative for each of
   these B clusters to minimize total in-cluster heterogeneity. Within
   each cluster, every non-representative sample is a potential new
   representative, each representing a different way of splitting the
   cluster. In the cluster, samples that are closer in bulk distance D[b]
   to the potential new representative than to the original representative
   can be reassigned to the new representative, splitting the cluster into
   two. The total heterogeneity of these two new clusters is then
   calculated based on the bulk distance. For each of the B most
   heterogeneous clusters, we select the sample whose corresponding split
   results in the minimal total in-cluster heterogeneity to be the new
   representative R[new], as shown in the equation below.
   [MATH: <msub><mrow><mi>R</mi></mrow><mrow><mi
   mathvariant="normal">new</mi></mrow></msub><mo>=</mo><msub><mrow><mi>ar
   g</mi><mi>min</mi></mrow><mrow><mi>j</mi><mo>∈</mo><msub><mrow><mi>C</m
   i></mrow><mrow><mi>H</mi></mrow></msub></mrow></msub><munder><mrow><mo
   mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><msub><mrow><mi>C</mi></mrow><mr
   ow><mi>H</mi></mrow></msub></mrow></munder><mi>min</mi><mrow><mo>(</mo>
   <mrow><msub><mrow><mi>D</mi></mrow><mrow><mi>b</mi></mrow></msub><mrow>
   <mo>(</mo><mrow><mi>j</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi>i</mi></mrow><mo>)</mo></mrow><mo>,</mo><ms
   pace
   width="0.25em"></mspace><msub><mrow><mi>D</mi></mrow><mrow><mi>b</mi></
   mrow></msub><mrow><mo>(</mo><mrow><msub><mrow><mi>R</mi></mrow><mrow><m
   i>c</mi></mrow></msub><mo>,</mo><mi>i</mi></mrow><mo>)</mo></mrow></mro
   w><mo>)</mo></mrow> :MATH]
   15

   Where R[c] denotes the original representative of this cluster.
   Finally, the cluster membership updating will be executed to make sure
   each sample i is assigned to the closest representative R(i).
   [MATH:
   <mi>R</mi><mrow><mo>(</mo><mrow><mi>i</mi></mrow><mo>)</mo></mrow><mo>=
   </mo><msub><mrow><mi>arg</mi><mi>min</mi></mrow><mrow><mi>r</mi><mo>∈</
   mo><mi
   mathvariant="normal">Rs</mi></mrow></msub><msub><mrow><mi>D</mi></mrow>
   <mrow><mi>b</mi></mrow></msub><mrow><mo>(</mo><mrow><mi>r</mi><mo>,</mo
   ><mspace width="0.25em"></mspace><mi>i</mi></mrow><mo>)</mo></mrow>
   :MATH]
   16

   With the identification of new representatives, they are subjected to
   real single-cell data profiling and appended to the existing
   collection. In subsequent semi-profiling iterations, certain target
   samples will be assigned with more analogous representatives, enhancing
   in silico single-cell inference accuracy. This iterative procedure
   persists until either the budget is exhausted or a sufficient number of
   representatives are ascertained, ensuring optimal semi-profiling
   results.

Semi-profiling pipeline stop criteria

   For our semi-profiling pipeline (Supplementary Fig. [286]S51) that
   iteratively selects representatives and performs in silico single-cell
   inference, we recommend two types of stop criteria for the users to
   choose according to their own needs. First, in the case that the user
   knows their budget and aims to achieve the best possible semi-profiling
   performance, they can run the pipeline and keep choosing
   representatives until they run out of budget. Second, in the case that
   the user aims to achieve an acceptable semi-profiling performance with
   the least amount of budget, the iterative cycle persists until an
   acceptable performance is reached. The performance is measured by
   comparing the representatives’ actual single-cell data and the inferred
   data for them in previous iterations. Based on the user’s needs, we
   recommend three levels of criteria. First, when the user is mainly
   studying the cell type proportion, we recommend stopping when the
   Pearson correlation between the ground truth and the estimated
   proportion reaches 0.85. Second, when the user’s study includes more
   detailed analysis, such as biomarker discovery, we recommend stopping
   when the Pearson correlations between the biomarker expression patterns
   (dot colors and sizes in the dot plots) discovered in real-profiled
   data and those in semi-profiled data reach 0.85. Finally, for studies
   requiring more intricate single-cell level analysis, it is advised to
   halt when the Pearson correlation between data matrices, generated from
   advanced analysis tasks like trajectory inference and cell-cell
   interaction, for semi-profiled and real-profiled data, reaches 0.85.
   These guidelines are merely suggestions, and users have the flexibility
   to modify the stopping criteria to suit their specific requirements.
   The effectiveness of the semi-profiling process can be consistently
   evaluated by comparing the semi-profiled single-cell data (from the
   previous round) with the actual single-cell profiling (in the
   subsequent round). If the actual single-cell profiling aligns with the
   semi-profiling generated by our model from the previous round, this
   indicates a robust semi-profiling model has been effectively learned,
   and the iterative process can thus be concluded.

Global mode representative selection

   Multiple rounds of representative selection could be challenging for
   some labs and could potentially lead to batch effects. To accommodate
   users who prefer all-in-one-batch representative profiling,
   scSemiProfiler offers a “global mode”. This mode enables users to
   assess data heterogeneity using Silhouette scores and bulk data
   visualization, helping to determine the approximate number of
   representative clusters required for their dataset. Once the number of
   clusters is established, users can select these representatives based
   on bulk data clustering and sequence all representatives in one batch
   to minimize potential batch effects. When using scSemiProfiler in
   real-world scenarios, users are advised to choose between the two
   modes—global and active learning-based on their specific needs. If
   budget conservation is paramount, then multiple-round selection using
   active learning is recommended. However, if the goal is to minimize
   batch effects and maximize convenience, then the global mode is more
   suitable. Even when the global mode is preferred, active learning
   remains crucial for evaluating the quality of in silico inference
   (semi-profiling). Unlike the active learning mode, where in silico
   inference from the previous round can be compared with actual
   single-cell profiling in the next round to assess semi-profiling
   accuracy, the accuracy of in silico inference in global mode cannot be
   directly examined. To overcome this limitation, a portion of the
   samples selected in global mode (e.g., N = 28 split into 24 training +
   4 test samples) can be used. This method involves using part of the
   samples (e.g., 24 samples) to train the model and performing in silico
   inference on the remaining samples (e.g., 4 samples). By comparing the
   in silico inferred single-cell data with the true single-cell
   measurements, the semi-profiling accuracy using N samples can be
   assessed. If the results are not satisfactory, the active learning
   module can then be used to select the next batch of representatives for
   improved semi-profiling in subsequent rounds. Therefore, a combination
   of the two modes can be employed in practical applications.

Semi-profiling performance evaluation

   Cell type annotation for semi-profiled data: Utilizing the annotated
   single-cell data from the representatives, we trained an MLP classifier
   available in the Scikit-learn package^[287]86. This classifier was then
   employed to categorize the cell types for cells in the semi-profiled
   data. In addition, beyond the supervised cell type annotation,
   scSemiProfer is capable of performing de novo cell type labeling for
   the generated cells according to the cell type marker expression using
   existing tools, such as Cellar^[288]52, CellMarker^[289]53,[290]54, or
   Panglao DB^[291]55.

   Dimensionality reduction: In order to juxtapose the real-profiled and
   semi-profiled cohorts, we merged them and executed Principal Component
   Analysis (PCA) to reduce it to 100 principal components. This
   PCA-processed data serves as the foundation for computing single-cell
   level difference metrics. For visualization purposes, we first used
   SCANPY^[292]106 to compute the neighbor graph using the PCA-reduced
   data, with the local neighborhood size parameter set to 50.
   Subsequently, the Uniform Manifold Approximation and Projection (UMAP)
   functionality within SCANPY was used for the visualization process.

   Semi-profiling error quantification: To assess the accuracy of the
   semi-profiled cohort, we performed PCA to reduce its dimensionality, in
   conjunction with the real-profiled data, to 100 dimensions. For each
   sample within the semi-profiled dataset, we calculated a single-cell
   difference metric in comparison to the real-profiled single-cell data
   corresponding to that sample. This involves determining the Euclidean
   distance of each cell to its K-nearest neighbors in another dataset
   within the PCA-reduced space, followed by averaging these distances
   across all cells. The formula below outlines the difference computation
   between single-cell matrices m[a] and m[b].
   [MATH: <msub><mrow><mi mathvariant="normal">Dif</mi></mrow><mrow><mi
   mathvariant="normal">sc</mi></mrow></msub><mrow><mo>(</mo><mrow><msub><
   mrow><mi mathvariant="bold">m</mi></mrow><mrow><mi
   mathvariant="bold">a</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">m</mi></mrow><mrow><mi
   mathvariant="bold">b</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>=</m
   o><mfrac><mrow><mn>1</mn></mrow><mrow><msub><mrow><mi>N</mi></mrow><mro
   w><mi>a</mi></mrow></msub><mo>+</mo><msub><mrow><mi>N</mi></mrow><mrow>
   <mi>b</mi></mrow></msub></mrow></mfrac><mfenced close=")"
   open="("><mrow><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><msub><mrow><mi>N</mi></mrow><mrow><mi>a</mi></mrow></msub></mrow
   ></munderover><munder><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>j</mi><mo>∈</mo><mi
   mathvariant="normal">KNN</mi><mrow><mo>(</mo><mrow><mi>i</mi></mrow><mo
   >)</mo></mrow></mrow></munder><mi
   mathvariant="normal">ED</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">v</mi></mrow><mrow><mi
   mathvariant="bold">a,i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">v</mi></mrow><mrow><mi
   mathvariant="bold">b,j</mi></mrow></msub></mrow><mo>)</mo></mrow><mo>+<
   /mo><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mr
   ow><mi>N</mi></mrow><mrow><mi>b</mi></mrow></msub></mrow></munderover><
   munder><mrow><mo mathsize="big">
   ∑</mo></mrow><mrow><mi>j</mi><mo>∈</mo><mi
   mathvariant="normal">KNN</mi><mrow><mo>(</mo><mrow><mi>i</mi></mrow><mo
   >)</mo></mrow></mrow></munder><mi
   mathvariant="normal">ED</mi><mrow><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold">v</mi></mrow><mrow><mi
   mathvariant="bold">b,i</mi></mrow></msub><mo>,</mo><msub><mrow><mi
   mathvariant="bold">v</mi></mrow><mrow><mi
   mathvariant="bold">a,j</mi></mrow></msub></mrow><mo>)</mo></mrow></mrow
   ></mfenced> :MATH]
   17

   Where Dif[sc] denotes single-cell level difference. N[a] and N[b]
   signify the cell counts in matrices m[a] and m[b] respectively. The vs
   represents the PCA-reduced matrices. The terms v[a,i], v[b,j] represent
   the ith and jth cell (rows) in the PCA-reduced matrices of sample a and
   b respectively. The function KNN identifies the K nearest neighbors
   within the alternate sample’s PCA-reduced data matrix. We chose K = 1
   for our experiments. ED computes the Euclidean distance between two
   vectors.

   Referring to the comprehensive performance curves depicted in (Figs.
   [293]2f, [294]4f, [295]6f, the normalized error E[N] displayed on the
   left y-axis represents the mean single-cell difference between each
   sample’s real-profiled version and semi-profiled version. This error is
   normalized using a theoretical upper bound (UB) and a lower bound (LB)
   of an expected single-cell difference between samples in the study.
   [MATH:
   <msub><mrow><mi>E</mi></mrow><mrow><mi>N</mi></mrow></msub><mo>=</mo><m
   frac><mrow><mn>1</mn></mrow><mrow><mi
   mathvariant="normal">UB</mi><mo>−</mo><mi
   mathvariant="normal">LB</mi></mrow></mfrac><mfenced close=")"
   open="("><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mi>P</mi></mrow></m
   frac><munderover accent="false" accentunder="false"><mrow><mo
   mathsize="big">
   ∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>P</mi
   ></mrow></munderover><msub><mrow><mi
   mathvariant="normal">Dif</mi></mrow><mrow><mi
   mathvariant="normal">sc</mi></mrow></msub><mrow><mo>(</mo><mrow><msub><
   mrow><mi mathvariant="bold">m</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><mover
   accent="true"><mrow><msub><mrow><mi
   mathvariant="bold">m</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mrow><mo>~</mo></mrow></
   mover></mrow><mo>)</mo></mrow><mo>−</mo><mi
   mathvariant="normal">LB</mi></mrow></mfenced> :MATH]
   18

   and if i itself is a representative, the single-cell difference equals
   the lower bound.
   [MATH: <msub><mrow><mi mathvariant="normal">Dif</mi></mrow><mrow><mi
   mathvariant="normal">sc</mi></mrow></msub><mrow><mo>(</mo><mrow><msub><
   mrow><mi mathvariant="bold">m</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub><mo>,</mo><mover
   accent="true"><mrow><msub><mrow><mi
   mathvariant="bold">m</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mrow><mo>~</mo></mrow></
   mover></mrow><mo>)</mo></mrow><mo>=</mo><mi
   mathvariant="normal">LB</mi><mo>,</mo><mspace
   width="1.0em"></mspace><mi mathvariant="normal">if</mi><mspace
   width="0.25em"></mspace><mi>i</mi><mo>∈</mo><mi
   mathvariant="normal">Rs</mi> :MATH]

   where P stands for the total count of samples within the cohort and
   [MATH: <mover accent="true"><mrow><msub><mrow><mi
   mathvariant="bold">m</mi></mrow><mrow><mi
   mathvariant="bold">i</mi></mrow></msub></mrow><mrow><mo>~</mo></mrow></
   mover> :MATH]
   refers to the semi-profiled version of m[i]. Rs denotes the set of
   representatives. The upper bound UB was derived by averaging the
   single-cell difference of randomly paired samples based on their
   PCA-reduced real-profiled data. This represents the performance of the
   worst possible representative selection, which is random selection. For
   computing the lower bound LB, we first divided the comprehensive PCA
   matrix, which contains all single-cell samples, into two random halves.
   Due to the randomness of the split, these two halves have the same data
   probabilistic distribution. Furthermore, given the large sample size,
   these halves are expected to have approximately the same actual cell
   distribution and thus can be regarded as two replicates of the same
   study. The difference between the replicates should be regarded as the
   lower bound, as it represents the best performance that an optimal
   selection can achieve^[296]107.

   Deconvolution benchmarking: We benchmarked the accuracy of our
   deconvolution method by comparing its performance against established
   methods: CIBERSORTx, Bisque, TAPE, Scaden, MuSiC, NNLS, and EPIC. We
   ran CIBERSORTx using the official web portal. Bisque, TAPE, MuSiC, and
   EPIC were installed according to the official GitHub repositories
   provided in the corresponding publications. We used the NNLS included
   in the MuSiC package. We used the PyTorch version of Scaden implemented
   in TAPE’s GitHub repository for its testing. First, we annotated our
   cells generated by the deep learning model using the MLP classifier,
   which was trained on the representatives’ annotated data, as discussed
   previously. Using this annotated data, we computed the cell type
   proportion. Subsequently, we employed the CCC, Pearson correlation
   coefficient (calculated using the Python package SciPy^[297]108) and
   RMSE to assess the consistency with the ground truth. However, when
   applying the aforementioned methods to the three datasets, we
   encountered both time and space complexity constraints for many
   benchmarked methods. Specifically, CIBERSORTx and Bisque were unable to
   utilize as many representative samples for single-cell references as