Graphical abstract

   graphic file with name fx1.jpg
   [29]Open in a new tab

Highlights

     * •
       Experimental constraints hinder cell-level measurement of molecules
       such as proteins
     * •
       DESP is an algorithm that maps bulk omics data to the underlying
       cell states
     * •
       DESP overcomes technical resolution limitations to enable
       cell-state-level proteomics

Motivation

   The dynamic expression of molecules drives phenotype remodeling
   critical to normal tissue development and pathobiology, yet due to
   technical challenges, these systems remain largely unexplored at the
   level of the underlying cell states. We present a bioinformatics
   algorithm that leverages independent readouts of cellular proportions
   using established techniques, such as single-cell RNA-seq, to resolve
   cell-state-level differences from bulk omics data, such as quantitative
   proteomics. Our algorithm provides a simple but effective computational
   solution to overcome experimental challenges associated with
   single-cell measurements of molecular data such as proteomics and
   metabolomics, enabling researchers to contextualize observed changes in
   global molecular measurements based on the underlying constituent cell
   states driving those changes.
     __________________________________________________________________

   Youssef et al. present DESP, an algorithm addressing the limited
   exploration of quantitative proteomics data at the cell-state level due
   to experimental constraints. DESP accurately deciphers cell-state
   contributions to parallel quantitative proteomics, providing a
   generalizable computational framework for studying cell-state-level
   proteomes within conventional bulk-level workflows.

Introduction

   The proteome occupies a crucial position in the functional landscape of
   dynamic cell states, as changes in protein expression underly
   biochemical processes driving phenotypic responses and altered cellular
   functions. In contrast to rapid advances in single-cell nucleic acid
   profiling, experimental challenges limit the vast majority of existing
   proteome studies to bulk measurements of mixed cell populations, mixing
   together protein abundance contributions from the individual
   constituent cell states and obscuring the native cellular contexts of
   the proteome[30]^1^,[31]^2^,[32]^3 ([33]Figure 1A). Going beyond this
   global birds-eye view generated by standard proteomics datasets to
   resolve the foundational cell-level proteomes is crucial to improve
   mechanistic understanding of dynamic cell states with the potential to
   enhance the prioritization of susceptible cell subpopulations for
   targeted therapeutics. While the ideal scenario is to directly quantify
   cellular proteomes, current state-of-the-art single-cell proteomics
   technologies face technical challenges including the difficulty of
   capturing individual cells from mixtures while maintaining proteome
   integrity, the small amount of extracted protein, and the infeasibility
   of efficiently scaling up mass spectrometry runs comparable to
   sequencing methods.[34]^4^,[35]^5 Emerging technologies aim to address
   these challenges, but significant limitations remain, including the
   extent of proteome coverage and the inability to quantify functional
   information such as macromolecular interactions and post-translational
   modifications.[36]^4^,[37]^5

Figure 1.

   [38]Figure 1
   [39]Open in a new tab

   Overview of DESP algorithm

   (A) Motivation and rationale behind DESP. Standard omic techniques such
   as mass spectrometry-based quantitative proteomics are often limited to
   bulk measurements of mixed cell populations and fail to resolve sample
   heterogeneity at the cell-state level.

   (B) Analysis workflow. DESP takes as an input paired bulk omics data,
   e.g., proteomics, and matching single-cell data and predicts the
   contributions of the underlying cell states to the observed bulk
   measurements.

   (C) Mathematical model. DESP models the observed bulk profiles, e.g.,
   proteomics measurements, as the sum of unknown underlying cell-state
   profiles. DESP applies several constraints to this formulation to
   predict the unknown cell-state profiles.

   (D) Output. Schematic of the DESP output showcasing its ability to
   demix and associate bulk sample measurements to the underlying
   contributing cell states for any given gene/protein.

   Advances in single-cell molecular profiling, most notably single-cell
   RNA sequencing (scRNA-seq), offer a powerful approach for the
   identification and characterization of heterogeneous cell states in
   dynamic biological processes, such as cell differentiation and tissue
   development, leading to breakthroughs in emerging domains of biomedical
   research.[40]^6^,[41]^7 Related computational methods can define the
   quantitative transcriptomic profiles underlying these discrete cell
   states.[42]^8 However, these transcriptomic profiles have not proven to
   be accurate predictors of the corresponding protein expression patterns
   due to multiple biological and technical factors that lead to poor
   correlation between the cognate mRNA and protein levels of genes,
   preventing accurate inference of protein abundance using gene
   expression values such as those derived from
   scRNA-seq.[43]^9^,[44]^10^,[45]^11 This represents an important gap in
   our understanding of the specialized molecular programs that define
   cell states in heterogeneous biological contexts.

   As a generalizable computational solution to this challenge, we present
   a novel demixing algorithm titled DESP (Demixing Cell State Profiles in
   Bulk Omics) that leverages information on cell-state proportions
   derived from single-cell techniques to deconvolute standard bulk
   proteomics measurements of heterogeneous cell mixtures by
   distinguishing the underlying contributions of the involved cell
   states. DESP’s mathematical model is designed to circumvent the poor
   mRNA-protein correlation, which represents a major challenge for
   multi-omics integration efforts, by leveraging the ability of
   single-cell readouts, such as scRNA-seq or fluorescence-activated cell
   sorting (FACS), to identify the proportions of distinct cell states
   within dynamic heterogeneous samples for which bulk proteomics
   measurements were generated. DESP’s model combines these observed
   changes in cell-state proportions with the corresponding variations in
   bulk-level protein abundance to estimate the relative protein abundance
   levels associated with each cell state. While we demonstrate DESP’s
   utility using proteomics data in this paper, it can be applied to demix
   any type of bulk omics data, such as metabolomics or phopshoproteomics.

   We applied DESP to a multi-omics time course dataset investigating the
   molecular changes of pre-cancerous cells undergoing transforming growth
   factor (TGF)-beta-induced epithelial-to-mesenchymal transition (EMT)
   in vitro.[46]^12^,[47]^13 EMT is a complex biological process by which
   epithelial cells lose their adhesion and gradually differentiate into
   mobile mesenchymal cells and is of particular importance to the
   biomedical community due to its integral role in wound healing, tissue
   development, and cancer progression.[48]^12 We analyzed bulk
   quantitative proteomics and scRNA-seq data generated at eight
   consecutive time points representing MCF10A mammary epithelial cells
   undergoing EMT,[49]^13 and we show that DESP is able to identify the
   corresponding cellular proteomes and recover expected patterns of
   canonical EMT markers. Additionally, DESP goes further by identifying
   the proteomic profile of transient cell states key to the
   differentiation process that were not detected at the transcriptomics
   level and whose signal is significantly diluted in the bulk proteomics
   due to the confounding presence of other cell states in the mixture.
   These results demonstrate a compelling use case of DESP while also
   shedding light on the transient nature of intermediate molecular
   mechanisms driving EMT.

Results

DESP algorithm overview

   DESP is a tool for deconvoluting bulk omics data to the cell-state
   level by estimating the contributions of the underlying cell states to
   the observed bulk measurements. [50]Figure 1B shows an overview of DESP
   as applied to proteomics data as an example. DESP combines bulk
   proteomics data with single-cell data, such as scRNA-seq, acquired for
   a set of biological samples representing different physiological
   contexts. Examples of potential applications for DESP include time
   course experiments, cross-tissue omics atlases, or studies comparing
   disease subtypes. The single-cell data is used solely to delineate the
   proportions of the cell states that make up the samples under study
   and, optionally, their quantitative similarities to each other. The
   resultant cell-state composition matrix is fed into DESP alongside
   corresponding bulk protein measurements ([51]Figure 1C), where the
   objective is to find the most likely possible combination of
   cell-state-level protein abundances whose sum would lead to the
   observed bulk measurements. The output of DESP is a matrix containing
   the predicted protein levels for each of the pre-defined cell states,
   effectively demixing the measured bulk proteomics matrix into the
   underlying cell-state proteomes ([52]Figure 1D).

   The premise of DESP is that the bulk proteome can be modeled as the sum
   of the underlying single-cell proteomes. Near-complete sequencing
   coverage of current single-cell technologies like scRNA-seq make it
   routinely feasible to obtain the identity and proportions of cell
   states in given biological samples,[53]^6^,[54]^7 and advances in mass
   spectrometry make acquiring comprehensive measurements of bulk
   proteomes similarly feasible.[55]^6^,[56]^7 We thus sought to integrate
   these readily available and comprehensive data types within a
   computational framework that could predict the experimentally
   challenging single-cell proteomes. DESP achieves this goal by applying
   the bulk-as-sum-of-single-cells principle through modeling the measured
   global proteome as the product of the observed cell-state proportions
   and the unknown corresponding cell-state proteomes, providing a simple
   but effective formulation that can be solved mathematically to estimate
   the cell-state proteomes ([57]Figure 1C).

   The DESP model solves a typically under-determined problem where the
   number of samples is less than the number of cell states. For example,
   an experiment that surveys a handful of cancer subtypes will likely
   contain dozens of cell states involved, as such giving many possible
   cell-level proteomes that could lead to the observed bulk proteome. We
   thus apply a set of both biologically and mathematically motivated
   constraints to guide the algorithm toward the optimal solution in a
   deterministic manner ([58]Figure 1C). One notable constraint leverages
   correlations between cell-state profiles in the input single-cell data
   to maintain the global cell-state structure within DESP’s predictions
   by preserving the original state-state similarities. Notably, DESP’s
   robustness to input parameters allows it to efficiently make
   predictions without requiring parameter tuning ([59]Figure S1). A
   detailed description of the mathematical model can be found in the
   [60]STAR Methods section.

   DESP does not aim to define cell states or their proportions from omics
   data, but rather it predicts the proteome of pre-defined cell states of
   interest based on the bulk proteomics data. The utility of DESP thus
   lies in the presence of established routine workflows for identifying
   the global proteome and cell-state composition of heterogeneous
   samples, effectively bypassing the lack of methods for deep measurement
   of individual cell proteomes. It is worth noting that DESP is agnostic
   of the technique used to identify cell-state proportions and thus does
   not directly use the single-cell measurements to infer the protein
   levels of the given cell states, allowing DESP to avoid issues
   regarding poor correlations between different data types such as
   RNA-seq and proteomics.

Multi-omics profiling of EMT

   We applied DESP to a time course study examining pre-cancerous cells
   undergoing TGF-beta-induced EMT.[61]^13 Bulk quantitative proteomics
   and scRNA-seq data for over 5,600 genes were generated in parallel for
   eight consecutive time points as MCF10A human mammary epithelial cells
   underwent the EMT differentiation process ([62]Figure 2A). Analysis of
   the scRNA-seq data using the Seurat tool[63]^14 revealed ten discrete
   cell states among the 1,913 captured cells, with intermediate states
   emerging from the predominantly epithelial cultures before transforming
   into the mesenchymal end state over the time course ([64]Figures 2C and
   2D). The availability of both bulk protein measurements and single-cell
   proportion information for the samples along with the dynamic nature of
   the samples’ cellular compositions render it an excellent testbed for
   DESP ([65]Figure 2F). This dataset was thus used for both algorithm
   validation and as a case study of a real-life application of DESP
   ([66]Figures 2G and 2H).

Figure 2.

   [67]Figure 2
   [68]Open in a new tab

   Case study: Epithelial-to-mesenchymal transition

   (A) Case study: Epithelial-to-mesenchymal transition (EMT) dynamic
   multi-omics profiling. In this experiment, bulk proteomics and
   scRNA-seq data were generated for human MCF10A cells undergoing
   TGF-beta-induced EMT across eight consecutive time points, generating
   molecular profiles of transient intermediate states.

   (B) Dynamic scRNA-seq profiles. Heatmap visualizes the scaled relative
   expression of genes (rows) across cells (columns).

   (C) Identification of intermediate cell states occurring during EMT.
   Clustering the scRNA-seq data reveals ten intermediate cell states as
   visualized on UMAP plot.

   (D) Cell-state fluctuations during EMT. Bar plot visualizing cell-state
   composition at each time point.

   (E) Mass spectrometry-based quantitative proteomics profiles. Heatmap
   visualizes scaled TMT reporter ion intensity, a proxy for relative
   abundance, of proteins (rows) across time points (columns).

   (F) Applying DESP to EMT bulk proteomic profiles and scRNA-derived
   cell-state composition matrix to predict cell-state-level protein
   profiles.

   (G) Prediction of intermediate cell-state profiles using DESP. Heatmap
   visualizes the predicted relative protein abundances for each of the
   cell states. One putative marker of a transient cell state is
   highlighted as an example finding from these predictions.

   (H) Identifying contribution of individual cell states to time point
   bulk proteomics measurements. Schematic shows DESP demixing of a given
   protein to distinct underlying cell states, highlighting a transient
   cell-state signal that is lost by examining bulk measurements alone due
   to the presence of confounding signals from multiple cell states within
   each time point.

Validation using “pseudobulk” RNA data

   To validate the algorithm and assess performance, we investigated
   DESP’s ability to recover the RNA profiles of individual EMT cell
   states from corresponding bulk RNA profiles. We utilized the scRNA-seq
   data, and not the proteomics, for the mathematical validation, as they
   represent the “true” experimentally derived single-cell profiles to
   compare against DESP’s bulk-inferred predictions. Since DESP’s
   predictions are made at the resolution of cell states rather than
   single cells, we constructed representative cell-state profiles from
   the scRNA data by averaging the expression of each gene in each cell
   state’s cells to represent the ground truth.[69]^15 [70]Figure 3A shows
   a schematic illustrating the validation strategy, where a pseudobulk
   RNA matrix is first created by summing the scRNA measurements within
   each of the eight time points. We then used DESP to demix these
   pseudobulk profiles to the level of the ten cell states present in this
   dataset, as guided by the scRNA-derived changes in cell-state
   proportions across the time points ([71]STAR Methods). These predicted
   cell-state profiles were then compared to the ground-truth profiles.

Figure 3.

   [72]Figure 3
   [73]Open in a new tab

   Mathematical validation of DESP using pseudobulk RNA data

   (A) Validation strategy. Pseudobulk profiles, generated by summing the
   scRNA profiles in each time point, were demixed by DESP guided by the
   cell-state compositions. The predicted DESP profiles were compared to
   the averaged per-cluster gene expression values in the input scRNA
   data.

   (B) UMAP showing overlap of the real and predicted cell-state profiles,
   labeled as “centroids,” within the global structure of the scRNA-seq
   data.

   (C) Bar plots showing the average Euclidean distance between each cell
   state’s constituent single-cell measurements to their real centroid
   compared to that of the predicted centroid.

   (D) Mapping predicted centroids to the closest real centroid based on
   solving the linear sum assignment problem to minimize total Euclidean
   distances.

   (E) Relative per-gene prediction error when varying the number of cell
   clusters input to DESP, highlighting the chosen 10 cluster scenario.
   Standard error bars are not visible due to small error range.

   (F) Relative per-gene prediction error when systematically introducing
   noise to the input cell cluster proportions. The noise is introduced as
   relative deviations from the true cell proportions matrix.

   Several metrics were used to compare the real and predicted cell-state
   profiles, henceforth referred to as “centroids,” to determine whether
   DESP’s predictions were reasonable approximations of the ground truth.
   To visualize the similarity of the predicted and real centroids within
   the global structure of the single-cell data, we created a uniform
   manifold approximation and projection (UMAP) projection of the scRNA
   data with the centroids overlayed, showing strong overlap between the
   real and predicted centroids ([74]Figure 3B). We also quantitively
   assessed DESP’s ability to recover the true centroids by computing the
   average Euclidean distance between each state’s constituent single-cell
   measurements to their real centroid compared to that of the predicted
   centroid, again showing that the predictions fall within the correct
   intra-cell-state range ([75]Figure 3C). Mapping the real and predicted
   centroids in a one-to-one manner based on minimizing their Euclidean
   distances showed that DESP correctly assigned centroid labels with a
   near-zero total distance ([76]Figure 3D). Additionally, the mean
   per-gene relative prediction error was only 15% (interquartile range
   [IQR] 6%–17%), suggesting reasonable approximations of the real data
   even down to the level of individual genes ([77]Figure 3E).

   Accurate estimation of cell proportions from molecular profiling data
   is crucial in understanding the cellular composition of complex
   tissues, but experimentally determined proportions often deviate from
   the true underlying ones. To assess the impact of errors in
   user-provided cell-state proportions on DESP’s predictions, we
   systematically introduced random errors to the true matrix, ranging
   from 5% to 50% in increments of 5% ([78]STAR Methods). This introduced
   variability in the cell-state proportions, mimicking real-world
   experimental uncertainties. The results reveal a nuanced relationship
   between the level of noise and the performance of DESP ([79]Figure 3F).
   At lower levels of noise up to 25%, only a slight increase in
   prediction error (2%–3%) was observed. Interestingly, the algorithm
   demonstrated a more noticeable decrease in performance beyond this
   threshold but remained within 10% of its original prediction errors
   even when subjected to up to 50% introduced noise.

   Cell-state assignments by tools such as Seurat can be imprecise. Hence,
   to ensure the robustness of DESP to variations in the number of input
   cell states, we repeated the validation with numbers of input cell
   states varying between 5 and 20 and found that while the prediction
   error expectedly increased with the number of states, it remained
   within a modest error range ([80]Figure 3E). We also show that DESP is
   robust to a wide range of its two input parameters, with negligible
   effects on performance ([81]Figure S1).

   Since DESP performs predictions on a feature-by-feature basis, we also
   examined whether the accuracy of predictions varied with the expression
   of particular genes and observed that more cross-cell stable
   “housekeeping” genes had lower prediction errors than the more variable
   genes, while expression magnitudes were not influential
   ([82]Figures S2A and S2B). In a similar vein, we examined the
   relationship between the prediction errors and cell-state properties
   and found that the larger and more homogeneous states had the lowest
   prediction errors ([83]Figure S2C).

Validation using the Human Cell Atlas

   To further validate the effectiveness of DESP, an additional case study
   was conducted using scRNA-seq data obtained from the Human Cell
   Atlas.[84]^16 The Human Cell Atlas provides a comprehensive collection
   of scRNA-seq data, encompassing 48,852 genes expressed across 24 human
   tissues. This dataset serves as a global reference atlas of human cell
   expression profiles, offering an ideal opportunity to assess DESP’s
   performance on large-scale datasets ([85]Figure 4A).

Figure 4.

   [86]Figure 4
   [87]Open in a new tab

   Benchmarking DESP using Human Cell Atlas and CITE-seq data

   (A) UMAP plot visualizing the Human Cell Atlas scRNA-seq data color
   coded by cell type.

   (B) Bar plot visualizing cell-type composition of the 24 tissues from
   the Human Cell Atlas database.

   (C) UMAP plot showing overlap of averaged cross-tissue cell-type
   profiles as computed from the original scRNA-seq data and as predicted
   by DESP. Lines connect the points corresponding to the real and
   predicted profiles of the same cell type.

   (D) Histogram showing distribution of DESP-relative prediction errors
   for each of the 48,846 genes from the Human Cell Atlas database.

   (E) Histogram showing distribution of DESP-relative prediction errors
   for each of the 36 cell types from the Human Cell Atlas database.

   (F) (Left) UMAP plot visualizing the CITE-seq single-cell data color
   coded by cell type. (Middle) Bar plot visualizing cell-type composition
   of samples. (Right) Boxplot comparing DESP’s per-gene relative
   prediction error for the RNA and protein data separately.

   For the validation process, we focused on the 36 most prevalent cell
   types, as determined by the number of tissues in which each cell type
   was present. Initially, we constructed a matrix that represented the
   proportion of each cell type within each tissue ([88]Figure 4B).
   Subsequently, we evaluated DESP’s ability to reconstruct the
   single-cell profiles by applying it to a pseudobulk dataset. This
   dataset was generated by summarizing the scRNA-seq data at the tissue
   level by averaging the expression of each gene within each cell type
   and multiplying these averaged expressions by the corresponding
   cell-type proportions across tissues ([89]STAR Methods).

   We performed a qualitative evaluation of DESP’s performance by
   combining the real and predicted cell-type profiles into one matrix and
   constructing a UMAP plot that visualizes these profiles in two
   dimensions, revealing strong overlap between the corresponding real and
   predicted cell-type profiles within this reduced dimensional space
   ([90]Figure 4C). As a quantitative assessment, we computed the relative
   prediction error on a gene-by-gene basis by summing the absolute
   difference between each gene’s real and DESP-predicted averaged
   cell-type profiles divided by the real cell-type profiles. The
   resultant median per-gene error was 15%, and the mean was 20% (IQR
   6%–30%) ([91]Figure 4D). We repeated the same analysis for each cell
   type separately, finding a median per-cell-type error of 7% and a mean
   of 10% (IQR 2%–13%) ([92]Figure 4E). In our experiment, it took only
   6 s to perform the demixing for all 48,852 genes on a standard laptop
   machine, demonstrating DESP’s ability to swiftly produce accurate
   predictions on large-scale datasets.

Performance comparison between transcriptomics and proteomics

   DESP provides a mathematical framework for charting the relationship
   between single-cell and bulk omics datasets, which depends on the
   assumption that the bulk data can be modeled as the sum of the
   single-cell data. To ensure that this relationship holds for different
   types of omics data, we performed mathematical validation using both
   single-cell transcriptomics and proteomics datasets as described in the
   corresponding sections of this paper. Furthermore, to test the scenario
   where both datasets were obtained from the same samples, we applied
   DESP to matched single-cell transcriptome and protein data generated by
   the cellular indexing of transcriptomes and epitopes (CITE)-seq
   technology from the Hao et al. study.[93]^17 This dataset comprised
   161,764 peripheral blood mononuclear cells (PBMCs)isolated from 8
   individuals at three time points after receiving an HIV vaccine. The
   authors grouped the cells into 31 annotated immune cell types. Notably,
   there were 187 genes that had both RNA and protein measurements
   available for analysis.

   We used DESP to independently demix two pseudobulk datasets created
   using the same underlying cell-type composition matrix: the aggregated
   RNA data and the aggregated protein data per sample ([94]STAR Methods).
   Comparing the 31 DESP-predicted cell-type profiles to the true ones
   measured by CITE-seq revealed a median per-gene relative prediction
   error of 27% (IQR 21%–34%) for the protein data compared to 46% (IQR
   28%–70%) for the RNA data ([95]Figure 4F). These results indicate that
   DESP can predict single-cell protein measurements from
   transcriptomics-derived cell-type structures with comparable, or even
   superior, accuracy to its predictions for transcriptomics data. These
   observations align with previous research highlighting the consistency
   of high-level cell-type assignments between RNA and protein
   measurements, despite the discordance in per-gene correlations between
   RNA-seq and proteomics data.[96]^5^,[97]^17

Comparison with experimentally derived profiles

   We applied DESP to demix the EMT dynamic bulk proteomics data by
   separating the contributions of the ten cell states to each of the
   eight time point bulk measurements, revealing the underlying proteome
   profiles of the cell states ([98]STAR Methods). To benchmark these
   predicted protein profiles, we analyzed single-cell proteomics data
   from an analogous model of EMT generated by the SCoPE2[99]^4 platform,
   which quantified >1,700 proteins across 420 cells at three key
   transition time points.[100]^18 The similarity of this independent
   experimental model to our EMT multi-omics case study combined with the
   presence of more than 1,000 overlapping proteins between them make it
   ideal for comparing DESP’s computationally predicted protein profiles
   to independently acquired, experimentally derived ones.

   We first used a k-nearest neighbor strategy to impute missing values in
   the single-cell proteomics data followed by clustering the cells
   ([101]STAR Methods). The clusters in both datasets were labeled as
   epithelial (E), intermediate (I), or mesenchymal (M) based on their
   temporal dynamics and expression of canonical EMT markers ([102]STAR
   Methods; [103]Figure 5A). To mitigate potential confounding factors
   arising from technical experimental variations, our benchmarking
   strategy focused on the comparative analysis of the three EMT cell
   states (E, I, and M). Specifically, we assessed whether the protein
   fold changes inferred from DESP’s predicted profiles exhibited greater
   resemblance to the experimentally derived fold changes compared to
   those derived solely from the scRNA data ([104]Figure 5B).

Figure 5.

   [105]Figure 5
   [106]Open in a new tab

   Comparing DESP predictions to experimentally derived profiles

   (A) Dynamic EMT single-cell proteomics data used for biological
   validation.

   (B) Heatmap of computed protein relative abundance fold changes between
   cell states in single-proteomics data, scRNA data, and bulk-demixed
   DESP predictions.

   (C) Pearson correlation of the protein relative abundance fold changes
   between cell states in each dataset. p values are based on Fisher’s z
   transformation of the Pearson correlation coefficients.

   (D) UMAP plot showing overlap of the real and DESP-predicted cell-state
   profiles, labeled as centroids, within the global structure of the
   single-cell proteomics data. Inset plot: results of mapping each
   predicted centroid profile to the closest real centroid profile in a
   one-to-one manner based on solving the linear sum assignment problem.

   (E) Bar plots showing the average Euclidean distance between each cell
   state’s constituent single-cell measurements to their real centroid
   compared to that of the DESP-predicted centroid.

   For this benchmark, we selected the proteins with high correlation
   (Pearson R^2 > 0.5) between the bulk proteomics and a pseudobulk matrix
   constructed by summing the single-cell proteomics data in each time
   point to address technical variation between the single-cell and
   bulk proteomics measurements ([107]STAR Methods). We also focused our
   comparison on the genes that differ the most (Pearson R^2 < −0.5)
   between the bulk proteomics and pseudobulk scRNA, representing the
   genes that are the best at distinguishing the RNA and protein levels in
   these data and thus are appropriate for benchmarking DESP. By computing
   the Pearson correlation between the resulting fold changes, we observed
   that DESP’s predictions not only exhibited significant correlation
   (p < 0.05) with the single-cell proteomics data but were also
   significantly closer to the single-cell proteomics fold changes
   compared to the scRNA-derived fold changes, suggesting that DESP’s
   predictions align well with the experimentally derived profiles
   ([108]Figure 5C).

   As an additional validation of DESP’s performance with proteomics data
   and the consistency of the bulk-single cell relationship across the
   protein and RNA layers, we also created a pseudobulk proteomics matrix
   by summing the experimentally derived single-cell proteomics profiles
   within each time point. We then used DESP to demix these per-time point
   pseudobulk profiles to the level of the identified cell states, guided
   by the single-cell-derived changes in cell-state proportions across the
   time points. We used the same input parameters used for applying DESP
   to the transcriptomics data (λ = 1^−7, β = 1^−4). These predicted
   cell-state profiles were then compared to the true cell-state profiles
   obtained from the single-cell proteomics data by averaging the relative
   abundance of each protein in each cell state. The average per-protein
   relative prediction error was only 2.7% (IQR 0.5%–3.8%), and the
   cell-state-level results visualized in [109]Figures 5D and 5E
   quantitively and qualitatively demonstrate the ability of DESP to
   accurately recapture the real cell-state profiles from the pseudobulk
   quantifications. Put together, these results demonstrate the
   generalizability of DESP and the potential for applications to other
   types of omics data without need for custom configurations.

Detection of transient EMT mechanisms using DESP

   Applying DESP to demix the bulk EMT proteomics profiles allowed us to
   contextualize the cross-time point proteome changes according to the
   underlying cell states and to characterize the distinct protein
   programs that occur during EMT that are lost in the bulk measurements,
   as elaborated upon below.

   To evaluate the extent to which DESP’s predictions provided information
   beyond what the bulk proteomics or scRNA data provided alone, we
   compared the results of performing differential expression analysis
   between the cell states from the DESP predictions versus those two
   original data types by themselves and found that DESP detected hundreds
   of markers for the EMT stages that were not deemed differential in the
   other techniques ([110]Figure S3). Several of these markers are known
   to be involved in EMT, such as TPM2 in the E state and CD44 in the M
   state.

   Gene set enrichment analysis also revealed more than 200 pathways that
   were significantly enriched among DESP-identified markers but not
   revealed by the other standard approaches, demonstrating DESP’s ability
   to glean novel insight into the involved molecular mechanisms rather
   than just recapitulating what existing methods would have found
   ([111]Figure 6A). These pathways included both canonical EMT-related
   pathways such as signaling, migration, and wound healing, as well as
   potentially novel biological processes that had not been associated
   with EMT before, such as cellular stress response, spliceosome
   assembly, and protein transport ([112]Figure 6B).

Figure 6.

   [113]Figure 6
   [114]Open in a new tab

   Detection of transient EMT mechanisms using DESP

   (A) Venn diagrams showing overlap of significantly enriched pathways
   within each EMT stage detected by scRNA-seq data alone, bulk proteomics
   data alone, and DESP predictions.

   (B) Heatmap of relative enrichment scores of selected GO terms among
   cell-stage markers that were enriched among DESP’s predictions compared
   to results obtained from scRNA-seq or bulk proteomics alone.

   (C) PRC2 core histone methylation complex. The EZH2 and H2AFX proteins
   are highlighted based on their inferred influential transient profiles
   detected by DESP.

   (D) DESP demixing of H2AFX across transient cell states, with
   intermediate time point T2 highlighting the time-sensitive
   disproportionate contribution by transient cell state 8.

   (E) Heatmap of temporal phosphorylation of EZH2 and H2AFX in global
   quantitative phosphoproteomics analysis of the same EMT samples.

   The intermediate cell states are of high interest due to their
   transient nature and expected influential role in driving EMT. We thus
   sought to investigate if unique insights could be gleaned by DESP on
   the transient molecular mechanisms governing EMT. Notably, eight
   proteins, including three encoded by histone genes, were identified as
   being key to the transition due to their relative enrichment in the
   predicted intermediate cell-state proteome compared to their diluted
   signal in the bulk proteomics data, representing a signal that likely
   would have been missed if analyzing bulk proteomics alone
   ([115]Figure 6D).

   Of these eight proteins, H2AFX stands out as a marker (99th percentile)
   of the predicted intermediate proteomics profile yet with low abundance
   (9th percentile) in the corresponding scRNA profile, suggesting
   extensive post-transcriptional regulation. H2AFX is a core histone
   protein variant that both contributes to chromatin remodeling and is
   post-translationally modified during EMT.[116]^19^,[117]^20 Notably,
   H2AFX is hyper-phosphorylated during the earliest time points of
   EMT,[118]^13 suggesting a change in this protein’s functional activity
   during the initial intermediate cell states ([119]Figure 6E).

   The STRING database[120]^21 shows that H2AFX and two other
   DESP-identified markers are functionally linked with nine proteins
   annotated as EMT associated in the GO[121]^22 and MSigDB Hallmark gene
   sets.[122]^23 One of these associated proteins is EZH2, which is part
   of the PRC2 chromatin remodeling complex that performs
   transcriptionally inhibitory histone methylation ([123]Figure 6C). EZH2
   has emerged in recent years as a promising target for cancer
   treatments,[124]^24^,[125]^25 with an FDA-approved drug (tazemetostat)
   targeting EZH2 as a cancer treatment[126]^26 and another EZH2-targeting
   compound currently in clinical trials.[127]^27 In contrast to its
   target H2AFX, EZH2 shows elevated phosphorylation during later time
   points of the experiment, suggesting the activation of a delayed
   regulatory “switch” after 3 days of TGF-beta induction.

   Taken together, these findings generated by DESP-based proteomic
   demixing pinpoint specific transient events as plausible mechanisms
   governing EMT progression despite the RNA-protein discord.

Discussion

   Here, we present, evaluate, and apply DESP as an algorithm for
   resolving the contributions of cell states to quantitative global
   proteomics measurements and potentially other bulk omics profiles. DESP
   enables researchers to gain insights into dynamic cell-state contexts
   from standard bulk omics workflows, helping to integrate underserved
   omics layers like proteomics into the rapidly evolving single-cell
   analysis toolkit. We validated DESP mathematically and obtained
   independent experimental evidence supporting the validity of DESP’s
   predictions. We also applied DESP to tease out the proteome composition
   of transient intermediate cell states in an in vitro model of EMT,
   shining a light on key intermediate molecular mechanisms.

   We note that as a new type of demixing algorithm, DESP does not aim to
   define, nor does it require, the optimal cell-state structures for the
   bulk data to be demixed. Rather, DESP solves the problem of mapping
   bulk omics measurements, such as proteomics, to a pre-defined
   cell-state structure of interest to the user, independent of how that
   structure was originally defined. In contrast to existing omics
   deconvolution algorithms,[128]^28 DESP does not aim to predict
   cell-state proportions, but instead it predicts the proteome of given
   cell states based on demixing the bulk data. DESP also offers a
   generalizable framework applicable to all types of omics data unlike
   existing single-cell inference algorithms, such as TCA[129]^29 and
   BMIND,[130]^30 that are tailored to specific data types. DESP provides
   a generalizable computational framework for bulk omics demixing, with
   planned future applications including demixing protein interaction
   networks and spatial transcriptomics data. To facilitate adoption and
   usability by the community, DESP is available as an R package at
   [131]https://github.com/AhmedYoussef95/DESP.

Limitations of the study

   DESP is especially well suited for biological scenarios characterized
   by dynamic shifts in cell-state/-type composition across a range of
   samples, as exemplified in the presented case studies. However, its
   accuracy diminishes when applied to “static” biological contexts where
   cell proportions or the corresponding bulk omics measurements exhibit
   minimal changes. Notably, DESP’s performance is significantly
   influenced by the ratio between the number of cell states and the
   number of bulk samples in the system under study, displaying diminished
   accuracy when the former greatly exceeds the latter. Furthermore, DESP
   lacks the capability to handle missing values and address batch effects
   in input data and thus relies on the user to perform appropriate
   imputation and data pre-processing when necessary. Related to this,
   acknowledging that log transformation is a common pre-processing step
   for omics data, we recommend reversing log transformations before
   applying DESP to align with its additive model (bulk as sum of single
   cells), followed by log transformation of the resultant
   cell-state-level predictions for downstream analyses as needed.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Deposited data
     __________________________________________________________________

   Raw and analyzed data This paper
   [132]https://doi.org/10.1101/2023.01.19.524460
   Human Cell Atlas scRNA-seq data Regev et al.[133]^16
   [134]https://doi.org/10.7554/eLife.27041
   CITE-Seq single-cell proteomics and transcriptomics Hao et al.[135]^17
   [136]https://doi.org/10.1016/j.cell.2021.04.048
   EMT proteomics and scRNA-Seq data Paul et al.[137]^13
   [138]https://doi.org/10.1038/s41467-023-36122-x
   EMT single-cell proteomics data Khan et al.[139]^18
   [140]https://doi.org/10.1101/2023.12.21.572913
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   DESP algorithm This paper [141]https://github.com/AhmedYoussef95/DESP
   [142]https://doi.org/10.5281/zenodo.10633315
   STRING database Szklarczyk et al.[143]^21
   [144]https://doi.org/10.1093/nar/gkaa1074
   Gene Ontology (GO) database GO Consortium[145]^23
   [146]https://doi.org/10.1093/nar/gkaa1113
   MSigDB gene set collection Liberzon et al.[147]^23
   [148]https://doi.org/10.1016/j.cels.2015.12.004
   DrugBank database Wishart et al.[149]^27
   [150]https://doi.org/10.1093/nar/gkx1037
   OMIM database Amberger et al.[151]^31
   [152]https://doi.org/10.1093/nar/gku1205
   KEGG database Kanehisa et al.[153]^32
   [154]https://doi.org/10.1093/nar/gkv1070
   MaxQuant software Tyanova et al.[155]^33
   [156]https://doi.org/10.1038/nprot.2016.136
   [157]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources should be directed to
   and will be fulfilled by the lead contact, Andrew Emili
   (emili@ohsu.edu).

Materials availability

   This study is purely computational and did not generate any unique
   reagents.

Data and code availability

     * •
       All supporting raw and processed data along with the associated
       analysis scripts are publicly-available and can be found on Zenodo
       at [158]https://zenodo.org/records/10345000
       ([159]https://doi.org/10.1101/2023.01.19.524460).
     * •
       DESP is available as an R package with all original code and
       detailed documentation publicly-available at
       [160]https://github.com/AhmedYoussef95/DESP. A persistent archived
       version of record of the repository is also publicly-available on
       Zenodo ([161]https://doi.org/10.5281/zenodo.10633315).
     * •
       Any additional information required to reanalyze the data reported
       in this paper is available from the [162]lead contact upon request.

Method details

DESP algorithm

Background

   We formalize DESP as recovery of cell state profiles under the mixing
   equation
   [MATH: <mrow><mi>A</mi><mi>X</mi><mo linebreak="goodbreak"
   linebreakstyle="after">=</mo><mi>Y</mi></mrow> :MATH]
   , with mixing matrix
   [MATH: <mrow><mi>A</mi></mrow> :MATH]
   (samples
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after">×</mo></mrow> :MATH]
   states) applied to unobserved state profiles
   [MATH: <mrow><mi>X</mi></mrow> :MATH]
   (states
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after">×</mo></mrow> :MATH]
   genes) yielding observed bulk data
   [MATH: <mrow><mi>Y</mi></mrow> :MATH]
   (samples
   [MATH: <mrow><mo linebreak="goodbreak"
   linebreakstyle="after">×</mo></mrow> :MATH]
   genes). Each pair of corresponding columns of
   [MATH: <mrow><mi>X</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>Y</mi></mrow> :MATH]
   is an independent equation; in what follows we denote a column of
   [MATH: <mrow><mi>X</mi></mrow> :MATH]
   as
   [MATH: <mrow><mi>x</mi></mrow> :MATH]
   and the corresponding column of
   [MATH: <mrow><mi>Y</mi></mrow> :MATH]
   as
   [MATH: <mrow><mi>y</mi></mrow> :MATH]
   .

   This problem is challenging when there are fewer samples than states,
   leading to an ill-posed linear inverse problem. To constrain the
   solution toward meaningful and biologically likely profiles, DESP
   applies a series of regularizations as described next.

Nonnegativity

   First, DESP imposes nonnegativity on the solutions. This leads to the
   nonnegative least squares problem:
   [MATH: <mrow><mover accent="true"><mi>x</mi><mo>ˆ</mo></mover><mo
   linebreak="badbreak">=</mo><mi mathvariant="italic">arg</mi><munder><mi
   mathvariant="italic">min</mi><mrow><mi>x</mi><mo
   linebreak="badbreak">≥</mo><mn>0</mn></mrow></munder><mspace
   width="0.25em"></mspace><msup><mrow><mo>‖</mo><mrow><mi>A</mi><mi>x</mi
   ><mo>−</mo><mi>y</mi></mrow><mo>‖</mo></mrow><mn>2</mn></msup><mtext>.<
   /mtext></mrow> :MATH]

   which is equivalent to the quadratic program:
   [MATH: <mrow><mover accent="true"><mi>x</mi><mo>ˆ</mo></mover><mo
   linebreak="badbreak">=</mo><mi mathvariant="italic">arg</mi><munder><mi
   mathvariant="italic">min</mi><mrow><mi>x</mi><mo
   linebreak="badbreak">≥</mo><mn>0</mn></mrow></munder><mspace
   width="0.25em"></mspace><mo>−</mo><mn>2</mn><msup><mi>y</mi><mi>T</mi><
   /msup><mi>A</mi><mi>x</mi><mo
   linebreak="goodbreak">+</mo><msup><mi>x</mi><mi>T</mi></msup><msup><mi>
   A</mi><mi>T</mi></msup><mi>A</mi><mi>x</mi><mtext>.</mtext></mrow>
   :MATH]

Tikhonov regularization

   Next, DESP incorporates Tikhonov regularization (ridge regression) to
   favor small-norm solutions:
   [MATH:
   <mrow><msup><mrow><mrow><msup><mrow><mo>‖</mo><mrow><mi>A</mi><mi>x</mi
   ><mo>−</mo><mi>y</mi></mrow><mo>‖</mo></mrow><mn>2</mn></msup><mo
   linebreak="badbreak">+</mo><msub><mi>λ</mi><mi>x</mi></msub><mrow><mo>‖
   </mo><mi>x</mi></mrow></mrow><mo>‖</mo></mrow><mn>2</mn></msup></mrow>
   :MATH]

   We use the notation
   [MATH: <mrow><msub><mi>λ</mi><mi>x</mi></msub></mrow> :MATH]
   to indicate that the proper amount of regularization can be chosen
   based on the properties of
   [MATH: <mrow><mi>x</mi></mrow> :MATH]
   . (As shown in [163]Figure S2, statistical properties of individual
   gene expressions are correlated with estimation accuracy.) However, our
   results are based on using the same value of
   [MATH: <mrow><msub><mi>λ</mi><mi>x</mi></msub></mrow> :MATH]
   for all genes. As shown in [164]Figure S1, using a single value of
   [MATH: <mrow><msub><mi>λ</mi><mi>x</mi></msub></mrow> :MATH]
   across all genes yields good recovery in general with results that are
   relatively insensitive to the specific value of
   [MATH: <mrow><msub><mi>λ</mi><mi>x</mi></msub></mrow> :MATH]
   (chosen within a broad range).

   Expressed as a quadratic program this becomes:
   [MATH: <mrow><mover accent="true"><mi>x</mi><mo>ˆ</mo></mover><mo
   linebreak="badbreak">=</mo><mi mathvariant="italic">arg</mi><munder><mi
   mathvariant="italic">min</mi><mrow><mi>x</mi><mo
   linebreak="badbreak">≥</mo><mn>0</mn></mrow></munder><mspace
   width="0.25em"></mspace><mo>−</mo><mn>2</mn><msup><mi>y</mi><mi>T</mi><
   /msup><mi>A</mi><mi>x</mi><mo
   linebreak="goodbreak">+</mo><msup><mi>x</mi><mi>T</mi></msup><mrow><mo
   stretchy="true">(</mo><mrow><msup><mi>A</mi><mi>T</mi></msup><mi>A</mi>
   <mo
   linebreak="badbreak">+</mo><msub><mi>λ</mi><mi>x</mi></msub><mi>I</mi><
   /mrow><mo stretchy="true">)</mo></mrow><mi>x</mi></mrow> :MATH]

Similarity regularization

   Finally, we make the following observation: the process used to
   construct
   [MATH: <mrow><mi>A</mi></mrow> :MATH]
   may have access to additional information. For example, when scRNA
   clusters are used to define the cell states leading to the cluster
   mixtures in
   [MATH: <mrow><mi>A</mi></mrow> :MATH]
   , DESP can extract useful information from the cluster RNA expression
   profiles. We emphasize that this information can be useful even though
   no assumption is made about any particular relationship between, for
   example, RNA expression levels and protein abundance. DESP employs this
   strategy by incorporating a state-state similarity measure. While in
   principle DESP can use any similarity or dissimilarity matrix, in our
   results DESP operates using scRNA data as follows:

   Define the matrix
   [MATH: <mrow><mi>C</mi></mrow> :MATH]
   to consist of cell state profiles with cell state profiles on the rows
   and genes on the columns, where entries in
   [MATH: <mrow><mi>C</mi></mrow> :MATH]
   correspond to average gene expression in the cell state. (Processing of
   scRNA data to form cell state profiles is discussed below.) Rows of
   [MATH: <mrow><mi>C</mi></mrow> :MATH]
   are standardized to zero mean and unit norm, to obtain
   [MATH: <mrow><mover accent="true"><mi>C</mi><mo>˜</mo></mover></mrow>
   :MATH]
   , which is used to form the correlation matrix of clusters
   [MATH: <mrow><mi>M</mi><mo linebreak="badbreak">=</mo><mover
   accent="true"><mi>C</mi><mo>˜</mo></mover><msup><mover
   accent="true"><mi>C</mi><mo>˜</mo></mover><mi>T</mi></msup></mrow>
   :MATH]

   [MATH: <mrow><mi>M</mi></mrow> :MATH]
   is typically invertible due to the independence of cell states, so DESP
   performs similarity regularization by seeking to minimize
   [MATH:
   <mrow><msup><mi>x</mi><mi>T</mi></msup><msup><mi>M</mi><mrow><mo>−</mo>
   <mn>1</mn></mrow></msup><mi>x</mi></mrow> :MATH]

   Hence to include cell state similarity as a form of regularization,
   DESP minimizes:
   [MATH: <mrow><mover accent="true"><mi>x</mi><mo>ˆ</mo></mover><mo
   linebreak="badbreak">=</mo><mi mathvariant="italic">arg</mi><munder><mi
   mathvariant="italic">min</mi><mrow><mi>x</mi><mo
   linebreak="badbreak">≥</mo><mn>0</mn></mrow></munder><mrow><mo>‖</mo><m
   row><mi>A</mi><mi>x</mi><mo>−</mo><mi>y</mi></mrow></mrow><msup><mo>‖</
   mo><mn>2</mn></msup><mo
   linebreak="goodbreak">+</mo><msub><mi>λ</mi><mi>x</mi></msub><mrow><mo>
   ‖</mo><mi>x</mi></mrow><msup><mo>‖</mo><mn>2</mn></msup><mo
   linebreak="goodbreak">+</mo><mi>β</mi><msup><mi>x</mi><mi>T</mi></msup>
   <msup><mi>M</mi><mrow><mo>−</mo><mn>1</mn></mrow></msup><mi>x</mi></mro
   w> :MATH]

   which is equivalent to the quadratic program:
   [MATH: <mrow><mover accent="true"><mi>x</mi><mo>ˆ</mo></mover><mo
   linebreak="badbreak">=</mo><mi mathvariant="italic">arg</mi><munder><mi
   mathvariant="italic">min</mi><mrow><mi>x</mi><mo
   linebreak="badbreak">≥</mo><mn>0</mn></mrow></munder><mo>−</mo><mn>2</m
   n><msup><mi>y</mi><mi>T</mi></msup><mi>A</mi><mi>x</mi><mo
   linebreak="goodbreak">+</mo><msup><mi>x</mi><mi>T</mi></msup><mrow><mo
   stretchy="true">(</mo><mrow><msup><mi>A</mi><mi>T</mi></msup><mi>A</mi>
   <mo
   linebreak="badbreak">+</mo><msub><mi>λ</mi><mi>x</mi></msub><mi>I</mi><
   mo
   linebreak="badbreak">+</mo><mi>β</mi><msup><mi>M</mi><mrow><mo>−</mo><m
   n>1</mn></mrow></msup></mrow><mo
   stretchy="true">)</mo></mrow><mi>x</mi><mspace
   width="0.25em"></mspace><mtext>.</mtext></mrow> :MATH]
   (Equation 1)

   As for
   [MATH: <mrow><msub><mi mathvariant="bold-italic">λ</mi><mi
   mathvariant="bold-italic">x</mi></msub></mrow> :MATH]
   , we find that DESP obtains good recovery in general for
   [MATH: <mrow><mi mathvariant="bold-italic">β</mi></mrow> :MATH]
   values across a fairly broad range ([165]Figure S1).

Implementation

   The R function solve.QP from the quadprog package can solve any of the
   above quadratic programs. So for each feature (e.g., gene/protein),
   DESP invokes solve.QP to solve the constrained minimization (1) to
   estimate cell state profiles. DESP also has minimal running time; in
   our experiments, we found that DESP performed predictions for >8,000
   features within less than a second on a standard laptop machine.

Robustness to noise in user-provided cell state proportions matrix

   We systematically introduced noise to a cell state proportions matrix
   to assess the DESP algorithm’s stability and robustness to variations
   in the user-provided cell state proportions matrix. As described in the
   section '[166]identifying intermediate cell states from scRNA-Seq
   data", we generated a cell state proportions matrix representing the
   cell state composition of the 10 different EMT cell states across the 8
   timepoints profiled in that experiment. This matrix served as our
   ground truth proportions matrix. We also constructed a ‘pseudobulk’
   matrix representing the summed gene expression levels across all cell
   states within each timepoint, representing our ground truth bulk data,
   along with the ground truth averaged expression profiles of each cell
   state. We systematically introduced random noise to the proportions,
   ranging in increments of 5% from 5% to 50% of the original values, by
   drawing multiplicative noise values from a uniform distribution and
   subsequently normalizing them to ensure cumulative alignment with the
   selected noise level. For each noise level, we computed DESP’s relative
   per-gene prediction error for demixing the same pseudobulk matrix to
   predict the cell state expression profiles. We repeated this experiment
   1,000 times for each noise level to account for the random nature of
   noise generation and ensured reproducibility by setting a seed for the
   random noise generation. The averaged prediction error for each noise
   level provided a nuanced understanding of DESP’s behavior in response
   to input variations and contributed valuable insights for its practical
   application in noisy biological datasets.

Testing robustness of DESP predictions with different numbers of cell states

   In the EMT case study detailed in the manuscript, we identified ten
   transitional cell states occurring during EMT. These cell states were
   identified by clustering the single-cell RNA-Seq data to reveal groups
   of transcriptionally-correlated cells using the Seurat R package
   ([167]STAR Methods). Since determining the optimal number of cell
   states form scRNA-Seq data is not an objective process, we repeated our
   mathematical validation of DESP using different numbers of input cell
   states to examine the robustness of DESP’s predictions.

   To vary the number of cell states, we used different clustering
   resolutions within Seurat’s FindClusters algorithm followed by
   repeating the process for constructing a cell state proportions matrix
   detailing the proportion of each cell state within each of the
   timepoints. We tested DESP using numbers of clusters varying between 5
   and 20 and computed mean per-gene relative prediction error for each
   case ([168]Figure 3E). For cases where there were 8 or less cell
   states, the problem was overdetermined due to the presence of eight
   timepoints in the data and thus the prediction error was zero. Starting
   from nine clusters and above, the problem is underdetermined meaning
   that multiple possible combinations of cell state protein profiles
   could lead to the observed pseudobulk measurements. As expected, we
   found that the prediction error increased as the number of cell states
   increased, with the error ranging from 10 to 25% for cases with 9–20
   clusters respectively.

Testing robustness of DESP predictions to input parameters

   DESP expects two input parameters denoted as λ and β ([169]STAR
   Methods). In the EMT case study described in the manuscript, we set the
   values of these parameters to 1^−7 and 1^−4 respectively. To test the
   robustness of DESP’s predictions to variations in the values of the
   input parameters, we repeated the mathematical validation of DESP using
   combinations of the input parameters ranging from 1^−10 to 1^10, with
   increasing steps of an order of magnitude each, for each of the two
   parameters and quantified the mean per-gene relative prediction error
   for deconvoluting the scRNA-Seq ‘pseudobulk’ data ([170]Figure S1).

   We found that the error remained stable at 15% for a wide range of
   combinations of the input parameters, with predictions made with values
   of β ranging between 1^−7 and 0.01 and corresponding values of λ of the
   same or less value all giving nearly identical results with errors of
   15%. Furthermore, the error never exceeded 22% for any combination of
   the parameters with values between 1^−10 and 1 for each parameter. The
   prediction errors were smaller for deconvoluting the pseudobulk
   single-cell proteomics data ([171]Figures 5D and 5E), with prediction
   errors at 2% for any value of β between 1^−10 and 0.1 and a
   corresponding λ of the same or less value, and never exceeding 16% for
   any combination of values between 1^−10 and 1 for either parameter.
   These results demonstrate the robustness of DESP to variations in the
   input parameters and that there is no need to hypertune the input
   parameters for different input datasets.

Effect of gene/protein and cell cluster properties on DESP’s predictions

   DESP performs predictions on a feature-by-feature basis. Depending on
   the input data, these features will typically be genes or proteins. We
   examined the correlation between mathematical properties of the
   features in the bulk data and their corresponding prediction error
   based on applying DESP to the scRNA-derived pseudobulk matrix as
   described in the mathematical validation chapter of the [172]STAR
   Methods section. To explore the relationship between properties of the
   genes and their corresponding predicted values, for each of the 8,528
   genes in the dataset we computed the following measures.
     * (1)
       Expression specificity: This metric looked at the relative
       specificity of a given gene’s expression to the clusters. The
       coefficient of variation (CV) of the gene’s cluster-specific
       expression values is computed as the standard deviation of the
       per-cluster expression values divided by their mean. Genes with a
       lower CV can be interpreted as being relatively stable
       ‘housekeeping’ genes due to the smaller variance in their
       expressions across clusters.
     * (2)
       Average expression: This metric is concerned with the relative
       quantity of each gene’s transcripts and is simply computed as the
       mean expression of the gene across clusters.

   Each of the above properties was correlated to the gene’s relative
   prediction error, defined as the absolute difference between the real
   and predicted per-cluster values divided by the real value. A Pearson
   correlation coefficient of 0.6 was observed between the per-gene
   prediction errors and their expression specificity, indicating that the
   less variable ‘housekeeping’ genes tended to have higher prediction
   accuracy ([173]Figure S2A). Meanwhile, the Pearson correlation
   coefficient was only (−0.01) with the genes’ average expression values,
   suggesting that the magnitude of a gene’s expression is less
   influential to DESP’s predictions ([174]Figure S2B).

   In a similar vein, we computed the Pearson correlation coefficient
   between the following mathematical properties of the cell clusters and
   their corresponding prediction errors measured as the Euclidean
   distance between the predicted centroids and the real ones.
     * (1)
       Cluster heterogeneity: For each cluster, the average Euclidean
       distance between each member cell’s scRNA expression profile and
       that of its corresponding cluster centroid, i.e., the average
       expression profile of all cells belonging to the cluster.
     * (2)
       Cluster size: The number of cells within each cluster.
     * (3)
       Cluster fluctuation across samples: The coefficient of variation of
       each cluster’s proportion in each timepoint.

   We found a positive correlation between cluster heterogeneity and the
   prediction error (Pearson R^2 = 0.48) indicating that the prediction
   error was lower for the more homogeneous clusters, a negative
   correlation with size (Pearson R^2 = −0.47) suggesting better
   prediction for larger clusters, and near-zero correlation with
   cross-sample cluster fluctuations (Pearson R^2 = 0.04)
   ([175]Figures S2C–S2E). These results are potentially confounded
   however by the fact that cluster size and heterogeneity were strongly
   negatively-correlated to each other (Pearson R^2 = −0.65).

Quantification and statistical analysis

EMT multi-omics experiment summary

   We used data from (Paul et al.).[176]^13 In that study, cells of the
   human mammary epithelial cell-line MCF10A were treated with TGF-Beta to
   induce epithelial-to-mesenchymal transition (EMT) and samples were
   aliquoted at ten timepoints for multi-omics profiling. Details on the
   experiment and the downloadable datasets can be accessed at (Paul
   et al.).[177]^13 Relevant data is also provided in the [178]data and
   code availability section.

Proteomics data processing

   Quantitative profiles of 6,967 proteins were generated by nanoLC-MS/MS
   across ten timepoints (0, 4 h, 1–6, 8 & 12 days post-TGF-Beta
   injection) with three biological replicates per timepoint. Raw MS files
   were processed using MaxQuant (version 1.6).[179]^33 Tandem mass
   spectra were searched against the reference proteome of Homo sapiens
   (Taxonomy ID 9606) downloaded from UniProt in April 2017. Peptides of
   minimum seven amino acids and maximum of two missed cleavages were
   allowed, and a false discovery rate of 1% was used for the
   identification of peptides and proteins. Contaminants and reverse
   decoys were removed prior to downstream analysis. The average intensity
   across replicates was computed for each protein in each timepoint for
   downstream analysis. Timepoints 3 and 9 were not used in this study
   since they are not present in the scRNA-Seq data generated from these
   samples and subsequently used in downstream analysis. All data
   pre-processing of the MaxQuant files was performed using the R
   statistical software (version 4.1.0).

Single-cell RNA-Seq data processing

   Single-cell RNA-Seq quantified the expression of 9,785 genes in 1,913
   cells at eight consecutive timepoints, with roughly 200 cells per
   timepoint. Details on the experiment and the downloadable data can be
   found at (Paul et al., 2023).[180]^13 Low-quality genes detected in
   less than 5% of all cells were discarded (17 genes). Lowly-expressed
   genes with less than three counts in at least three cells were also
   removed (1,240 genes). On average, each cell expressed ∼3,600 genes
   after filtering. Finally, the data was normalized such that each cell’s
   expression values sum to one. Since DESP ultimately combines
   information derived from the scRNA-Seq data with the proteomics data,
   we filtered the scRNA-Seq data to the set of 5,606 genes for which
   quantitative bulk protein measurements were also present in this
   dataset for the identification of co-expressed cell states prior to
   applying DESP to the proteomics data. All processing of the scRNA-Seq
   data was performed using the R statistical software (version 4.1.0).

Identifying intermediate cell states from scRNA-seq data

   The cell states in the dataset were identified in an unsupervised
   manner based on the similarity of gene expression profiles in the
   scRNA-Seq data. All 1,913 cells from all the timepoints were pooled
   together for this analysis. The Seurat[181]^14 R package was used to
   identify the cell states using their recommended workflow as follows.
   The data was normalized by dividing the gene counts in each cell by the
   total counts in the cell and multiplied by a scale factor of 10,000.
   The normalized counts were log-transformed after adding a pseudocount
   of one. The data was then centered by subtracting each gene’s
   expression by its average expression and scaled by dividing the
   centered expression levels by their standard deviations. The top 2,000
   most variable genes were used to perform PCA dimensionality reduction
   and subsequently construct a k-nearest neighbor graph with k = 20 and
   10 principal components used. Finally, the data was clustered using
   Seurat’s FindClusters function with a resolution of 1.1, defining ten
   co-expressed cell clusters ranging in size from 68 to 303 cells. The
   choice of ten clusters was made based on manual evaluation of the
   clustering results at different clustering resolutions. [182]Table S1
   contains information on the markers enriched among each of these
   identified clusters. In the rest of the manuscript, we use the term
   ‘cell states’ to refer to these ten clusters.

   A matrix containing the cell state profiles (denoted ‘X’) was
   constructed by averaging the expression of each gene in each cell
   state. A cell state proportions matrix (denoted ‘A’) was also
   constructed by counting the number of cells from each state in each
   timepoint. These matrices were used in the downstream validation and
   application of DESP to the EMT multi-omics data. We also tested our
   approach on different pre-defined numbers of states ranging between
   five and fifteen by varying the clustering resolution to examine the
   effect of varying the number of cell populations on DESP’s predictions
   ([183]Figure S1).

   Each of the ten states identified in the previous step were also
   classified into one of three EMT stages based on their change in
   proportions across time: Epithelial (E), Mesenchymal (M), or
   Intermediate (I). The cell states with their maximum count in the
   earlier timepoints were labeled E, those in the later timepoints were
   labeled M, and those in the middle timepoints being I.

   The following classifications were assigned.
     * (1)
       Epithelial states: 2, 5, 9
     * (2)
       Intermediate states: 8
     * (3)
       Mesenchymal states: 7

   Any cell states not included in the above list were deemed not to show
   a distinct cross-timepoint change in proportion pattern that would best
   fall into one of the three biological states of interest.

Validation using RNA pseudobulk

   Prior to making inferences from the proteomics data, we first
   investigated DESP’s ability to recover the scRNA data from the bulk
   data at the RNA-level where we have the true single-cell profiles to
   compare against. To represent the bulk data that DESP expects as an
   input, we constructed a ‘pseudobulk’ matrix representing the summed
   gene expression levels across all cell states within each of the eight
   timepoints. This matrix is constructed as the product of multiplying
   the cell state proportions matrix ‘
   [MATH: <mrow><mi mathvariant="bold-italic">A</mi></mrow> :MATH]
   ’ and the cell state profile matrix ‘
   [MATH: <mrow><mi mathvariant="bold-italic">X</mi></mrow> :MATH]
   ’. The resultant matrix ‘
   [MATH: <mrow><mi mathvariant="bold-italic">Y</mi></mrow> :MATH]
   ’ had one value per gene in each timepoint. The cluster similarity
   matrix ‘
   [MATH: <mrow><mi mathvariant="bold-italic">M</mi><mo>’</mo></mrow>
   :MATH]
   was created by computing the Pearson correlation between each pair of
   cell state profiles, i.e., the rows of the matrix ‘
   [MATH: <mrow><mi mathvariant="bold-italic">X</mi><mo>’</mo></mrow>
   :MATH]
   , and used to guide the algorithm toward a solution that preserves the
   relationship between the input cell states. We then use DESP to solve
   the under-determined problem
   [MATH: <mrow><mi mathvariant="bold-italic">Y</mi><mo
   linebreak="goodbreak" linebreakstyle="after">=</mo><mi
   mathvariant="bold-italic">A</mi><mi
   mathvariant="bold-italic">X</mi><mo>’</mo></mrow> :MATH]
   to predict the cell state profile matrix
   [MATH: <mrow><mi mathvariant="bold-italic">X</mi><mo>’</mo></mrow>
   :MATH]
   as described in the ‘[184]DESP algorithm overview’ section. Finally,
   the predicted cell state profile matrix
   [MATH: <mrow><mi mathvariant="bold-italic">X</mi><mo>’</mo></mrow>
   :MATH]
   is then compared to the real cell state profile matrix
   [MATH: <mrow><mi mathvariant="bold-italic">X</mi></mrow> :MATH]
   to assess DESP’s performance. We utilized four different methods of
   assessing the prediction accuracy, each of which is detailed below.

   We computed the mean per-gene relative absolute error between
   [MATH: <mrow><mi>X</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>X</mi><mo>’</mo></mrow> :MATH]
   by calculating the absolute difference between each gene’s real
   cross-state expression values in
   [MATH: <mrow><mi>X</mi></mrow> :MATH]
   and the predicted ones in
   [MATH: <mrow><mi>X</mi><mo>’</mo></mrow> :MATH]
   and taking the mean across all such gene-level errors. This metric was
   motivated by the fact that DESP performs its predictions independently
   for each gene. This metric was also used to evaluate the robustness of
   DESP to variations in its two input parameters λ and β by examining the
   results for different values of the parameters ([185]Figure S1).

   As a visualization of the similarity of the predicted profiles in
   [MATH: <mrow><mi>X</mi><mo>’</mo></mrow> :MATH]
   to the real ones in
   [MATH: <mrow><mi>X</mi></mrow> :MATH]
   within the global structure of the single-cell data, we created a UMAP
   projection using the R package uwot on the original scRNA-Seq gene
   expression matrix with the addition of the real and predicted cell
   state profiles, i.e., the rows of
   [MATH: <mrow><mi>X</mi></mrow> :MATH]
   and
   [MATH: <mrow><mi>X</mi><mo>’</mo></mrow> :MATH]
   . The profiles were labeled on the subsequent UMAP plots to distinguish
   them from the individual cells ([186]Figure 3B).

   To determine if DESP’s predictions are reasonable approximations of the
   real data, we also compared the predictions to the real data by
   computing the Euclidean distance between individual cells’ measurements
   and their real state average as opposed to the distance between thee
   predictions and the same state average. More specifically, we iterated
   over each of the 1,913 cells in the original single-cell expression
   matrix and computing the Euclidean distance between the cell’s profile
   and that of its cell state’s profile. Next, for each cell state we
   compute the average distance of its cells to the averaged cell state
   profile, as well as the distance between the predicted profile and the
   same corresponding cell state profile. We then created bar plots
   comparing these two distances side-by-side to determine whether the
   predicted profiles fall within the correct intra-state range
   ([187]Figure 3C).

   Finally, we mapped the predicted cell state profiles to the real ones
   by solving the linear sum assignment problem (LSAP) using the Hungarian
   method as implemented in the solve_LSAP function in the clue R
   package[188]^34 ([189]Figure 3D). In summary, the LSAP algorithm
   expects similarities between cell state profiles as the entries in the
   input matrix. The idea of the matching is that given two sets A and B,
   we find the matching that maximizes
   [MATH:
   <mrow><mtext>sum</mtext><mrow><mo>(</mo><mrow><mtext>similarity</mtext>
   <mrow><mo>(</mo><mrow><mi mathvariant="normal">a</mi><mo>_</mo><mi
   mathvariant="normal">i</mi><mo>,</mo><mi
   mathvariant="normal">b</mi><mo>_</mo><mi
   mathvariant="normal">j</mi></mrow><mo>)</mo></mrow></mrow><mo>)</mo></m
   row></mrow> :MATH]
   where
   [MATH: <mrow><mi mathvariant="normal">a</mi><mo>_</mo><mi
   mathvariant="normal">i</mi></mrow> :MATH]
   is matched with
   [MATH: <mrow><mi mathvariant="normal">b</mi><mo>_</mo><mi
   mathvariant="normal">j</mi></mrow> :MATH]
   , and each member of the set is matched with exactly one member of the
   other set. The algorithm was run twice; once with the Euclidean
   distance as the similarity metric for the state profiles where the
   algorithm minimized the sum of assigned Euclidean distances, and once
   with the Pearson correlation as the similarity metric where the
   algorithm maximized the summed correlations across cell state
   assignments.

Applying DESP to demix EMT bulk proteomics data

   We used DESP to demix the EMT bulk proteomics data down to the level of
   the ten scRNA-defined cell states as described in the ‘[190]DESP
   algorithm overview’ section. The inputs to DESP were the processed
   protein-by-timepoint proteomics matrix, the cell state proportions
   matrix indicating the proportion of each cell state in each timepoint,
   and the cell state similarity matrix containing the Pearson
   correlations between the cell state scRNA profiles. The parameters λ
   and β were set to 1^−7 and 1^−4 respectively based on the tuning
   results obtained from demixing the ‘pseudobulk’ as described in the
   previous section. The output was a protein-by-state matrix which was
   used for the analysis described in the [191]results section of the
   manuscript.

Identifying cell state markers

   The R package Seurat’s[192]^14 FindMarkers function was used to find
   the genes that distinguish each cell state in the scRNA-Seq data based
   on a one-vs-all differential expression analysis. An FDR cutoff of 0.05
   and log fold-change cutoff of 0.5 were applied to determine the marker
   genes for each state. These markers are listed in [193]Table S1. This
   analysis was performed once at the level of the ten individual cell
   states, to confirm the classification of the cell states into the E/I/M
   stages, and once at the level of the three stages for pathway
   enrichment and comparing marker genes across datasets. This analysis
   identified 142 marker genes for the E state, 36 for the I state, and 81
   for the M state in the scRNA-Seq data ([194]Table S2). For the
   predicted cell state proteomics profiles from DESP, since we have only
   one measurement per cell state as opposed to the multiple per-cluster
   cell measurements in the scRNA-Seq data, we define the marker proteins
   for each cell state as those with an average log2 fold-change greater
   than 1 between each state and the average of all the other ones
   ([195]Table S1). As with the scRNA-Seq data, this analysis was repeated
   for the three EMT stages (E, I, and M) instead of the ten cell states
   after the initial cell states were classified into each of the three
   stages. This identified 90 marker proteins for the E state, 215 for the
   I state, and 155 for the M state in the DESP predictions
   ([196]Table S2).

Identifying timepoint markers in bulk proteomics

   We grouped the eight timepoints in the bulk proteomics data into the E
   (first two timepoints), I (middle three timepoints), and M (last three
   timepoints) stages. Markers of each stage were defined as the proteins
   with an average log2 fold-change greater than 1 between each stage’s
   timepoints and the average of all the other ones in the bulk proteomics
   data (i.e., one-vs-all differential expression analysis). The data was
   log2-transformed and quantile-normalized prior to identifying the
   markers. This analysis identified 118 marker genes for the E state, 21
   for the I state, and 56 for the M state in the scRNA-Seq data. These
   markers are listed in [197]Table S2.

Identifying intermediate EMT markers

   To focus on the mechanisms occurring exclusively during the
   intermediate stage, we decided to focus on the third timepoint (T2) and
   cell cluster 8. Cell cluster 8 is strongly-associated with this
   timepoint as it represents 43% of all cells in T2 but only a tiny
   percentage in the other timepoints. We converted the bulk protein
   abundances in T2 to proportions of the total T2 abundance and did the
   same for the predicted protein profile of cluster 8. We then divided
   these two percentages by each other to find the ratio between them
   (C8/T2), removing the few cases (∼20 proteins) where the protein has an
   abundance of zero in either dataset. Next, we ranked the proteins based
   on this ratio, and focused on proteins who have the highest/lowest
   ratio. The idea is that these proteins are important drivers of the
   transition since they are relatively important to the transient cluster
   8 but whose signal seems diluted in the bulk proteomics based on this
   ratio, probably due to the presence of other clusters contributing to
   the bulk signal. The resultant ratios had a mean and median of ∼1, and
   we decided to focus on the 99th percentile, representing the 70
   proteins whose ratio was above 1.32. In parallel, we also took the 208
   proteins defined as upregulated markers of the predicted protein
   profile of cell cluster 8 based on a one-vs-all differential expression
   analysis as a second set of intermediate markers. The two analyses
   described above thus produced two sets of proteins of interest for
   their inferred role in the transitional phase of EMT, and we found
   eight proteins that overlapped between them. The [198]results section
   of the manuscript elaborates on some of the inferred roles of these
   proteins in the context of EMT.

Recovering associations with EMT-related genes

   We curated a set of 320 EMT-associated genes and interrogated the set
   of predicted intermediate markers for physical and/or functional
   associations with those genes. The EMT-associated genes came from two
   sources: 200 genes from the MSigDB Hallmark geneset,[199]^23 and 132
   genes associated with at least one of the 11 GO Biological Process
   terms with “epithelial mesenchymal transition” in the title.[200]^22 12
   genes overlapped between the two sets, while none of our predicted
   markers were present in this set, suggesting potential novelty. We used
   the STRING database[201]^21 to check for reported physical interactions
   and/or functional associations between our marker genes and the
   EMT-associated genes. To cast a wide net, we considered associations
   reported based on all of STRING’s scoring methods, such as RNA
   co-expression and experimentally-determined interactions, but filtered
   them to include only high-confidence interactions (STRING combined
   score >70%).

Pathway enrichment analysis

   We used the enrichR[202]^35 tool to identify biological pathways
   significantly enriched for each cell state’s protein and RNA markers as
   well as the bulk proteomics timepoint markers based on a hypergeometric
   test with a p value cutoff of 0.01. The pathway databases interrogated
   included GO Biological Process, GO Molecular Function, and GO Cellular
   Component from the 2021 version of the GO database,[203]^22 in addition
   to MSigDB,[204]^23 OMIM,[205]^31 and KEGG.[206]^32 We also used the
   enrichR tool to compute and compare the enrichment scores of gene sets
   of interest across the markers ([207]Figure 6B). enrichR’s combined
   score, computed as log(p) ∗z where p is the Fisher test p value and z
   is the Z score for deviation from expected rank, was used for this
   analysis as a measure of the relative enrichment of a given geneset
   among a given set of cell state markers. [208]Table S3 lists the
   detected significantly-enriched pathways from the DESP predictions,
   scRNA-Seq, and bulk proteomics data.

Single-cell proteomics data processing

   Mass spectrometry-based single-cell proteomics data for cells
   undergoing TGF-Beta-induced EMT was obtained from (Khan
   et al.).[209]^18 Their experimental setup consisted of three timepoints
   (days 0, 3, and 9), and the data has 1,776 proteins x 420 cells, out of
   which 1,064 proteins overlapped with our EMT case study bulk proteomics
   and scRNA-Seq data. For all downstream analysis, only this set of
   ∼1,000 genes that overlap between the single-cell proteomics, bulk
   proteomics, and scRNA-Seq datasets are retained. Notably, 65% of the
   original single-cell proteomics matrix consists of missing values.
   Specht et al.[210]^4 proposed a strategy to impute the missing values
   by k-nearest neighbor imputation using Euclidean distance as a
   similarity measure between the cells. Briefly, for a given missing
   value, the expression of that gene in that cell is taken as the mean of
   its expression in the 3 most similar cells for which its expression is
   present. The key parameter for this KNN algorithm is the number of
   neighbors
   [MATH: <mrow><mi>K</mi></mrow> :MATH]
   . We selected
   [MATH: <mrow><mi>K</mi></mrow> :MATH]
   based on a 5-fold cross-validation within the training dataset and
   computing the resultant mean absolute prediction error on the held out
   non-missing values. The optimal value for
   [MATH: <mrow><mi>K</mi></mrow> :MATH]
   was found to be 10 and subsequently used to impute the missing data.

Identifying cell states in single-cell proteomics data

   We clustered the imputed single-cell proteomics data to four clusters
   using the K-means clustering algorithm based on cell-cell similarity in
   a 2-D UMAP projection of the data. The choice of four clusters was due
   to its manually-observed improved ability at distinguishing clusters
   that exhibited noticeable cross-time trends in sample proportion.
   [211]Figure 5A visualizes the clusters using a UMAP projection, as well
   the proportion of each cluster within each timepoint. Based on the
   change in cluster proportions across time, the clusters were mapped to
   the Epithelial (E), Intermediate (I), and Mesenchymal (M) states, with
   one cluster seemingly consistently present across the timecourse and as
   such was discarded from downstream analyses.

Comparing predictions to experimental data

   Our experimental validation strategy focused on comparing cell state
   fold-changes by examining whether the fold-changes derived from DESP’s
   prediction resembled the experimentally-derived fold-changes more than
   those derived from the scRNA data alone. Focusing on the relative
   fold-changes helps alleviate potential issues caused by technical
   differences between experiments and pre-processing strategies that
   would hinder attempts to map cell clusters between the datasets. We
   computed the gene/protein fold-changes between each pair of cell-states
   in each of the three single-cell matrices: scRNA-Seq, single-cell
   proteomics, and DESP’s single-cell proteomics predictions. Since there
   are three cell states defined here, this leads to three sets of
   fold-changes for each dataset: E vs. I, E vs. M, and I vs. M. These
   fold-changes are computed as log2 fold-changes using Seurat’s
   foldChanges function. The resultant fold-changes were centered and
   scaled prior to downstream analysis. Only the genes/proteins that
   overlap between the three datasets were used for this analysis.

   This analysis presented us with two main challenges to overcome.
     * (1)
       The effects of technical variation between the single-cell and bulk
       proteomics experiments.
     * (2)
       Only some genes show a difference between their RNA and protein
       profiles in this dataset.

   We addressed challenge 1 by selecting the set of proteins that showed
   high agreement (Pearson R^2 > 0.5) between the bulk proteomics dataset
   and a ‘pseudobulk’ matrix constructed by summing the single-cell
   proteomics measurements in each timepoint. We addressed challenge 2 by
   selecting the set of proteins that differ the most (Pearson R^2 < −0.5)
   between the bulk proteomics data and a ‘pseudobulk’ matrix constructed
   by summing the scRNA data in each timepoint, representing the set of
   genes that are the best at distinguishing the RNA and protein levels in
   this dataset and are thus most appropriate for our benchmarking
   analysis. These correlations only factored in the set of proteins and
   timepoints that overlapped between the two datasets. Ultimately, this
   led to a set of 92 proteins used in this analysis. We computed the
   Pearson correlation between the fold-changes derived from the different
   datasets as a measure of similarity of the results derived from each of
   the three matrices to determine whether DESP’s predicted profiles
   resembled the experimentally-determined ones more than the input scRNA
   ones ([212]Figure 6C). Note: Relaxing the correlation thresholds to 0
   instead of +/− 0.5 led to a larger set of 203 proteins (19% of total
   overlapping proteins) meeting the criteria and had equivalent final
   results.

Phosphoproteomics data processing

   Phosphoproteomics data for the same set of cells was downloaded from
   (Paul et al.),[213]^13 with three replicates per timepoint. The dataset
   consists of phosphorylation data for 8,727 phosphosites across 1,776
   unique proteins. The intensity was averaged across replicates for each
   phosphosite.

Human Cell Atlas case study

   Publicly-available scRNA-Seq data was downloaded from the Human Cell
   Atlas.[214]^16 The data consisted of gene expression values for 58,870
   genes across 24 tissues encompassing 177 annotated cell types. The
   following pre-processing steps were applied to the raw scRNA-Seq
   counts.
     * (1)
       Removed genes with zero variance across cells (removed 1,619 genes)
     * (2)
       Removed lowly-expressed genes by retaining genes with more than 3
       counts in at least 3 cells (removed 8,399 genes)
     * (3)
       Normalized the data such that each cell sums to 1
     * (4)
       Retained the top 36 cell types based on the number of tissues each
       cell type is expressed in (led to a minimum of 3 out of 24 tissues
       per cell type)

   To perform DESP demixing, we constructed a matrix with the average
   expression of each gene in each cell type. This matrix represented the
   true cell type profiles, i.e., the ground truth to compare DESP’s
   predictions against. We also constructed a matrix containing the
   proportion of each cell type among the cells belonging to each tissue
   ([215]Figure 4B). We then applied DESP to these two matrices, using a λ
   parameter value of 0 and β value of 1^−10, to predict the expression of
   the 48,852 retained genes across the 36 cell types.

Analysis of CITE-Seq data

   Publicly-available CITE-Seq data was downloaded from Hao et al.[216]^17
   The dataset consisted of parallel RNA and protein measurements for
   161,764 PBMCs isolated from 8 volunteers at three timepoints (days 0,
   3, and 7) after receiving an HIV vaccine, with a set of 187 overlapping
   genes between the RNA and protein measurements. Gene mapping between
   the RNA and protein measurements were carried out based on common
   Ensembl identifiers.[217]^36 The authors assigned cell types at 3
   hierarchical levels, and we decided to focus our analysis on the
   intermediate level which grouped the cells into 31 annotated immune
   cell types. The single-cell data was normalized using the Seurat
   package such that each cell sums to 10,000. We then summarized both the
   RNA and protein data into the following matrices.
     * (1)
       A (cell type composition): Number of cells from each cell type in
       each sample. This is consistent for both the RNA and protein data.
     * (2)
       X (cell type profiles): average expression of each gene/protein in
       each cell type.
     * (3)
       Y (pseudobulk): Matrix multiplication of A and X, representing
       aggregated single-cell data in each sample.

   We quantitatively evaluated DESP’s ability to predict the cell type
   profiles in the matrix X given the A and Y matrices, comparing
   performance for predicting the RNA and protein profiles separately.
   Since DESP predictions are performed on a feature-by-feature basis, we
   computed the prediction error as the absolute relative difference
   between the true and predicted cell state values for each feature, and
   created boxplots comparing DESP’s per-gene relative prediction error
   for the RNA and protein data separately ([218]Figure 4F).

Acknowledgments