Abstract

Background

   Towards discovering robust cancer biomarkers, it is imperative to
   unravel the cellular heterogeneity of patient samples and comprehend
   the interactions between cancer cells and the various cell types in the
   tumor microenvironment. The first generation of ‘partial’ computational
   deconvolution methods required prior information either on the
   cell/tissue type proportions or the cell/tissue type-specific
   expression signatures and the number of involved cell/tissue types. The
   second generation of ‘complete’ approaches allowed estimating both of
   the cell/tissue type proportions and cell/tissue type-specific
   expression profiles directly from the mixed gene expression data, based
   on known (or automatically identified) cell/tissue type-specific marker
   genes.

Results

   We present Deblender, a flexible complete deconvolution tool operating
   in semi−/unsupervised mode based on the user’s access to known marker
   gene lists and information about cell/tissue composition. In case of no
   prior knowledge, global gene expression variability is used in
   clustering the mixed data to substitute marker sets with cluster sets.
   In addition, we integrate a model selection criterion to predict the
   number of constituent cell/tissue types. Moreover, we provide a
   tailored algorithmic scheme to estimate mixture proportions for
   realistic experimental cases where the number of involved cell/tissue
   types exceeds the number of mixed samples. We assess the performance of
   Deblender and a set of state-of-the-art existing tools on a
   comprehensive set of benchmark and patient cancer mixture expression
   datasets (including TCGA).

Conclusion

   Our results corroborate that Deblender can be a valuable tool to
   improve understanding of gene expression datasets with implications for
   prediction and clinical utilization. Deblender is implemented in MATLAB
   and is available from ([31]https://github.com/kondim1983/Deblender/).

Electronic supplementary material

   The online version of this article (10.1186/s12859-018-2442-5) contains
   supplementary material, which is available to authorized users.

   Keywords: Gene expression, Cellular heterogeneity, Deconvolution,
   Matrix factorization, Particle swarm, Quadratic programming,
   Clustering, Model selection

Background

   In the era of Systems Medicine, the comprehension of disease etiology
   and pathogenesis has undergone a paradigm shift with the integration of
   multiple omics manifestations playing the leading role [[32]1–[33]3].
   The impact is more evident in cancer research where integromics
   approaches have already shown their potential to provide a more
   effective and accurate means for cancer biomarker discovery [[34]4,
   [35]5]. A key component in these studies is the transcriptome data
   descending from microarrays or RNA sequencing. However, standard
   approaches for the analyses of expression data are highly affected by
   the cellular heterogeneity present in tissue samples and the variations
   in cell type composition [[36]6, [37]7]. Tumor bulk tissue samples are
   still often analyzed without considering their complexity and the
   interactions among the cell types forming the tumor microenvironment
   [[38]8]. The microenvironment has been suggested to alter under
   different pathophysiological states, contributing to the comprehension
   of diverse diseases [[39]9]. Thus, in order to detect the true
   expression differences related to the different pathophysiological
   states rather than alterations in cell/tissue type composition, it is
   imperative to deconvolve the recorded mixed expression measurements
   into the component expression profiles of each cell/tissue type.

   Although experimental approaches like cell sorting, laser-capture
   microdissection and single cell sequencing can be used to unravel
   cellular heterogeneity, there is also a growing interest in in silico
   deconvolution. The advances in this field have shown that computational
   prediction has its advantages such as low time consumption and the
   ability to analyze expression responses from multiple cell types
   simultaneously, and importantly, avoiding experimentally perturbing the
   samples [[40]10]. The majority of computational deconvolution
   approaches employ the linearity assumption in which the gene expression
   level in a mixture of cell populations/tissues is modeled as the sum of
   gene expression of the constituent cell/tissue types weighted by their
   proportion in the mixture [[41]6, [42]11]. Methods for deconvolution
   can be classified in two major types [[43]12]: (a) partial
   deconvolution methods requiring as input either cell/tissue
   type-specific expression profiles or mixture proportions
   [[44]13–[45]18]; (b) complete deconvolution methods estimating both the
   cell/tissue type reference profiles and the proportions directly from
   the global mixed gene expression data. In the second type, the methods
   can be further divided into semi-supervised and unsupervised. The first
   assume that a set of marker genes is given for each cell/tissue type
   [[46]9, [47]19] while the latter require no such information. In the
   latter case, one makes the assumption that the variation in gene
   expression levels to a large extent is explained by the variation in
   mixture proportions across samples and marker genes are derived from a
   set of genes showing high variability in expression across the mixed
   samples [[48]7, [49]20, [50]21].

   Computational tools can be further classified based on the type of the
   gene expression data used as input with most tools being designed for
   and tested on microarray data and fewer for RNA-Seq data [[51]6]. It
   has been questioned and analyzed whether methods developed for
   microarray-based gene expression data can also be applied to RNA-seq
   data. There are studies stating that there are no confounding factors
   that make current methods inappropriate for analysis of RNA-Seq data,
   since they have observed a significantly linear association between RNA
   concentrations and sequence reads [[52]10, [53]22], compared to the
   not-so-linear microarrays [[54]10, [55]23]. Other studies like Liebner
   et al. [[56]7] suggest adaptation and incorporation of statistical
   models appropriate for analysis of RNA-seq data. Recently, for
   DeconRNASeq [[57]24] it has been shown that the established linear
   latent model widely used in microarray-based techniques can also be
   used for deconvolution of data from RNA-Seq applied to mixed samples.

   Here, we propose Deblender, a novel complete semi−/unsupervised
   deconvolution tool for “deblending” heterogeneous microarray and
   RNA-Seq data. The tool covers many usage scenarios with respect to what
   information is known, like available marker gene lists, number of
   constituent cell/tissue types in the mixture, and whether the mixed
   samples being studied outnumber the cell/tissue types. One feature
   distinguishing it from other published methods is that it utilizes as
   information source the global gene expression differences across
   cell/tissue types directly from the mixed dataset instead of genes with
   the highest variability often regarded as cell type-specific markers.
   These differences across cell types have been observed in recent
   expression studies [[58]25, [59]26]. Based on this assumption, we
   employ clustering as means for distinguishing cell/tissue type-specific
   gene groups and use those (along with their cluster exemplars) to
   substitute marker genes. Similar ideas have also been explored by an
   unpublished method, ClusDec R package [[60]27]. In this way we
   alleviate the need for marker genes known a priori and for setting
   arbitrary thresholds for detecting genes showing highly variable
   expression.

   In contrast to most other existing methods, Deblender requires no
   information about the number of cell/tissue types present in the
   samples under analysis. Similar to Wang et al. [[61]21], we apply an
   information theoretic model selection criterion based on the Minimum
   Description Length (MDL) principle. Other information theoretic model
   selection criteria, like Bayesian and Akaike Information Criterion, and
   principal component analysis have also been used to estimate number of
   cell subpopulations in the mixture based on copy number aberrations,
   DNA methylation or expression data [[62]8, [63]24, [64]28].
   Furthermore, Deblender is able to analyze datasets where the number of
   cell/tissue types exceeds the number of samples. Only a few relevant
   deconvolution methods can be applied to analyze such datasets [[65]7,
   [66]21, [67]29].

   The performance of Deblender has been assessed and compared to those of
   a set of partial and complete state-of-the art deconvolution methods
   summarized in Table [68]1. For this, we used several benchmark and
   patient mixture datasets with known mixture proportions or
   approximations of those. The results show that Deblender, when executed
   in complete unsupervised mode, performs on par with methods that
   require additional information to perform deconvolution. Therefore, we
   believe that Deblender can serve successfully at least as a tool for
   preprocessing mixed datasets providing an initial approximation of the
   cell composition. This can be used to seed other experimental or in
   silico reference-based techniques that may provide a more accurate
   deconvolution. The use of such techniques can thus be broadened to
   cases where no external information is available.

Table 1.

   Summary of the features of the proposed Deblender and the methods
   examined for comparison. With the term ‘input’ we refer to any type of
   information other than the mixed expression data
   Classification: Partial or Complete Mathematical/ Statistical method
   Input: reference/ signature expression data Input: marker gene lists No
   input Output:
   Cell/tissue type-specific proportions (A) Output:
   Cell/tissue type-specific expression profiles (S) Output: number of
   cell/tissue types
   MMAD [[69]7] C/P Maximum likelihood/ Conjugate gradient Yes Yes Yes^a
   Yes Yes No
   DSA [[70]9] C Quadratic programming No Yes No Yes Yes No
   NMF-CELLMIX [[71]12] C Non-negative Matrix Factorization No Yes No Yes
   Yes No
   CIBERSORT [[72]16] P v-Support Vector Regression Yes No No Yes No No
   DeconRNASeq [[73]24] P Quadratic programming Yes No No Yes No Yes
   Deblender C Least Squares/ Quadratic programming/ Non-negative Matrix
   Factorization/ Unified Particle Swarm Optimization No Yes Yes Yes Yes
   Yes
   [74]Open in a new tab

   ^aMMAD requires as input the number of cell/tissue types

Results

   Deblender offers four pipelines (two semi- and two unsupervised) for
   estimating the mixture proportions and cell/tissue type-specific
   profiles from mixed microarray or RNA-Seq data (Fig. [75]1). The
   appropriate pipeline is determined by the availability of marker gene
   lists, whether the number of constituent cell/tissue types is assumed
   to be known, and whether the mixed samples outnumber the cell/tissue
   types.

Fig. 1.

   [76]Fig. 1
   [77]Open in a new tab

   Overview of Deblender. Deblender is a flexible tool operating both on
   semi- and unsupervised mode based on the availability of marker gene
   lists. More problem-specific pipelines are also available depending on
   the number of samples relative to the number of cell/tissue types
   (under-determined refers to the case where the number of samples is
   lower than the number of involved cell/tissue types, otherwise
   over-determined) and on information about the number of participating
   cell/tissue types

   To examine the efficiency of Deblender for estimating the mixture
   proportions, we employed several benchmark mixture datasets (including
   both microarray and RNA-Seq data) with well-defined cell subpopulations
   (see Additional file [78]1) comparing the performance of Deblender with
   a set of other deconvolution tools. For use in partial and complete
   semi-supervised methods, we derived marker gene lists or gene
   expression signatures for the cell/tissues in the datasets, either from
   the literature [[79]9, [80]30] or by use of other tools [[81]16,
   [82]24]. Also, two patient cancer expression datasets (one microarray
   and one RNA-Seq) were explored and estimates of cell type proportions
   obtained using deconvolution approaches were compared to analogous
   estimates from flow cytometry and histology.

   For completeness, we compared Deblender not only to other complete
   semi-supervised and unsupervised techniques, but also to two robust
   partial methods, CIBERSORT and DeconRNASeq. To assess the accuracy of
   each method, we calculated the Root Mean Squared Error (RMSE) and the
   Pearson correlation coefficient to compare the estimated mixture
   proportions and the known (or otherwise measured) cell type
   proportions. RMSE was calculated both based on the full proportion
   matrices as well as on the proportions of each cell/tissue type
   separately with the arithmetic mean value reported (mRMSE).

Estimating mixture proportions in benchmark expression data

   All methods were applied to the set of probe/gene expression data that
   fits to their operational mode (details are provided in Additional file
   [83]1). Deblender operates under two algorithmic schemes applied
   consecutively in two stages (stages I and II). Here, we report
   primarily the results from stage I (S1) while the result from stages I
   and II (S1 & S2) is reported in cases where improved performance was
   achieved. First, Deblender in the unsupervised mode (S1) was tested on
   all recorded probes/genes (annotated and un-annotated) and the
   performance in terms of correlation with known mixing proportions
   ranged in [0.78 − 0.96], while on the preprocessed datasets (i.e.,
   using only annotated probes, one selected per gene identifier) ranged
   in [0.72 − 0.89] (Additional file [84]1: Tables S1-S2). Second, we
   report the estimated proportions based on a ‘default’ filtering setting
   (retains 53 − 74% of the genes) that performs well across all datasets
   and is likely to work well on most real datasets. Other settings are
   described in detail in Additional file [85]1.

   The [86]GSE19830 microarray dataset includes 33 mixed samples composed
   of known proportions of pure rat brain, liver and lung tissue. Partial
   and complete semi-supervised methods were evaluated both on a set of
   237 marker probes and a 171 probe signature matrix. Figure [87]2
   summarizes the results obtained on the 237 marker probes. In terms of
   how accurately they estimate mixture proportions, Deblender in its
   complete semi-supervised mode and NMF-CELLMIX performed similar to each
   other and to the partial CIBERSORT algorithm. Notably, in S1
   semi-supervised mode, Deblender applies the deconvolution method
   implemented in DSA tool and thus reproduces the same results therefore
   both tools are reported. Similar results were obtained when the set of
   171 signature probes was examined (see Additional file [88]1: Table
   S3). When switching to unsupervised mode, denoted by ‘*’, Deblender*
   outperformed MMAD* (for other settings see Additional file [89]1:
   Tables S4-S7). This dataset serves as good example for exploring the
   estimation of proportions with varying tissue ratios across samples (a
   side-by-side comparison of Deblender* estimates relative to real
   proportions is provided in Additional file [90]1: Table S8). The
   performance of Deblender* was high (r = 0.89) also when all probes were
   utilized (without dataset preprocessing - see Additional file [91]1:
   Table S1). Similar observations can be drawn for the [92]GSE11058 and
   [93]GSE19380 benchmark microarray datasets (Additional file [94]1:
   Figures S1-S2, Tables S1-S2, S4-S7, S9-S10).

Fig. 2.

   [95]Fig. 2
   [96]Open in a new tab

   [97]GSE19830 dataset with 33 mixed samples including 3 tissues (brain,
   liver, lung). Evaluation of methods relative to real mixture
   proportions based on known markers (partial: CIBERSORT, complete
   semi-supervised: Deblender, DSA, MMAD, NMF-CELLMIX) or without prior
   information (complete unsupervised: Deblender*, MMAD*). Deblender*
   results are reported with (A) default preprocessing – S1 (B) default
   preprocessing – S1&S2. mRMSE: arithmetic mean of the RMSE calculated
   for each tissue separately

   Next, we analyzed an RNA-seq dataset which includes 10 mixed samples
   composed of human brain, muscle, lung, liver and heart tissue, in known
   proportions [[98]24]. A set of 1520 signature genes was extracted by
   DeconRNASeq. CIBERSORT and DeconRNASeq were run with the full signature
   matrix while the complete semi-supervised methods were run with the
   5-fold differentially expressed genes (i.e., genes expressed at least
   5-fold higher in the respective cell type relative to any of the other
   cell types). The results are in agreement with those for the other
   datasets (Fig. [99]3). Complete semi-supervised techniques like
   Deblender/DSA and MMAD showed high performance and in some cases even
   higher than partial techniques. When switching to complete unsupervised
   mode, Deblender* outperformed MMAD* (see also Additional file [100]1:
   Tables S2, S4, S6, S7). A side-by-side comparison of Deblender*
   estimates against real proportions is provided (Additional file [101]1:
   Table S11). Moreover, we checked how common preprocessing steps in
   RNA-seq analysis – such as adding a pseudo-count offset to avoid zero
   values in downstream analyses – affected the performance of Deblender*
   and MMAD*. We checked two offsets, 0.0001 and 1. MMAD* did not perform
   well with the offset of 0.0001 since it affected the identification of
   highly variable genes by causing an inflation of the variance of low
   abundance genes after log transformation. When tested with an offset of
   1, MMAD* improved and performed better than Deblender* in certain
   parameter settings whereas Deblender* preserved its good performance as
   with the offset of 0.0001 (see Additional file [102]1: Table S12).

Fig. 3.

   [103]Fig. 3
   [104]Open in a new tab

   RNA-Seq dataset with 10 mixed samples including 5 tissues (brain,
   muscle, lung, liver, heart). Evaluation of methods relative to ground
   truth mixture proportions based on a set of signature genes (partial:
   CIBERSORT, DeconRNASeq), on markers extracted from the signature set
   (complete semi-supervised: Deblender, DSA, MMAD, NMF-CELLMIX) or
   without prior information (complete unsupervised: Deblender*, MMAD*).
   Deblender* result is reported with default preprocessing – S1

   The unsupervised methods were also run with different parameters,
   dataset filtering settings and with noise added (see Additional file
   [105]1: Tables S4-S7, S10, S13, S14).

Estimating mixture proportions in under-determined cases based on benchmark
expression data

   We examined the under-determined case relevant for semi-supervised and
   unsupervised methods where the number of samples is less than the
   number of involved cell/tissue types. We evaluated Deblender and MMAD
   on a subset dataset from [106]GSE11058 (i.e., 3 samples containing 4
   cell types) and on a subset dataset from the RNA-Seq dataset (i.e., 4
   samples containing 5 tissues). For the first dataset, we checked the
   performance based on known marker genes, while in the latter we checked
   the unsupervised mode. For MMAD* we applied default percentile (see
   also Additional file [107]1: Table S6). Deblender outperformed MMAD in
   both cases (Fig. [108]4).

Fig. 4.

   Fig. 4
   [109]Open in a new tab

   Evaluation of Deblender and MMAD (A) on a subset dataset extracted from
   [110]GSE11058 with 3 samples including 4 cell types (semi-supervised
   mode) and (B) a subset dataset extracted from the RNA-Seq with 4
   samples including 5 tissues (unsupervised mode)

Estimating the number of cell/tissue types in benchmark expression datasets

   We evaluated the efficiency of MDL criterion integrated in Deblender*
   for estimating the number of cell/tissue types present in the mixture.
   For this, we selected the [111]GSE19830, [112]GSE11058 and RNA-Seq
   datasets in which all cell/tissue types are present in all mixed
   samples. For all datasets we applied the unsupervised mode S1&S2 and
   recorded the MDL value with k ranging from 2 to 8 after filtering the
   5% of genes with the lowest expression vector norm and the 5% of genes
   with the highest expression vector norm. We also used a cutoff of
   CV ≥ 0.4, since we observed that highly variable genes improve MDL
   computation. As seen in Fig. [113]5, for all datasets the minimum of
   the MDL curve predicted successfully the correct number of cell/tissue
   types.

Fig. 5.

   [114]Fig. 5
   [115]Open in a new tab

   Performance evaluation of Minimum Description Length (MDL) criterion
   for detecting the number of cell/tissue types present in three mixture
   datasets. The boxes mark the true number of involved cell/tissue types

Estimating mixture proportions in patient cancer expression data

   We evaluated Deblender* and MMAD* on patient cancer expression datasets
   for which only estimates of the real proportions are available. First,
   we examined [116]GSE65135 microarray dataset which contains 14
   follicular lymphoma samples consisting of CD4+ T cells, CD8+ T cells
   and B cells, with proportions estimated based on flow cytometry data.
   As shown in Fig. [117]6, Deblender* (default setting – S1) performed
   better in terms of correlation with the flow cytometry proportions than
   did MMAD* (default percentile, see also Additional file [118]1: Tables
   S6 and S7, Figure S3 and Additional file [119]2). Further, the MDL
   estimation indicated K = 2 as the number of involved cell types with
   K = 3 being close to this minimal value (Additional file [120]1: Figure
   S4).

Fig. 6.

   Fig. 6
   [121]Open in a new tab

   [122]GSE65135 dataset with 14 mixed samples including 3 cell types (CD4
   T cells, CD8 T cells and B cells). Evaluation of the unsupervised mode
   of Deblender* (default preprocessing – S1) and MMAD* (default
   percentile) relative to flow cytometry proportions. Pearson correlation
   (r) results per cell type, Deblender*: CD4 T cells: -0.178, CD8 T
   cells: 0.535, B cells: 0.266, MMAD: CD4 T cells: 0.099, CD8 T cells:
   0.301, B cells: 0.352

   We also evaluated the performance of Deblender* and MMAD* on the TCGA
   RNA-Seq data of 1093 breast cancer primary solid tumor samples
   [[123]31]. We used a simplified model with three major tissue
   components for which histological estimates are available for the main
   types of tissue components recognized on the tissue slides (normal,
   stromal and tumor). Of note, the MDL estimation of Deblender showed
   that the number of involved tissue components ranges between 15 and 26
   (Additional file [124]1: Figure S4) and this observation accords well
   with the prediction of 23 cell types in the TCGA samples by relevant
   study [[125]8]. Deblender* and MMAD* were tested both with their
   ‘default’ settings (as used for the benchmark datasets), which include
   in the analysis many of the lowly expressed genes, but also with a
   ‘customized’ setting that discards them. Finally, ~76% of the complete
   gene set was retained and for Deblender* no other filtering was
   applied. We performed pathway enrichment analysis on the three clusters
   identified by Deblender* in the customized setting and checked the
   enriched Gene Ontology (GO) categories and Kyoto Encyclopedia of Genes
   and Genomes (KEGG) pathways. The enriched terms were matched to each
   tissue component after considering the cluster order configuration that
   led to the highest correlation with the known histology estimates (see
   Additional file [126]3). GO categories reflecting immune response
   activity were top ranked amongst the categories enriched in the cluster
   corresponding to ‘Normal’. GO terms reflecting various metabolic
   processes were significantly enriched in the cluster that corresponded
   to ‘Stromal’, as also reflected by the enriched KEGG pathway terms. GO
   terms reflecting metabolism at different levels were also enriched in
   the cluster that corresponded to ‘Cancer’. Further, various gene sets
   reflecting different cancer associated pathways and insulin related
   signaling were enriched in the ‘Cancer’ cluster.

   In Fig. [127]7, we show the results based on the ‘customized’ setting
   where Deblender* (S1) performed better than MMAD* (default percentile)
   in terms of correlation with the histological estimates (see also
   Additional file [128]1: Tables S6 and S7 and Additional file [129]2).
   Similar were the results for Deblender* when the ‘default’ setting was
   applied (r = 0.74). Notably, when looking each tissue component
   independently, no significant correlation was found between real
   proportions and MMAD* and Deblender* estimates (Additional file [130]1:
   Figure S5), respectively. The overall better correlation of Deblender*
   reflects its ability to better recover for each sample the relative
   abundance of each tissue component. The Deblender* mean proportion
   value of the tumor component was lower (S1:μ = 0.48, σ = 0.08) than the
   respective histology-based estimates (μ = 0.74, σ = 0.18). When zooming
   into subsets of samples with known molecular subtype (Luminal A,
   Luminal B, Basal-like, Her2-enriched, Normal breast-like), Deblender*
   (S1) showed higher performance for Basal-like, Her2-enriched and Normal
   breast-like groups.

Fig. 7.

   Fig. 7
   [131]Open in a new tab

   TCGA breast cancer RNA-Seq dataset with 1093 primary tumor mixed
   samples including 3 defined tissue components (normal, stromal, tumor).
   Evaluation of the unsupervised mode of Deblender* (customized setting –
   S1) and MMAD* methods (customized setting – default percentile)
   relative to histological estimates. Correlation is also reported for
   subsets of samples with known molecular subtype. Pearson correlation
   (r) results per tissue component, Deblender*: Normal: − 0.051, Stromal:
   − 0.022, Tumor: -0.061, MMAD: Normal: − 0.041, Stromal: 0.071, Tumor:
   − 0.026

   Since Deblender* used the full set of expression data to assign mixture
   proportions, its results may depend strongly on the set of samples
   included in the analysis. To assess this dependence, we also applied
   Deblender* on a bigger dataset including, in addition to the 1093
   primary tumor samples, also 7 metastasis and 112 normal samples. The
   tissue composition is assumed to be highly different between primary
   tumor, metastasis and normal samples. We observed that only when
   applying the customized setting and CV ≥ 1 (analyzing ~23% of the
   complete gene set), the overall correlation of primary tumor samples
   relative to known mixture proportions decreased (S1: r = 0.54) but the
   normal samples achieved higher mean proportion value for the ‘normal’
   tissue component (S1: μ = 0.54, σ = 0.10) relative to the mean value of
   the primary and metastasis tumor samples respectively (S1: μ = 0.19,
   σ = 0.06). Also, the small cohort of metastasis samples displayed mean
   values for all components similar to those observed in the primary
   tumor samples.

   We further evaluated the agreement of Deblender* estimated proportions
   relative to the tumor purity estimates (i.e., the proportion of cancer
   cells in the mixture) assigned by other relevant methods that used gene
   expression or other TCGA genomic data such as somatic copy-number
   variation, somatic mutations and DNA methylation. For this, we
   downloaded the results from the different methods from Aran et al.
   [[132]32], where a systematic analysis of a set of methods (ESTIMATE,
   ABSOLUTE, LUMP, IHC) as well as an additional consensus method (CPE) is
   presented. In this case, Deblender* was run in the cohort of primary
   tumor samples both with three tissue components and in a constrained
   fashion with two – corresponding to tumor and non-tumor component. We
   ran Deblender* using three tissue types (S1) as aforementioned and also
   with two tissue types using two different settings (S1, setting 1: no
   filtering, setting 2: CV ≥ 3). We checked across all samples the
   correlation of Deblender* tumor purity estimates relative to those
   estimated by each method and found in general a low but positive
   correlation with respect to the consensus method (CPE) (S1: r = 0.24,
   for the three-tissue-component model, r = 0.15 and r = 0.35, for the
   two-tissue-component model) (see also Additional file [133]1: Figures
   S6-S11). However, when restricting our analysis to the set of samples
   where Deblender* proportions deviated in the range [−0.2, +0.2] from
   CPE results, we found moderate or high correlations (S1: r = 0.56,
   r = 0.80 for 64% and 43% of the samples) in the two-tissue-component
   model. In the pairwise comparisons, Deblender* results correlated more
   with ABSOLUTE and LUMP scores.

Runtime

   We recorded the elapsed time (tic-toc result in Matlab) of Deblender*
   unsupervised mode for calculating mixture proportions (S1 & S2) and MDL
   in two benchmark datasets ([134]GSE19830 and RNA-Seq) with the
   respective default settings (Additional file [135]1: Figure S12). All
   tests were run on a 2.40 GHz Intel Core i7 with 7.65G RAM running
   Windows 7.

Discussion

   In silico deconvolution modeling started with partial approaches
   requiring as available, either the cell/tissue type-specific expression
   profiles or mixture proportions, but gradually complete semi-supervised
   and more importantly unsupervised methods have gained ground. The
   unsupervised ones showed that representative profiles - sufficient to
   decompose the signal - can be extracted directly from the mixed data
   alleviating the need for additional experiments or borrowing reference
   data from other sources.

   In this work, we present Deblender, a new flexible complete
   deconvolution tool with semi- and unsupervised operational modes which
   integrates features introduced by recent approaches [[136]7, [137]9,
   [138]12, [139]19, [140]21, [141]33] and proposes several novel
   concepts. First, Deblender adopts the deconvolution model and
   constraints proposed by others [[142]9, [143]12, [144]19], based solely
   on marker gene lists, and extends this concept into unsupervised by
   employing a flexible assumption about the gene cell/tissue
   type-specific expression. In particular, we assume that many genes show
   differences in terms of relative expression among the different
   cell/tissue types [[145]25]. Lately, single-cell sequencing studies
   like Dueck et al. [[146]26] have shown that gene expression differs
   globally across tissues in terms of the number of genes expressed, the
   average expression pattern and the within-cell-type variation patterns.
   Although each cell type exhibits a characteristic transcriptome profile
   enriched in marker genes, the marker gene expression is rarely if ever
   limited to the relevant cell type [[147]26]. Moreover, some marker
   genes show significant variability within the relevant cell type,
   indicating that these genes are not sufficient to determine the
   cellular phenotype [[148]26]. Under this notion, we apply clustering to
   identify groups of genes prone to be more expressed in a specific
   cell/tissue type and employ those clusters (along with their cluster
   exemplars) with the constraints others use for the marker sets. In this
   way, we overcome all the arbitrary cutoffs/criteria that most partial
   and complete methods face when selecting the small cohort of
   signature/marker genes. The cluster-based concept was adapted to two
   different algorithmic approaches. First, we adapted the unsupervised
   algorithm of Zhong et al. [[149]9] for estimating mixture proportions
   after isolating the marker mixed gene expression profiles and
   subsequently estimating cell/tissue type-specific expression profiles
   for all recorded genes based on the estimated proportion result.
   Second, we adapted the Non-negative Matrix Factorization scheme of
   Gaujoux and Seoighe [[150]12, [151]19] which estimates mixture
   proportions and cell/tissue type-specific profiles based on all genes
   with the marker genes constrained to express only in the relevant
   tissue/cell type. Deblender runs primarily the first algorithm
   (referred to as S1) which we have found to work well on a set of
   benchmark datasets. The results of this algorithm can further be used
   to initialize the second (referred to as S2). We suggest using S1&S2
   after evaluating the clustering result where cluster exemplars that do
   not differentiate well from each other might not serve as good
   candidates for deconvolution with S1. Third, we extend our unsupervised
   method by incorporating an information theoretic criterion to predict
   the number of cell/tissue types. In the work of Wang et al. [[152]21],
   this criterion evaluated the number of cell/tissue types based on
   predicted small-sized marker gene sets. Here, we show that this
   criterion can perform equally well in our proposed cluster-based
   approach. Finally, we introduce an adapted Non-negative Matrix
   Factorization (NMF) scheme to deal with the challenging
   under-determined cases for proportion estimation in semi- /unsupervised
   mode, i.e., cases where the number of cell/tissue types exceeds the
   number of samples in the dataset.

   We assessed the performance of Deblender on a set of benchmark datasets
   where the real proportions of cell/tissue types are available and
   cancer patient datasets where only flow cytometry or histological
   estimates are available. For comparison, we recruited several partial
   and complete state-of-the-art methods (see Additional file [153]1 for
   short description). The results on benchmark datasets showed that both
   the semi-supervised and the unsupervised mode of Deblender performed in
   both the over- and under-determined cases similarly to the comparative
   reference-based methods included in the analyses. At this point it is
   worth commenting that the extra information included in the partial
   deconvolution methods as compared to the complete semi-supervised ones
   (that is reference expression profiles in partial methods compared to
   marker gene sets for the semi-supervised ones) is not always translated
   into better performance. With respect to unsupervised mode, we showed
   that cluster sets (and their cluster exemplars) descending from the
   mixed gene expression dataset can serve as a successful alternative to
   externally defined lists of marker genes. This indicates that large
   part of the transcriptome carries considerable cell/tissue
   type-specific information, i.e., many genes have cell/tissue-type
   dependent expression levels. Therefore, clustering using expression
   across most genes can lead to successful signal decomposition. We have
   also seen that in some cases, it is beneficial to include only the
   highly variable genes in the clustering. This may depend on which
   cell/tissue types are included and also on how much cell type
   proportions vary among the samples analyzed. If all samples have highly
   similar cell/tissue-type proportions, the approach will not work.
   Similarly, the method is not appropriate for cases where the sample
   cohort includes both mixture and one-cell-type samples due to the
   clustering involved. Moreover, the results showed that Deblender* is
   applicable on RNA-Seq data and overcomes preprocessing challenges that
   one faces when dealing with RNA-seq data, e.g., choosing the right
   pseudo-count offset prior to log transformation. However, for some of
   the benchmark datasets and also on both real datasets Deblender
   produced estimates that considerably deviated from known (or otherwise
   approximated) proportions. This can be attributed to (a) the clustering
   which may determine cluster exemplars that most likely do not represent
   well distinct cell/tissue types (each cluster having contributions from
   multiple underlying components), as compared to marker genes, resulting
   so in higher errors in the estimates of cell type proportions, (b)
   deconvolution results are affected by the inherent measurement noise,
   biological variability and the fact that the mixed data we have
   available has been processed and normalized (e.g., quantile normalized)
   as described in the original articles. The effect of data processing is
   debated in the literature, and it has been argued that the choice of
   normalization procedures affects the deconvolution results [[154]34,
   [155]35], (c) the multicollinearity issue of having correlated cell
   types in the mixture [[156]34] and (d) the fact that the flow cytometry
   and histological estimations in real datasets are similarly an
   approximation of the real proportions.

   Moving forward, Deblender* performed better than MMAD* in the cancer
   patient expression datasets. The improved performance was more evident
   in the TCGA breast cancer RNA-Seq dataset, where we checked
   independently the sets of samples with known molecular subtype
   identities, although deconvolution was realized on the full set of
   samples irrespective of this information. We observed the greatest
   differences for the Basal-like, Her2-enriched and Normal breast-like
   subtypes whereas similar performance was achieved for Luminal A and
   Luminal B. In our simplified model with three tissue types for the TCGA
   dataset, we observed that the mean proportion value for the tumor
   component was lower (S1:μ = 0.48, σ = 0.08) compared to the mean
   estimated by histology (μ = 0.74, σ = 0.18). However, this difference
   may have other reasons. First, there is more than one cell type in each
   tissue component, and some of the signals from the tumor cell component
   might resemble signaling from normal epithelial cells or the stromal
   tissue. Second, we applied clustering with the number of clusters being
   lower than the actual number of tissue components [[157]8]. In this
   way, the identified cluster exemplars most likely do not represent well
   distinct cell/tissue components (each cluster having contributions from
   multiple underlying components), thus affecting deconvolution
   efficiency. Also, immune cells might be dispersed in the different
   components, complicating the differentiation of signals between the
   tissue components. Lately, in the study of Onuchik et al. [[158]8] the
   intra-tumoral heterogeneity was explored by deconvolving in silico the
   methylation data of mixed samples into five different cancer epithelial
   cell types along with one normal epithelial, one immune and one stromal
   tissue. Third, differences between transcriptome-based and
   pathology-based estimates have been reported [[159]36]. Here, they
   hypothesized that reasons for the differences include interobserver
   bias and differences between the tissue sections of the samples
   examined and those used for nucleic acid extraction [[160]29]. Further,
   we checked the performance of Deblender on a cohort of samples with
   increased biological variability in the inherent component profiles
   including, in addition to primary tumor samples, metastasis and normal
   samples. Our unsupervised analysis indicates that lowly or moderately
   variable genes can reveal more information for the samples with
   increased heterogeneity, like primary tumor and metastasis samples.
   However, it is the subset of highly variable/marker genes that can
   assist in more accurate proportion estimation of samples with one or
   few tissue components involved, like the normal samples.

   Also, we showed in the TCGA breast cancer data that the tumor
   proportions estimated by Deblender* in a two-tissue-component model
   (tumor/non-tumor) using 76% of the complete gene set correlated well
   (S1: Pearson’s r = 0.56), for a good proportion of the samples (~64%),
   with the consensus tumor proportions descending from methods that use
   genomic information other than gene expression, such as somatic
   copy-number variation, somatic mutations and DNA methylation [[161]32].
   A confounding reason behind comparisons is the fact that Deblender*
   uses expression information for the whole set of samples before
   assigning mixture proportions to each sample. Moreover, in this first
   round of analysis we did not consider the intrinsic molecular subtypes
   [[162]37] and assumed simplistically that all samples share common
   tumor and non-tumor profiles. On the other side, the ESTIMATE method
   [[163]36], which is the method conceptually closest to Deblender, uses
   reference gene expression data to assign tumor purity in each sample
   independently. Furthermore, with respect to ABSOLUTE which uses copy
   number data [[164]38], it has already been commented by Wang et al.
   [[165]20] that such values are generally ‘static’, while gene
   expression values are intrinsically ‘dynamic’, a realistic assumption
   especially in cases where the samples are pooled during dynamic complex
   diseases like cancer. Also the technical variability, i.e., noise, is
   significantly different between copy number and gene expression
   signals.

Conclusions

   The Deblender tool represents a methodological framework with a broad
   repertoire of operational modes for datasets with limited or no access
   to reference cell/tissue type-specific expression data, marker gene
   lists and information about the cell/tissue type composition. As such,
   we believe that it will be of particular interest to studies taking
   place in a realistic setting, i.e., highly complex tissues and small
   patient cohorts. Deblender extracts information from the whole
   transcriptome rather than searching features serving as hallmarks of
   marker genes. As such, Deblender is a good candidate tool for providing
   an initial virtual decomposition of the tumor and its microenvironment,
   situation where the definition of marker genes is not straightforward.

   The increasing number of published partial and complete semi-supervised
   computational deconvolution methods and their recent achievements as
   predictive and clinical tools [[166]16] have indicated to a great
   extent that in silico decomposition has great explanatory power and can
   serve as an appealing low-cost alternative to physical cell separation
   techniques. Thus, it can facilitate a more extensive adoption of
   microarray/RNA-seq technology across the continuum from prevention to
   detection, staging, prognostication and treatment development in
   complex diseases. Nevertheless, our notion is that this next generation
   of complete unsupervised techniques - like Deblender - will be a step
   towards integrating such tools into clinical decision pipelines and
   allowing researchers to re−/meta-analyze public transcriptome data in a
   manner that demands little time and performs in a robust fashion. In
   this way, researchers will gain increased insight into the true
   disease-dependent expression and mixture proportion differences and be
   able to generalize results across studies.

Methods

Linear model on gene expression deconvolution

   We adopt the linear model described in several deconvolution algorithms
   [[167]9, [168]12, [169]19]. Let S be a n × k cell/tissue-type-specific
   gene expression matrix that contains k cell/tissue types and n genes, A
   be a k × p matrix with each column representing the mixture proportions
   of k cell/tissue types in each sample, and X be a n × p mixed
   expression matrix with n genes and p samples (containing the measured
   gene expression levels for the mixed samples). The mixing process can
   be described in terms of matrix notation as:
   [MATH: <mi>X</mi><mo>=</mo><mi mathvariant="italic">SA</mi> :MATH]
   1

   In complete deconvolution both S and A are unknown. In our approach,
   first the proportion matrix A is calculated based either on marker gene
   lists (semi-supervised) or cluster sets (unsupervised mode). Then, we
   estimate the cell/tissue type-specific expression values for each gene
   independently to finally formulate the whole S cell/tissue
   type-specific gene expression profile matrix.

Estimating mixture proportions

   Deblender employs the assumption that a subset of genes called markers,
   i.e., those that are highly expressed in a specific cell/tissue type
   and lowly expressed in all other cell/tissue types known to be included
   in the mixture, are appropriate to predict mixture proportions
   (semi-supervised mode) [[170]9, [171]12, [172]19]. In case this
   information is not available, we extend these methods and apply
   clustering on the gene expression dataset and subsequently use
   representative profiles – cluster exemplars – as an alternative.
   Deblender operates in two stages: in stage I (S1), the proportions are
   estimated using an approach based on that described by Zhong et al.
   [[173]9] and in stage II (S2), the proportions are calculated using the
   method proposed by Gaujoux and Seoighe [[174]12, [175]19] after using
   as initialization (in one of many replicated runs of stage II) the
   proportion matrix (optionally also the cell/tissue type-specific
   expression profiles based on this proportion result) produced in S1.

   Stage I: We first assume S[m], a m × k matrix representing for each of
   the k cell/tissue types a set of marker genes (or a cluster) highly
   expressed in a respective cell/tissue type and zero in the remaining
   cell/tissue types.
   [MATH:
   <msub><mi>X</mi><mi>m</mi></msub><mo>=</mo><msub><mi>S</mi><mi>m</mi></
   msub><mi>A</mi> :MATH]
   2

   We simplify further the model (eq. [176]3) and assume that a
   meta-marker average expression profile for each set (or cluster
   exemplar) exists and formulate
   [MATH: <msub><mover accent="true"><mi mathvariant="normal">S</mi><mo
   stretchy="true">~</mo></mover><mi mathvariant="normal">m</mi></msub>
   :MATH]
   (eq.[177]4) as follows:
   [MATH: <msub><mrow><mi>S</mi></mrow><mi>m</mi></msub><mo>=</mo><mfenced
   close=")" open="("><mrow><mtable columnalign="left" columnspacing="1em"
   rowspacing="1pt"><mtr><mtd><msub><mrow><mi>g</mi></mrow><mrow><mn>11</m
   n></mrow></msub><mspace width="2em"></mspace><mn>0</mn><mspace
   width="1.74em"></mspace><mo>…</mo><mspace
   width="1.98em"></mspace><mn>0</mn></mtd></mtr><mtr><mtd><mrow></mrow><m
   sub><mrow><mi>g</mi></mrow><mrow><mn>21</mn></mrow></msub><mspace
   width="2em"></mspace><mn>0</mn><mspace
   width="1.74em"></mspace><mo>…</mo><mspace
   width="1.99em"></mspace><mn>0</mn></mtd></mtr><mtr><mtd><mrow></mrow><m
   space width="thickmathspace"></mspace><mn>0</mn><mspace
   width="2.1em"></mspace><msub><mrow><mi>g</mi></mrow><mrow><mn>32</mn></
   mrow></msub><mspace width=".85em"></mspace><mo>…</mo><mspace
   width="2em"></mspace><mn>0</mn></mtd></mtr><mtr><mtd><mrow></mrow><mspa
   ce width="thickmathspace"></mspace><mn>0</mn><mspace
   width="2.12em"></mspace><msub><mrow><mi>g</mi></mrow><mrow><mn>42</mn><
   /mrow></msub><mspace width="1.45em"></mspace><mo>…</mo><mspace
   width="2em"></mspace><mn>0</mn></mtd></mtr><mtr><mtd><mrow></mrow><mspa
   ce width="thickmathspace"></mspace><mn>0</mn><mspace
   width="2.12em"></mspace><msub><mrow><mi>g</mi></mrow><mrow><mn>52</mn><
   /mrow></msub><mspace width="1.5em"></mspace><mo>…</mo><mspace
   width="1.9em"></mspace><mn>0</mn></mtd></mtr><mtr><mtd><mrow></mrow><ms
   pace width="thickmathspace"></mspace><mn>0</mn><mspace
   width="2.55em"></mspace><mn>0</mn><mspace
   width="1.62em"></mspace><mo>⋱</mo><mspace
   width="1.9em"></mspace><mn>0</mn></mtd></mtr><mtr><mtd><mrow></mrow><ms
   pace width="thickmathspace"></mspace><mn>0</mn><mspace
   width="2.57em"></mspace><mn>0</mn><mspace
   width="1.8em"></mspace><mo>…</mo><mspace
   width="1.9em"></mspace><msub><mrow><mi>g</mi></mrow><mrow><mi>m</mi><mi
   >k</mi></mrow></msub></mtd></mtr></mtable></mrow></mfenced> :MATH]
   3
   [MATH:
   <mover><msub><mi>S</mi><mi>m</mi></msub><mo>~</mo></mover><mo>=</mo><mf
   enced close=")" open="("><mtable
   columnalign="left"><mtr><mtd><msub><mi>m</mi><mn>1</mn></msub><mspace
   width="1.12em"></mspace><mn>0</mn><mspace
   width="1em"></mspace><mo>…</mo><mspace
   width="1.12em"></mspace><mn>0</mn></mtd></mtr><mtr><mtd><mspace
   width="0.12em"></mspace><mn>0</mn><mspace
   width="1.5em"></mspace><msub><mi>m</mi><mrow><mn>2</mn><mspace
   width="0.5em"></mspace></mrow></msub><mo>…</mo><mspace
   width="1.1em"></mspace><mn>0</mn></mtd></mtr><mtr><mtd><mspace
   width="0.12em"></mspace><mn>0</mn><mspace
   width="1.72em"></mspace><mn>0</mn><mspace
   width=".82em"></mspace><mo>⋱</mo><mspace
   width="1.13em"></mspace><mn>0</mn></mtd></mtr><mtr><mtd><mspace
   width="0.12em"></mspace><mn>0</mn><mspace
   width="1.72em"></mspace><mn>0</mn><mspace
   width="0.98em"></mspace><mo>…</mo><mspace
   width="1.13em"></mspace><msub><mi>m</mi><mi>k</mi></msub></mtd></mtr></
   mtable></mfenced> :MATH]
   4

   Similar to
   [MATH: <msub><mover accent="true"><mi mathvariant="normal">S</mi><mo
   stretchy="true">~</mo></mover><mi mathvariant="normal">m</mi></msub>
   :MATH]
   , we create the
   [MATH: <msub><mover accent="true"><mi mathvariant="normal">X</mi><mo
   stretchy="true">~</mo></mover><mi mathvariant="normal">m</mi></msub>
   :MATH]
   of the mixed data by taking for each set the average of the mixed
   expression profiles of the respective marker genes (or cluster
   exemplars) and then eq. [178]2 becomes:
   [MATH: <msub><mover accent="true"><mi>X</mi><mo
   stretchy="true">˜</mo></mover><mi>m</mi></msub><mo>=</mo><msub><mover
   accent="true"><mi>S</mi><mo
   stretchy="true">˜</mo></mover><mi>m</mi></msub><mi>A</mi> :MATH]
   5

   Since
   [MATH: <msub><mover accent="true"><mi>S</mi><mo
   stretchy="true">~</mo></mover><mi>m</mi></msub> :MATH]
   is diagonal matrix, we can multiple both sides with
   [MATH: <msubsup><mover accent="true"><mi>S</mi><mo
   stretchy="true">~</mo></mover><mi>m</mi><mrow><mo>−</mo><mn>1</mn></mro
   w></msubsup> :MATH]
   and obtain:
   [MATH:
   <mover><msubsup><mi>S</mi><mi>m</mi><mrow><mo>−</mo><mn>1</mn></mrow></
   msubsup><mo>~</mo></mover><mspace
   width="0.24em"></mspace><mover><msub><mi>X</mi><mi>m</mi></msub><mo>~</
   mo></mover><mo>=</mo><mi>A</mi> :MATH]
   6

   Based on the model constraint that the sum of proportions for each
   sample sums to 1, a new system of equations is formed with k unknown
   variables, i.e., the diagonal elements of
   [MATH: <msubsup><mover accent="true"><mi mathvariant="normal">S</mi><mo
   stretchy="true">~</mo></mover><mi
   mathvariant="normal">m</mi><mrow><mo>−</mo><mn>1</mn></mrow></msubsup>
   :MATH]
   :
   [MATH: <munderover><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>k</mi></munderover><msub><mfenced close=")"
   open="("><mrow><mover><msubsup><mi>S</mi><mi>m</mi><mrow><mo>−</mo><mn>
   1</mn></mrow></msubsup><mo>~</mo></mover><msub><mi>X</mi><mi>m</mi></ms
   ub></mrow></mfenced><mi
   mathvariant="italic">ij</mi></msub><mo>=</mo><mn>1</mn> :MATH]
   7

   When the number of mixed samples is greater than the number of
   cell/tissue types involved, i.e., when p > k, we can solve the
   overdetermined system of equations (eq. [179]7) with k unknown
   parameters, non-negative constraints and as objective function the
   minimization of the squared norm of the residual. After estimating the
   unknown variables of eq. [180]7 we can return to eq. [181]6 and
   estimate the A mixture proportion matrix. Throughout all operation
   stages of Deblender, in case the sum of estimated proportions for a
   given sample does not sum to 1, the proportions are scaled accordingly
   before subsequent analysis.

   Stage II: In this stage, we adopt a Nonnegative Matrix Factorization
   scheme proposed for describing the deconvolution model based on marker
   genes [[182]12, [183]19]. The general algorithmic structure of NMF
   approaches is given a non-negative n × p matrix X, to find, after
   minimizing common objective functions like the Frobenius norm or the
   Kullback-Leibler divergence, an approximation
   [MATH: <mi>X</mi><mo>≈</mo><mi mathvariant="italic">WH</mi> :MATH]
   8

   where W (i.e., basis components), H (i.e., mixture coefficients) are
   n × r and r × p non-negative matrices, respectively, and the
   factorization rank r is often such that r <  < min(n, p).This can be
   adapted to the deconvolution model if the columns of H are constrained
   to sum to one. X represents the mixed gene expression matrix, the
   columns of the matrix W (i.e., S) correspond to the cell/tissue
   type-specific expression profiles and each column of the matrix H
   (i.e., A) provides the sample-specific mixture proportions. In the work
   of Gaujoux and Seoighe [[184]19], more constraints were imposed on the
   cell/tissue type-specific expression matrix S that refer to the marker
   genes. They demand both during initialization and in each of the
   iterations of the algorithm, that the rows of S that correspond to the
   markers of each cell/tissue type to be zero-valued in the irrelevant
   cell/tissue types. During each iteration, the remaining non-zero values
   of S across all genes are updated. In Deblender, in an analogous way we
   adapt the multiplicative update algorithm (‘nnmf’ Matlab function
   [[185]39]). In the unsupervised mode, we employ cluster sets instead of
   marker sets. Specifically, for each cluster set we choose the n %
   (user-defined) of genes closest to exemplars and use these cluster
   subsets likewise with markers. Then, the NMF-adapted process can be run
   either on the complete gene set of the mixture dataset (pseudocode is
   available in Additional file [186]1) where only the cluster subsets are
   treated as markers or on the set that includes only genes belonging to
   the cluster subsets. In both cases the rows of S that correspond to
   each cluster subset will be zero-valued in the irrelevant cell/tissue
   types throughout the NMF iterations. Of note, as initialization A[0],
   in one of the replicated experimental runs, the proportion matrix
   A obtained in stage I (S1) is used (optionally S[0 ]can be the
   S produced based on the proportion result of S1) and the objective
   function to be minimized is the root-mean-squared residual between X
   and S ∗ A (as in ‘nnmf’ Matlab function). Similar to [[187]19], we
   apply the sum-to-one constraint on the final A result.

Estimating cell/tissue type-specific expression profiles

   Once the A mixture proportion matrix is known (either from S1 or
   S1&S2), one can solve the system of equations independently for each
   gene i, as proposed in Zhong et al. [[188]9], and calculate the S[i] by
   minimizing:
   [MATH: <msup><mfenced close="‖"
   open="‖"><mrow><mi>x</mi><mo>−</mo><msub><mi
   mathvariant="italic">AS</mi><mi>i</mi></msub></mrow></mfenced><mn>2</mn
   ></msup><mo>,</mo><msub><mi>S</mi><mi>i</mi></msub><mo>≺</mo><mi
   mathvariant="italic">ub</mi><mspace width="0.5em"></mspace><mtext
   mathvariant="italic">and</mtext><mspace
   width="0.5em"></mspace><msub><mi>S</mi><mi>i</mi></msub><mo>≻</mo><mn
   mathvariant="italic">1</mn><mi>b</mi> :MATH]

   where ub and lb set the constraint that the resulting values are within
   the boundaries set by the maximum and minimum measurable gene
   expression levels. Based on this least squares optimization, the
   average gene expression values of each cell/tissue type are estimated.
   Similar to Onuchik et al. [[189]8], we extend this so that one can also
   calculate in this multiple linear regression problem the standard error
   (SE) for each of the coefficients (i.e., expression level of each gene
   i) in each cell/tissue type j:
   [MATH: <mi mathvariant="italic">SE</mi><mfenced close="}"
   open="{"><msub><mi>S</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></m
   sub></mfenced><mo>=</mo><msqrt><msub><msup><mfenced close="]"
   open="["><mrow><msub><mi
   mathvariant="italic">MSE</mi><mi>i</mi></msub><mfenced close=")"
   open="("><mrow><mi>A</mi><msup><mi>A</mi><mi>t</mi></msup></mrow></mfen
   ced></mrow></mfenced><mrow><mo>−</mo><mn>1</mn></mrow></msup><mrow><mi>
   i</mi><mo>,</mo><mi>j</mi></mrow></msub></msqrt> :MATH]
   9

   where MSE[i] (mean squared error for gene i):
   [MATH: <msub><mi
   mathvariant="italic">MSE</mi><mi>i</mi></msub><mo>=</mo><mfrac><mrow><m
   underover><mo>∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mi>k</m
   i></munderover><msubsup><mi>R</mi><mrow><mi>i</mi><mo>,</mo><mi>j</mi><
   /mrow><mn>2</mn></msubsup></mrow><mrow><mi>p</mi><mo>−</mo><mi>k</mi></
   mrow></mfrac> :MATH]
   10

   and p is the number of samples, k the number of cell/tissue types and R
   the residual matrix (R = X − SA).

Unsupervised mode

Data preprocess

   Deblender applies deconvolution in the linear space, as established by
   others [[190]9, [191]40]. As such, data need to be raw normalized
   values (without log transformation). Also, as preprocessing step prior
   applying Deblender, we propose filtering of un-annotated probes and in
   case of multiple probes per gene identifier (e.g., Entrez ID) selecting
   the probe with the highest variance across dataset. Deblender applies
   further filtering steps that have been adopted by other similar
   approach [[192]21]. The user can first define the percentage cutoffs to
   filter genes with high (outliers) and/or low (noise) expression vector
   norms. Subsequently, genes can be further filtered by setting a
   threshold on the coefficient of variation (CV), defined as the ratio of
   the standard deviation of the raw gene expression profile to the mean
   of the raw gene expression profile.

Clustering

   The user can choose between k-means or k-medoids algorithms for
   clustering with ‘correlation’ as distance function in order to assign
   genes to a putative cell type. Data prior to clustering can be either
   in linear or log scale; we observed that the latter leads to increased
   performance in some cases. Regardless of the cluster option setting,
   the cluster exemplars used in S1 are defined in linear scale and can
   represent for k-means the average mean profile of the cluster members
   and for k-medoids either the medoid or the average mean profile of the
   cluster members. In S2, in order to define the cluster subsets, the
   genes closest to the centroids/medoids (exemplars), as defined in the
   respective cluster option setting, are considered. In the ‘kmeans’
   Matlab function [[193]41], setup parameters like ‘OnlinePhase’ have
   been turned on to guarantee that the provided solution is a local
   minimum of the distance criterion. Also, the ‘Replicates’ parameter,
   i.e., the number of times to repeat clustering using new initial
   cluster centroid positions, has been set to 500 to achieve optimal
   performance and stability. We recorded (100 runs) the clustering
   objective function value as well as the produced proportions in
   multiple datasets (unsupervised mode – S1) and observed no significant
   fluctuations. Also, with respect to k-medoids (‘kmedoids’ Matlab
   function [[194]42]), we chose the ‘large’ algorithm [[195]43] (similar
   to k-means) due to the large size of the datasets.

Cell/tissue type identity

   When Deblender operates on unsupervised mode, there is no prior
   information in which distinct cell/tissue type each cluster exemplar
   should be assigned to. With respect to benchmark datasets, we checked
   all possible combinations and preserved the result giving the highest
   correlation when comparing estimated with real proportions. We
   validated this concept based on the reference cell/tissue type-specific
   profiles of the benchmark datasets and observed that in most cases the
   best order configuration was the one in which the majority of genes of
   a given cluster showed maximal expression in the assigned cell/tissue
   type.

Model selection

   In case where the user has no knowledge about the number of cell/tissue
   types involved in the mixture, Deblender can automatically estimate, in
   the unsupervised mode, the number k of cell/tissue types present in the
   mixture. For this, we employ the Minimum Description Length (MDL)
   information criterion as recently proposed by others for use in
   expression deconvolution [[196]21]. In particular, Deblender is applied
   multiple times with a different k each time as candidate and the model
   that explains the observed data, while avoiding unnecessarily complex
   models, is preferred. The formula of MDL is defined as:
   [MATH: <mi>M</mi><mi>D</mi><mi>L</mi><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo><mo>=</mo><mo>−</mo><mi>log</mi><mrow><mo>(</mo>
   <mrow><mi>λ</mi></mrow><mrow><mo>(</mo><msub><mrow><mi>X</mi></mrow><mi
   >m</mi></msub><mrow><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo><mo>)</mo></mrow><mo>)</mo></mrow><mo>+</mo><mfr
   ac><mrow><mrow><mo>(</mo><mi>k</mi><mo>−</mo><mn>1</mn><mo>)</mo></mrow
   ><mi>p</mi></mrow><mn>2</mn></mfrac><mi>log</mi><mo
   stretchy="false">(</mo><mi>m</mi><mo
   stretchy="false">)</mo><mo>+</mo><mfrac><mrow><mi>k</mi><mi>m</mi></mro
   w><mn>2</mn></mfrac><mi>log</mi><mo
   stretchy="false">(</mo><mi>p</mi><mo stretchy="false">)</mo> :MATH]
   11

   where the first part refers to the joint likelihood function with X[m]
   being the mixed profiles of the genes after filtering process and θ(k)
   the set of freely adjustable parameters in the model. The remaining
   part is the penalty function where in case of A proportion matrix
   calculation for a given sample m genes participate, thus in total
   contributing (k − 1)p log(m)/2 bits, while in the case of S[m]
   calculation p scalar entries for a given gene are used, thus
   contributing in total k m log(p)/2 bits.

Estimating proportions in under-determined cases

   In case the number of cell/tissue types is higher than the number of
   available mixed samples, Deblender offers the option (when the number
   of samples is ≥2) to apply an adapted NMF scheme to approximate the
   proportions based on available marker gene lists (semi-supervised) or
   cluster sets (unsupervised). More specifically, we solve the system of
   eq. [197]2 after decomposing it into k subsystems, i.e., subsets of
   equation systems, each one corresponding to a cell/tissue type-specific
   marker gene or cluster subset. The cluster subsets are defined after
   clustering the preprocessed dataset in k cluster sets and retaining for
   each cluster set n% of genes that are closest to the cluster exemplar.

   Zooming into each subsystem (eq. [198]8), X is a c × p mixed expression
   matrix with c cell-specific marker or cluster subset genes and p
   samples, W a c × 1 vector and H a 1 × p vector representing the
   cell/tissue type-specific S and A vectors (pseudocode and extended
   description about the formulation of the subsystem is available in
   Additional file [199]1). These vectors can be computed with adapted NMF
   schemes that can optionally include normalization of A and space bound
   constraints. Specifically, one can apply the multiplicative update
   algorithmic scheme (as implemented by ‘nnmf’ Matlab function) with or
   without normalization of A or a UPSO-NMF scheme we designed motivated
   by others [[200]12, [201]19, [202]33, [203]44, [204]45] so as to offer
   to the user the option to include space bounds during the calculation
   of the S and A vectors. Finally, from each subsystem only the vector of
   the cell/tissue type-specific proportions is preserved, all of which
   are subsequently concatenated into one matrix and rescaled with the
   sum-to-one constraint to form the final proportion matrix A.

Choosing a solver

   Deblender provides different algorithms for solving the series of
   equations describing the deconvolution model. There is option to employ
   algorithms like ‘active-set’ and ‘trust region reflective’ that solve
   linear least-squares problems with bounds or linear constraints
   (implemented with ‘lsqnonneg’ and ‘lsqlin’ functions in Matlab
   [[205]46, [206]47]) or use algorithms like ‘interior point convex’ to
   solve the problem with quadratic objectives and linear constraints
   (implemented with ‘quadprog’ function in Matlab [[207]48]). Further,
   Deblender offers the user the opportunity to explore particle swarm
   optimization (PSO) as an alternative for solving the series of
   equations. PSO was introduced by Eberhart and Kennedy [[208]49] and is
   inspired by behavior patterns seen in socially organized colonies to
   probe the search space so as to find the optimal solution. In PSO the
   population is called ‘swarm’ and the individuals (search points) are
   called ‘particles’. Each particle moves in the search space with a
   velocity adapted iteratively. Also, each particle keeps history of its
   best position during iterations, i.e., the position with the best
   objective function value and simultaneously shares this information
   with the other particles of the swarm. In the global variants of PSO,
   the neighborhood of each particle is the whole swarm. In the local
   variants, the neighborhoods are substantially smaller consisting of few
   particles. In Deblender, we adapted and tailored to the deconvolution
   problem the Unified Particle Swarm Optimization (UPSO) scheme [[209]50]
   that combines the exploration and exploitation properties of both the
   local and global PSO variants. Finally, with regard to Non-negative
   Matrix Factorization (NMF), we apply the multiplicative update
   algorithm (‘nnmf’ Matlab function) and also versions of this adapted to
   use in Deblender and an adapted UPSO-NMF scheme.

Additional files

   [210]Additional file 1:^ (2.2MB, docx)

   Supplementary results, discussion and methods. (DOCX 2.18 mb)
   [211]Additional file 2:^ (98.1KB, xlsx)

   Deblender mixture proportions estimates for GSE65135 and TCGA. (XLSX 98
   kb)
   [212]Additional file 3:^ (691KB, xlsx)

   Gene Ontology and KEGG pathway enrichment results. (XLSX 690 kb)

Acknowledgements