Abstract Motivation Single-cell Chromatin ImmunoPrecipitation DNA-Sequencing (scChIP-seq) analysis is challenging due to data sparsity. High degree of sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from the ENCODE project to impute missing protein-DNA interacting regions of target histone marks or transcription factors. Results Imputations using machine learning models trained for each single cell, each ChIP protein target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene identification on real human data. Results on bulk data simulating single cells show that the imputations are single-cell specific as the imputed profiles are closer to the simulated cell than to other cells related to the same ChIP protein target and the same cell type. Simulations also show that 100 input genomic regions are already enough to train single-cell specific models for the imputation of thousands of undetected regions. Furthermore, SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region. The corresponding feature importance values derived from promoter-interaction profiles of H3K4me3, an activating histone mark, highly correlate with co-expression of genes that are present within the cell-type specific pathways in 2 real human and mouse datasets. The SIMPA’s interpretable imputation method allows users to gain a deep understanding of individual cells and, consequently, of sparse scChIP-seq datasets. Availability and implementation Our interpretable imputation algorithm was implemented in Python and is available at [30]https://github.com/salbrec/SIMPA. Introduction The discovery of protein-DNA interactions of histone marks and transcription factors is of great importance in biomedical studies because of their impact on the regulation of core cellular processes such as chromatin structure organization and gene expression. These interactions are measured by chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq). Public data from the ENCODE portal, which provides a large collection of experimental bulk ChIP-seq data, has been used for comprehensive investigations providing insights into epigenomic processes that affect chromatin 3D-structure, chromatin state, and gene expression, to name just a few [[31]1, [32]2]. Recently developed protocols for scChIP-seq are powerful techniques that will enable in-depth characterization of those processes at single-cell resolution. ChIP-seq was successfully performed on single cells with sequencing depth as low as 1,000 unique reads per cell, reflecting the low amount of cellular material that can be obtained from only one single cell [[33]3]. Even though this low coverage leads to sparse datasets, scChIP-seq data has enabled the study of biological systems that cannot be investigated with bulk ChIP-seq applied for many cells, for example, the differences between drug-sensitive and drug-resistant breast cancer cells [[34]4]. The analysis of single-cell assays is strongly affected by the sparsity of the data. In the context of ChIP-seq, sparsity means no signal observed for numerous genomic regions without the possibility to explain whether this is real or due to low sequencing coverage. Notably, sparsity may disable the investigation of functional genomic elements that could be of crucial interest. Hence, an imputation method is needed to complete sparse scChIP-seq datasets while preserving the identity of each individual cell. The first published imputation method for bulk NGS epigenomic signals was ChromImpute [[35]5], later followed by [[36]6], an improved method for the imputation of signal tracks for several molecular assays in a biosample-specific manner (biosample refers to the specific tissue or cell-type, not to single cell). The challenge of transcription factor binding site prediction was approached, for example, using deep learning algorithms on sequence position weight matrices [[37]7] and more recently by the embedding of transcription factor labels and k-mers [[38]8]. With the aim to complete the ENCODE portal with imputed bulk experiments, Schreiber et al. implemented the method Avocado, which extends the basic concept of PREDICTD by deep neural networks [[39]9]. Avocado was also validated on ChIP-seq data from both histone marks and transcription factors [[40]10]. Such methods show the successful application of machine learning algorithms and mathematical approaches in predicting epigenomic signals such as transcription factor binding activity. However, their scope, being limited to either imputation of missing bulk experiments or sequence-specific binding site prediction, hampers their application to single-cell data. The challenge of imputation for sparse datasets from single-cell assays has been extensively approached for single-cell RNA-seq (scRNA-seq) used to quantify gene expression at single-cell resolution e.g. [[41]11–[42]19]. In this context, similarly to scChIP-seq data, sparsity is described by dropout events, which are transcripts having a transcription rate of zero without knowing if the corresponding gene is not expressed at all or if the expression rate is not detected due to technical limitations [[43]16]. The question arises if these methods can be easily adapted for imputation of scChIP-seq data. However, there are crucial differences between the application of RNA-seq and ChIP-seq techniques that must be considered regarding the development of a method for scChIP-seq imputation. First, in RNA-seq the set of relevant genomic regions, defined by the species-specific transcripts, is more limited. For a ChIP-seq profile, the regions of potential interest may originate from any position in the genome and cannot be defined in advance. To simplify the analysis, in scChIP-seq imputation the genome can be organized in non-overlapping genomic windows (bins) of a certain size. At 5 kb resolution, this binning concept results in more than 600,000 possible bins in the human genome, a number that is much higher than the number of transcripts considered in the RNA-seq context. The second main difference is that scChIP-seq interactions are usually represented by a Boolean value (the value can be either “True” or “False”) describing the presence or absence of a significant enrichment of sequencing reads defining a peak, while RNA-seq datasets contain transcription rates. Consequently, the application of scRNA-seq imputation methods on scChIP-seq data might be less appropriate. In contrast, imputation methods for chromatin accessibility profiles from single-cell ATAC-seq (single-cell Assay for Transposase-Accessible Chromatin using sequencing, scATAC-seq) are potentially more transferable to scChIP-seq imputation as their data representation is more similar. A few methods exist that implement imputation for scATAC-seq, though none of them was tested on scChIP-seq data so far. Methods such as SCALE which uses a combination of Gaussian mixture models and variational autoencoder [[44]20] have been shown to outperform scRNA-seq methods to impute scATAC-seq data (see also FITs [[45]21] and scOpen [[46]22]). These methods complement each other with respect to the different approaches they implement, however, they share the common concept of imputing the missing values within a sparse matrix defined by the single cells (rows) and the genomic bins (columns) of a given experiment, and only bins are considered that were detected by at least one single cell if no further filtering is applied. Consequently, such methods can offer imputation only on regions that were observed in the single-cell dataset and it is likely that many important regions along the whole genome will be missed. To overcome this limitation, we developed SIMPA, an algorithm for Single-cell ChIP-seq iMPutAtion, that uses bulk ChIP-seq datasets of the ENCODE project to train its machine learning models [[47]2, [48]23]. It was already shown that an additional bulk RNA-seq dataset can be used to improve the imputation for a sparse scRNA-seq dataset [[49]14]. Within SIMPA, publicly available bulk ChIP-seq data is turned into a reference set used to define potential bins to be imputed. Bins observed in the single-cell dataset are then used to derive training sets specified by both the bulk reference data and single-cell data. This training set is then leveraged by machine learning models to compute specific imputation probabilities. Moreover, these models are interpretable and can be used to gain insights into a given single-cell dataset, allowing a more detailed investigation of individual cells. The interpretability is implemented by InterSIMPA, an extension which takes a single cell as input together with a genomic position of interest and trains one classification model for the position to derive a probability which can be seen as an imputation score. More importantly, InterSIMPA ranks the genomic bins from the sparse single-cell profile by their relevance for the model. The ranked bins are enriched by detailed information and an importance score describing the strength of their relationship with the given genomic position of interest. These relationships can be interpreted as dependencies between genomic regions that could be part of the gene regulatory network (e.g., between enhancers and promoters). The basic reference-based imputation concept of SIMPA was first validated for human on simulated data and then on a real scChIP-seq dataset of immune cells [[50]4]. The simulated data was used to demonstrate the single-cell specificity of the imputations on full data profiles. The real dataset allowed us to investigate the algorithm’s capability of retaining the cell-type clustering and furthermore to assess the biological relevance of the imputed regions based on a pathway enrichment analysis. Results from InterSIMPA were validated using promoter regions related to genes of the B- and T-cell signaling pathways in the human dataset and regions related to brain functions in a mouse dataset [[51]24]. Methods Datasets Preparation of the reference data (bulk ChIP-seq datasets) To create the reference set that is used by SIMPA we downloaded all ChIP-seq experiments from the ENCODE portal that comply with the following criteria: the status is released, the experiment is replicated (isogenic or anisogenic), no treatment to the biosample, without genetic modification, and the organism is Homo Sapiens (human) or Mus Musculus (mouse). Experiments represent different ChIP protein targets (antibody targets within the ChIP) and biosamples (either tissue or an immortalized cell line). For all experiments, we downloaded fully preprocessed sets of protein-DNA interacting regions as peak files: the replicated peaks for histone mark ChIP and the optimal IDR thresholded peaks for transcription factor ChIP. For human, 2251 peak files of the hg19 or hg38 assembly were downloaded and preprocessed as hg38 after converting hg19 files with the UCSC LiftOver tool [[52]25]. For mouse, 848 peak files of the mm10 assembly were downloaded and preprocessed (data from other assemblies did not complement the selection). For downloading, preprocessing, and updating the datasets we used a semi-automatic, SQL-backend procedure that was already used in a previous project [[53]26]. A list with all 3099 reference experiments is provided in [54]S1 Table including information about the protein target, the cell-type or tissue, the assembly, and the exact ENCODE accession IDs. Data preprocessing In order to limit computational complexity, all reference experiments were converted from ChIP-seq peak sets to genomic bin sets (bins are defined as non-overlapping regions on a genome). We provide on github the reference data in bin sizes (or resolution) of 5 kb and 50 kb for hg38 ([55]https://github.com/salbrec/simpa). We also provide the reference data for mm10 for some of the main histone marks in different sizes. However, the github repository also provides scripts and descriptions that enable the preparation of any target in the desired resolution (bin sizes). Bin sets for the reference ChIP-seq experiment are in binary format to be more efficiently integrated by the main Python scripts. Given one reference experiment, a bin is said to be “present” if there is at least one ChIP-seq peak that overlaps this bin, “absent” otherwise. Preprocessing of scChIP-seq data for human (Grosselin et al., 2019) We downloaded the count matrices for H3K4me3 and H3K27me3 available in GEO under accession number [56]GSE117309 in 5 kb and 50 kb binning resolution, respectively. From the matrices, we derived bed files for every single cell excluding gender-specific chromosomes. SIMPA, InterSIMPA and other imputation methods were then applied on 25% of the single cells randomly sampled, 1520 bed files for H3K4me3 and 1128 bed files for H3K27me3. Preprocessing of scChIP-seq data for mouse (Zhu et al., 2021) We downloaded the count matrices for H3K4me3 available in GEO under accession number [57]GSE152020 in 1kb binning resolution. From the matrices, we derived bed files for every single cell excluding gender-specific chromosomes. The initial dataset provides 7465 cells for this histone mark; we applied InterSIMPA on a subset of 1000 randomly sampled cells. The SIMPA algorithm SIMPA is an algorithm implemented in Python 3.7.3 for “Single-cell ChIP-seq iMPutAtion”, which is applied to one single cell represented by a sparse set of scChIP-seq genomic regions (or peaks) provided by the user in bed format. Within the algorithm, the given single-cell bed file is converted into a set of bins SC describing the single-cell input ([58]Fig 1A). The user also provides the protein target name for the histone mark or transcription factor targeted by the antibody within the single-cell immunoprecipitation. The target name is needed to specify the training set, which consists of experiments from the ENCODE reference set. Fig 1. SIMPA’s algorithm and cross validations. [59]Fig 1 [60]Open in a new tab A. Identified ChIP-seq regions from bulk experiments were downloaded from ENCODE and mapped to bins defined as non-overlapping and contiguous genomic regions of a defined length (5 kb for H3K4me3 and 50 kb for H3K27me3 in the human dataset) and covering the whole genome (the table). A bin is given a value of 1 for a particular experiment if there is at least one ChIP-seq region in this experiment that overlaps the bin, 0 otherwise. In total 2251 human and 848 mouse ChIP-seq experiments for several targets (histone marks or transcription factors) performed in several biosamples (tissues and cell-lines) were downloaded from ENCODE portal and preprocessed. Depending on the target specified by the user, the target-specific reference set RS is then created and contains all experiments related to this target (rows in red; H3K4me3 is given as example) and all bins observed for at least one of those experiments. B. The single-cell specific training feature matrix TF is created as a subset of RS by selecting only bins observed within the given single cell (green columns). All other bins from RS are the candidate bins (c; blue columns) and define the class vectors consisting of the corresponding values in RS. For each candidate bin, a classification model is trained based on the training features and the class vector identifying associated experiments. C. Cross-validated evaluations of SIMPA’s Random Forest performances to predict values of candidate bins defined for real human single-cell data related to H3K4me3. For each candidate bin, a ten-fold cross-validation was applied and summarized as the mean Area under ROC-Curve (AUROC) or Area under Precision-Recall Curve (AUPRC) (y-axes). Results for all bins are represented by boxplots subdivided by class balance in the candidate bins (percentage of “1” values in the bin) (x-axis). The dashed lines describe the baseline performance expected from a random classification model: 0.5 for AUROC and equal to the class balance for AUPRC. Given the sparse profile of one single cell as input, SIMPA aims to impute missing bins based on predictive information within bulk data specified by the selected ChIP protein target and further by the regions taken from the input single-cell profile. In order to make the bulk data informative, first SIMPA collects all the ENCODE reference experiments available for the given target that defines the rows of the reference set matrix (RS) where columns represent bins ([61]Fig 1A): [MATH: RS=ai,j,1in ,1jm :MATH] with [MATH: ai,j0,1describingacellofthematrixwithvalue=1whenbinjinreferenceexperimentiispresent,0otherwise, :MATH] and where n is the number of experiments available for the given target, and m is the number of bins that are present in at least one of the target specific experiments. As the rows are defined by the given target, the target-specificity is induced within this step. Second, a subset of RS is created by selecting only the columns for bins that are present in the given single-cell profile SC to create the training features TF: [MATH: TFRS, :MATH] [MATH: TF=ai,k,1in ,1ks , :MATH] where k indexes a selection of bins from RS that are present in SC and with s the number of bins in SC ([62]Fig 1B). Bins present in RS but not in SC are collected and named as candidate bins c that are potentially imputed bins. Third, SIMPA takes each candidate bin c[g] in c (c is the set of candidate bins; the number of bins in c is P; thus 1 ≤ g ≤ P) separately to compute an individual imputed probability ρ[g] that c[g] is present in the single cell. Given c[g], SIMPA trains a classification model cm[g] based on TF defining the features and c[g] the class vector. Because an individual model is trained for each individual candidate bin, bin-specificity is induced for the whole approach. The imputed probability ρ[g] is finally computed by cm[g] which takes as input an artificial instance vector b = (b[k]), b[k] = 1, 1 ≤ k ≤ s. Consequently, ρ[g] is the probability of c[g] to be predicted for the imputed single-cell result, given the fact that all bins in SC are observed. As we use a Random Forest implementation from the scikit-learn (version 0.21.3) Python’s library [[63]27, [64]28] with default settings to build classification models, the imputed probability is then the mean predicted class probability of the trees (by default 100) in the forest while the class probability of a single tree is calculated by the fraction of samples of the same class in a leaf. When applying this algorithmic strategy on a real sparse single-cell profile to impute candidate bins, the final step of receiving the imputed probability differs from any cross-validation scenario as there is no hold-out sample that could be used to apply the model on. Instead, the model is applied on a synthetic vector containing only 1s to receive the imputed probability. Providing this vector in which the interaction is present for each bin, differs to the nature of the reference data which usually describes a mixture of interaction and non-interaction. However, this strategy is applied in the same way for all candidate bins excluding a potential bias regarding the ranking of candidate bins by the imputed probability. More crucially, by this final step we force the Random Forest model to return an imputed probability based on the knowledge that all bins captured by a single cell are present, thus, the outcome is highly specific to the given single cell. Moreover, this strategy allows us to keep focus on a cell while interpreting the underlying model to gain more insights with high specificity to the given individually single cell. Finally, SIMPA creates two files: one file in bed format and the other in SIMPA format described as a table listing the single-cell bins first, followed by the imputed bins sorted by the imputed probability. A line represents a bin described by its ID, its genomic coordinates, its frequency according to the target-specific reference experiments, and the imputed probability. Note that the first bins on top of this file have no imputed probabilities as they represent the original sparse single-cell input (a default value of -1 is assigned). The second file created by SIMPA is the imputed bed file containing the original single-cell bins and the additional imputed bins selected among those with the highest imputed probability. The number of bins within this bed file is defined by the average number of bins present in the target-specific bulk experiments, e.g. 32,584 for H3K4me3 (5 kb bin size) and 12,598 for H3K27me3 (50 kb bin size) on the hg38 samples. Cross validations Stratified ten-fold cross-validations were done to verify if the Random Forest Classifier used by SIMPA when applied to the real single-cell dataset is able to make use of statistical patterns from the bulk data to train accurate models predicting the presence or absence of a protein-DNA interaction in candidate bins. Hence, within this analysis, the performance of Random Forest was cross-validated on the candidate bin values not used in the training set but still defined in the reference set. We chose Random Forest as, by default, the algorithm never uses all given features for training one decision tree and consequently smaller sets of genomic bins are considered for a tree which agrees with the biological assumption that not all regions, captured by an individual single cell, are relevant for the imputation of a candidate bin. Given a candidate bin, SIMPA trains a Random Forest with 100 decision trees (number of estimators) and aggregating the votes from all trees results in a probability we use as imputed probability. We applied the following cross-validation approach (results shown in [65]Fig 1C and Fig S1 in [66]S1 File): given a protein target (either H3K4me3 or H3K27me3), 10 single cells were randomly chosen from the real dataset for each cell type; for each single cell the training feature matrix TF for SIMPA was created as explained above together with the set of candidate bins. Then, for each candidate bin that defines the class vector, a Random Forest classification model was trained and evaluated by the area under the ROC-curve within a stratified ten-fold cross-validation. In addition, we used the area under precision-recall curve to better study the class vector imbalance. Results Algorithmic concept and cross-validations Unlike many other single-cell imputation methods, SIMPA leverages predictive information within bulk ChIP-seq data by combining the sparse input of one single cell and a collection of bulk ChIP-seq experiments from ENCODE. In order to better compare bulk and single-cell data, ChIP-seq regions (or significant signal/noise ChIP-seq peaks) are mapped to genomic bins ([67]Fig 1A; see [68]Methods for details about bulk and single-cell data retrieval and processing). SIMPA produces results for each single cell of a scChIP-seq dataset by using machine learning models trained on a subset of the ENCODE data related to a selected ChIP protein target, that is the histone mark or transcription factor used in the single-cell experiment. Derived from this target-specific subset, the classification features are defined by genomic bins detected in the single cell, while the class to predict is defined by a bin observed in at least one target-specific bulk ENCODE experiment, but not in the single cell ([69]Fig 1B). In other words, by using this particular data selection strategy, SIMPA searches relevant statistical patterns linking (i) the bulk ChIP-seq data across single-cell-related bins and target-related experiments for different cell types to (ii) the potential presence or absence of a bin in the given single cell. Within a cross-validation scenario that compared the predicted probabilities to corresponding candidate bin values, results show that the machine learning-concept of SIMPA is able to provide accurate predictions ([70]Fig 1C, Supplementary Note 1 in [71]S1 File, and Fig S1 in [72]S1 File). Moreover, on the high-resolution H3K4me3 human dataset [[73]4], SIMPA achieved high recall rates for bins removed from single-cell profiles (Supplementary Note 2 and Fig S2 in [74]S1 File). Validation on simulated data In order to evaluate the algorithm’s ability from few input bins (hundreds) to complete full data profiles (thousands of bins) of different protein targets and cell-types, we simulated sparse protein-DNA interaction profiles from the bulk ENCODE ChIP-seq experiments that are used as reference data by SIMPA. For the simulation, we took human bulk experiments for different cell-type-target combinations to define them as full single-cell profiles (origin) and down-sampled those profiles to simulate sparse single-cell profiles (from 100 to 1600 randomly selected bins) (see Supplementary Note 3 in [75]S1 File for details). Each simulated sparse profile was used as input for SIMPA and the output was compared to the origin (excluding bins used as input). For the model training, the full origin profile was excluded from the reference training set in order to apply the default validation, called leave-out origin (LOO). Additionally, a more challenging validation strategy was applied in which all reference profiles for the same cell-type (biosample) were excluded, called leave-out cell-type (LOCT). For H3K4me3, the most frequently investigated target in ENCODE, high area under ROC-curve values confirm that SIMPA is able to accurately recapitulate the original data from the simulated sparse profiles ([76]Fig 2A). Even if the cell-type-specific information is completely removed from the training set (LOCT), the performance is still high. Furthermore, these observations are confirmed when using precision-recall curves as performance measure ([77]Fig 2B), a relevant analysis given the imbalance in the validation sets (containing far fewer positive than negative samples). We made similar observations in a ROC-curve and precision-recall curve analysis for other cell-type-target combinations (Figs S4 and S5 in [78]S1 File). Fig 2. Performance on simulated sparse profiles in different cell-types. [79]Fig 2 [80]Open in a new tab A. For simulation, a full human single-cell profile (origin profile) is defined by a full bulk profile and the corresponding sparse single-cell profile is defined by the down-sampled bulk profile. Compared to real data, the simulations allows us to test SIMPA on full profiles related to a large variety of ChIP protein targets and biosamples. Using the origin profile as the validation set of true binding interactions, the area under ROC-curve (AUROC in y-axis) describes the capability of SIMPA to accurately impute the sparse profile and recapitulate the origin. The bars describe the mean AUROC and the error bars describe the standard deviation across multiple applications on sparse sets with different sizes. SIMPA was validated with two strategies, the default leave-out origin (LOO; origin profile excluded from the training set) and the extreme leave-out cell-type (LOCT; all experiments with the same cell type than the origin profile are excluded from the training set). The x-axis labels indicate the cell-type of the origin profile and additionally the ENCODE accession to show which of the experimental dataset was used as origin. B. Same as in A but using the area under precision-recall curve (AUPRC in y-axis) as performance measure. The pink bars show the class balance (fraction of positives in the class feature) representing the random assumption as baseline to be expected from a primitive classifier that randomly assigns the class values (according to [[81]29]). Note, the sampled bins, that simulate a sparse single-cell profile and are expected to be known before imputation, were completely excluded before computing area under the curve values. In order to assess the single-cell specificity of SIMPA in this simulation, we compared each fully imputed profile to its origin profile and also to a consensus profile representing experimental datasets that are most similar to the origin (experimental profiles with same protein target and same cell-type, see Supplementary Note 3 in [82]S1 File for details). All comparisons excluded bins used as input for SIMPA. Results show that for most of the simulations (>95%) the imputed profile is closer to the origin profile, hence single-cell specific ([83]Fig 3A). Moreover, we observed that the origin profiles can be more similar to the consensus profile (less specific) or less similar (more specific). When the origin profiles are less specific, it is harder for SIMPA to achieve an imputed profile specific to the origin (single-cell specific). However, for such cases in which the origin is quite close to the consensus (Jaccard-Index > 0.65) the imputation is still single-cell specific, although with a lower single-cell specificity value ([84]Fig 3B). Fig 3. Single-cell specificity analysis. [85]Fig 3 [86]Open in a new tab A. The Jaccard-Index is used to compare the imputed profiles from SIMPA with the origin human bulk profile used to create a simulated sparse profile and the consensus profile representing the remaining experiments available for the same biosample-target combination as the origin profile. The dashed line shows the balance line at which the imputed profile from SIMPA is neither closer to the origin nor to the consensus. Cases above the dashed line are those in which the imputed profile is single-cell specific, hence, closer to the origin than to the consensus. B. “Single-Cell Specificity” on the y-axis is defined as the difference between the imputed-to-origin similarity (y-axis in A) and imputed-to-consensus similarity (x-axis in A). Having the similarity between the origin and the consensus on the x-axis, this plot allows the visualization of the single-cell specificity in relation to how specific the origin is. The higher the similarity between the origin and consensus, the less specific is the origin profile and the harder the challenge to capture its specificity. Profiles above the 0 line are single-cell specific as they are closer to the origin than to the consensus. Before computing the Jaccard-Index values, the sampled bins, which simulate a sparse single-cell profile and are expected to be known before imputation, were removed from all the sets, origin, imputed, and consensus. Taken together, the simulation results show that models trained from a few bins accurately impute thousands of bins and show that completed profiles can be single-cell specific on real data even if the investigated cell-type is not represented by any of the bulk datasets in the reference set (leave-out cell-type validation). Model interpretability on real data Addressing one main aim of this study to make models interpretable, we implemented an extension called InterSIMPA. Here we define interpretability as the possibility of obtaining information of potential biological relevance from the relationships observed between the training features (genomic bins) and a genomic position of interest. These relations can be expected to be part of the genomic regulatory network. The training features are derived by InterSIMPA in the same way as for SIMPA but a single machine learning model is trained for a selected genomic position of interest defined by the user. Accordingly, one imputed probability is returned for genomic position of interest with information about the genomic bins most important for the machine learning model. Finally, the algorithm reports the genes closest to these bins (Supplementary Note 4 in [87]S1 File). To demonstrate how interpretable imputation models can be used to expose more information from the sparse ChIP-seq profile of individual single cells, we use the human single-cell ChIP-seq dataset of H3K4me3 interactions in B-cells and T-cells from Grosselin et al. [[88]4]. According to the given cell types, we focused on promoter regions of genes that are involved within the B-cell and T-cell receptor signaling pathways. The two gene sets contain 67 and 97 genes for the B-cell receptor and T-cell receptor signaling pathways, respectively, with an overlap of 44 genes. To focus on the genes that could be more specific to the cell-types under investigation, from the union of the two gene sets we selected 24 genes with frequency of their promoter regions lower than 20% in the corresponding H3K4me3-specific reference set, which means that their promoter has no detected interaction site for more than 80% of the ENCODE reference experiments for H3K4me3 in different cell-types and tissues (Supplementary Note 4 in [89]S1 File). As H3K4me3 is an activating histone mark, we expected to observe interaction sites in the promoter regions of these genes. However, for many of those promoter regions, the H3K4me3 binding is missing for most of the single cells in the sparse data ([90]Fig 4A). Our expectation that SIMPA is able to impute such regions at least in a cell-type-specific manner, is confirmed by comparing the imputed probabilities calculated by SIMPA for promoter regions of the 24 selected genes in single B- and T-cells ([91]Fig 4B). For most of the genes, the imputed probability is higher when SIMPA is applied on single cells that are from the pathway-related cell-type. Finally, we evaluated the interpretability of the 24 imputation models by comparing feature importance values and co-expression values of the feature-related genes with the gene of the imputed promoter ([92]Fig 4C). Co-expression data from the STRING database was used [[93]30]. The observed high correlations suggest that InterSIMPA is capable to describe biologically relevant promoter-promoter relations by the predictive information hidden within sparse histone mark profiles of an activating mark. Consequently, our approach not only completes the sparse scChIP-seq dataset, but its interpretability-extension is even capable of providing deeper insights into the data. Fig 4. Pathway related gene analysis using the interpretation of imputation models. [94]Fig 4 [95]Open in a new tab A. Fraction of single cells for which H3K4me3 binding is observed within the gene’s promoter region in the human single-cell dataset (orange and blue bars representing B-cells or T-cells, respectively). Y-axis labels show the gene names and if the gene belongs to the B-cell or T-cell receptor signaling pathway or to both. B. Imputed probability computed by SIMPA for the gene-related promoter regions shown in A. The imputation was applied on numerous single cells from the cell-types B-cell (orange) and T-cell (blue). The error bars represent the standard deviation across the imputation runs on different cells. For the majority of genes, the imputed probability is higher within the cell-type that corresponds to the gene’s pathway. C. Correlation of feature importance and co-expression values. For each model used to impute a promoter (y-axis), the training features (genomic bins) were extracted together with their importance value provided by the Random Forest algorithm and annotated with the nearest gene on the genome. Co-expression values, derived from transcriptomic and proteomic measurements, of those genes with the gene related to the imputed promoter were retrieved from the STRING database. The Pearson correlation coefficient of feature importance and co-expression values is shown (x-axis). Performance on cell-type clustering and functional analysis After the evaluation of the InterSIMPA extension, here, we evaluate how SIMPA enhances single-cell data corresponding to different cell types. For the imputation of a full single-cell dataset, SIMPA was applied for each cell individually. The resulting imputed profiles were then analyzed within two validations, to examine if (i) cell-type clustering was retained after imputation and (ii) if the imputed single-cell profiles are significantly associated with genes of the corresponding cell-type-specific pathway. Following the investigations of Schreiber et al. [[96]31], we also compared bin probabilities from SIMPA to a simple imputation approach that uses bin frequencies in the reference set (experiments with same protein target) as a probabilistic model without using any machine learning model, called the average interaction method. Additional imputations and randomization tests were applied and compared to better analyze the basic concept of SIMPA (see Supplementary Note 5 in [97]S1 File). Because of the better resolution available for H3K4me3 processed as genomic bins of size 5 kb in the human dataset, we present below results on this histone mark and refer to supplementary material for H3K27me3 processed at 50 kb bins (Fig S6 in [98]S1 File). For benchmarking, we used a single-cell ATAC-seq imputation method, SCALE, solely based on the single-cell dataset itself (reference-free) in contrast to SIMPA, which takes advantage of information from the reference bulk dataset. After applying a two-dimensional projection on the sparse and imputed datasets, we observed that the separation between the cell types was retained by SIMPA and by the reference-free method, contrary to the average interaction method ([99]Fig 5A). Moreover, three T-cell outliers were successfully associated to the related cell-type cluster by SIMPA, which achieved a slightly better homogeneity of the clusters in comparison to SCALE which did not handled correctly the outliers (Fig S6 in [100]S1 File). Dimensionality reduction was done by a combination of principal component analysis (PCA) and t-stochastic neighbor embedding (t-SNE) as suggested by Grosselin et al. on their analysis of sparse data [[101]4]. Unlike the suggested procedure, we did not perform cell filtering, as we were interested to observe outliers after imputation. Fig 5. Cell-type specificity validation. [102]Fig 5 [103]Open in a new tab A + B Separation of single cells according to cell type. A. Dimensionality reduction analysis applied on the human H3K4me3 data derived from (i) the sparse single-cell data and three different imputation methods, (ii) SIMPA, (iii) reference-free imputation, and (iv) average interaction based on expected frequencies in the reference set. Results from SIMPA and from the reference-free method achieve the best clustering by separating the single cells (points) by cell types (colors). B. Effects of input modification on SIMPA, (i) using a shuffled reference set or (ii) randomized sparse input data, or using other histone marks as reference instead of H3K4me3, either (iii) the functionally different histone mark H3K36me3, or (iv) the functionally similar histone marks H3K9ac and H3K27ac. Only SIMPA used with relevant protein targets was able to correctly cluster all single cells (no outliers). C. Pathway enrichment analysis. Boxplots show the significance of pathway enrichment analyses of genes annotated by single-cell regions as log-transformed false discovery rate (FDR; x-axis). Each dot represents the FDR of one single cell from the results of the different analysis experiments shown in A+B (y-axis). The dashed lines represent the log-transformed significance threshold of an FDR equal to 0.001. Only SIMPA achieves significant results by imputing preferably genomic regions associated with relevant pathway-related genes. In order to validate further the algorithmic concept of SIMPA, we implemented two randomization tests in which either the ENCODE reference information was shuffled (Shuffled Reference) or the sparse single-cell input was randomly sampled (Randomized Sparse Input). Additionally, we applied SIMPA on the same data but with models trained for different related or unrelated histone marks. The selected histone marks were H3K36me3, a histone mark functionally different to H3K4me3, and H3K9ac and H3K27ac, a group of two histone marks functionally related to H3K4me3. These two marks were used together to increase the training data size. From this comparison, we observed that (i) the separation on the projection is lost after removing statistical patterns through shuffling or randomization, (ii) separation quality is moderate with an input mark functionally different from the real mark, and (iii) separation quality stays high using SIMPA with target histone marks functionally similar to the real mark ([104]Fig 5B). Thus, the most relevant statistical patterns from the reference dataset are identified by both the selection of single-cell-specific regions and the selection of target-specific experiments. Similar observations were made for H3K27me3 although a more compact clustering could be achieved on the SIMPA profiles compared to those from the reference-free method (Fig S6 in [105]S1 File). Across several dimensionality reduction procedures applied for the H3K4me3 dataset, SIMPA and the reference-free method were both stable in retaining the cell-type clustering (Fig S7 in [106]S1 File). From these analyses we additionally conclude that the UMAP method using the Jaccard-Index distance achieves reasonable results when applied directly on the sparse data (in comparison to the common approach in single-cell analysis that uses first a PCA to select dimensions). As pathway enrichment analysis is a common step in ChIP-seq data exploration, we next investigated if enrichment analyses of cell-type-specific pathways for individual single cells improve after applying imputation. We analyzed the sparse profiles and different imputed results with the KEGG pathway analysis function of the Cistrome-GO tool [[107]32]. As reported in [108]Fig 5C, the original sparse data did not provide enough interaction sites to show a significant pathway enrichment for any of the two cell types. Results from the reference-free strategy showed an improvement but not significant. However, with regions imputed by SIMPA, it was indeed possible to achieve significant enrichment scores and recover the cell-type-specific pathways for most of the cells. These results show that SIMPA is able to integrate functionally relevant information from the reference data in order to impute additional biologically meaningful bins, in contrast to the reference-free method, which is limited to the single-cell dataset. Optimal size of the imputation sets As described, SIMPA computes the imputed probabilities for numerous genomic bins, and sorts and prioritizes those accordingly for imputation. In the previous validations, as a default, we imputed a number of bins equivalent to the average number of bins observed across all bulk ChIP-seq profiles from the target-specific reference set. On 5 kb resolution, the average number of bins of the H3K4me3 experiments is 32,584. However, once the bins are ranked by the imputed probability it is up to the user to alternatively create imputed sets of different sizes. With the next analysis we address the question about the optimal number of bins needed to improve the cell-type clustering at the same time enabling the detection of the relevant biological function by a significant enrichment of the correct pathway (for details see Supplementary Note 6 in [109]S1 File). The best cell-type clustering quality, evaluated by the Davies-Bouldin score [[110]33], is reached when adding ~11,000 bins ([111]Fig 6A). At this level, SIMPA slightly improves the clustering quality compared to the reference-free method. Fig 6. Clustering quality and pathway enrichment for different sizes. [112]Fig 6 [113]Open in a new tab A. Clustering quality (y-axis) evaluated with the Davies-Bouldin score (the lower the better) applied on the imputed data after dimensionality reduction as described for [114]Fig 5A+5B. While the reference-free method derived only one imputed set for all the single cells (dashed black line), we could derive several imputed sets of different sizes using the imputed probabilities from SIMPA (x-axis). B. The pathways under investigation are the B-cell and T-cell receptor signaling pathways. In this way we analyze two pathways surely related or unrelated with the cell-types present in the dataset, B-cell and T-cell. The y-axis describes the percentage of imputed profiles for which a significant enrichment of the aforementioned pathways could be achieved. The dashed lines represent cases for which the unrelated pathway is significantly enriched, which is the T-cell receptor signaling pathway when analyzing a B-cell and vice versa. The significance level used is 0.001 similar to the analysis shown in [115]Fig 5C. To reduce the computational resources spent on the pathway enrichment, this was done for ten imputation set sizes (highlighted in blue on the x-axis). Considering the amount of cells in which the cell-type related pathway is significantly enriched, we observed that in ~50% of the cells the related pathway is associated when adding ~28,000 bins ([116]Fig 6B). After adding more than 32,000 bins, almost all cells have a significant enrichment for the cell-type related pathway, however, it seems to be also the limit for avoiding the association of the unrelated pathway. For this analysis the same pathways and settings are used as in [117]Fig 5C; the unrelated pathway is the T-cell receptor signaling pathway when analyzing a B-cell and vice versa. InterSIMPA applied on mouse scChIP-seq data (Zhu et al., 2021) New technologies allow to obtain the joint profiling of histone modifications and transcriptome in single cells as used by [[118]24]. From this dataset we chose H3K4me3 profiles from mouse brain cells available on 1kb binning resolution to further validate the concept of InterSIMPA. As each single-cell dataset is very specific to the investigated cell-types, we focused on four genes also used within the original study to analyze the difference between excitatory and inhibitory neurons, as well as non-neurons. We first annotated the promoter regions of these genes based on 1000 randomly selected sparse single-cell profiles and observed a very low coverage of the promoter regions ranging from 0.0 to ~2.0% ([119]Fig 7A). Despite the low single-cell coverage, the patterns agree with those from the original study (see Fig 1F in [[120]24]); note that we focus on the promoter only and do not consider the full gene body. We then analyzed the imputed probabilities from InterSIMPA for these four genes ([121]Fig 7B). For the three genes Snap25, Neurod6, and Gad2, the imputed probability is higher for the neurons compared with the non-neurons, and no difference can be observed between cell types for Slc1a3. Similarly for Snap25, Neurod6, and Gad2, a moderate positive correlation could be observed between feature importance values and STRING co-expression data, but not for Slc1a3 ([122]Fig 7C). By using the appropriate parameter of the tool, we restricted the InterSIMPA output to regions within the proximity of 10kb or 5kb, respectively. As shown in [123]Fig 7C this has a positive impact on the results from InterSIMPA as it increased substantially the correlation calculated for the three genes. Fig 7. Gene analysis using the interpretation of imputation models. [124]Fig 7 [125]Open in a new tab A. Fraction of single cells for which H3K4me3 binding is observed within the gene’s promoter region in the mouse single-cell dataset. Dark green and yellow bars representing excitatory (ExNeu) and inhibitory (InNeu) neurons, respectively. Red bars represent non-neurons (NonNeu). y-axis labels show the gene names. B. Imputed probability computed by SIMPA for the gene-related promoter regions shown in A. The imputation was applied on 1000 single cells in total and the error bars represent the standard deviation across the imputation runs on different cells. C. Correlation of feature importance and co-expression values. For each model used to impute a promoter (y-axis), the training features (genomic bins) were extracted together with their importance value provided by the Random Forest algorithm and annotated with the nearest gene on the genome. As indicated by the figure legend, the selection of genomic bins was either not restricted (none), or restricted by a maximum distance to the transcription start site (TSS) of the nearest gene. Co-expression values, derived from transcriptomic and proteomic measurements of those genes with the gene related to the imputed promoter were retrieved from the STRING database. The Pearson’s correlation coefficient of feature importance and co-expression values is shown (x-axis). Discussion After confirming the presence of statistical patterns within the ENCODE bulk ChIP-seq reference data, we show that machine learning models can leverage those patterns for the accurate inference of interaction sites in sparse single-cell ChIP-seq profiles from individual single cells. The investigation of protein-DNA interactions on single-cell resolution emerged more recently compared to gene expression (single-cell RNA-seq) or chromatin accessibility (single-cell ATAC-seq) and consequently, less datasets are available for scChIP-seq. The human dataset we chose for our analysis includes profiles for H3K4me3 in 5kb resolution and H3K27me3 in 50kb resolution in human B-cells and T-cells. This dataset is appropriate as it provides a clear cell-type annotation which enabled an analysis based on the pathways we would expect to be assigned to B-cells and T-cells, respectively. Contrary to the human dataset from Grosselin et al., we observed that a very complex procedure was applied in the study presenting the mouse dataset from Zhu et al. to reveal the different cell clusters [[126]24]. From this study we chose the only profile available on a higher resolution (1kb), which describes the histone mark H3K4me3. Using common dimension reduction methods for single-cell datasets, we were able to reproduce clustering results for the Grosselin dataset (Fig S7 in [127]S1 File) but it was not possible to recapitulate the mouse cell types, neither before nor after imputation (results not shown). This could be explained by the complexity of the joint profiling of different histone marks within single cells and by the fact that the Zhu and colleagues could not apply a barcoding strategy to annotate the cell-types as it was done for the Grosselin dataset. Therefore, we extensively used the human dataset together with simulations to validate SIMPA and limited the use of the mouse dataset to the validation of the model interpretability offered by the InterSIMPA extension. Based on the simulations, we could show that the imputed results obtained by SIMPA are single-cell-specific for several cell-type-target combinations even if the experiments related to the cell-type were completely excluded from the training set. In both types of validation (leave-out origin and leave-out cell-type validations), SIMPA was able to capture cell-type-specific patterns even though the reference set was composed of profiles from many different cell-types and tissues, or the cell-type related data was completely excluded. Because the number of available bulk experimental profiles (ENCODE datasets) differs between targets, different training set sizes are available for different targets, with the smallest training set for H3K9ac (49 biosamples). Even for training sets of smaller size, the predictive performance remained high, although we expect models to be more reliable the larger the training set. Given that data portals such as ENCODE are still growing, we expect that the model reliability will increase in the future for many targets with a growing number of available reference datasets. The interpretation of the SIMPA models, done with InterSIMPA applied on a real scChIP-seq dataset published by Grosselin and colleagues, allows us to reveal additional information from the ChIP-seq profiles measured within individual cells regarding regions responsible for the imputation. Importantly, leveraging reference data allows us to impute regions that were not present in the single-cell dataset at all, in contrast to a reference-free strategy. Considering for example the promoter regions of the T-cell receptor signaling pathway genes CTLA4 and ICOS, these promoter-regions are not detected in any of the cells from the Grosselin et al. dataset, however, both have a high imputed probability from SIMPA. Moreover, for both promoters a high correlation coefficient was achieved within the validation by STRING co-expression values, confirming that our implementation not only answers the question about whether these promoters should be imputed or not, but it additionally reveals valuable information about regulatory relations implied by the single-cell dataset. Regarding the full data imputation analysis, we observed further advantages of SIMPA’s reference-based imputation strategy compared to the reference-free imputation method. While both algorithms achieve a good separation of the cell types, only with the imputed profiles from SIMPA it was possible to determine the relevant biological function of single cells as shown by the pathway enrichment analysis. This suggests that SIMPA imputes biologically meaningful genomic bins which are of functional relevance and confirms that, even though the training set involves a variety of different tissues and cell-types, SIMPA can find statistical patterns that belong to the correct cell-type. For single-cell datasets which reveal unknown subpopulations of cells, SIMPA could be used to identify active pathways for those cells after imputation. Interestingly, the quality of those results was maintained to some extent when not exactly the same scChIP-seq histone-mark target but functionally related targets were used to define the reference set. This suggests a valuable strategy to be applied for targets with a low availability of public bulk reference profiles. Given the results achieved by InterSIMPA on H3K4me3 single-cell profiles for mouse brain cells, we see a difference in performance compared to those achieved on human B-cells and T-cells that might be explained by the more complex cell type identification in the original study from Zhu et al. [[128]24]. However, we still see higher imputed probabilities of Snap25, Neurod6 and Gad2 in excitatory and inhibitory neurons compared to non-neurons, which confirms the relevance of those genes to a cell-type-specific study. This could not be observed for Slc1a3. The validation analysis based on the STRING co-expression data also highlighted the potential involvement of the three genes in regulatory networks. This analysis, which is also integrated within InterSIMPA, provides information about the reliability of the imputed probabilities. The gene Slc1a3 serves as an example for which the non-specificity of the imputed probabilities agrees with the low correlation coefficient between the InterSIMPA feature importance and STRING co-expression. SIMPA integrates solely datasets from bulk ChIP-seq in order to build the reference set. However, in the future, it will be relevant to integrate other types of data in order to complementarily extend the reference set. For example, SCRAT is an analysis tool that summarizes single-cell regulome data using different types of public datasets such as genome annotations or motif databases that could be of interest for the application of SIMPA to transcription factor scChIP-seq profiles [[129]34]. The scATAC-seq analysis tool SCATE performs imputation of missing regions integrating different types of public datasets (e.g. co-activated cis-regulatory elements and bulk DNase-seq profiles) [[130]35]. For future work, such approaches suggest the development of a reference-based method, allowing the imputation for both scChIP-seq and scATAC-seq data, integrating both types of reference data from the corresponding bulk assays and further complementary datasets. In the current version, SIMPA and InterSIMPA use all bulk profiles as selected by the target(s) specified by the user. Yet, there is no parameter that allows a user to further restrict the reference set by cell-types or tissues. We expect that this is not necessary as the machine learning algorithm can find the relevant patterns in the training set due to the additional specification induced by the given single-cell profile, which provides enough information about its cell-type despite its sparsity. Especially the pathway enrichment analysis supports this assumption since the imputed regions are related to the true biological function of the given cell-types. Nevertheless, for a future extension of the algorithm, it might be beneficial to investigate the impact of further specifying the reference set by tissues or cell-types related to the single-cell dataset. We demonstrate that the usage of bulk reference data can be beneficial in scChIP-seq imputation, especially when the single-cell data set leaves genomic regions uncovered. Given a single cell as input, its profile is used to specify the underlying training set to receive a result individual for the cell. Thus, single-cell and bulk information are combined in the imputation concept we propose. However, the imputation might improve for one individual cell by incorporating information from other, maybe similar, cells from the single-cell dataset. Recently, it was shown by the method scAND that the concept of network diffusion can successfully be applied in scATAC-seq imputation [[131]36]. In this study a bipartite network is created in which the edges describe if a region is accessible in a cell; no bulk reference is involved in this concept. However, considering the integration of bulk data, it might be possible to expand such a network by patterns detected in a bulk reference set. Edges could then describe different states for genomic regions depending on their occurrence in the single-cell and bulk datasets. Such networks would describe a complex composition of information and concepts from multi-graph theory might be helpful to extract patterns from these networks relevant for imputation [[132]37]. SIMPA’s strategy, to train a model for each candidate bin and each single cell, results in its capability to produce highly relevant results and at the same time in its main limitation which is the requirement of a large amount of computational resources. Using a high-performance cluster, the results presented in this manuscript could be obtained within 1–2 days. However, if computational resources are limited, SIMPA offers the opportunity to run the imputation for a selection of cells which, for instance, represent a certain cluster to be analyzed. As shown, cell clusters can be identified even on the sparse profiles using the appropriate method for dimensionality reduction. Importantly, InterSIMPA can also be applied for individual cells, providing interpretable results within seconds of runtime. On two datasets we demonstrated the potential of the concept of involving bulk data especially for ChIP-seq imputation, a topic less covered by other studies so far. The application of the proposed methods is possible as we provide the source code with comprehensive explanations in a github repository. However, our study might additionally serve as a guideline to be considered for the development of new imputation methods. As applied in the study from Zhu and colleagues [[133]24], it is possible to profile several histone marks and the transcriptome of single cells simultaneously and we expect that such techniques will be further improved in the near future. The resulting new datasets are expected to come up with a high level of sparsity, and imputation methods will be needed; though, there will be new requirements for imputation methods as single cells are then described by a variety of profiles describing a variety of biological functions. The complexity of information available for each single cell should be included in a novel imputation method applicable for such datasets. We therefore expect that future steps in single-cell imputation will involve the development of methods able to incorporate several protein-DNA interaction profiles and the transcriptome of single cells to integrate all these data for more robust imputation results. Based on the findings of this study we suggest to consider the integration of corresponding bulk data as well. Moreover, interpretability concepts for the imputation models should also be included within the development of future methods as it reveals detailed insights into the single-cell dataset under investigation. Conclusion The strategy of SIMPA leveraging bulk ChIP-seq datasets for single-cell sequencing data imputation, is able to complete specifically sparse scChIP-seq data of individual single cells. In comparison to the non-imputed data and a reference-free imputation method, SIMPA was better at recovering cell-type-specific pathways. Furthermore, the interpretability of the machine learning models trained for the imputation can be used to reveal biologically important information from a sparse single-cell dataset. Conclusively, we developed an ensemble of computational methods to extract more information from a sparse dataset and impute missing data to better handle data sparsity of scChIP-seq datasets. Supporting information S1 File. Contains supplementary notes 1 to 7 and supplementary figures S1 to S7. (DOCX) [134]Click here for additional data file.^ (2.8MB, docx) S1 Table (XLSX) [135]Click here for additional data file.^ (134.1KB, xlsx) Acknowledgments