Abstract
Motivation
Single-cell Chromatin ImmunoPrecipitation DNA-Sequencing (scChIP-seq)
analysis is challenging due to data sparsity. High degree of sparsity
in biological high-throughput single-cell data is generally handled
with imputation methods that complete the data, but specific methods
for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data
imputation method leveraging predictive information within bulk data
from the ENCODE project to impute missing protein-DNA interacting
regions of target histone marks or transcription factors.
Results
Imputations using machine learning models trained for each single cell,
each ChIP protein target, and each genomic region accurately preserve
cell type clustering and improve pathway-related gene identification on
real human data. Results on bulk data simulating single cells show that
the imputations are single-cell specific as the imputed profiles are
closer to the simulated cell than to other cells related to the same
ChIP protein target and the same cell type. Simulations also show that
100 input genomic regions are already enough to train single-cell
specific models for the imputation of thousands of undetected regions.
Furthermore, SIMPA enables the interpretation of machine learning
models by revealing interaction sites of a given single cell that are
most important for the imputation model trained for a specific genomic
region. The corresponding feature importance values derived from
promoter-interaction profiles of H3K4me3, an activating histone mark,
highly correlate with co-expression of genes that are present within
the cell-type specific pathways in 2 real human and mouse datasets. The
SIMPA’s interpretable imputation method allows users to gain a deep
understanding of individual cells and, consequently, of sparse
scChIP-seq datasets.
Availability and implementation
Our interpretable imputation algorithm was implemented in Python and is
available at [30]https://github.com/salbrec/SIMPA.
Introduction
The discovery of protein-DNA interactions of histone marks and
transcription factors is of great importance in biomedical studies
because of their impact on the regulation of core cellular processes
such as chromatin structure organization and gene expression. These
interactions are measured by chromatin immunoprecipitation followed by
high-throughput sequencing (ChIP-seq). Public data from the ENCODE
portal, which provides a large collection of experimental bulk ChIP-seq
data, has been used for comprehensive investigations providing insights
into epigenomic processes that affect chromatin 3D-structure, chromatin
state, and gene expression, to name just a few [[31]1, [32]2].
Recently developed protocols for scChIP-seq are powerful techniques
that will enable in-depth characterization of those processes at
single-cell resolution. ChIP-seq was successfully performed on single
cells with sequencing depth as low as 1,000 unique reads per cell,
reflecting the low amount of cellular material that can be obtained
from only one single cell [[33]3]. Even though this low coverage leads
to sparse datasets, scChIP-seq data has enabled the study of biological
systems that cannot be investigated with bulk ChIP-seq applied for many
cells, for example, the differences between drug-sensitive and
drug-resistant breast cancer cells [[34]4].
The analysis of single-cell assays is strongly affected by the sparsity
of the data. In the context of ChIP-seq, sparsity means no signal
observed for numerous genomic regions without the possibility to
explain whether this is real or due to low sequencing coverage.
Notably, sparsity may disable the investigation of functional genomic
elements that could be of crucial interest. Hence, an imputation method
is needed to complete sparse scChIP-seq datasets while preserving the
identity of each individual cell.
The first published imputation method for bulk NGS epigenomic signals
was ChromImpute [[35]5], later followed by [[36]6], an improved method
for the imputation of signal tracks for several molecular assays in a
biosample-specific manner (biosample refers to the specific tissue or
cell-type, not to single cell). The challenge of transcription factor
binding site prediction was approached, for example, using deep
learning algorithms on sequence position weight matrices [[37]7] and
more recently by the embedding of transcription factor labels and
k-mers [[38]8]. With the aim to complete the ENCODE portal with imputed
bulk experiments, Schreiber et al. implemented the method Avocado,
which extends the basic concept of PREDICTD by deep neural networks
[[39]9]. Avocado was also validated on ChIP-seq data from both histone
marks and transcription factors [[40]10]. Such methods show the
successful application of machine learning algorithms and mathematical
approaches in predicting epigenomic signals such as transcription
factor binding activity. However, their scope, being limited to either
imputation of missing bulk experiments or sequence-specific binding
site prediction, hampers their application to single-cell data.
The challenge of imputation for sparse datasets from single-cell assays
has been extensively approached for single-cell RNA-seq (scRNA-seq)
used to quantify gene expression at single-cell resolution e.g.
[[41]11–[42]19]. In this context, similarly to scChIP-seq data,
sparsity is described by dropout events, which are transcripts having a
transcription rate of zero without knowing if the corresponding gene is
not expressed at all or if the expression rate is not detected due to
technical limitations [[43]16]. The question arises if these methods
can be easily adapted for imputation of scChIP-seq data. However, there
are crucial differences between the application of RNA-seq and ChIP-seq
techniques that must be considered regarding the development of a
method for scChIP-seq imputation.
First, in RNA-seq the set of relevant genomic regions, defined by the
species-specific transcripts, is more limited. For a ChIP-seq profile,
the regions of potential interest may originate from any position in
the genome and cannot be defined in advance. To simplify the analysis,
in scChIP-seq imputation the genome can be organized in non-overlapping
genomic windows (bins) of a certain size. At 5 kb resolution, this
binning concept results in more than 600,000 possible bins in the human
genome, a number that is much higher than the number of transcripts
considered in the RNA-seq context.
The second main difference is that scChIP-seq interactions are usually
represented by a Boolean value (the value can be either “True” or
“False”) describing the presence or absence of a significant enrichment
of sequencing reads defining a peak, while RNA-seq datasets contain
transcription rates. Consequently, the application of scRNA-seq
imputation methods on scChIP-seq data might be less appropriate.
In contrast, imputation methods for chromatin accessibility profiles
from single-cell ATAC-seq (single-cell Assay for Transposase-Accessible
Chromatin using sequencing, scATAC-seq) are potentially more
transferable to scChIP-seq imputation as their data representation is
more similar. A few methods exist that implement imputation for
scATAC-seq, though none of them was tested on scChIP-seq data so far.
Methods such as SCALE which uses a combination of Gaussian mixture
models and variational autoencoder [[44]20] have been shown to
outperform scRNA-seq methods to impute scATAC-seq data (see also FITs
[[45]21] and scOpen [[46]22]). These methods complement each other with
respect to the different approaches they implement, however, they share
the common concept of imputing the missing values within a sparse
matrix defined by the single cells (rows) and the genomic bins
(columns) of a given experiment, and only bins are considered that were
detected by at least one single cell if no further filtering is
applied. Consequently, such methods can offer imputation only on
regions that were observed in the single-cell dataset and it is likely
that many important regions along the whole genome will be missed.
To overcome this limitation, we developed SIMPA, an algorithm for
Single-cell ChIP-seq iMPutAtion, that uses bulk ChIP-seq datasets of
the ENCODE project to train its machine learning models [[47]2,
[48]23]. It was already shown that an additional bulk RNA-seq dataset
can be used to improve the imputation for a sparse scRNA-seq dataset
[[49]14]. Within SIMPA, publicly available bulk ChIP-seq data is turned
into a reference set used to define potential bins to be imputed. Bins
observed in the single-cell dataset are then used to derive training
sets specified by both the bulk reference data and single-cell data.
This training set is then leveraged by machine learning models to
compute specific imputation probabilities. Moreover, these models are
interpretable and can be used to gain insights into a given single-cell
dataset, allowing a more detailed investigation of individual cells.
The interpretability is implemented by InterSIMPA, an extension which
takes a single cell as input together with a genomic position of
interest and trains one classification model for the position to derive
a probability which can be seen as an imputation score. More
importantly, InterSIMPA ranks the genomic bins from the sparse
single-cell profile by their relevance for the model. The ranked bins
are enriched by detailed information and an importance score describing
the strength of their relationship with the given genomic position of
interest. These relationships can be interpreted as dependencies
between genomic regions that could be part of the gene regulatory
network (e.g., between enhancers and promoters).
The basic reference-based imputation concept of SIMPA was first
validated for human on simulated data and then on a real scChIP-seq
dataset of immune cells [[50]4]. The simulated data was used to
demonstrate the single-cell specificity of the imputations on full data
profiles. The real dataset allowed us to investigate the algorithm’s
capability of retaining the cell-type clustering and furthermore to
assess the biological relevance of the imputed regions based on a
pathway enrichment analysis. Results from InterSIMPA were validated
using promoter regions related to genes of the B- and T-cell signaling
pathways in the human dataset and regions related to brain functions in
a mouse dataset [[51]24].
Methods
Datasets
Preparation of the reference data (bulk ChIP-seq datasets)
To create the reference set that is used by SIMPA we downloaded all
ChIP-seq experiments from the ENCODE portal that comply with the
following criteria: the status is released, the experiment is
replicated (isogenic or anisogenic), no treatment to the biosample,
without genetic modification, and the organism is Homo Sapiens (human)
or Mus Musculus (mouse). Experiments represent different ChIP protein
targets (antibody targets within the ChIP) and biosamples (either
tissue or an immortalized cell line). For all experiments, we
downloaded fully preprocessed sets of protein-DNA interacting regions
as peak files: the replicated peaks for histone mark ChIP and the
optimal IDR thresholded peaks for transcription factor ChIP. For human,
2251 peak files of the hg19 or hg38 assembly were downloaded and
preprocessed as hg38 after converting hg19 files with the UCSC LiftOver
tool [[52]25]. For mouse, 848 peak files of the mm10 assembly were
downloaded and preprocessed (data from other assemblies did not
complement the selection). For downloading, preprocessing, and updating
the datasets we used a semi-automatic, SQL-backend procedure that was
already used in a previous project [[53]26]. A list with all 3099
reference experiments is provided in [54]S1 Table including information
about the protein target, the cell-type or tissue, the assembly, and
the exact ENCODE accession IDs.
Data preprocessing
In order to limit computational complexity, all reference experiments
were converted from ChIP-seq peak sets to genomic bin sets (bins are
defined as non-overlapping regions on a genome). We provide on github
the reference data in bin sizes (or resolution) of 5 kb and 50 kb for
hg38 ([55]https://github.com/salbrec/simpa). We also provide the
reference data for mm10 for some of the main histone marks in different
sizes. However, the github repository also provides scripts and
descriptions that enable the preparation of any target in the desired
resolution (bin sizes). Bin sets for the reference ChIP-seq experiment
are in binary format to be more efficiently integrated by the main
Python scripts. Given one reference experiment, a bin is said to be
“present” if there is at least one ChIP-seq peak that overlaps this
bin, “absent” otherwise.
Preprocessing of scChIP-seq data for human (Grosselin et al., 2019)
We downloaded the count matrices for H3K4me3 and H3K27me3 available in
GEO under accession number [56]GSE117309 in 5 kb and 50 kb binning
resolution, respectively. From the matrices, we derived bed files for
every single cell excluding gender-specific chromosomes. SIMPA,
InterSIMPA and other imputation methods were then applied on 25% of the
single cells randomly sampled, 1520 bed files for H3K4me3 and 1128 bed
files for H3K27me3.
Preprocessing of scChIP-seq data for mouse (Zhu et al., 2021)
We downloaded the count matrices for H3K4me3 available in GEO under
accession number [57]GSE152020 in 1kb binning resolution. From the
matrices, we derived bed files for every single cell excluding
gender-specific chromosomes. The initial dataset provides 7465 cells
for this histone mark; we applied InterSIMPA on a subset of 1000
randomly sampled cells.
The SIMPA algorithm
SIMPA is an algorithm implemented in Python 3.7.3 for “Single-cell
ChIP-seq iMPutAtion”, which is applied to one single cell represented
by a sparse set of scChIP-seq genomic regions (or peaks) provided by
the user in bed format. Within the algorithm, the given single-cell bed
file is converted into a set of bins SC describing the single-cell
input ([58]Fig 1A). The user also provides the protein target name for
the histone mark or transcription factor targeted by the antibody
within the single-cell immunoprecipitation. The target name is needed
to specify the training set, which consists of experiments from the
ENCODE reference set.
Fig 1. SIMPA’s algorithm and cross validations.
[59]Fig 1
[60]Open in a new tab
A. Identified ChIP-seq regions from bulk experiments were downloaded
from ENCODE and mapped to bins defined as non-overlapping and
contiguous genomic regions of a defined length (5 kb for H3K4me3 and 50
kb for H3K27me3 in the human dataset) and covering the whole genome
(the table). A bin is given a value of 1 for a particular experiment if
there is at least one ChIP-seq region in this experiment that overlaps
the bin, 0 otherwise. In total 2251 human and 848 mouse ChIP-seq
experiments for several targets (histone marks or transcription
factors) performed in several biosamples (tissues and cell-lines) were
downloaded from ENCODE portal and preprocessed. Depending on the target
specified by the user, the target-specific reference set RS is then
created and contains all experiments related to this target (rows in
red; H3K4me3 is given as example) and all bins observed for at least
one of those experiments. B. The single-cell specific training feature
matrix TF is created as a subset of RS by selecting only bins observed
within the given single cell (green columns). All other bins from RS
are the candidate bins (c; blue columns) and define the class vectors
consisting of the corresponding values in RS. For each candidate bin, a
classification model is trained based on the training features and the
class vector identifying associated experiments. C. Cross-validated
evaluations of SIMPA’s Random Forest performances to predict values of
candidate bins defined for real human single-cell data related to
H3K4me3. For each candidate bin, a ten-fold cross-validation was
applied and summarized as the mean Area under ROC-Curve (AUROC) or Area
under Precision-Recall Curve (AUPRC) (y-axes). Results for all bins are
represented by boxplots subdivided by class balance in the candidate
bins (percentage of “1” values in the bin) (x-axis). The dashed lines
describe the baseline performance expected from a random classification
model: 0.5 for AUROC and equal to the class balance for AUPRC.
Given the sparse profile of one single cell as input, SIMPA aims to
impute missing bins based on predictive information within bulk data
specified by the selected ChIP protein target and further by the
regions taken from the input single-cell profile. In order to make the
bulk data informative, first SIMPA collects all the ENCODE reference
experiments available for the given target that defines the rows of the
reference set matrix (RS) where columns represent bins ([61]Fig 1A):
[MATH:
RS=ai,j,1≤i≤n
,1≤j≤m
:MATH]
with
[MATH:
ai,j∈0,1describingacellofthematrixwithvalue=1whenbinjinreferenceexperimentiispresent,0otherwise, :MATH]
and where n is the number of experiments available for the given
target, and m is the number of bins that are present in at least one of
the target specific experiments. As the rows are defined by the given
target, the target-specificity is induced within this step.
Second, a subset of RS is created by selecting only the columns for
bins that are present in the given single-cell profile SC to create the
training features TF:
[MATH: TF⊂RS,
:MATH]
[MATH:
TF=ai,k,1≤i≤n
,1≤k≤s
, :MATH]
where k indexes a selection of bins from RS that are present in SC and
with s the number of bins in SC ([62]Fig 1B). Bins present in RS but
not in SC are collected and named as candidate bins c that are
potentially imputed bins.
Third, SIMPA takes each candidate bin c[g] in c (c is the set of
candidate bins; the number of bins in c is P; thus 1 ≤ g ≤ P)
separately to compute an individual imputed probability ρ[g] that c[g]
is present in the single cell. Given c[g], SIMPA trains a
classification model cm[g] based on TF defining the features and c[g]
the class vector. Because an individual model is trained for each
individual candidate bin, bin-specificity is induced for the whole
approach. The imputed probability ρ[g] is finally computed by cm[g]
which takes as input an artificial instance vector b = (b[k]), b[k] =
1, 1 ≤ k ≤ s. Consequently, ρ[g] is the probability of c[g] to be
predicted for the imputed single-cell result, given the fact that all
bins in SC are observed. As we use a Random Forest implementation from
the scikit-learn (version 0.21.3) Python’s library [[63]27, [64]28]
with default settings to build classification models, the imputed
probability is then the mean predicted class probability of the trees
(by default 100) in the forest while the class probability of a single
tree is calculated by the fraction of samples of the same class in a
leaf.
When applying this algorithmic strategy on a real sparse single-cell
profile to impute candidate bins, the final step of receiving the
imputed probability differs from any cross-validation scenario as there
is no hold-out sample that could be used to apply the model on.
Instead, the model is applied on a synthetic vector containing only 1s
to receive the imputed probability. Providing this vector in which the
interaction is present for each bin, differs to the nature of the
reference data which usually describes a mixture of interaction and
non-interaction. However, this strategy is applied in the same way for
all candidate bins excluding a potential bias regarding the ranking of
candidate bins by the imputed probability. More crucially, by this
final step we force the Random Forest model to return an imputed
probability based on the knowledge that all bins captured by a single
cell are present, thus, the outcome is highly specific to the given
single cell. Moreover, this strategy allows us to keep focus on a cell
while interpreting the underlying model to gain more insights with high
specificity to the given individually single cell.
Finally, SIMPA creates two files: one file in bed format and the other
in SIMPA format described as a table listing the single-cell bins
first, followed by the imputed bins sorted by the imputed probability.
A line represents a bin described by its ID, its genomic coordinates,
its frequency according to the target-specific reference experiments,
and the imputed probability. Note that the first bins on top of this
file have no imputed probabilities as they represent the original
sparse single-cell input (a default value of -1 is assigned). The
second file created by SIMPA is the imputed bed file containing the
original single-cell bins and the additional imputed bins selected
among those with the highest imputed probability. The number of bins
within this bed file is defined by the average number of bins present
in the target-specific bulk experiments, e.g. 32,584 for H3K4me3 (5 kb
bin size) and 12,598 for H3K27me3 (50 kb bin size) on the hg38 samples.
Cross validations
Stratified ten-fold cross-validations were done to verify if the Random
Forest Classifier used by SIMPA when applied to the real single-cell
dataset is able to make use of statistical patterns from the bulk data
to train accurate models predicting the presence or absence of a
protein-DNA interaction in candidate bins. Hence, within this analysis,
the performance of Random Forest was cross-validated on the candidate
bin values not used in the training set but still defined in the
reference set. We chose Random Forest as, by default, the algorithm
never uses all given features for training one decision tree and
consequently smaller sets of genomic bins are considered for a tree
which agrees with the biological assumption that not all regions,
captured by an individual single cell, are relevant for the imputation
of a candidate bin. Given a candidate bin, SIMPA trains a Random Forest
with 100 decision trees (number of estimators) and aggregating the
votes from all trees results in a probability we use as imputed
probability.
We applied the following cross-validation approach (results shown in
[65]Fig 1C and Fig S1 in [66]S1 File): given a protein target (either
H3K4me3 or H3K27me3), 10 single cells were randomly chosen from the
real dataset for each cell type; for each single cell the training
feature matrix TF for SIMPA was created as explained above together
with the set of candidate bins. Then, for each candidate bin that
defines the class vector, a Random Forest classification model was
trained and evaluated by the area under the ROC-curve within a
stratified ten-fold cross-validation. In addition, we used the area
under precision-recall curve to better study the class vector
imbalance.
Results
Algorithmic concept and cross-validations
Unlike many other single-cell imputation methods, SIMPA leverages
predictive information within bulk ChIP-seq data by combining the
sparse input of one single cell and a collection of bulk ChIP-seq
experiments from ENCODE. In order to better compare bulk and
single-cell data, ChIP-seq regions (or significant signal/noise
ChIP-seq peaks) are mapped to genomic bins ([67]Fig 1A; see [68]Methods
for details about bulk and single-cell data retrieval and processing).
SIMPA produces results for each single cell of a scChIP-seq dataset by
using machine learning models trained on a subset of the ENCODE data
related to a selected ChIP protein target, that is the histone mark or
transcription factor used in the single-cell experiment. Derived from
this target-specific subset, the classification features are defined by
genomic bins detected in the single cell, while the class to predict is
defined by a bin observed in at least one target-specific bulk ENCODE
experiment, but not in the single cell ([69]Fig 1B). In other words, by
using this particular data selection strategy, SIMPA searches relevant
statistical patterns linking (i) the bulk ChIP-seq data across
single-cell-related bins and target-related experiments for different
cell types to (ii) the potential presence or absence of a bin in the
given single cell.
Within a cross-validation scenario that compared the predicted
probabilities to corresponding candidate bin values, results show that
the machine learning-concept of SIMPA is able to provide accurate
predictions ([70]Fig 1C, Supplementary Note 1 in [71]S1 File, and Fig
S1 in [72]S1 File). Moreover, on the high-resolution H3K4me3 human
dataset [[73]4], SIMPA achieved high recall rates for bins removed from
single-cell profiles (Supplementary Note 2 and Fig S2 in [74]S1 File).
Validation on simulated data
In order to evaluate the algorithm’s ability from few input bins
(hundreds) to complete full data profiles (thousands of bins) of
different protein targets and cell-types, we simulated sparse
protein-DNA interaction profiles from the bulk ENCODE ChIP-seq
experiments that are used as reference data by SIMPA. For the
simulation, we took human bulk experiments for different
cell-type-target combinations to define them as full single-cell
profiles (origin) and down-sampled those profiles to simulate sparse
single-cell profiles (from 100 to 1600 randomly selected bins) (see
Supplementary Note 3 in [75]S1 File for details). Each simulated sparse
profile was used as input for SIMPA and the output was compared to the
origin (excluding bins used as input). For the model training, the full
origin profile was excluded from the reference training set in order to
apply the default validation, called leave-out origin (LOO).
Additionally, a more challenging validation strategy was applied in
which all reference profiles for the same cell-type (biosample) were
excluded, called leave-out cell-type (LOCT).
For H3K4me3, the most frequently investigated target in ENCODE, high
area under ROC-curve values confirm that SIMPA is able to accurately
recapitulate the original data from the simulated sparse profiles
([76]Fig 2A). Even if the cell-type-specific information is completely
removed from the training set (LOCT), the performance is still high.
Furthermore, these observations are confirmed when using
precision-recall curves as performance measure ([77]Fig 2B), a relevant
analysis given the imbalance in the validation sets (containing far
fewer positive than negative samples). We made similar observations in
a ROC-curve and precision-recall curve analysis for other
cell-type-target combinations (Figs S4 and S5 in [78]S1 File).
Fig 2. Performance on simulated sparse profiles in different cell-types.
[79]Fig 2
[80]Open in a new tab
A. For simulation, a full human single-cell profile (origin profile) is
defined by a full bulk profile and the corresponding sparse single-cell
profile is defined by the down-sampled bulk profile. Compared to real
data, the simulations allows us to test SIMPA on full profiles related
to a large variety of ChIP protein targets and biosamples. Using the
origin profile as the validation set of true binding interactions, the
area under ROC-curve (AUROC in y-axis) describes the capability of
SIMPA to accurately impute the sparse profile and recapitulate the
origin. The bars describe the mean AUROC and the error bars describe
the standard deviation across multiple applications on sparse sets with
different sizes. SIMPA was validated with two strategies, the default
leave-out origin (LOO; origin profile excluded from the training set)
and the extreme leave-out cell-type (LOCT; all experiments with the
same cell type than the origin profile are excluded from the training
set). The x-axis labels indicate the cell-type of the origin profile
and additionally the ENCODE accession to show which of the experimental
dataset was used as origin. B. Same as in A but using the area under
precision-recall curve (AUPRC in y-axis) as performance measure. The
pink bars show the class balance (fraction of positives in the class
feature) representing the random assumption as baseline to be expected
from a primitive classifier that randomly assigns the class values
(according to [[81]29]). Note, the sampled bins, that simulate a sparse
single-cell profile and are expected to be known before imputation,
were completely excluded before computing area under the curve values.
In order to assess the single-cell specificity of SIMPA in this
simulation, we compared each fully imputed profile to its origin
profile and also to a consensus profile representing experimental
datasets that are most similar to the origin (experimental profiles
with same protein target and same cell-type, see Supplementary Note 3
in [82]S1 File for details). All comparisons excluded bins used as
input for SIMPA. Results show that for most of the simulations (>95%)
the imputed profile is closer to the origin profile, hence single-cell
specific ([83]Fig 3A). Moreover, we observed that the origin profiles
can be more similar to the consensus profile (less specific) or less
similar (more specific). When the origin profiles are less specific, it
is harder for SIMPA to achieve an imputed profile specific to the
origin (single-cell specific). However, for such cases in which the
origin is quite close to the consensus (Jaccard-Index > 0.65) the
imputation is still single-cell specific, although with a lower
single-cell specificity value ([84]Fig 3B).
Fig 3. Single-cell specificity analysis.
[85]Fig 3
[86]Open in a new tab
A. The Jaccard-Index is used to compare the imputed profiles from SIMPA
with the origin human bulk profile used to create a simulated sparse
profile and the consensus profile representing the remaining
experiments available for the same biosample-target combination as the
origin profile. The dashed line shows the balance line at which the
imputed profile from SIMPA is neither closer to the origin nor to the
consensus. Cases above the dashed line are those in which the imputed
profile is single-cell specific, hence, closer to the origin than to
the consensus. B. “Single-Cell Specificity” on the y-axis is defined as
the difference between the imputed-to-origin similarity (y-axis in A)
and imputed-to-consensus similarity (x-axis in A). Having the
similarity between the origin and the consensus on the x-axis, this
plot allows the visualization of the single-cell specificity in
relation to how specific the origin is. The higher the similarity
between the origin and consensus, the less specific is the origin
profile and the harder the challenge to capture its specificity.
Profiles above the 0 line are single-cell specific as they are closer
to the origin than to the consensus. Before computing the Jaccard-Index
values, the sampled bins, which simulate a sparse single-cell profile
and are expected to be known before imputation, were removed from all
the sets, origin, imputed, and consensus.
Taken together, the simulation results show that models trained from a
few bins accurately impute thousands of bins and show that completed
profiles can be single-cell specific on real data even if the
investigated cell-type is not represented by any of the bulk datasets
in the reference set (leave-out cell-type validation).
Model interpretability on real data
Addressing one main aim of this study to make models interpretable, we
implemented an extension called InterSIMPA. Here we define
interpretability as the possibility of obtaining information of
potential biological relevance from the relationships observed between
the training features (genomic bins) and a genomic position of
interest. These relations can be expected to be part of the genomic
regulatory network.
The training features are derived by InterSIMPA in the same way as for
SIMPA but a single machine learning model is trained for a selected
genomic position of interest defined by the user. Accordingly, one
imputed probability is returned for genomic position of interest with
information about the genomic bins most important for the machine
learning model. Finally, the algorithm reports the genes closest to
these bins (Supplementary Note 4 in [87]S1 File).
To demonstrate how interpretable imputation models can be used to
expose more information from the sparse ChIP-seq profile of individual
single cells, we use the human single-cell ChIP-seq dataset of H3K4me3
interactions in B-cells and T-cells from Grosselin et al. [[88]4].
According to the given cell types, we focused on promoter regions of
genes that are involved within the B-cell and T-cell receptor signaling
pathways. The two gene sets contain 67 and 97 genes for the B-cell
receptor and T-cell receptor signaling pathways, respectively, with an
overlap of 44 genes. To focus on the genes that could be more specific
to the cell-types under investigation, from the union of the two gene
sets we selected 24 genes with frequency of their promoter regions
lower than 20% in the corresponding H3K4me3-specific reference set,
which means that their promoter has no detected interaction site for
more than 80% of the ENCODE reference experiments for H3K4me3 in
different cell-types and tissues (Supplementary Note 4 in [89]S1 File).
As H3K4me3 is an activating histone mark, we expected to observe
interaction sites in the promoter regions of these genes. However, for
many of those promoter regions, the H3K4me3 binding is missing for most
of the single cells in the sparse data ([90]Fig 4A). Our expectation
that SIMPA is able to impute such regions at least in a
cell-type-specific manner, is confirmed by comparing the imputed
probabilities calculated by SIMPA for promoter regions of the 24
selected genes in single B- and T-cells ([91]Fig 4B). For most of the
genes, the imputed probability is higher when SIMPA is applied on
single cells that are from the pathway-related cell-type. Finally, we
evaluated the interpretability of the 24 imputation models by comparing
feature importance values and co-expression values of the
feature-related genes with the gene of the imputed promoter ([92]Fig
4C). Co-expression data from the STRING database was used [[93]30]. The
observed high correlations suggest that InterSIMPA is capable to
describe biologically relevant promoter-promoter relations by the
predictive information hidden within sparse histone mark profiles of an
activating mark. Consequently, our approach not only completes the
sparse scChIP-seq dataset, but its interpretability-extension is even
capable of providing deeper insights into the data.
Fig 4. Pathway related gene analysis using the interpretation of imputation
models.
[94]Fig 4
[95]Open in a new tab
A. Fraction of single cells for which H3K4me3 binding is observed
within the gene’s promoter region in the human single-cell dataset
(orange and blue bars representing B-cells or T-cells, respectively).
Y-axis labels show the gene names and if the gene belongs to the B-cell
or T-cell receptor signaling pathway or to both. B. Imputed probability
computed by SIMPA for the gene-related promoter regions shown in A. The
imputation was applied on numerous single cells from the cell-types
B-cell (orange) and T-cell (blue). The error bars represent the
standard deviation across the imputation runs on different cells. For
the majority of genes, the imputed probability is higher within the
cell-type that corresponds to the gene’s pathway. C. Correlation of
feature importance and co-expression values. For each model used to
impute a promoter (y-axis), the training features (genomic bins) were
extracted together with their importance value provided by the Random
Forest algorithm and annotated with the nearest gene on the genome.
Co-expression values, derived from transcriptomic and proteomic
measurements, of those genes with the gene related to the imputed
promoter were retrieved from the STRING database. The Pearson
correlation coefficient of feature importance and co-expression values
is shown (x-axis).
Performance on cell-type clustering and functional analysis
After the evaluation of the InterSIMPA extension, here, we evaluate how
SIMPA enhances single-cell data corresponding to different cell types.
For the imputation of a full single-cell dataset, SIMPA was applied for
each cell individually. The resulting imputed profiles were then
analyzed within two validations, to examine if (i) cell-type clustering
was retained after imputation and (ii) if the imputed single-cell
profiles are significantly associated with genes of the corresponding
cell-type-specific pathway. Following the investigations of Schreiber
et al. [[96]31], we also compared bin probabilities from SIMPA to a
simple imputation approach that uses bin frequencies in the reference
set (experiments with same protein target) as a probabilistic model
without using any machine learning model, called the average
interaction method. Additional imputations and randomization tests were
applied and compared to better analyze the basic concept of SIMPA (see
Supplementary Note 5 in [97]S1 File).
Because of the better resolution available for H3K4me3 processed as
genomic bins of size 5 kb in the human dataset, we present below
results on this histone mark and refer to supplementary material for
H3K27me3 processed at 50 kb bins (Fig S6 in [98]S1 File). For
benchmarking, we used a single-cell ATAC-seq imputation method, SCALE,
solely based on the single-cell dataset itself (reference-free) in
contrast to SIMPA, which takes advantage of information from the
reference bulk dataset. After applying a two-dimensional projection on
the sparse and imputed datasets, we observed that the separation
between the cell types was retained by SIMPA and by the reference-free
method, contrary to the average interaction method ([99]Fig 5A).
Moreover, three T-cell outliers were successfully associated to the
related cell-type cluster by SIMPA, which achieved a slightly better
homogeneity of the clusters in comparison to SCALE which did not
handled correctly the outliers (Fig S6 in [100]S1 File). Dimensionality
reduction was done by a combination of principal component analysis
(PCA) and t-stochastic neighbor embedding (t-SNE) as suggested by
Grosselin et al. on their analysis of sparse data [[101]4]. Unlike the
suggested procedure, we did not perform cell filtering, as we were
interested to observe outliers after imputation.
Fig 5. Cell-type specificity validation.
[102]Fig 5
[103]Open in a new tab
A + B Separation of single cells according to cell type. A.
Dimensionality reduction analysis applied on the human H3K4me3 data
derived from (i) the sparse single-cell data and three different
imputation methods, (ii) SIMPA, (iii) reference-free imputation, and
(iv) average interaction based on expected frequencies in the reference
set. Results from SIMPA and from the reference-free method achieve the
best clustering by separating the single cells (points) by cell types
(colors). B. Effects of input modification on SIMPA, (i) using a
shuffled reference set or (ii) randomized sparse input data, or using
other histone marks as reference instead of H3K4me3, either (iii) the
functionally different histone mark H3K36me3, or (iv) the functionally
similar histone marks H3K9ac and H3K27ac. Only SIMPA used with relevant
protein targets was able to correctly cluster all single cells (no
outliers). C. Pathway enrichment analysis. Boxplots show the
significance of pathway enrichment analyses of genes annotated by
single-cell regions as log-transformed false discovery rate (FDR;
x-axis). Each dot represents the FDR of one single cell from the
results of the different analysis experiments shown in A+B (y-axis).
The dashed lines represent the log-transformed significance threshold
of an FDR equal to 0.001. Only SIMPA achieves significant results by
imputing preferably genomic regions associated with relevant
pathway-related genes.
In order to validate further the algorithmic concept of SIMPA, we
implemented two randomization tests in which either the ENCODE
reference information was shuffled (Shuffled Reference) or the sparse
single-cell input was randomly sampled (Randomized Sparse Input).
Additionally, we applied SIMPA on the same data but with models trained
for different related or unrelated histone marks. The selected histone
marks were H3K36me3, a histone mark functionally different to H3K4me3,
and H3K9ac and H3K27ac, a group of two histone marks functionally
related to H3K4me3. These two marks were used together to increase the
training data size. From this comparison, we observed that (i) the
separation on the projection is lost after removing statistical
patterns through shuffling or randomization, (ii) separation quality is
moderate with an input mark functionally different from the real mark,
and (iii) separation quality stays high using SIMPA with target histone
marks functionally similar to the real mark ([104]Fig 5B). Thus, the
most relevant statistical patterns from the reference dataset are
identified by both the selection of single-cell-specific regions and
the selection of target-specific experiments. Similar observations were
made for H3K27me3 although a more compact clustering could be achieved
on the SIMPA profiles compared to those from the reference-free method
(Fig S6 in [105]S1 File). Across several dimensionality reduction
procedures applied for the H3K4me3 dataset, SIMPA and the
reference-free method were both stable in retaining the cell-type
clustering (Fig S7 in [106]S1 File). From these analyses we
additionally conclude that the UMAP method using the Jaccard-Index
distance achieves reasonable results when applied directly on the
sparse data (in comparison to the common approach in single-cell
analysis that uses first a PCA to select dimensions).
As pathway enrichment analysis is a common step in ChIP-seq data
exploration, we next investigated if enrichment analyses of
cell-type-specific pathways for individual single cells improve after
applying imputation. We analyzed the sparse profiles and different
imputed results with the KEGG pathway analysis function of the
Cistrome-GO tool [[107]32]. As reported in [108]Fig 5C, the original
sparse data did not provide enough interaction sites to show a
significant pathway enrichment for any of the two cell types. Results
from the reference-free strategy showed an improvement but not
significant. However, with regions imputed by SIMPA, it was indeed
possible to achieve significant enrichment scores and recover the
cell-type-specific pathways for most of the cells. These results show
that SIMPA is able to integrate functionally relevant information from
the reference data in order to impute additional biologically
meaningful bins, in contrast to the reference-free method, which is
limited to the single-cell dataset.
Optimal size of the imputation sets
As described, SIMPA computes the imputed probabilities for numerous
genomic bins, and sorts and prioritizes those accordingly for
imputation. In the previous validations, as a default, we imputed a
number of bins equivalent to the average number of bins observed across
all bulk ChIP-seq profiles from the target-specific reference set. On 5
kb resolution, the average number of bins of the H3K4me3 experiments is
32,584. However, once the bins are ranked by the imputed probability it
is up to the user to alternatively create imputed sets of different
sizes. With the next analysis we address the question about the optimal
number of bins needed to improve the cell-type clustering at the same
time enabling the detection of the relevant biological function by a
significant enrichment of the correct pathway (for details see
Supplementary Note 6 in [109]S1 File).
The best cell-type clustering quality, evaluated by the Davies-Bouldin
score [[110]33], is reached when adding ~11,000 bins ([111]Fig 6A). At
this level, SIMPA slightly improves the clustering quality compared to
the reference-free method.
Fig 6. Clustering quality and pathway enrichment for different sizes.
[112]Fig 6
[113]Open in a new tab
A. Clustering quality (y-axis) evaluated with the Davies-Bouldin score
(the lower the better) applied on the imputed data after dimensionality
reduction as described for [114]Fig 5A+5B. While the reference-free
method derived only one imputed set for all the single cells (dashed
black line), we could derive several imputed sets of different sizes
using the imputed probabilities from SIMPA (x-axis). B. The pathways
under investigation are the B-cell and T-cell receptor signaling
pathways. In this way we analyze two pathways surely related or
unrelated with the cell-types present in the dataset, B-cell and
T-cell. The y-axis describes the percentage of imputed profiles for
which a significant enrichment of the aforementioned pathways could be
achieved. The dashed lines represent cases for which the unrelated
pathway is significantly enriched, which is the T-cell receptor
signaling pathway when analyzing a B-cell and vice versa. The
significance level used is 0.001 similar to the analysis shown in
[115]Fig 5C. To reduce the computational resources spent on the pathway
enrichment, this was done for ten imputation set sizes (highlighted in
blue on the x-axis).
Considering the amount of cells in which the cell-type related pathway
is significantly enriched, we observed that in ~50% of the cells the
related pathway is associated when adding ~28,000 bins ([116]Fig 6B).
After adding more than 32,000 bins, almost all cells have a significant
enrichment for the cell-type related pathway, however, it seems to be
also the limit for avoiding the association of the unrelated pathway.
For this analysis the same pathways and settings are used as in
[117]Fig 5C; the unrelated pathway is the T-cell receptor signaling
pathway when analyzing a B-cell and vice versa.
InterSIMPA applied on mouse scChIP-seq data (Zhu et al., 2021)
New technologies allow to obtain the joint profiling of histone
modifications and transcriptome in single cells as used by [[118]24].
From this dataset we chose H3K4me3 profiles from mouse brain cells
available on 1kb binning resolution to further validate the concept of
InterSIMPA. As each single-cell dataset is very specific to the
investigated cell-types, we focused on four genes also used within the
original study to analyze the difference between excitatory and
inhibitory neurons, as well as non-neurons. We first annotated the
promoter regions of these genes based on 1000 randomly selected sparse
single-cell profiles and observed a very low coverage of the promoter
regions ranging from 0.0 to ~2.0% ([119]Fig 7A). Despite the low
single-cell coverage, the patterns agree with those from the original
study (see Fig 1F in [[120]24]); note that we focus on the promoter
only and do not consider the full gene body. We then analyzed the
imputed probabilities from InterSIMPA for these four genes ([121]Fig
7B). For the three genes Snap25, Neurod6, and Gad2, the imputed
probability is higher for the neurons compared with the non-neurons,
and no difference can be observed between cell types for Slc1a3.
Similarly for Snap25, Neurod6, and Gad2, a moderate positive
correlation could be observed between feature importance values and
STRING co-expression data, but not for Slc1a3 ([122]Fig 7C). By using
the appropriate parameter of the tool, we restricted the InterSIMPA
output to regions within the proximity of 10kb or 5kb, respectively. As
shown in [123]Fig 7C this has a positive impact on the results from
InterSIMPA as it increased substantially the correlation calculated for
the three genes.
Fig 7. Gene analysis using the interpretation of imputation models.
[124]Fig 7
[125]Open in a new tab
A. Fraction of single cells for which H3K4me3 binding is observed
within the gene’s promoter region in the mouse single-cell dataset.
Dark green and yellow bars representing excitatory (ExNeu) and
inhibitory (InNeu) neurons, respectively. Red bars represent
non-neurons (NonNeu). y-axis labels show the gene names. B. Imputed
probability computed by SIMPA for the gene-related promoter regions
shown in A. The imputation was applied on 1000 single cells in total
and the error bars represent the standard deviation across the
imputation runs on different cells. C. Correlation of feature
importance and co-expression values. For each model used to impute a
promoter (y-axis), the training features (genomic bins) were extracted
together with their importance value provided by the Random Forest
algorithm and annotated with the nearest gene on the genome. As
indicated by the figure legend, the selection of genomic bins was
either not restricted (none), or restricted by a maximum distance to
the transcription start site (TSS) of the nearest gene. Co-expression
values, derived from transcriptomic and proteomic measurements of those
genes with the gene related to the imputed promoter were retrieved from
the STRING database. The Pearson’s correlation coefficient of feature
importance and co-expression values is shown (x-axis).
Discussion
After confirming the presence of statistical patterns within the ENCODE
bulk ChIP-seq reference data, we show that machine learning models can
leverage those patterns for the accurate inference of interaction sites
in sparse single-cell ChIP-seq profiles from individual single cells.
The investigation of protein-DNA interactions on single-cell resolution
emerged more recently compared to gene expression (single-cell RNA-seq)
or chromatin accessibility (single-cell ATAC-seq) and consequently,
less datasets are available for scChIP-seq. The human dataset we chose
for our analysis includes profiles for H3K4me3 in 5kb resolution and
H3K27me3 in 50kb resolution in human B-cells and T-cells. This dataset
is appropriate as it provides a clear cell-type annotation which
enabled an analysis based on the pathways we would expect to be
assigned to B-cells and T-cells, respectively. Contrary to the human
dataset from Grosselin et al., we observed that a very complex
procedure was applied in the study presenting the mouse dataset from
Zhu et al. to reveal the different cell clusters [[126]24]. From this
study we chose the only profile available on a higher resolution (1kb),
which describes the histone mark H3K4me3. Using common dimension
reduction methods for single-cell datasets, we were able to reproduce
clustering results for the Grosselin dataset (Fig S7 in [127]S1 File)
but it was not possible to recapitulate the mouse cell types, neither
before nor after imputation (results not shown). This could be
explained by the complexity of the joint profiling of different histone
marks within single cells and by the fact that the Zhu and colleagues
could not apply a barcoding strategy to annotate the cell-types as it
was done for the Grosselin dataset. Therefore, we extensively used the
human dataset together with simulations to validate SIMPA and limited
the use of the mouse dataset to the validation of the model
interpretability offered by the InterSIMPA extension.
Based on the simulations, we could show that the imputed results
obtained by SIMPA are single-cell-specific for several cell-type-target
combinations even if the experiments related to the cell-type were
completely excluded from the training set. In both types of validation
(leave-out origin and leave-out cell-type validations), SIMPA was able
to capture cell-type-specific patterns even though the reference set
was composed of profiles from many different cell-types and tissues, or
the cell-type related data was completely excluded. Because the number
of available bulk experimental profiles (ENCODE datasets) differs
between targets, different training set sizes are available for
different targets, with the smallest training set for H3K9ac (49
biosamples). Even for training sets of smaller size, the predictive
performance remained high, although we expect models to be more
reliable the larger the training set. Given that data portals such as
ENCODE are still growing, we expect that the model reliability will
increase in the future for many targets with a growing number of
available reference datasets.
The interpretation of the SIMPA models, done with InterSIMPA applied on
a real scChIP-seq dataset published by Grosselin and colleagues, allows
us to reveal additional information from the ChIP-seq profiles measured
within individual cells regarding regions responsible for the
imputation. Importantly, leveraging reference data allows us to impute
regions that were not present in the single-cell dataset at all, in
contrast to a reference-free strategy. Considering for example the
promoter regions of the T-cell receptor signaling pathway genes CTLA4
and ICOS, these promoter-regions are not detected in any of the cells
from the Grosselin et al. dataset, however, both have a high imputed
probability from SIMPA. Moreover, for both promoters a high correlation
coefficient was achieved within the validation by STRING co-expression
values, confirming that our implementation not only answers the
question about whether these promoters should be imputed or not, but it
additionally reveals valuable information about regulatory relations
implied by the single-cell dataset.
Regarding the full data imputation analysis, we observed further
advantages of SIMPA’s reference-based imputation strategy compared to
the reference-free imputation method. While both algorithms achieve a
good separation of the cell types, only with the imputed profiles from
SIMPA it was possible to determine the relevant biological function of
single cells as shown by the pathway enrichment analysis. This suggests
that SIMPA imputes biologically meaningful genomic bins which are of
functional relevance and confirms that, even though the training set
involves a variety of different tissues and cell-types, SIMPA can find
statistical patterns that belong to the correct cell-type. For
single-cell datasets which reveal unknown subpopulations of cells,
SIMPA could be used to identify active pathways for those cells after
imputation. Interestingly, the quality of those results was maintained
to some extent when not exactly the same scChIP-seq histone-mark target
but functionally related targets were used to define the reference set.
This suggests a valuable strategy to be applied for targets with a low
availability of public bulk reference profiles.
Given the results achieved by InterSIMPA on H3K4me3 single-cell
profiles for mouse brain cells, we see a difference in performance
compared to those achieved on human B-cells and T-cells that might be
explained by the more complex cell type identification in the original
study from Zhu et al. [[128]24]. However, we still see higher imputed
probabilities of Snap25, Neurod6 and Gad2 in excitatory and inhibitory
neurons compared to non-neurons, which confirms the relevance of those
genes to a cell-type-specific study. This could not be observed for
Slc1a3. The validation analysis based on the STRING co-expression data
also highlighted the potential involvement of the three genes in
regulatory networks. This analysis, which is also integrated within
InterSIMPA, provides information about the reliability of the imputed
probabilities. The gene Slc1a3 serves as an example for which the
non-specificity of the imputed probabilities agrees with the low
correlation coefficient between the InterSIMPA feature importance and
STRING co-expression.
SIMPA integrates solely datasets from bulk ChIP-seq in order to build
the reference set. However, in the future, it will be relevant to
integrate other types of data in order to complementarily extend the
reference set. For example, SCRAT is an analysis tool that summarizes
single-cell regulome data using different types of public datasets such
as genome annotations or motif databases that could be of interest for
the application of SIMPA to transcription factor scChIP-seq profiles
[[129]34]. The scATAC-seq analysis tool SCATE performs imputation of
missing regions integrating different types of public datasets (e.g.
co-activated cis-regulatory elements and bulk DNase-seq profiles)
[[130]35]. For future work, such approaches suggest the development of
a reference-based method, allowing the imputation for both scChIP-seq
and scATAC-seq data, integrating both types of reference data from the
corresponding bulk assays and further complementary datasets.
In the current version, SIMPA and InterSIMPA use all bulk profiles as
selected by the target(s) specified by the user. Yet, there is no
parameter that allows a user to further restrict the reference set by
cell-types or tissues. We expect that this is not necessary as the
machine learning algorithm can find the relevant patterns in the
training set due to the additional specification induced by the given
single-cell profile, which provides enough information about its
cell-type despite its sparsity. Especially the pathway enrichment
analysis supports this assumption since the imputed regions are related
to the true biological function of the given cell-types. Nevertheless,
for a future extension of the algorithm, it might be beneficial to
investigate the impact of further specifying the reference set by
tissues or cell-types related to the single-cell dataset.
We demonstrate that the usage of bulk reference data can be beneficial
in scChIP-seq imputation, especially when the single-cell data set
leaves genomic regions uncovered. Given a single cell as input, its
profile is used to specify the underlying training set to receive a
result individual for the cell. Thus, single-cell and bulk information
are combined in the imputation concept we propose. However, the
imputation might improve for one individual cell by incorporating
information from other, maybe similar, cells from the single-cell
dataset. Recently, it was shown by the method scAND that the concept of
network diffusion can successfully be applied in scATAC-seq imputation
[[131]36]. In this study a bipartite network is created in which the
edges describe if a region is accessible in a cell; no bulk reference
is involved in this concept. However, considering the integration of
bulk data, it might be possible to expand such a network by patterns
detected in a bulk reference set. Edges could then describe different
states for genomic regions depending on their occurrence in the
single-cell and bulk datasets. Such networks would describe a complex
composition of information and concepts from multi-graph theory might
be helpful to extract patterns from these networks relevant for
imputation [[132]37].
SIMPA’s strategy, to train a model for each candidate bin and each
single cell, results in its capability to produce highly relevant
results and at the same time in its main limitation which is the
requirement of a large amount of computational resources. Using a
high-performance cluster, the results presented in this manuscript
could be obtained within 1–2 days. However, if computational resources
are limited, SIMPA offers the opportunity to run the imputation for a
selection of cells which, for instance, represent a certain cluster to
be analyzed. As shown, cell clusters can be identified even on the
sparse profiles using the appropriate method for dimensionality
reduction. Importantly, InterSIMPA can also be applied for individual
cells, providing interpretable results within seconds of runtime.
On two datasets we demonstrated the potential of the concept of
involving bulk data especially for ChIP-seq imputation, a topic less
covered by other studies so far. The application of the proposed
methods is possible as we provide the source code with comprehensive
explanations in a github repository. However, our study might
additionally serve as a guideline to be considered for the development
of new imputation methods. As applied in the study from Zhu and
colleagues [[133]24], it is possible to profile several histone marks
and the transcriptome of single cells simultaneously and we expect that
such techniques will be further improved in the near future. The
resulting new datasets are expected to come up with a high level of
sparsity, and imputation methods will be needed; though, there will be
new requirements for imputation methods as single cells are then
described by a variety of profiles describing a variety of biological
functions. The complexity of information available for each single cell
should be included in a novel imputation method applicable for such
datasets. We therefore expect that future steps in single-cell
imputation will involve the development of methods able to incorporate
several protein-DNA interaction profiles and the transcriptome of
single cells to integrate all these data for more robust imputation
results. Based on the findings of this study we suggest to consider the
integration of corresponding bulk data as well. Moreover,
interpretability concepts for the imputation models should also be
included within the development of future methods as it reveals
detailed insights into the single-cell dataset under investigation.
Conclusion
The strategy of SIMPA leveraging bulk ChIP-seq datasets for single-cell
sequencing data imputation, is able to complete specifically sparse
scChIP-seq data of individual single cells. In comparison to the
non-imputed data and a reference-free imputation method, SIMPA was
better at recovering cell-type-specific pathways. Furthermore, the
interpretability of the machine learning models trained for the
imputation can be used to reveal biologically important information
from a sparse single-cell dataset. Conclusively, we developed an
ensemble of computational methods to extract more information from a
sparse dataset and impute missing data to better handle data sparsity
of scChIP-seq datasets.
Supporting information
S1 File. Contains supplementary notes 1 to 7 and supplementary figures
S1 to S7.
(DOCX)
[134]Click here for additional data file.^ (2.8MB, docx)
S1 Table
(XLSX)
[135]Click here for additional data file.^ (134.1KB, xlsx)
Acknowledgments