Abstract
Single-cell RNA sequencing is a powerful tool to investigate the
cellular makeup of tumor samples. However, due to the sparse data and
the complex tumor microenvironment, it can be challenging to identify
neoplastic cells that play important roles in tumor growth and disease
progression. This is especially relevant for blood cancers, where
neoplastic cells may be highly similar to normal cells. To address this
challenge, we have developed partCNV and partCNVH, two methods for
rapid and accurate detection of aneuploid cells with local copy number
deletion or amplification. PartCNV uses an expectation-maximization
(EM) algorithm with mixtures of Poisson distributions and incorporates
cytogenetic information to guide the classification. PartCNVH further
improves partCNV by integrating a hidden Markov model for feature
selection. We have thoroughly evaluated the performance of partCNV and
partCNVH through simulation studies and real data analysis using three
scRNA-seq datasets from blood cancer patients. Our results show that
partCNV and partCNVH have favorable accuracy and provide more
interpretable results compared to existing methods. In the real data
analysis, we have identified multiple biological processes involved in
the oncogenesis of myelodysplastic syndromes and acute myeloid
leukemia.
Subject terms: Cancer, Computational biology and bioinformatics
Introduction
Single-cell RNA sequencing (scRNA-seq) has greatly improved our ability
to understand the cellular composition of the tissues and organs of
interest, identify phenotype-associated cell groups, and elucidate the
mechanisms behind many biological processes^[34]1–[35]3. These
advantages make scRNA-seq a powerful tool for studying a wide range of
human diseases, including Alzheimer’s disease^[36]4, cardiovascular
disease^[37]5, and cancer^[38]6. In cancer research, a crucial step of
scRNA-seq data analysis is to delineate tumor cells or neoplastic cells
from other cell types^[39]7,[40]8. The tumors of each patient have
their unique tumoral and microenvironmental evolution, and thus the
scRNA-seq data from cancer patients tend to be more heterogeneous. Such
heterogeneity is an exciting opportunity for improving our
understanding of cancer with scRNA-seq, but it also imposes
computational challenges to dissect composing cell types^[41]9.
Neoplastic cells are abnormal cells that are undergoing excessive and
uncontrolled proliferation^[42]10. These cells, which may or may not be
malignant, can be extracted experimentally through cell sorting,
although this is not always possible due to a lack of suitable markers
or the high cost and labor requirements associated with these
experiments^[43]11. In fact, it may be more useful to study all cells
from a sequencing experiment simultaneously in order to understand the
characteristics of neoplastic cells within their surrounding
microenvironment. Neoplastic cells in certain types of cancer often
have distinct features compared to non-neoplastic cells, such as high
expression of certain cell type markers or genes belonging to some
oncogenic pathways^[44]12. However, identifying neoplastic cells based
on these markers or pathways can be difficult due to inter-individual
heterogeneity, technical artifacts, and noise from the tumor
microenvironment^[45]13.
Recently, computational methods have been developed to identify
large-scale copy number variations (CNVs) by comparing the smoothed
scRNA-seq data against an internal or external normal reference, such
as inferCNV, HoneyBADGER, and copyKAT^[46]3,[47]14,[48]15. For example,
inferCNV is a popular visualization method for identifying large-scale
CNVs. It uses smoothed averages over gene windows and compares the
expression magnitude to the average over a set of reference ‘normal’
cells. CopyKat is a recently developed tool serving a similar
purpose^[49]15. CopyKAT uses an integrative Bayesian segmentation
approach combining CNV inference and hierarchical clustering, which has
been shown to achieve high accuracy in distinguishing cancer cells from
normal cells in multiple cancer types. Both inferCNV and CopyKAT
generally work well with tumor cells that demonstrate extensive
chromosomal alterations, but they do not work well for cancer types
that have fewer and shorter CNVs. This is often the case in hematologic
cancers such as myelodysplastic syndromes and acute myeloid
leukemia^[50]16. Moreover, they couldn’t incorporate additional
clinical information for detecting specific CNVs. There are also
methods that sought to integrate scRNA-seq data with bulk DNA
sequencing (DNA-seq) or single-cell DNA-seq (scDNA-seq) data, such as
CONGAS, clonealign, and CCNMF^[51]17–[52]19. These methods serve a
different purpose: to cluster cells based on CNV or mutation
information.
In this paper, we propose to exploit cytogenetic information to improve
the sensitivity and specificity for CNV identification. Cytogenetic
data are routinely measured and recorded for patients with hematologic
cancers^[53]20 and they provide useful information to identify
CNVs^[54]21. In the case of myelodysplastic syndromes, cytogenetic
features, along with other factors such as morphology, immunophenotype,
and clinical features, are included in the World Health Organization
(WHO)-classification-based Prognostic Score System (WPSS) for
myelodysplastic syndromes^[55]22 and its revised version^[56]23.
Similar risk scoring systems also exist for other types of hematologic
malignancies^[57]24,[58]25. For example, Leukemia patients with certain
cytogenetic features, such as deletion of chromosome 7 or 7q, deletion
of 3q, or amplification of chromosome 8, have been shown to have a poor
prognosis^[59]26. Cytogenetic data provide location information of each
CNV and the proportion of cells with specific CNVs based on the
analysis of 20 metaphases. For example, if a patient has cytogenetic
data as “46,XY,del(20)(q11.1q13.1)[5]/46,XY[15],” this means that
approximately
[MATH: 25% :MATH]
percent of cells (5 out of 20) have a deletion of chromosome 20 in the
region q11.1 to q13.1, while the rest of the cells have normal
chromosomal features. Cytogenetic data are typically cheaper and more
readily available in clinical settings compared to DNA-seq or scDNA-seq
experiments. While the proportion in cytogenetic data is a crude
estimate of the aberrant cells, they can still be useful in classifying
cell status and identifying cells with chromosomal abnormalities, which
may be markers for neoplastic cells^[60]27. None of the existing
computational methods is able to incorporate such cytogenetic
information in the analysis of scRNA-seq data.
Here, we introduce two methods, partCNV and partCNVH, for identifying
cells with regional chromosomal abnormalities from scRNA-seq data by
integrating cytogenetic information. Both methods are based on a
statistical framework that models the count expression matrix of
scRNA-seq data using a mixture of Poisson distributions while
incorporating the cytogenetic information through prior specification.
PartCNVH is built on partCNV and it further includes a hidden Markov
model (HMM) to improve feature selection and clustering accuracy. It
should be noted that our proposal is complementary to the existing
methods such as copyKAT and inferCNV, as they focus on identifying
large-scale CNVs while we detect smaller variations with the
incorporation of external information. We implement our proposed
methods in a computationally efficient expectation-maximization (EM)
algorithm^[61]28 and evaluate their performance through extensive
simulation studies. We then apply them to three scRNA-seq data sets
from patients with hematologic malignancies and show that they can
identify cells with chromosomal deletions or amplifications in specific
regions suggested by the cytogenetic data. We also perform additional
analysis to understand the changes in the pathways and biologic
processes in the identified aneuploid cells. Compared to existing
methods, partCNV and partCNVH provide more interpretable results and
additional findings. With the widespread use of single-cell technology
in hematologic cancer research and clinical care of cancer patients,
our methods offer a useful solution for fully leveraging cytogenetic
data to identify cells with specific chromosomal abnormalities.
Method overview
PartCNV is a statistical framework that uses a hierarchical Poisson
mixture model to differentiate two mixture components corresponding to
normal and aberrant cells. PartCNVH is an extension of partCNV with the
addition of HMM when there is a sufficient number of genes that allow
feature selection. Figure [62]1 provides a schematic overview of the
proposed methods. Our methods start with the normalized expression
counts from the region with a known chromosomal deletion or
amplification and explicitly incorporate the prior knowledge from
cytogenetic data through imposing a Bernoulli prior on the cell status
(i.e., normal or aberrant). We develop an EM algorithm that treats cell
status as the missing variable and efficiently solves the mixture
model. The inferred cell status from this step is the output of
partCNV.
Fig. 1.
[63]Fig. 1
[64]Open in a new tab
Schematic of PartCNV and PartCNVH. With the input of normalized
expression counts from scRNA-seq experiments and the cytogenetic
information from the patient, we develop an EM algorithm with mixtures
of two Poisson distributions to infer the aneuploid/diploid status for
the regions of interest. We further include a hidden Markov model to
improve feature selection and the classification accuracy.
Taking the ouptut from partCNV, partCNVH further refines it by a HMM.
Specifically, a group average is taken for the inferred two groups of
cells and the rolling average of the ratios between the two groups is
used to infer the hidden status of the regions by a HMM. There are two
reasons that we adopt this combination of rolling average and HMM in
partCNVH. First, as shown in our later results, the group mean and
rolling average can effectively magnify the signal of the regional
deletion or amplification on the expression level. This is especially
important when the signal of copy number alternations is weak related
to noise in gene expression measurement. Second, it is possible that
only a subset of the regions of interest has copy number changes. HMM
can identify regions that are more likely to contain the chromosomal
changes, which in turn improves the performance of aneuploid/diploid
cell classification. After this HMM-based feature selection step,
partCNVH performs a second round of the EM algorithm using the Poisson
mixture model and reports the inferred cell status.
Simulations
We design comprehensive simulation settings to evaluate the performance
of partCNV and partCNVH. As comparisons, we also consider existing
methods using dimension reduction by principal component analysis (PCA)
followed by Louvain or Leiden clustering. Previous literature has
reported that the Leiden algorithm can generate better connected
communities through including an extra refinement step and run faster
than Louvain^[65]29. Additionally, we include two widely used machine
learning clustering algorithms, K-means clustering and hierarchical
clustering. All the previous mentioned methods can be applied to detect
locally aneuploid cells. Although our proposed method is not directly
comparable to existing methods that classify cells based on
whole-genome CNV inference, we still design a separate simulation study
to compare the proposed methods versus the two large-scale CNV
detection-based methods, inferCNV and copyKAT^[66]14,[67]15.
We consider two settings where the first one studies aneuploid cells
with deletions and the second one studies amplifications: redSimulation
data 1 and 2. The mean expression for these genes is generated by
taking a ratio of the normal expression. This ratio (or log fold
change) is randomly drawn from a Uniform distribution with different
base levels (0.5/0.6 for Setting 1 and 1.5/1.4 in Setting 2) and
different noise levels. A larger noise level makes the expression from
the aneuploid cells similar to the normal cells, and thus creates
harder scenarios for the methods. The evaluation criteria is the
accuracy of the classification results of the proposed method evaluated
by the true cell status (i.e., being aneuploid or normal). More details
of the simulation settings are provided in the Methods section.
First consider simulation data 1, where 500 out of 3000 cells have
deletions. Our proposed methods partCNV and partCNVH have the highest
accuracy among all the methods in all scenarios (Figure [68]2A-B).
Using the same normalized gene expression counts input as our methods,
K-means and hierarchical clustering have the lowest accuracy ranging
from 0.5 to 0.7. PCA plus Louvain and Leiden have higher accuracy than
K-means and hierarchical clustering. When the signal is strong (ratio =
0.5 in panel A) and noise is small, PCA plus Louvain/Leiden also have
similar high accuracy as the proposed methods. But with the increase of
the noise level, the accuracy of PCA plus Louvain or Leiden decreases.
The advantage of the proposed methods becomes more obvious when the
ratio is 0.6 and the noise level is high. For example, the mean
accuracy of PCA plus Leiden is around 0.75 while partCNVH can achieve a
high accuracy of 0.9. This is understandable since the proposed methods
specifically model the data through two components for normal and
aneuploid cells, and they allow mixtures of regions with and without
deletions. Second, partCNV and partCNVH have similarly good performance
with accuracy higher than 0.9, and partCNVH generally has higher
accuracy than partCNV. To better understand the role of the HMM step of
partCNVH and the result of feature selection, we use one simulation
data set as an example and visualize the mean gene expression across
the region (Figure [69]3A), ratios of the mean expressions of the two
groups inferred by partCNV (Figure [70]3B), the rolling average of the
mean expression ratios (Figure [71]3C), and the inferred status from
HMM (Figure [72]3D). It can be seen that the rolling average of mean
expression ratios between the two groups can effectively magnify the
signal, and a majority of the HMM selected genes are located in the
region with deletion. Figure [73]3E shows that our proposed method
partCNVH has greater accuracy in classifying cells than the other
methods.
Fig. 2.
[74]Fig. 2
[75]Open in a new tab
Results of simulation Setting 1 with deletions. The methods that are
compared include K-means clustering (Kmeans), hierarchical clustering
(HClust), dimension reduction using PCA plus Louvain clustering
(PCA+Louv), and dimension reduction using PCA and Leiden clustering
(PCA+Leid). Each simulation dataset contains a total of 3000 cells:
2500 normal cells and 500 with deletions. (A) The accuracy of these
methods when the ratio of gene expression in a deletion region versus
normal expression is 0.5 at different noise levels (0.1: low, 0.2:
medium, 0.3: high). (B) The results of the setting with ratio = 0.6 at
different noise levels. All results are summarized over 100 Monte Carlo
iterations.
Fig. 3.
[76]Fig. 3
[77]Open in a new tab
Illustrating the procedure of the feature selection step using HMM in a
simulation data set. (A) The log-transformed mean gene expression
levels for the genes located in regions without and with deletion. Each
dot is a gene. (B) The log-transformed ratio of the mean expression
levels for cells without versus with deletion inferred by partCNV. (C)
The expression mean ratio for the two groups of cells by partCNV after
applying the rolling average with a bandwidth of 50. (D) The latent
states inferred by HMM based on the rolling average from panel (C). (E)
The results of classification accuracy of partCNVH, PCA plus Leiden,
hierarchical clustering, and K-means clustering. The red dots are the
cells incorrectly classified and the blue dots are the correct ones.
Evaluation of PartCNV & PartCNVH for different prior information:
amplification regions
Next we evaluate different methods in simulation data 2 with
amplifications (Figure [78]4). Note that for cells with deletion, the
expression change odds is 2 (from 1 to 0.5) while in cells with
amplification the odds is 0.67 (from 1 to 1.5) or its inverse 1.5. Thus
the signals from the amplified regions can be harder to detect than the
first setting with deleted regions. In Figure [79]4(A), we observe the
performance of all methods decrease, especially for PCA plus Louvain
and Leiden. When the noise level is high (0.3), PCA plus Leiden only
reaches an accuracy of 0.74. In comparison, our proposed method still
has a high accuracy of around 0.95. With the ratio level set as 1.4 in
panel B, the signal level becomes lower and the existing methods have
even lower accuracy ranging from 0.5 to 0.6, while the proposed methods
still stay at a reasonable accuracy level around 0.9. These demonstrate
the robustness of the proposed methods and highlight the importance of
applying partCNV or partCNVH instead of existing methods when the
region of interest has amplifications.
Fig. 4.
[80]Fig. 4
[81]Open in a new tab
Results of simulation Setting 2 with amplifications. Each simulation
dataset contains a total of 3000 cells: 2500 normal cells and 500 with
amplifications. (A) The accuracy of the methods compared when the ratio
= 1.5 at different noise levels (0.1: low, 0.2: medium, 0.3: high). (B)
The results when the ratio = 1.4 at different noise levels. All results
are summarized over 100 Monte Carlo iterations.
Evaluation of PartCNV & PartCNVH for different prior information: cell
numbers
Our current simulation design considers a total of 3000 cells. When we
study a region of interest suggested by cytogenetic data, more cells
generally provide more information, and thus identifying signals from
fewer cells can be more challenging. To evaluate the proposed methods
under this scenario, we generate simulation data 3 by fixing the cell
number as 1300, where 1000 cells are normal and 300 are aneuploid
cells. Figure S1 shows the simulation results with 1300 total cells and
the region of interest (200 out of 600 genes) has a deletion in the
aneuploid cells. Compared with the results in Figure [82]2, all the
existing methods have worse performance for the same ratio and noise
combinations. For example, both PCA plus Louvain and PCA plus Leiden
have a high accuracy of around 0.90 when the ratio is 0.6 with a medium
noise of 0.2 using 3000 total cells, while the accuracy decreases to
around 0.8 using 1300 cells. The variation of the classification
results also increases. Although our proposed methods have slightly
decreased performance when the ratio is 0.6 and the noise is 0.3, their
overall performance remains similar in other scenarios (ratio = 0.5 at
all noise levels, ratio = 0.6 with low and medium noise levels).
Similar patterns can be observed in amplification settings by comparing
the results in Figure S2 versus the results in Figure [83]4.
Surprisingly, PCA plus Louvain or Leiden has even worse median accuracy
than the hierarchical clustering, even though they all have quite low
classification accuracy. These results suggest that our proposed
methods tend to have more robust performance even with fewer cell
numbers in the analyzed dataset, while the existing methods have
decreased accuracy and more varied results, especially when the noise
level is high.
Evaluation of PartCNV & PartCNVH for different prior information: proportions
of aneuploid cells
As a methodology advantage, partCNV and partCNVH are able to
incorporate the prior knowledge of an estimated proportion of aneuploid
cells. If the prior is misspecified, we seek to understand the impact
on the results. We generate simulation data 4 with the same data
generation procedure but the prior information is specified as correct
(0.17 for total cell number 3000 and 0.23 for total cell number 1300)
and incorrect (0.5), and we examine the results of our proposed
methods.
Figures [84]5 and S3 illustrate the accuracy for the total cell numbers
3000 and 1300. First, it is clear that the correct prior knowledge
improves the classification accuracy than a non-informative prior of
0.5. This improvement is small when the ratio is 0.5 or 1.5, but it can
be substantial when the signal is harder to detect (ratio = 0.6 or 1.4)
and the noise level is high. For example, when the ratio is 1.4 and the
noise is 0.3, the improvement of the accuracy for both partCNV and
partCNVH using a correct prior can be about
[MATH: 10% :MATH]
compared to using the incorrect prior. Second, both Figures 5 and S3
demonstrate the robustness of the proposed methods against incorrect
prior information, especially in panels A and C where the ratios are
0.5 and 1.5, respectively. In these experiments, we choose 0.5 as the
incorrect prior knowledge, which is far from the true proportion 0.17
and illustrate a worst scenario that the prior is completely
non-informative. In reality, when a closer prior such as 0.20 or 0.15
is used, the impact would be much smaller. Even in the worst scenario,
with an amplification ratio 1.4 and a high noise level 0.3, our method
with an incorrect prior still reaches a median accuracy above 0.75,
which is better than the existing methods under the same scenario.
These results highlight the advantage of the proposed methods in
accurately identifying aneuploid cells.
Fig. 5.
[85]Fig. 5
[86]Open in a new tab
Simulation results of evaluating the impact of different prior
information on the classification accuracy of the proposed methods. The
dark and light blue bars correspond to the results with and without
correct specification of the prior information (true prior: 0.17).
Panel (A) and (B) are the simulation results based on the first
simulation setting with ratios = 0.5, 0.6, respectively. Panel (C) and
(D) are based on the second simulation setting with ratios = 1.5, 1.4,
respectively. All results are summarized over 100 Monte Carlo
iterations.
Comparison with genome-wide CNV detection methods
Lastly, we compare the proposed methods with two widely used
genome-wide CNV detection methods, inferCNV, copyKAT as well as the
regional versions of inferCNV and copyKAT. For the regional versions,
we applied hierarchical clustering on the normalized expression matrix
by inferCNV/copyKAT for the regions of interest only. One example of
simulation data 4 is visualized in Figure [87]6A. As inferCNV requires
the input of normal cells, we use an additional 100 normal cells as the
reference for inferCNV. Both inferCNV and copyKAT generally are much
more computationally intensive. We summarize the accuracy of 20 Monte
Carlo simulations. It can be observed that neither inferCNV/copyKat nor
the regional versions of these methods is able to accurately infer
aneuploid/diploid cell status (Figure [88]6B-D), since the regional
aneuploid signal is too weak compared to genome-wide copy number
alterations that inferCNV and copyKAT are designed to detect. It is
also interesting that when we re-cluster the cells based on the
normalized expression from the sub-regions of chromosome 20, the
performance of copyKAT increased but not inferCNV. These findings
highlight the need for region-specific detection tools with the
considerations of cell type mixtures to distinguishing between
aneuploid and diploid cells.
Fig. 6.
[89]Fig. 6
[90]Open in a new tab
Simulation results to compare the proposed method versus existing
whole-genome based methods, InferCNV and CopyKAT. (A) Heatmap of the
simulated expression values for genes in region Chr20(q11.1-q13.1),
where the rows are the cells and columns are the genes. Logarithmic of
the expression values plus one are used for visualization. Rows are
labeled by the true aneuploid/diploid status and the inferred status by
partCNV and partCNVH. (B) CopyKAT output of the aneuploid/diploid
prediction versus the true cell status. The heatmap includes all the
chromosomes. (C) Copy number results using InferCNV. (D) Boxplot of the
aneuploid/diploid inference of the methods averaged over 20 Monte Carlo
simulations. Regional InferCNV and Regional CopyKat are the
hierarchical clustering results based on the normalized expression of
the region of interest from InferCNV and CopyKAT.
Real data application
We demonstrate the usage of the proposed methods on three real data
applications. For each application, the scRNA-seq data from one patient
were collected by the 10X genomics platform and the cytogenetic data
were collected from patients’ medical records. All three patients have
a subset of the cells with regional copy number variations, and the
rest are normal cells. The three applications have different complexity
levels. The first subject (patient 1) is the most straightforward one,
as a very long region (the whole chromosome Y) was reported as lost in
a subset of the cells according to the patient’s cytogenetic data. The
second subject (patient 2) has a subset of cells with partial
chromosome 20 lost. The third subject (patient 3) has a complicated
situation as this patient has cells with partial deletions, as well as
cells with partial amplifications.
Patient 1: MDS with loss of chromosome Y
We obtain the scRNA-seq data of a bone marrow sample from an MDS
patient treated at MD Anderson Cancer Center. The data, after alignment
and quality control, contains a total of 33,538 genes and 655 cells.
For this patient, the bone marrow sample has been specifically sorted
for CD34+ cells to enrich hematopoietic stem and progenitor cells
(HSPCs). As a result, the cell number is smaller than regular scRNA-seq
experiments. Based on the clinically obtained cytogenetic data, this
patient has around
[MATH: 35% :MATH]
of cells with the loss of chromosome Y and our goal is to identify
these aneuploid cells. We first apply copyKAT to the whole
transcriptome data from this sample to infer copy number variations. As
shown in Figure S4, copyKAT clusters the cells based on the inferred
CNV statuses across the whole genome, but it could not take regional
data or the cytogenetic information into consideration when identifying
the neoplastic cells.
To specifically identify cells with the chromosome Y loss, we apply the
proposed methods and the five existing methods (PCA plus Louvain, PCA
plus Leiden, K-means clustering, hierarchical clustering, and CopyKAT)
on this dataset. The input data are the normalized counts from the
genes located on chromosome Y. In this application, the HMM step from
partCNVH selects the whole set of genes so we only present the results
from partCNV. Since the CNV in this dataset encompasses the entire
chromosome Y, we expect some of the existing methods also work well for
this analysis. We find that partCNV, PCA plus Louvain, PCA plus Leiden,
and K-means clustering all have proportions of aneuploid cells close to
35% (40.9%, 39.5%, and 39.5%, respectively) (Figure [91]7 A-E).
Hierarchical clustering identifies a much smaller number of aneuploid
cells (9.31%). From visualizing the pairwise ARI values of these
results, we find that partCNV, PCA plus Louvain/Leiden have very
similar results, while K-means clustering and hierarchical clustering
have very different results (Figure [92]7 F). As the UMAP coordinates
in Figure [93]7(A-E) are obtained using the whole transcriptome data,
the fact that copyKAT-identified aneuploid cells cluster together
suggests that copyKAT captures the whole transcriptome pattern instead
of chromosome Y specific changes.
Fig. 7.
[94]Fig. 7
[95]Open in a new tab
Results of applying different methods to the scRNA-seq data using the
bone marrow CD34 positive cells from patient 1. Cytogenetic information
shows that
[MATH: ∼35% :MATH]
of the cells from the sample have chromosome Y loss. (A-E) The cell
classification based on the inference of chromosome Y loss using
different methods. Panel F shows the heatmap of the ARI values from
comparing the classification results using different methods. (G) The
expression of chromosome Y genes for cells that are labeled as normal
in partCNV (partCNV:0) and aneuploid in CopyKAT (CopyKat:1), and three
similar groups. (H,I) The expression of chromosome Y genes comparing
partCNV versus k-means clustering and comparing partCNV and
Hierarchical clustering.
We further examine the average gene expression of chromosome Y genes
among the aneuploid/normal cells identified by partCNV, copyKAT,
K-means, and hierarchical clustering (Figure [96]7 G-I). It is apparent
that partCNV-labeled aneuploid cells have much lower expression than
the cells labeled as aneuploid by other methods but normal per partCNV,
confirming that partCNV correctly identifies the cells with deletion on
chromosome Y.
In summary, these results suggest that partCNV and the PCA plus Louvain
or Leiden clustering have identified the cells with the chromosome Y
loss. In contrast, K-means clustering, hierarchical clustering, and
copyKAT failed to do so.
Patient 2: MDS with partial deletion of chromosome 20
We obtain the scRNA-seq data from the bone marrow sample for a
different MDS patient. The data were also generated by the 10X genomics
scRNA-seq technology. This sample was sequenced directly without the
cell sorting step, and thus both HSPC and immune cells can be
potentially identified. After alignment, preprocessing, and quality
control, a total of 24,519 genes and 3,643 cells are kept for the
analysis. Based on the cytogenetic data, about 20% of cells in the
sample have deletions in chromosome 20 at regions q11.1 to q13.1, which
is about 24.2 Mb long. We also apply CopyKAT to this data and present
the heatmap result in Figure S5. Although cytogenetics reported
deletions in chromosome 20, the log copy number ratio heatmap does not
have an obvious deletion pattern in the suggested regions. Based on the
whole genome copy number inference, copyKAT only reports about 500
aneuploid cells (
[MATH: ∼ :MATH]
5.5%).
We analyze the whole-transcriptome data using Seurat^[97]2 through
identifying highly variable genes, extracting top principal components
(PCs) based on these genes, and we perform UMAP and clustering analysis
(Figure [98]8). UMAP is used for dimension reduction and unsupervised
clustering is performed with the default Louvain clustering using the
top PCs. A total of 10 clusters are identified, and the cluster
specific markers are used for annotating the cell type labels based on
biological knowledge by our MDS biologist. We also apply the proposed
methods and existing methods targeting the region on chromosome 20 with
known chromosomal deletions. Figure [99]8 A-E shows cell type labels
and the aneuploid/diploid inference result using partCNVH, PCA plus
Louvain, K-means clustering, and hierarchical clustering. The results
for partCNV and PCA plus Leiden are presented in Figure S6. We find
that the proposed methods have the closest proportion of aneuploid
cells to the cytogenetics reported proportion; the other methods all
have much lower or higher proportions. The major difference between our
proposed method and PCA plus Louvain is in the cycling RBC progenitors
(Figure [100]8 B, C, and F). Our method reports high proportions in all
three MDS-related cell groups (i.e., HSPCs, RBC progenitors, and
cycling RBC progenitors), while PCA plus Louvain only reports aneuploid
cells in the former two clusters. Previous literature found impaired
erythroid-proliferating capacities to be a prominent characteristic in
patients with MDS^[101]30,[102]31. Both RBC progenitors and cycling RBC
progenitors are major cell types involved in the
erythroid-proliferating function, and thus it makes sense to identify
neoplastic cells in both cell types.
Fig. 8.
[103]Fig. 8
[104]Open in a new tab
Application results of the proposed method and existing methods using
the scRNA-seq data from patient 2. (A) The cell type annotation based
on marker genes’ expression and biological knowledge. RBC progenitor:
red blood cell progenitors. HSPC: hematopoietic stem and progenitor
cells. NK: natural killer. (B-E) The inferred aneuploid cells with
deletion chr20(q11.1q13.1) using different methods (B, partCNV; C,
PCA+Louvian; D, K Means clustering; E, Hierarchical clustering). The
proportions in the title bracket are the proportions of cells that are
inferred as cells with this specific deletion. (F) The comparisons of
results by partCNVH and PCA plus Louvain versus cell type labels. (G)
Gene set enrichment analysis results for the genes that are
differential expressed between the cells with and without deletion
chr(20)(q11.1q13.1). Gene sets are defined using the Reactome Pathway
Database.
To understand the differences between the identified locally aneuploid
cells and the normal cells, we conduct differential expression analysis
for each cluster to compare the aneuploid versus diploid cells
identified by our method^[105]32. A total of 177 cluster-specific
differentially expressed genes were identified using a cutoff of 0.05
for adjusted p values. We also perform over-representation analysis
using GO Biological Process database^[106]33,[107]34 and identify
several functional categories over-represented by differential
expression signals, including neutrophil degranulation, a few immune
system related terms, and interleukin 4 and interleukin 13 signals
(Figure [108]8 G). Many of these terms have been reported in previous
literature to be related with MDS. For example, neutrophil
degranulation and migration has been reported to be associated with MDS
compared with normal controls^[109]35. The important role of the immune
system and innate immune signaling in MDS has also been reported in
multiple publications^[110]36–[111]38. The over-representation results
using the Reactome pathway and Hallmark database are presented in
Figure S7^[112]39,[113]40. In the Reactome pathway enrichment results,
we also find several immune response-related and lymphocyte activation
related terms. In Hallmark analysis, allograft rejection, complement,
interferon gamma response, and TNFA signaling via NFKB are the top
findings, which also have been associated with MDS
pathogenesis^[114]41–[115]43. There are also some terms that have not
been reported in previous MDS studies, such as generation of second
messenger molecules, hemostasis, defense response, and cell activation,
which could be promising targets for future research.
Patient 3: AML with partial gain and partial deletion of chromosome 8
Lastly, we study the scRNA-seq data from an AML patient with
complicated chromosomal variations^[116]44. Specifically, this patient
has amplifications of the whole chromosome 8 in 25% of the cells,
deletion of chromosome 8 at region q21.2 to q24.3 in 40% of the cells,
and normal karyotype in the rest of the cells. We are interested in
identifying which cells contain the chromosome 8 gain and which have
del(8)(q21.2q24.3). The scRNA-seq data were generated using the 10X
genomics technology platform. After preprocessing, the data contain a
total of 20,521 genes from 4294 cells. Since there are overlaps between
the deletion and amplification regions, we apply partCNVH through a
two-step approach. We first focus on the region of chromosome 8 before
q21.2 where about 25% of the cells have amplification; the rest of the
cells are normal in the area. After the cells with chromosome 8 gain
are detected, we apply partCNVH again to the rest of the cells, which
contain del(8)(q21.2q24.3) in around 53.3% (
[MATH: =40%/(1-25%) :MATH]
) of the cells. In the two steps,
[MATH: 25% :MATH]
and
[MATH: 53.3% :MATH]
are used as the prior knowledge of the aneuploid cell proportions in
partCNVH.
We find that the majority of the inferred cells with chromosome 8
amplification are from the erythroid cells, while the cells with
del(8)(q21.1q24.3) are mostly AML cells (Figure [117]9 A-B). The
proportions of cells with chromosome 8 amplification and
del(8)(q21.1.q24.3) are about 13% and 60%, which are close to the prior
knowledge from cytogenetics. We visualize the normalized expression for
the region of interest on chromosome 8 in Figure S8A. We also present
the results by inferCNV and copyKAT in Figure S8B and Figure S9,
respectively. We observe that the patterns for chromosome 8 varies
between the three cell groups in Figure S8A. However, whether the
pattern shows gain or loss is not easy to identify due to data
sparsity. Neither CopyKAT nor InferCNV can be applied to regional data,
and thus they couldn’t reproduce the patterns we observed in Figure
S8A.
Fig. 9.
[118]Fig. 9
[119]Open in a new tab
Results of applying the proposed method partCNVH to the scRNA-seq data
from an AML patient with
[MATH: ∼25% :MATH]
cells having amplification of chromosome 8 and
[MATH: ∼40% :MATH]
cells having chromosome 8 deletion at region q21.2-q24.3. (A) The true
cell type labels annotated by a clinician-scientist. (B) The inferred
cells with amplification of 8 (“+8”) and the deletion of chromosome 8
at q21.2-q24.3 (“del(8)(q21.2q24.3)”). (C, D) The Reactome pathway
enrichment analysis results using the DEGs by comparing the AML cells
with del(8)(q21.2q24.3) versus diploid cells, and by comparing the
erythroid cells with +8 versus diploid cells, respectively.
To understand the different molecular mechanisms related to chromosome
changes, we perform differential analysis for comparing the AML cells
with del(8)(q21.2q24.3) versus diploid AML cells and obtain 266
differentially expressed genes (DEGs). Similarly, we compare the
erythroid cells with a gain of chromosome 8 versus diploid erythroid
cells and obtain 426 DEGs. Gene set enrichment analysis identify a few
shared significant terms between AML and erythroid cells (Figure
[120]9C-D), including multiple immune system related terms, neutrophil
degranulation, and cellular responses to stimuli. Some unique terms in
AML are hemostasis, platelet activation signaling and aggregation, and
arachidonic acid metabolism. Hemostatic and thrombotic complications
are prevalent symptoms in AML patients and hemostasis has been studied
before for AML pathogenesis-related mechanisms^[121]45. The term
platelet activation signaling and aggregation is also consistent with
the previous literature that the platelet defects and other hemorrhagic
symptoms are widely observed in AML patients^[122]46. The arachidonic
acid metabolism is a process highlighted in a few cancer research
publications, but the evidence for their involvement in AML is still
accumulating^[123]47,[124]48. Overall, our results are consistent with
literature and provide some novel disease-related biological processes
for future research.
Discussion
We introduce partCNV/partCNVH, a statistical framework that
distinguishes neoplastic cells with copy number alterations from normal
diploid cells based on regional chromosomal deletions or
amplifications. Unlike existing methods, our statistical framework can
incorporate prior knowledge from cytogenetic data that includes both
chromosomal locations of aberrations and the observed proportion. As
demonstrated in our simulation study, this prior information can
effectively improve the classification accuracy when the signal is
weak. Our framework also includes a feature selection step using the
hidden Markov model, which is able to filter the genes when part(s) of
the region are diploid. This step further improves the signals of
chromosomal changes and results in higher accuracy to detect neoplastic
cells. We have illustrated the benefits of partCNV and partCNVH through
extensive simulation studies and in-depth analysis of three scRNA-seq
datasets from MDS or AML patients.
Cytogenetic information is a key component of the scoring system of
risk assessment, treatment selection, and outcome prediction for
patients with hematologic malignancies^[125]49,[126]50. In this work,
we exemplify the use of the proposed methods in patients with MDS and
AML. These methods can also be applied to other hematologic
malignancies or even other cancer types if similar problems are
encountered. As cytogenetic analysis is a mature technology and is
widely used in clinical settings, the cytogenetic data should be fairly
accessible from clinical collaborators who provide patient samples.
Such cytogenetic data-guided analysis can be a useful tool for
identifying subgroups of cells with the chromosomal changes of
interest.
Implemented in an EM algorithm, our proposed methods have favorable
computational performance. For a simulation dataset with 3000 cells and
600 genes, it takes 12 seconds and 21.5 seconds for partCNV and
partCNVH to complete the analysis, respectively. This computation cost
is similar to existing methods PCA plus Louvain (
[MATH: ∼ :MATH]
6.8 seconds) and PCA plus Leiden (
[MATH: ∼ :MATH]
30.7 seconds). In comparison, both inferCNV and CopyKAT take more than
10 minutes, sometimes more than an hour, to process a scRNA-seq dataset
with a few thousand cells. Additionally, our proposed methods scale
almost linearly to large datasets. If we increase the cell number from
3000 to 6000, partCNV takes about 26.5 seconds and partCNVH 41.5
seconds to complete the computation. Since cytogenetic data are
generally unique for each subject, we expect that our methods are
applied person by person in real applications. Thus, it is reasonable
to assume a few thousand cells in the dataset, for which our methods
can complete the computation within one minute.
For future work, the methods can be further extended to incorporate
additional biological knowledge, such as the marker or mutation
information mentioned in Fan et al^[127]9. Our method has the potential
to be applied on multiple samples in parallel or even to borrow
information across different samples from the same subject, such
extension is not trivial and needs further evaluations. Moreover, due
to the complexity of sequencing depth, gene expression variations,
number of genes impacted by the aneuploid event, we have not evaluated
the power of the proposed methods or the minimum required size for CNV
events. These should be carefully evaluated in future works.
Previous work also developed machine learning models to predict the
neoplastic/non-neoplastic status of the cells by splitting the
annotated data and training a random forest model^[128]16. Such models
usually rely on the training dataset and may not generalize well to
other studies. With the accumulation of annotated single cell data, it
is also possible to harness the power of deep learning algorithms to
further improve existing models and achieve more accurate predictions.
Our current methods assume a fixed prior based on the cytogenetic data.
It is possible that different proportions of aneuploid cells have
different confidence levels. The current method can be further extended
to incorporate such confidence into the model to improve accuracy. When
scDNA-seq data is available, it may be another prior knowledge to
replace the cytogenetic information of clinical data as the input to
PartCNV and PartCNHV. The method can be further extended to incorporate
regions where the scDNA-seq data shows enriched hierarchical aneuploid
CNVs with high resolution.
Methods
Details of partCNV
We aim to identify the cells with the known chromosomal deletion or
amplification from the scRNA-seq data with incorporation of the prior
cytogenetic knowledge. Assume a total of N cells were sequenced by
scRNA-seq and G genes fall in the region with deletion or
amplification. The count matrix is denoted by
[MATH: Y=(ygi) :MATH]
, which is a
[MATH: G×N :MATH]
matrix with rows being the genes and columns being the samples. Without
loss of generality, we assume the genes in
[MATH: Y :MATH]
are ordered by their locations on the chromosome. As the observed data
contain the mixture of cells with and without chromosomal changes,
denote the underlying status of the cell i by
[MATH: ci :MATH]
. Assume the prior proportion of the aneuploid cells is
[MATH: q0 :MATH]
for the region of interest. In our motivating problem,
[MATH: q0 :MATH]
is calculated based on the number of metaphases observing the
chromosomal changes divided by the total number of metaphases from an
cytogenetics test. As shown in some recent literature regarding the
distribution of the scRNA-seq count^[129]51,[130]52, the scRNA-seq data
may not be zero-inflated and the excessive zeros are due to low
expression level of each single cell. Additionally, since the region of
interest generally contains a limited number of genes, the estimated
dispersion parameters are not accurate enough if we use a negative
binomial distribution. Thus, we assume the expression count of gene i
follows a Poisson distribution with mean
[MATH: θg1 :MATH]
if the cell is aneuploid or mean
[MATH: θg0 :MATH]
if diploid, i.e.,
[MATH: Pr(ygi|θg1,ci=1
)=θg1giexp(-θg1<
/mn>)ygi!andPr(ygi|θg0,ci=0
)=θg0giexp(-θg0<
/mn>)ygi!
. :MATH]
We assume the cell status variable
[MATH: ci :MATH]
follows a Bernoulli distribution
[MATH: Pr(ci|qi)=qi(1-qi
msub>) :MATH]
where
[MATH: qi :MATH]
denotes the probability of cell i having the chromosomal changes at the
region of interest, i.e.,
[MATH:
qi=Pr(ci=1
) :MATH]
. The prior knowledge of the aneuploid cell proportion is best
described through a beta distribution, which we approximate through a
normal distribution. Though cytogenetic information is obtained based
on 20 metaphases, the involved cells can number in the hundreds of
thousands, and thus a normal distribution can adequately approximate
the underlying beta distribution. We assume
[MATH: qi :MATH]
follows a prior Normal distribution with mean
[MATH: q0 :MATH]
and variance
[MATH: λ2 :MATH]
:
[MATH: Pr(qi|q0,λ
)=1λ2πex
p-(qi-q0)22λ2.
:MATH]
The variance
[MATH: λ2 :MATH]
represents the confidence about the prior information. Smaller variance
indicates stronger confidence in the prior knowledge. However, in our
experiments, we found that the actual value of
[MATH: λ :MATH]
has minimal impact on the classification results as long as the value
is within a decent range (e.g.,
[MATH: λ :MATH]
between 0.01 and 1). Throughout our experiments, we use
[MATH: λ=0.1 :MATH]
. Together, the full likelihood of the problem can be written as
[MATH: L(θ1,θ0,c,q|Y,q0<
/mn>,λ)=∏i[Pr(ygi|θg1)]1(ci=1
)[Pr(ygi|θg0)]1(ci=0
)Pr(ci|qi)Pr(qi|q0,λ
)
:MATH]
1
and the detailed log-likelihood is
[MATH: l(θ1,θ0,c,q|Y,q0<
/mn>,λ)=∑i1(ci=1
){ygilog(θg1)-θg1}+∑i1(ci=0
){ygilog(θg0)-θg0}+∑icilog(qi)+∑i(1-ci
msub>)log(1-qi
msub>)-∑i(qi-q0)22λ2.
mtable> :MATH]
Directly solving the likelihood ([131]1) may not be feasible, and thus
we use the EM algorithm by treating cell status
[MATH: {ci} :MATH]
as the missing variables. The objective function of the EM algorithm is
[MATH: Q(θ1,θ0,c,q)=∑ipi{ygilog(θg1)-θg1}+∑i(1-pi
msub>){ygilog(θg0)-θg0}+∑icilog(qi)+∑i(1-ci
msub>)log(1-qi
msub>)-∑i(qi-q0)22λ2.
mtable> :MATH]
2
In the M step of the t-th iteration, we solve for
[MATH: θg0(t) :MATH]
,
[MATH: θg1(t) :MATH]
, and
[MATH: qi(t) :MATH]
as follows:
[MATH: θg1(t)=∑i
mi>qi(t-1)ygi∑
iqi(t-1),θg
mi>0(t)=∑i
mi>1-qi(t-1)ygiN-<
/mo>∑iqi(t-1),qi(t)=argmaxqi[qi(t-1)∑ici
1-qi(t-1)N-
∑ici
exp-n(qi(t-1)-q0
)22<
msup>λ2]. :MATH]
In the E step, we estimate the conditional expectation of the cell
status by
[MATH: pi(t)=Pr(ci=1
|ygi)=Pr(ygi|ci=1
)Pr(ci=1
)Pr(ygi,ci)=∏g
mi>Pr(ygi|θg1)·qi(t)∏gPr(ygi|θg0)·qi(t)+∏g
Pr(ygi|θg0)·(1-qi(t)). :MATH]
The algorithm repeats the M step and E step until convergence. The
convergence criteria is defined as the absolute difference between
[MATH: p(t)=(p1(t),⋯,<
msubsup>pN(t)) :MATH]
and
[MATH: p(t-1) :MATH]
smaller than
[MATH: 10-5 :MATH]
. The core of partCNV is the described EM algorithm and the output is
the inferred cell status. With the estimated probability
[MATH: p^ :MATH]
, the cells with
[MATH: p^i≥0.5
mrow> :MATH]
are assigned as aneuploid cells and the rest as diploid cells.
Details of partCNVH
Limited by the technology, cytogenetic data only provide crude
information about the chromosomal deletion or amplification. When we
include all the genes from the regions of interest, it is likely that
not all of the genes have chromosomal changes in the aneuploid cells.
It is helpful to select the genes that demonstrate chromosomal changing
patterns. Motivated by this idea, we design partCNVH, the combination
of partCNV and HMM for improving classification performance. Denote the
underlying status for the observed G genes by
[MATH: Z=(z1,⋯
,zG) :MATH]
. The first step of partCNVH is to apply partCNV on the scRNA-seq data
from the region of interest. Denote the obtained cell status from
partCNV by
[MATH: {c^i} :MATH]
. Then for each gene g, we compute the mean expression of this gene for
the two groups, i.e.,
[MATH: y¯g(1)=∑j∈{i:ci
msub>=1}ygiandy¯g(0)=∑j∈{i:ci
msub>=0}ygi. :MATH]
Based on
[MATH: y¯g(1) :MATH]
and
[MATH: y¯g(0) :MATH]
, we obtain the mean expression ratio for all the genes by
[MATH: rg=y¯g(0)/y¯g(1) :MATH]
if the region has deletion and
[MATH: rg=y¯g(1)/y¯g(0) :MATH]
if the region has amplification. As shown in Figure [132]3, the signals
from the mean expression ratio are weak, and thus we apply a rolling
average on the mean expression ratios to strengthen the signals. Denote
the window size for the rolling average by K, and the rolling average
at gene g becomes
[MATH: r¯grolling=
1K∑i=g-[K/2]g+[K/2]ri
msub>. :MATH]
In R, we implement the rolling average computation by function
frollmean from the data.table package. We then build HMM using the
rolling average of the genes. Denote the underlying state of gene g by
[MATH: zg :MATH]
, where
[MATH:
zg=1
:MATH]
is gene g with deletion or amplification in the aneuploid cells, and
[MATH:
zg=0
:MATH]
otherwise. HMM aims to infer the hidden status
[MATH: Z={zg} :MATH]
through the observed sequence
[MATH: R¯rolling=(r¯grolling) :MATH]
by solving the likelihood
[MATH: Pr(R¯rolling,Z)=Pr(R¯rolling|Z)×Pr(Z)=∏g=1GPr(r¯grolling|zg)×∏g=1GPr(zg|zg-1<
/mn>).
:MATH]
We solve HMM using the depmix function from the depmixS4 package in R
by specifying the initial transition matrix between the two states
(“genes in aneuploid region” or “A”, “genes in diploid region” or “D”)
as
[MATH: A→AA→DD→AD→D=0.90.10.1
0.9
mrow>. :MATH]
The initial states of the genes are decided as
[MATH:
zg=1
:MATH]
if
[MATH: r¯grolling≥m<
mi>edian{r¯grolling,g<
mo>=1,⋯,G} :MATH]
, and
[MATH:
zg=0
:MATH]
otherwise. After the hidden states are inferred by HMM, we identify the
states corresponding to the “genes in diploid region” by comparing the
mean expression ratio of the two states, and the state with larger mean
expression ratio is labeled as state “D”. Denote the expression count
matrix for the selected genes by
[MATH: Y~ :MATH]
. We apply the EM algorithm described in partCNV to
[MATH: Y~ :MATH]
and infer the final cell states.
Simulation designs
To mimic the real data analysis, we generate the simulation datasets
based on existing scRNA-seq data from patients with triple negative
breast cancer (TNBC). The data were downloaded from the Gene Expression
Omnibus (GEO) with accession number [133]GSE148673. We compared the
data characteristics between the TNBC dataset and the scRNA-seq data
from our MDS patients. In Supplementary Figure S10, we presented the
mean expression levels of genes and cells, as well as the dropout rate
for the two datasets. We found that the data characteristics are
similar between the two data sources. As a data processing step, genes
with zero expression in all cells in the TNBC data are removed. We keep
all of the normal cells based on the cell type annotation from the
original study^[134]15. As our method focuses on the region with known
deletion or amplification from the cytogenetic data and the region
often covers tens or hundreds of genes, we generate the expression
count for
[MATH: n0 :MATH]
normal cells and
[MATH: n1 :MATH]
aneuploid cells with a total of G genes. For each iteration, we
randomly draw the expression of G genes from the normal cells of the
TNBC dataset. We compute the mean expression of these G genes and
denote it by
[MATH: ξ^=(ξ1,⋯
,ξG) :MATH]
. For normal cells, the expression is generated from Poisson
[MATH: (ξg^) :MATH]
for
[MATH:
g=1,⋯,G
:MATH]
. For aneuploid cells, assume
[MATH: G1 :MATH]
genes are located at the deleted or amplified regions and
[MATH:
G0=G-G1 :MATH]
are from the normal regions.
We assume half of the
[MATH: G0 :MATH]
genes are in the regions that are left-adjacent to a copy-number
alteration and the other half are in regions right-adjacent that have
normal expression in all cells. This partial chromosomal variation
often happens in practice as cytogenetic data only provide a crude
observation of the changed regions. Without loss of generality, we
assume the beginning
[MATH: G1 :MATH]
genes are from the aneuploid region. For a gene g from this region for
an aneuploid cell i, we generate the expression from
[MATH: ygi∼Poisson{λg·(r+η)}and the noiseη∼Uniform(0,τ), :MATH]
where r is the ratio controlling the impact level of chromosomal
deletion or amplification on the expression and
[MATH: η :MATH]
is the noise, in the first setting with deletion in aneuploid cells. r
takes value 0.5 or 0.6 in this setting (Simulation data 1). In the
second setting, we consider amplification in the aneuploid cells. The
expression is generated from
[MATH: ygi∼Poisson{λg·(r-η)}and the noiseη∼Uniform(0,τ), :MATH]
and r takes value 1.5 or 1.4. (Simulation data 2) The noise parameter
[MATH: η :MATH]
controls the heterogeneity of the impacts among all the interested
genes, mimicking the fact that gene expressions are heterogeneous when
the genes are located in deleted or amplified regions. Through our
experiments, we specify
[MATH:
η=0.1,0.2
:MATH]
, and 0.3 for low, medium, and high noise levels, respectively.
To understand the impact of different sample sizes, we consider the
combination of
[MATH:
n1=500
:MATH]
and
[MATH:
n0=2500
:MATH]
in one scenario and
[MATH:
n1=300
:MATH]
and
[MATH:
n0=1000
:MATH]
in the other (Simulation data 3). We also evaluate the impact of prior
specification in all the scenarios. In the first scenario, the true
prior proportion of the aneuploid cell is
[MATH: 17% :MATH]
and we evaluate the method with both
[MATH: 17% :MATH]
and
[MATH: 50% :MATH]
. In the second scenario, the true proportion is
[MATH: 23% :MATH]
and we also compare the results with both
[MATH: 23% :MATH]
and
[MATH: 50% :MATH]
(Simulation data 4). To compare the proposed methods versus existing
ones that only perform on whole genome data, we randomly sample 2000
normal cells from an existing scRNA-seq dataset (Simulation data 5). We
then replace the expression of the genes located at the region
chromosome 20 q11.1 to q13.1 using the simulated data as described
above to mimic situations that 400 out of 2000 (20%) cells are locally
aneuploid. The gene expressions in the region of interest are generated
in the similar way as Simulation data 1. We then apply inferCNV and
copyKAT using the suggested arguments as suggested by the original
methods. All the simulation results are summarized over 100 Monte Carlo
datasets.
Real data analysis
The data for the AML patient were downloaded from European
Genome-phenome Archive (EGA) with accession EGAD00001007672^[135]44.
The data from sample 7A were used for the analysis due to the available
cytogenetic information. The raw data for the two MDS patients are
currently not publicly available due to confidentiality regulations.
Data preprocessing
The raw sequencing data were preprocessed (demultiplexed cellular
barcodes, read alignment, and generation of gene count matrix) using
Cell Ranger Single Cell Software Suite (version 3.0,
https://support.10xgenomics.com/single-cell-gene-expression/software/pi
pelines/latest/what-is-cell-ranger) provided by 10X Genomics using
Human Genome GRCh38. Genes were detected in
[MATH: <0.1% :MATH]
of total sequenced cells and cells where
[MATH: <200 :MATH]
genes had nonzero counts were filtered out and not included in the
analysis. Low quality cells where
[MATH: >20% :MATH]
of the counts were derived from the mitochondrial genome were also
discarded. The doublet cell status was inferred using
DoubletFinder^[136]53 and only the singlet cells were kept for further
analysis. Data were normalized using the total sequencing count per
cell to adjust for differences in sequencing depth. The chromosome
database associated with cytogenetic location is downloaded via the
UCSC genome website.
Functional over-representation analysis
In the real data analysis section, after the aneuploid/diploid status
has been inferred by the proposed or existing method, we perform cell
type specific differential analysis using “MAST”^[137]32 (available in
Seurat package^[138]2) to compare the diploid versus aneuploid cells.
The genes with an adjusted p value smaller than 0.05 are included as
the DEGs. We then perform functional over-representation analysis using
the MSigDB platform^[139]33
(http://www.gsea-msigdb.org/gsea/msigdb/annotate.jsp) with the Reactome
pathway^[140]39, GO Biological Process^[141]54, and Hallmark^[142]40
databases. We present the top 10 pathways or biological process terms
regardless of the significance level.
Implementation of existing methods
All of the existing methods take the normalized expression counts for
the genes located in the region of interest as input. K-means and
hierarchical clustering are implemented using the functions kmeans and
hclust from the stat package in R, respectively. The PCA plus
Louvain^[143]55 and Leiden methods^[144]29 are implemented using the
Seurat package in function FindClusters with arguments algorithm
[MATH: =1 :MATH]
and 4. Since the Seurat clustering function does not allow
specification of cluster numbers, we design a loop with a precision
parameter ranging from 0.001 to 0.5 with distance 0.005 until the
algorithm identifies exactly two clusters. All the analyses and
plotting are performed in R v4.0.3.
Supplementary Information
[145]Supplementary Information.^ (20.7MB, pdf)
Acknowledgements