Abstract

   Single-cell RNA sequencing is a powerful tool to investigate the
   cellular makeup of tumor samples. However, due to the sparse data and
   the complex tumor microenvironment, it can be challenging to identify
   neoplastic cells that play important roles in tumor growth and disease
   progression. This is especially relevant for blood cancers, where
   neoplastic cells may be highly similar to normal cells. To address this
   challenge, we have developed partCNV and partCNVH, two methods for
   rapid and accurate detection of aneuploid cells with local copy number
   deletion or amplification. PartCNV uses an expectation-maximization
   (EM) algorithm with mixtures of Poisson distributions and incorporates
   cytogenetic information to guide the classification. PartCNVH further
   improves partCNV by integrating a hidden Markov model for feature
   selection. We have thoroughly evaluated the performance of partCNV and
   partCNVH through simulation studies and real data analysis using three
   scRNA-seq datasets from blood cancer patients. Our results show that
   partCNV and partCNVH have favorable accuracy and provide more
   interpretable results compared to existing methods. In the real data
   analysis, we have identified multiple biological processes involved in
   the oncogenesis of myelodysplastic syndromes and acute myeloid
   leukemia.

   Subject terms: Cancer, Computational biology and bioinformatics

Introduction

   Single-cell RNA sequencing (scRNA-seq) has greatly improved our ability
   to understand the cellular composition of the tissues and organs of
   interest, identify phenotype-associated cell groups, and elucidate the
   mechanisms behind many biological processes^[34]1–[35]3. These
   advantages make scRNA-seq a powerful tool for studying a wide range of
   human diseases, including Alzheimer’s disease^[36]4, cardiovascular
   disease^[37]5, and cancer^[38]6. In cancer research, a crucial step of
   scRNA-seq data analysis is to delineate tumor cells or neoplastic cells
   from other cell types^[39]7,[40]8. The tumors of each patient have
   their unique tumoral and microenvironmental evolution, and thus the
   scRNA-seq data from cancer patients tend to be more heterogeneous. Such
   heterogeneity is an exciting opportunity for improving our
   understanding of cancer with scRNA-seq, but it also imposes
   computational challenges to dissect composing cell types^[41]9.

   Neoplastic cells are abnormal cells that are undergoing excessive and
   uncontrolled proliferation^[42]10. These cells, which may or may not be
   malignant, can be extracted experimentally through cell sorting,
   although this is not always possible due to a lack of suitable markers
   or the high cost and labor requirements associated with these
   experiments^[43]11. In fact, it may be more useful to study all cells
   from a sequencing experiment simultaneously in order to understand the
   characteristics of neoplastic cells within their surrounding
   microenvironment. Neoplastic cells in certain types of cancer often
   have distinct features compared to non-neoplastic cells, such as high
   expression of certain cell type markers or genes belonging to some
   oncogenic pathways^[44]12. However, identifying neoplastic cells based
   on these markers or pathways can be difficult due to inter-individual
   heterogeneity, technical artifacts, and noise from the tumor
   microenvironment^[45]13.

   Recently, computational methods have been developed to identify
   large-scale copy number variations (CNVs) by comparing the smoothed
   scRNA-seq data against an internal or external normal reference, such
   as inferCNV, HoneyBADGER, and copyKAT^[46]3,[47]14,[48]15. For example,
   inferCNV is a popular visualization method for identifying large-scale
   CNVs. It uses smoothed averages over gene windows and compares the
   expression magnitude to the average over a set of reference ‘normal’
   cells. CopyKat is a recently developed tool serving a similar
   purpose^[49]15. CopyKAT uses an integrative Bayesian segmentation
   approach combining CNV inference and hierarchical clustering, which has
   been shown to achieve high accuracy in distinguishing cancer cells from
   normal cells in multiple cancer types. Both inferCNV and CopyKAT
   generally work well with tumor cells that demonstrate extensive
   chromosomal alterations, but they do not work well for cancer types
   that have fewer and shorter CNVs. This is often the case in hematologic
   cancers such as myelodysplastic syndromes and acute myeloid
   leukemia^[50]16. Moreover, they couldn’t incorporate additional
   clinical information for detecting specific CNVs. There are also
   methods that sought to integrate scRNA-seq data with bulk DNA
   sequencing (DNA-seq) or single-cell DNA-seq (scDNA-seq) data, such as
   CONGAS, clonealign, and CCNMF^[51]17–[52]19. These methods serve a
   different purpose: to cluster cells based on CNV or mutation
   information.

   In this paper, we propose to exploit cytogenetic information to improve
   the sensitivity and specificity for CNV identification. Cytogenetic
   data are routinely measured and recorded for patients with hematologic
   cancers^[53]20 and they provide useful information to identify
   CNVs^[54]21. In the case of myelodysplastic syndromes, cytogenetic
   features, along with other factors such as morphology, immunophenotype,
   and clinical features, are included in the World Health Organization
   (WHO)-classification-based Prognostic Score System (WPSS) for
   myelodysplastic syndromes^[55]22 and its revised version^[56]23.
   Similar risk scoring systems also exist for other types of hematologic
   malignancies^[57]24,[58]25. For example, Leukemia patients with certain
   cytogenetic features, such as deletion of chromosome 7 or 7q, deletion
   of 3q, or amplification of chromosome 8, have been shown to have a poor
   prognosis^[59]26. Cytogenetic data provide location information of each
   CNV and the proportion of cells with specific CNVs based on the
   analysis of 20 metaphases. For example, if a patient has cytogenetic
   data as “46,XY,del(20)(q11.1q13.1)[5]/46,XY[15],” this means that
   approximately
   [MATH: <mrow><mn>25</mn><mo>%</mo></mrow> :MATH]
   percent of cells (5 out of 20) have a deletion of chromosome 20 in the
   region q11.1 to q13.1, while the rest of the cells have normal
   chromosomal features. Cytogenetic data are typically cheaper and more
   readily available in clinical settings compared to DNA-seq or scDNA-seq
   experiments. While the proportion in cytogenetic data is a crude
   estimate of the aberrant cells, they can still be useful in classifying
   cell status and identifying cells with chromosomal abnormalities, which
   may be markers for neoplastic cells^[60]27. None of the existing
   computational methods is able to incorporate such cytogenetic
   information in the analysis of scRNA-seq data.

   Here, we introduce two methods, partCNV and partCNVH, for identifying
   cells with regional chromosomal abnormalities from scRNA-seq data by
   integrating cytogenetic information. Both methods are based on a
   statistical framework that models the count expression matrix of
   scRNA-seq data using a mixture of Poisson distributions while
   incorporating the cytogenetic information through prior specification.
   PartCNVH is built on partCNV and it further includes a hidden Markov
   model (HMM) to improve feature selection and clustering accuracy. It
   should be noted that our proposal is complementary to the existing
   methods such as copyKAT and inferCNV, as they focus on identifying
   large-scale CNVs while we detect smaller variations with the
   incorporation of external information. We implement our proposed
   methods in a computationally efficient expectation-maximization (EM)
   algorithm^[61]28 and evaluate their performance through extensive
   simulation studies. We then apply them to three scRNA-seq data sets
   from patients with hematologic malignancies and show that they can
   identify cells with chromosomal deletions or amplifications in specific
   regions suggested by the cytogenetic data. We also perform additional
   analysis to understand the changes in the pathways and biologic
   processes in the identified aneuploid cells. Compared to existing
   methods, partCNV and partCNVH provide more interpretable results and
   additional findings. With the widespread use of single-cell technology
   in hematologic cancer research and clinical care of cancer patients,
   our methods offer a useful solution for fully leveraging cytogenetic
   data to identify cells with specific chromosomal abnormalities.

Method overview

   PartCNV is a statistical framework that uses a hierarchical Poisson
   mixture model to differentiate two mixture components corresponding to
   normal and aberrant cells. PartCNVH is an extension of partCNV with the
   addition of HMM when there is a sufficient number of genes that allow
   feature selection. Figure [62]1 provides a schematic overview of the
   proposed methods. Our methods start with the normalized expression
   counts from the region with a known chromosomal deletion or
   amplification and explicitly incorporate the prior knowledge from
   cytogenetic data through imposing a Bernoulli prior on the cell status
   (i.e., normal or aberrant). We develop an EM algorithm that treats cell
   status as the missing variable and efficiently solves the mixture
   model. The inferred cell status from this step is the output of
   partCNV.

Fig. 1.

   [63]Fig. 1
   [64]Open in a new tab

   Schematic of PartCNV and PartCNVH. With the input of normalized
   expression counts from scRNA-seq experiments and the cytogenetic
   information from the patient, we develop an EM algorithm with mixtures
   of two Poisson distributions to infer the aneuploid/diploid status for
   the regions of interest. We further include a hidden Markov model to
   improve feature selection and the classification accuracy.

   Taking the ouptut from partCNV, partCNVH further refines it by a HMM.
   Specifically, a group average is taken for the inferred two groups of
   cells and the rolling average of the ratios between the two groups is
   used to infer the hidden status of the regions by a HMM. There are two
   reasons that we adopt this combination of rolling average and HMM in
   partCNVH. First, as shown in our later results, the group mean and
   rolling average can effectively magnify the signal of the regional
   deletion or amplification on the expression level. This is especially
   important when the signal of copy number alternations is weak related
   to noise in gene expression measurement. Second, it is possible that
   only a subset of the regions of interest has copy number changes. HMM
   can identify regions that are more likely to contain the chromosomal
   changes, which in turn improves the performance of aneuploid/diploid
   cell classification. After this HMM-based feature selection step,
   partCNVH performs a second round of the EM algorithm using the Poisson
   mixture model and reports the inferred cell status.

Simulations

   We design comprehensive simulation settings to evaluate the performance
   of partCNV and partCNVH. As comparisons, we also consider existing
   methods using dimension reduction by principal component analysis (PCA)
   followed by Louvain or Leiden clustering. Previous literature has
   reported that the Leiden algorithm can generate better connected
   communities through including an extra refinement step and run faster
   than Louvain^[65]29. Additionally, we include two widely used machine
   learning clustering algorithms, K-means clustering and hierarchical
   clustering. All the previous mentioned methods can be applied to detect
   locally aneuploid cells. Although our proposed method is not directly
   comparable to existing methods that classify cells based on
   whole-genome CNV inference, we still design a separate simulation study
   to compare the proposed methods versus the two large-scale CNV
   detection-based methods, inferCNV and copyKAT^[66]14,[67]15.

   We consider two settings where the first one studies aneuploid cells
   with deletions and the second one studies amplifications: redSimulation
   data 1 and 2. The mean expression for these genes is generated by
   taking a ratio of the normal expression. This ratio (or log fold
   change) is randomly drawn from a Uniform distribution with different
   base levels (0.5/0.6 for Setting 1 and 1.5/1.4 in Setting 2) and
   different noise levels. A larger noise level makes the expression from
   the aneuploid cells similar to the normal cells, and thus creates
   harder scenarios for the methods. The evaluation criteria is the
   accuracy of the classification results of the proposed method evaluated
   by the true cell status (i.e., being aneuploid or normal). More details
   of the simulation settings are provided in the Methods section.

   First consider simulation data 1, where 500 out of 3000 cells have
   deletions. Our proposed methods partCNV and partCNVH have the highest
   accuracy among all the methods in all scenarios (Figure [68]2A-B).
   Using the same normalized gene expression counts input as our methods,
   K-means and hierarchical clustering have the lowest accuracy ranging
   from 0.5 to 0.7. PCA plus Louvain and Leiden have higher accuracy than
   K-means and hierarchical clustering. When the signal is strong (ratio =
   0.5 in panel A) and noise is small, PCA plus Louvain/Leiden also have
   similar high accuracy as the proposed methods. But with the increase of
   the noise level, the accuracy of PCA plus Louvain or Leiden decreases.
   The advantage of the proposed methods becomes more obvious when the
   ratio is 0.6 and the noise level is high. For example, the mean
   accuracy of PCA plus Leiden is around 0.75 while partCNVH can achieve a
   high accuracy of 0.9. This is understandable since the proposed methods
   specifically model the data through two components for normal and
   aneuploid cells, and they allow mixtures of regions with and without
   deletions. Second, partCNV and partCNVH have similarly good performance
   with accuracy higher than 0.9, and partCNVH generally has higher
   accuracy than partCNV. To better understand the role of the HMM step of
   partCNVH and the result of feature selection, we use one simulation
   data set as an example and visualize the mean gene expression across
   the region (Figure [69]3A), ratios of the mean expressions of the two
   groups inferred by partCNV (Figure [70]3B), the rolling average of the
   mean expression ratios (Figure [71]3C), and the inferred status from
   HMM (Figure [72]3D). It can be seen that the rolling average of mean
   expression ratios between the two groups can effectively magnify the
   signal, and a majority of the HMM selected genes are located in the
   region with deletion. Figure [73]3E shows that our proposed method
   partCNVH has greater accuracy in classifying cells than the other
   methods.

Fig. 2.

   [74]Fig. 2
   [75]Open in a new tab

   Results of simulation Setting 1 with deletions. The methods that are
   compared include K-means clustering (Kmeans), hierarchical clustering
   (HClust), dimension reduction using PCA plus Louvain clustering
   (PCA+Louv), and dimension reduction using PCA and Leiden clustering
   (PCA+Leid). Each simulation dataset contains a total of 3000 cells:
   2500 normal cells and 500 with deletions. (A) The accuracy of these
   methods when the ratio of gene expression in a deletion region versus
   normal expression is 0.5 at different noise levels (0.1: low, 0.2:
   medium, 0.3: high). (B) The results of the setting with ratio = 0.6 at
   different noise levels. All results are summarized over 100 Monte Carlo
   iterations.

Fig. 3.

   [76]Fig. 3
   [77]Open in a new tab

   Illustrating the procedure of the feature selection step using HMM in a
   simulation data set. (A) The log-transformed mean gene expression
   levels for the genes located in regions without and with deletion. Each
   dot is a gene. (B) The log-transformed ratio of the mean expression
   levels for cells without versus with deletion inferred by partCNV. (C)
   The expression mean ratio for the two groups of cells by partCNV after
   applying the rolling average with a bandwidth of 50. (D) The latent
   states inferred by HMM based on the rolling average from panel (C). (E)
   The results of classification accuracy of partCNVH, PCA plus Leiden,
   hierarchical clustering, and K-means clustering. The red dots are the
   cells incorrectly classified and the blue dots are the correct ones.

Evaluation of PartCNV & PartCNVH for different prior information:
amplification regions

   Next we evaluate different methods in simulation data 2 with
   amplifications (Figure [78]4). Note that for cells with deletion, the
   expression change odds is 2 (from 1 to 0.5) while in cells with
   amplification the odds is 0.67 (from 1 to 1.5) or its inverse 1.5. Thus
   the signals from the amplified regions can be harder to detect than the
   first setting with deleted regions. In Figure [79]4(A), we observe the
   performance of all methods decrease, especially for PCA plus Louvain
   and Leiden. When the noise level is high (0.3), PCA plus Leiden only
   reaches an accuracy of 0.74. In comparison, our proposed method still
   has a high accuracy of around 0.95. With the ratio level set as 1.4 in
   panel B, the signal level becomes lower and the existing methods have
   even lower accuracy ranging from 0.5 to 0.6, while the proposed methods
   still stay at a reasonable accuracy level around 0.9. These demonstrate
   the robustness of the proposed methods and highlight the importance of
   applying partCNV or partCNVH instead of existing methods when the
   region of interest has amplifications.

Fig. 4.

   [80]Fig. 4
   [81]Open in a new tab

   Results of simulation Setting 2 with amplifications. Each simulation
   dataset contains a total of 3000 cells: 2500 normal cells and 500 with
   amplifications. (A) The accuracy of the methods compared when the ratio
   = 1.5 at different noise levels (0.1: low, 0.2: medium, 0.3: high). (B)
   The results when the ratio = 1.4 at different noise levels. All results
   are summarized over 100 Monte Carlo iterations.

Evaluation of PartCNV & PartCNVH for different prior information: cell
numbers

   Our current simulation design considers a total of 3000 cells. When we
   study a region of interest suggested by cytogenetic data, more cells
   generally provide more information, and thus identifying signals from
   fewer cells can be more challenging. To evaluate the proposed methods
   under this scenario, we generate simulation data 3 by fixing the cell
   number as 1300, where 1000 cells are normal and 300 are aneuploid
   cells. Figure S1 shows the simulation results with 1300 total cells and
   the region of interest (200 out of 600 genes) has a deletion in the
   aneuploid cells. Compared with the results in Figure [82]2, all the
   existing methods have worse performance for the same ratio and noise
   combinations. For example, both PCA plus Louvain and PCA plus Leiden
   have a high accuracy of around 0.90 when the ratio is 0.6 with a medium
   noise of 0.2 using 3000 total cells, while the accuracy decreases to
   around 0.8 using 1300 cells. The variation of the classification
   results also increases. Although our proposed methods have slightly
   decreased performance when the ratio is 0.6 and the noise is 0.3, their
   overall performance remains similar in other scenarios (ratio = 0.5 at
   all noise levels, ratio = 0.6 with low and medium noise levels).

   Similar patterns can be observed in amplification settings by comparing
   the results in Figure S2 versus the results in Figure [83]4.
   Surprisingly, PCA plus Louvain or Leiden has even worse median accuracy
   than the hierarchical clustering, even though they all have quite low
   classification accuracy. These results suggest that our proposed
   methods tend to have more robust performance even with fewer cell
   numbers in the analyzed dataset, while the existing methods have
   decreased accuracy and more varied results, especially when the noise
   level is high.

Evaluation of PartCNV & PartCNVH for different prior information: proportions
of aneuploid cells

   As a methodology advantage, partCNV and partCNVH are able to
   incorporate the prior knowledge of an estimated proportion of aneuploid
   cells. If the prior is misspecified, we seek to understand the impact
   on the results. We generate simulation data 4 with the same data
   generation procedure but the prior information is specified as correct
   (0.17 for total cell number 3000 and 0.23 for total cell number 1300)
   and incorrect (0.5), and we examine the results of our proposed
   methods.

   Figures [84]5 and S3 illustrate the accuracy for the total cell numbers
   3000 and 1300. First, it is clear that the correct prior knowledge
   improves the classification accuracy than a non-informative prior of
   0.5. This improvement is small when the ratio is 0.5 or 1.5, but it can
   be substantial when the signal is harder to detect (ratio = 0.6 or 1.4)
   and the noise level is high. For example, when the ratio is 1.4 and the
   noise is 0.3, the improvement of the accuracy for both partCNV and
   partCNVH using a correct prior can be about
   [MATH: <mrow><mn>10</mn><mo>%</mo></mrow> :MATH]
   compared to using the incorrect prior. Second, both Figures 5 and S3
   demonstrate the robustness of the proposed methods against incorrect
   prior information, especially in panels A and C where the ratios are
   0.5 and 1.5, respectively. In these experiments, we choose 0.5 as the
   incorrect prior knowledge, which is far from the true proportion 0.17
   and illustrate a worst scenario that the prior is completely
   non-informative. In reality, when a closer prior such as 0.20 or 0.15
   is used, the impact would be much smaller. Even in the worst scenario,
   with an amplification ratio 1.4 and a high noise level 0.3, our method
   with an incorrect prior still reaches a median accuracy above 0.75,
   which is better than the existing methods under the same scenario.
   These results highlight the advantage of the proposed methods in
   accurately identifying aneuploid cells.

Fig. 5.

   [85]Fig. 5
   [86]Open in a new tab

   Simulation results of evaluating the impact of different prior
   information on the classification accuracy of the proposed methods. The
   dark and light blue bars correspond to the results with and without
   correct specification of the prior information (true prior: 0.17).
   Panel (A) and (B) are the simulation results based on the first
   simulation setting with ratios = 0.5, 0.6, respectively. Panel (C) and
   (D) are based on the second simulation setting with ratios = 1.5, 1.4,
   respectively. All results are summarized over 100 Monte Carlo
   iterations.

Comparison with genome-wide CNV detection methods

    Lastly, we compare the proposed methods with two widely used
   genome-wide CNV detection methods, inferCNV, copyKAT as well as the
   regional versions of inferCNV and copyKAT. For the regional versions,
   we applied hierarchical clustering on the normalized expression matrix
   by inferCNV/copyKAT for the regions of interest only. One example of
   simulation data 4 is visualized in Figure [87]6A. As inferCNV requires
   the input of normal cells, we use an additional 100 normal cells as the
   reference for inferCNV. Both inferCNV and copyKAT generally are much
   more computationally intensive. We summarize the accuracy of 20 Monte
   Carlo simulations. It can be observed that neither inferCNV/copyKat nor
   the regional versions of these methods is able to accurately infer
   aneuploid/diploid cell status (Figure [88]6B-D), since the regional
   aneuploid signal is too weak compared to genome-wide copy number
   alterations that inferCNV and copyKAT are designed to detect. It is
   also interesting that when we re-cluster the cells based on the
   normalized expression from the sub-regions of chromosome 20, the
   performance of copyKAT increased but not inferCNV. These findings
   highlight the need for region-specific detection tools with the
   considerations of cell type mixtures to distinguishing between
   aneuploid and diploid cells.

Fig. 6.

   [89]Fig. 6
   [90]Open in a new tab

   Simulation results to compare the proposed method versus existing
   whole-genome based methods, InferCNV and CopyKAT. (A) Heatmap of the
   simulated expression values for genes in region Chr20(q11.1-q13.1),
   where the rows are the cells and columns are the genes. Logarithmic of
   the expression values plus one are used for visualization. Rows are
   labeled by the true aneuploid/diploid status and the inferred status by
   partCNV and partCNVH. (B) CopyKAT output of the aneuploid/diploid
   prediction versus the true cell status. The heatmap includes all the
   chromosomes. (C) Copy number results using InferCNV. (D) Boxplot of the
   aneuploid/diploid inference of the methods averaged over 20 Monte Carlo
   simulations. Regional InferCNV and Regional CopyKat are the
   hierarchical clustering results based on the normalized expression of
   the region of interest from InferCNV and CopyKAT.

Real data application

   We demonstrate the usage of the proposed methods on three real data
   applications. For each application, the scRNA-seq data from one patient
   were collected by the 10X genomics platform and the cytogenetic data
   were collected from patients’ medical records. All three patients have
   a subset of the cells with regional copy number variations, and the
   rest are normal cells. The three applications have different complexity
   levels. The first subject (patient 1) is the most straightforward one,
   as a very long region (the whole chromosome Y) was reported as lost in
   a subset of the cells according to the patient’s cytogenetic data. The
   second subject (patient 2) has a subset of cells with partial
   chromosome 20 lost. The third subject (patient 3) has a complicated
   situation as this patient has cells with partial deletions, as well as
   cells with partial amplifications.

Patient 1: MDS with loss of chromosome Y

   We obtain the scRNA-seq data of a bone marrow sample from an MDS
   patient treated at MD Anderson Cancer Center. The data, after alignment
   and quality control, contains a total of 33,538 genes and 655 cells.
   For this patient, the bone marrow sample has been specifically sorted
   for CD34+ cells to enrich hematopoietic stem and progenitor cells
   (HSPCs). As a result, the cell number is smaller than regular scRNA-seq
   experiments. Based on the clinically obtained cytogenetic data, this
   patient has around
   [MATH: <mrow><mn>35</mn><mo>%</mo></mrow> :MATH]
   of cells with the loss of chromosome Y and our goal is to identify
   these aneuploid cells. We first apply copyKAT to the whole
   transcriptome data from this sample to infer copy number variations. As
   shown in Figure S4, copyKAT clusters the cells based on the inferred
   CNV statuses across the whole genome, but it could not take regional
   data or the cytogenetic information into consideration when identifying
   the neoplastic cells.

   To specifically identify cells with the chromosome Y loss, we apply the
   proposed methods and the five existing methods (PCA plus Louvain, PCA
   plus Leiden, K-means clustering, hierarchical clustering, and CopyKAT)
   on this dataset. The input data are the normalized counts from the
   genes located on chromosome Y. In this application, the HMM step from
   partCNVH selects the whole set of genes so we only present the results
   from partCNV. Since the CNV in this dataset encompasses the entire
   chromosome Y, we expect some of the existing methods also work well for
   this analysis. We find that partCNV, PCA plus Louvain, PCA plus Leiden,
   and K-means clustering all have proportions of aneuploid cells close to
   35% (40.9%, 39.5%, and 39.5%, respectively) (Figure [91]7 A-E).
   Hierarchical clustering identifies a much smaller number of aneuploid
   cells (9.31%). From visualizing the pairwise ARI values of these
   results, we find that partCNV, PCA plus Louvain/Leiden have very
   similar results, while K-means clustering and hierarchical clustering
   have very different results (Figure [92]7 F). As the UMAP coordinates
   in Figure [93]7(A-E) are obtained using the whole transcriptome data,
   the fact that copyKAT-identified aneuploid cells cluster together
   suggests that copyKAT captures the whole transcriptome pattern instead
   of chromosome Y specific changes.

Fig. 7.

   [94]Fig. 7
   [95]Open in a new tab

   Results of applying different methods to the scRNA-seq data using the
   bone marrow CD34 positive cells from patient 1. Cytogenetic information
   shows that
   [MATH: <mrow><mo>∼</mo><mn>35</mn><mo>%</mo></mrow> :MATH]
   of the cells from the sample have chromosome Y loss. (A-E) The cell
   classification based on the inference of chromosome Y loss using
   different methods. Panel F shows the heatmap of the ARI values from
   comparing the classification results using different methods. (G) The
   expression of chromosome Y genes for cells that are labeled as normal
   in partCNV (partCNV:0) and aneuploid in CopyKAT (CopyKat:1), and three
   similar groups. (H,I) The expression of chromosome Y genes comparing
   partCNV versus k-means clustering and comparing partCNV and
   Hierarchical clustering.

   We further examine the average gene expression of chromosome Y genes
   among the aneuploid/normal cells identified by partCNV, copyKAT,
   K-means, and hierarchical clustering (Figure [96]7 G-I). It is apparent
   that partCNV-labeled aneuploid cells have much lower expression than
   the cells labeled as aneuploid by other methods but normal per partCNV,
   confirming that partCNV correctly identifies the cells with deletion on
   chromosome Y.

   In summary, these results suggest that partCNV and the PCA plus Louvain
   or Leiden clustering have identified the cells with the chromosome Y
   loss. In contrast, K-means clustering, hierarchical clustering, and
   copyKAT failed to do so.

Patient 2: MDS with partial deletion of chromosome 20

   We obtain the scRNA-seq data from the bone marrow sample for a
   different MDS patient. The data were also generated by the 10X genomics
   scRNA-seq technology. This sample was sequenced directly without the
   cell sorting step, and thus both HSPC and immune cells can be
   potentially identified. After alignment, preprocessing, and quality
   control, a total of 24,519 genes and 3,643 cells are kept for the
   analysis. Based on the cytogenetic data, about 20% of cells in the
   sample have deletions in chromosome 20 at regions q11.1 to q13.1, which
   is about 24.2 Mb long. We also apply CopyKAT to this data and present
   the heatmap result in Figure S5. Although cytogenetics reported
   deletions in chromosome 20, the log copy number ratio heatmap does not
   have an obvious deletion pattern in the suggested regions. Based on the
   whole genome copy number inference, copyKAT only reports about 500
   aneuploid cells (
   [MATH: <mo>∼</mo> :MATH]
   5.5%).

   We analyze the whole-transcriptome data using Seurat^[97]2 through
   identifying highly variable genes, extracting top principal components
   (PCs) based on these genes, and we perform UMAP and clustering analysis
   (Figure [98]8). UMAP is used for dimension reduction and unsupervised
   clustering is performed with the default Louvain clustering using the
   top PCs. A total of 10 clusters are identified, and the cluster
   specific markers are used for annotating the cell type labels based on
   biological knowledge by our MDS biologist. We also apply the proposed
   methods and existing methods targeting the region on chromosome 20 with
   known chromosomal deletions. Figure [99]8 A-E shows cell type labels
   and the aneuploid/diploid inference result using partCNVH, PCA plus
   Louvain, K-means clustering, and hierarchical clustering. The results
   for partCNV and PCA plus Leiden are presented in Figure S6. We find
   that the proposed methods have the closest proportion of aneuploid
   cells to the cytogenetics reported proportion; the other methods all
   have much lower or higher proportions. The major difference between our
   proposed method and PCA plus Louvain is in the cycling RBC progenitors
   (Figure [100]8 B, C, and F). Our method reports high proportions in all
   three MDS-related cell groups (i.e., HSPCs, RBC progenitors, and
   cycling RBC progenitors), while PCA plus Louvain only reports aneuploid
   cells in the former two clusters. Previous literature found impaired
   erythroid-proliferating capacities to be a prominent characteristic in
   patients with MDS^[101]30,[102]31. Both RBC progenitors and cycling RBC
   progenitors are major cell types involved in the
   erythroid-proliferating function, and thus it makes sense to identify
   neoplastic cells in both cell types.

Fig. 8.

   [103]Fig. 8
   [104]Open in a new tab

   Application results of the proposed method and existing methods using
   the scRNA-seq data from patient 2. (A) The cell type annotation based
   on marker genes’ expression and biological knowledge. RBC progenitor:
   red blood cell progenitors. HSPC: hematopoietic stem and progenitor
   cells. NK: natural killer. (B-E) The inferred aneuploid cells with
   deletion chr20(q11.1q13.1) using different methods (B, partCNV; C,
   PCA+Louvian; D, K Means clustering; E, Hierarchical clustering). The
   proportions in the title bracket are the proportions of cells that are
   inferred as cells with this specific deletion. (F) The comparisons of
   results by partCNVH and PCA plus Louvain versus cell type labels. (G)
   Gene set enrichment analysis results for the genes that are
   differential expressed between the cells with and without deletion
   chr(20)(q11.1q13.1). Gene sets are defined using the Reactome Pathway
   Database.

   To understand the differences between the identified locally aneuploid
   cells and the normal cells, we conduct differential expression analysis
   for each cluster to compare the aneuploid versus diploid cells
   identified by our method^[105]32. A total of 177 cluster-specific
   differentially expressed genes were identified using a cutoff of 0.05
   for adjusted p values. We also perform over-representation analysis
   using GO Biological Process database^[106]33,[107]34 and identify
   several functional categories over-represented by differential
   expression signals, including neutrophil degranulation, a few immune
   system related terms, and interleukin 4 and interleukin 13 signals
   (Figure [108]8 G). Many of these terms have been reported in previous
   literature to be related with MDS. For example, neutrophil
   degranulation and migration has been reported to be associated with MDS
   compared with normal controls^[109]35. The important role of the immune
   system and innate immune signaling in MDS has also been reported in
   multiple publications^[110]36–[111]38. The over-representation results
   using the Reactome pathway and Hallmark database are presented in
   Figure S7^[112]39,[113]40. In the Reactome pathway enrichment results,
   we also find several immune response-related and lymphocyte activation
   related terms. In Hallmark analysis, allograft rejection, complement,
   interferon gamma response, and TNFA signaling via NFKB are the top
   findings, which also have been associated with MDS
   pathogenesis^[114]41–[115]43. There are also some terms that have not
   been reported in previous MDS studies, such as generation of second
   messenger molecules, hemostasis, defense response, and cell activation,
   which could be promising targets for future research.

Patient 3: AML with partial gain and partial deletion of chromosome 8

   Lastly, we study the scRNA-seq data from an AML patient with
   complicated chromosomal variations^[116]44. Specifically, this patient
   has amplifications of the whole chromosome 8 in 25% of the cells,
   deletion of chromosome 8 at region q21.2 to q24.3 in 40% of the cells,
   and normal karyotype in the rest of the cells. We are interested in
   identifying which cells contain the chromosome 8 gain and which have
   del(8)(q21.2q24.3). The scRNA-seq data were generated using the 10X
   genomics technology platform. After preprocessing, the data contain a
   total of 20,521 genes from 4294 cells. Since there are overlaps between
   the deletion and amplification regions, we apply partCNVH through a
   two-step approach. We first focus on the region of chromosome 8 before
   q21.2 where about 25% of the cells have amplification; the rest of the
   cells are normal in the area. After the cells with chromosome 8 gain
   are detected, we apply partCNVH again to the rest of the cells, which
   contain del(8)(q21.2q24.3) in around 53.3% (
   [MATH: <mrow><mo>=</mo><mn>40</mn><mo>%</mo><mo
   stretchy="false">/</mo><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><mn>25</mn><mo>%</mo><mo
   stretchy="false">)</mo></mrow> :MATH]
   ) of the cells. In the two steps,
   [MATH: <mrow><mn>25</mn><mo>%</mo></mrow> :MATH]
   and
   [MATH: <mrow><mn>53.3</mn><mo>%</mo></mrow> :MATH]
   are used as the prior knowledge of the aneuploid cell proportions in
   partCNVH.

   We find that the majority of the inferred cells with chromosome 8
   amplification are from the erythroid cells, while the cells with
   del(8)(q21.1q24.3) are mostly AML cells (Figure [117]9 A-B). The
   proportions of cells with chromosome 8 amplification and
   del(8)(q21.1.q24.3) are about 13% and 60%, which are close to the prior
   knowledge from cytogenetics. We visualize the normalized expression for
   the region of interest on chromosome 8 in Figure S8A. We also present
   the results by inferCNV and copyKAT in Figure S8B and Figure S9,
   respectively. We observe that the patterns for chromosome 8 varies
   between the three cell groups in Figure S8A. However, whether the
   pattern shows gain or loss is not easy to identify due to data
   sparsity. Neither CopyKAT nor InferCNV can be applied to regional data,
   and thus they couldn’t reproduce the patterns we observed in Figure
   S8A.

Fig. 9.

   [118]Fig. 9
   [119]Open in a new tab

   Results of applying the proposed method partCNVH to the scRNA-seq data
   from an AML patient with
   [MATH: <mrow><mo>∼</mo><mn>25</mn><mo>%</mo></mrow> :MATH]
   cells having amplification of chromosome 8 and
   [MATH: <mrow><mo>∼</mo><mn>40</mn><mo>%</mo></mrow> :MATH]
   cells having chromosome 8 deletion at region q21.2-q24.3. (A) The true
   cell type labels annotated by a clinician-scientist. (B) The inferred
   cells with amplification of 8 (“+8”) and the deletion of chromosome 8
   at q21.2-q24.3 (“del(8)(q21.2q24.3)”). (C, D) The Reactome pathway
   enrichment analysis results using the DEGs by comparing the AML cells
   with del(8)(q21.2q24.3) versus diploid cells, and by comparing the
   erythroid cells with +8 versus diploid cells, respectively.

   To understand the different molecular mechanisms related to chromosome
   changes, we perform differential analysis for comparing the AML cells
   with del(8)(q21.2q24.3) versus diploid AML cells and obtain 266
   differentially expressed genes (DEGs). Similarly, we compare the
   erythroid cells with a gain of chromosome 8 versus diploid erythroid
   cells and obtain 426 DEGs. Gene set enrichment analysis identify a few
   shared significant terms between AML and erythroid cells (Figure
   [120]9C-D), including multiple immune system related terms, neutrophil
   degranulation, and cellular responses to stimuli. Some unique terms in
   AML are hemostasis, platelet activation signaling and aggregation, and
   arachidonic acid metabolism. Hemostatic and thrombotic complications
   are prevalent symptoms in AML patients and hemostasis has been studied
   before for AML pathogenesis-related mechanisms^[121]45. The term
   platelet activation signaling and aggregation is also consistent with
   the previous literature that the platelet defects and other hemorrhagic
   symptoms are widely observed in AML patients^[122]46. The arachidonic
   acid metabolism is a process highlighted in a few cancer research
   publications, but the evidence for their involvement in AML is still
   accumulating^[123]47,[124]48. Overall, our results are consistent with
   literature and provide some novel disease-related biological processes
   for future research.

Discussion

   We introduce partCNV/partCNVH, a statistical framework that
   distinguishes neoplastic cells with copy number alterations from normal
   diploid cells based on regional chromosomal deletions or
   amplifications. Unlike existing methods, our statistical framework can
   incorporate prior knowledge from cytogenetic data that includes both
   chromosomal locations of aberrations and the observed proportion. As
   demonstrated in our simulation study, this prior information can
   effectively improve the classification accuracy when the signal is
   weak. Our framework also includes a feature selection step using the
   hidden Markov model, which is able to filter the genes when part(s) of
   the region are diploid. This step further improves the signals of
   chromosomal changes and results in higher accuracy to detect neoplastic
   cells. We have illustrated the benefits of partCNV and partCNVH through
   extensive simulation studies and in-depth analysis of three scRNA-seq
   datasets from MDS or AML patients.

   Cytogenetic information is a key component of the scoring system of
   risk assessment, treatment selection, and outcome prediction for
   patients with hematologic malignancies^[125]49,[126]50. In this work,
   we exemplify the use of the proposed methods in patients with MDS and
   AML. These methods can also be applied to other hematologic
   malignancies or even other cancer types if similar problems are
   encountered. As cytogenetic analysis is a mature technology and is
   widely used in clinical settings, the cytogenetic data should be fairly
   accessible from clinical collaborators who provide patient samples.
   Such cytogenetic data-guided analysis can be a useful tool for
   identifying subgroups of cells with the chromosomal changes of
   interest.

   Implemented in an EM algorithm, our proposed methods have favorable
   computational performance. For a simulation dataset with 3000 cells and
   600 genes, it takes 12 seconds and 21.5 seconds for partCNV and
   partCNVH to complete the analysis, respectively. This computation cost
   is similar to existing methods PCA plus Louvain (
   [MATH: <mo>∼</mo> :MATH]
   6.8 seconds) and PCA plus Leiden (
   [MATH: <mo>∼</mo> :MATH]
   30.7 seconds). In comparison, both inferCNV and CopyKAT take more than
   10 minutes, sometimes more than an hour, to process a scRNA-seq dataset
   with a few thousand cells. Additionally, our proposed methods scale
   almost linearly to large datasets. If we increase the cell number from
   3000 to 6000, partCNV takes about 26.5 seconds and partCNVH 41.5
   seconds to complete the computation. Since cytogenetic data are
   generally unique for each subject, we expect that our methods are
   applied person by person in real applications. Thus, it is reasonable
   to assume a few thousand cells in the dataset, for which our methods
   can complete the computation within one minute.

   For future work, the methods can be further extended to incorporate
   additional biological knowledge, such as the marker or mutation
   information mentioned in Fan et al^[127]9. Our method has the potential
   to be applied on multiple samples in parallel or even to borrow
   information across different samples from the same subject, such
   extension is not trivial and needs further evaluations. Moreover, due
   to the complexity of sequencing depth, gene expression variations,
   number of genes impacted by the aneuploid event, we have not evaluated
   the power of the proposed methods or the minimum required size for CNV
   events. These should be carefully evaluated in future works.

   Previous work also developed machine learning models to predict the
   neoplastic/non-neoplastic status of the cells by splitting the
   annotated data and training a random forest model^[128]16. Such models
   usually rely on the training dataset and may not generalize well to
   other studies. With the accumulation of annotated single cell data, it
   is also possible to harness the power of deep learning algorithms to
   further improve existing models and achieve more accurate predictions.
   Our current methods assume a fixed prior based on the cytogenetic data.
   It is possible that different proportions of aneuploid cells have
   different confidence levels. The current method can be further extended
   to incorporate such confidence into the model to improve accuracy. When
   scDNA-seq data is available, it may be another prior knowledge to
   replace the cytogenetic information of clinical data as the input to
   PartCNV and PartCNHV. The method can be further extended to incorporate
   regions where the scDNA-seq data shows enriched hierarchical aneuploid
   CNVs with high resolution.

Methods

Details of partCNV

   We aim to identify the cells with the known chromosomal deletion or
   amplification from the scRNA-seq data with incorporation of the prior
   cytogenetic knowledge. Assume a total of N cells were sequenced by
   scRNA-seq and G genes fall in the region with deletion or
   amplification. The count matrix is denoted by
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mo>=</mo><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">)</mo></mrow> :MATH]
   , which is a
   [MATH: <mrow><mi>G</mi><mo>×</mo><mi>N</mi></mrow> :MATH]
   matrix with rows being the genes and columns being the samples. Without
   loss of generality, we assume the genes in
   [MATH: <mrow><mi mathvariant="bold-italic">Y</mi></mrow> :MATH]
   are ordered by their locations on the chromosome. As the observed data
   contain the mixture of cells with and without chromosomal changes,
   denote the underlying status of the cell i by
   [MATH: <msub><mi>c</mi><mi>i</mi></msub> :MATH]
   . Assume the prior proportion of the aneuploid cells is
   [MATH: <msub><mi>q</mi><mn>0</mn></msub> :MATH]
   for the region of interest. In our motivating problem,
   [MATH: <msub><mi>q</mi><mn>0</mn></msub> :MATH]
   is calculated based on the number of metaphases observing the
   chromosomal changes divided by the total number of metaphases from an
   cytogenetics test. As shown in some recent literature regarding the
   distribution of the scRNA-seq count^[129]51,[130]52, the scRNA-seq data
   may not be zero-inflated and the excessive zeros are due to low
   expression level of each single cell. Additionally, since the region of
   interest generally contains a limited number of genes, the estimated
   dispersion parameters are not accurate enough if we use a negative
   binomial distribution. Thus, we assume the expression count of gene i
   follows a Poisson distribution with mean
   [MATH: <msub><mi>θ</mi><mrow><mi>g</mi><mn>1</mn></mrow></msub> :MATH]
   if the cell is aneuploid or mean
   [MATH: <msub><mi>θ</mi><mrow><mi>g</mi><mn>0</mn></mrow></msub> :MATH]
   if diploid, i.e.,
   [MATH: <mrow><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">|</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>1</mn></mrow
   ></msub><mo>,</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>1</mn>
   <mo
   stretchy="false">)</mo></mrow><mo>=</mo><mfrac><mrow><msubsup><mi>θ</mi
   ><mrow><mi>g</mi><mn>1</mn></mrow><mrow><mi
   mathvariant="italic">gi</mi></mrow></msubsup><mo>exp</mo><mrow><mo
   stretchy="false">(</mo><mo>-</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>1<
   /mn></mrow></msub><mo
   stretchy="false">)</mo></mrow></mrow><mrow><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>!</mo></mrow></mfrac><msp
   ace width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace><mtext>and</mtext><mspace
   width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">|</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>0</mn></mrow
   ></msub><mo>,</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>0</mn>
   <mo
   stretchy="false">)</mo></mrow><mo>=</mo><mfrac><mrow><msubsup><mi>θ</mi
   ><mrow><mi>g</mi><mn>0</mn></mrow><mrow><mi
   mathvariant="italic">gi</mi></mrow></msubsup><mo>exp</mo><mrow><mo
   stretchy="false">(</mo><mo>-</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>0<
   /mn></mrow></msub><mo
   stretchy="false">)</mo></mrow></mrow><mrow><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>!</mo></mrow></mfrac><mo>
   .</mo></mrow> :MATH]

   We assume the cell status variable
   [MATH: <msub><mi>c</mi><mi>i</mi></msub> :MATH]
   follows a Bernoulli distribution
   [MATH: <mrow><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo
   stretchy="false">|</mo><msub><mi>q</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>=</mo><msub><mi>q</mi><mi>i</mi></msu
   b><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>q</mi><mi>i</mi></
   msub><mo stretchy="false">)</mo></mrow></mrow> :MATH]
   where
   [MATH: <msub><mi>q</mi><mi>i</mi></msub> :MATH]
   denotes the probability of cell i having the chromosomal changes at the
   region of interest, i.e.,
   [MATH:
   <mrow><msub><mi>q</mi><mi>i</mi></msub><mo>=</mo><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>1
   </mn><mo stretchy="false">)</mo></mrow></mrow> :MATH]
   . The prior knowledge of the aneuploid cell proportion is best
   described through a beta distribution, which we approximate through a
   normal distribution. Though cytogenetic information is obtained based
   on 20 metaphases, the involved cells can number in the hundreds of
   thousands, and thus a normal distribution can adequately approximate
   the underlying beta distribution. We assume
   [MATH: <msub><mi>q</mi><mi>i</mi></msub> :MATH]
   follows a prior Normal distribution with mean
   [MATH: <msub><mi>q</mi><mn>0</mn></msub> :MATH]
   and variance
   [MATH: <msup><mi>λ</mi><mn>2</mn></msup> :MATH]
   :
   [MATH: <mrow><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>q</mi><mi>i</mi></msub><mo
   stretchy="false">|</mo><msub><mi>q</mi><mn>0</mn></msub><mo>,</mo><mi>λ
   </mi><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mfrac><mn>1</mn><mrow><mi>λ</m
   i><msqrt><mrow><mn>2</mn><mi>π</mi></mrow></msqrt></mrow></mfrac><mo>ex
   p</mo><mfenced close=")" open="("><mo>-</mo><mfrac><msup><mrow><mo
   stretchy="false">(</mo><msub><mi>q</mi><mi>i</mi></msub><mo>-</mo><msub
   ><mi>q</mi><mn>0</mn></msub><mo
   stretchy="false">)</mo></mrow><mn>2</mn></msup><mrow><mn>2</mn><msup><m
   i>λ</mi><mn>2</mn></msup></mrow></mfrac></mfenced><mo>.</mo></mrow>
   :MATH]

   The variance
   [MATH: <msup><mi>λ</mi><mn>2</mn></msup> :MATH]
   represents the confidence about the prior information. Smaller variance
   indicates stronger confidence in the prior knowledge. However, in our
   experiments, we found that the actual value of
   [MATH: <mi>λ</mi> :MATH]
   has minimal impact on the classification results as long as the value
   is within a decent range (e.g.,
   [MATH: <mi>λ</mi> :MATH]
   between 0.01 and 1). Throughout our experiments, we use
   [MATH: <mrow><mi>λ</mi><mo>=</mo><mn>0.1</mn></mrow> :MATH]
   . Together, the full likelihood of the problem can be written as
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><mi>L</mi><mrow><mo
   stretchy="false">(</mo><mrow><msub><mi
   mathvariant="bold-italic">θ</mi><mn
   mathvariant="bold">1</mn></msub></mrow><mo>,</mo><mrow><msub><mi
   mathvariant="bold-italic">θ</mi><mn
   mathvariant="bold">0</mn></msub></mrow><mo>,</mo><mrow><mi
   mathvariant="bold-italic">c</mi></mrow><mo>,</mo><mrow><mi
   mathvariant="bold-italic">q</mi></mrow><mo
   stretchy="false">|</mo><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mo>,</mo><msub><mi>q</mi><mn>0<
   /mn></msub><mo>,</mo><mi>λ</mi><mo
   stretchy="false">)</mo></mrow><mo>=</mo><munder><mo>∏</mo><mi>i</mi></m
   under><mrow><mo maxsize="1.2em" minsize="1.2em"
   stretchy="true">[</mo></mrow><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">|</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>1</mn></mrow
   ></msub><mo stretchy="false">)</mo></mrow><msup><mrow><mo
   maxsize="1.2em" minsize="1.2em" stretchy="true">]</mo></mrow><mrow><mn
   mathvariant="double-struck">1</mn><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>1
   </mn><mo stretchy="false">)</mo></mrow></msup><mrow><mo maxsize="1.2em"
   minsize="1.2em" stretchy="true">[</mo></mrow><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">|</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>0</mn></mrow
   ></msub><mo stretchy="false">)</mo></mrow><msup><mrow><mo
   maxsize="1.2em" minsize="1.2em" stretchy="true">]</mo></mrow><mrow><mn
   mathvariant="double-struck">1</mn><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>0
   </mn><mo stretchy="false">)</mo></mrow></msup><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo
   stretchy="false">|</mo><msub><mi>q</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>q</mi><mi>i</mi></msub><mo
   stretchy="false">|</mo><msub><mi>q</mi><mn>0</mn></msub><mo>,</mo><mi>λ
   </mi><mo
   stretchy="false">)</mo></mrow></mrow></mtd></mtr></mtable></mrow>
   :MATH]
   1

   and the detailed log-likelihood is
   [MATH: <mrow><mtable><mtr><mtd columnalign="right"><mrow><mi>l</mi><mo
   stretchy="false">(</mo><mrow><msub><mi
   mathvariant="bold-italic">θ</mi><mn
   mathvariant="bold">1</mn></msub></mrow><mo>,</mo><mrow><msub><mi
   mathvariant="bold-italic">θ</mi><mn
   mathvariant="bold">0</mn></msub></mrow><mo>,</mo><mrow><mi
   mathvariant="bold-italic">c</mi></mrow><mo>,</mo><mrow><mi
   mathvariant="bold-italic">q</mi></mrow><mo
   stretchy="false">|</mo><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mo>,</mo><msub><mi>q</mi><mn>0<
   /mn></msub><mo>,</mo><mi>λ</mi><mo
   stretchy="false">)</mo></mrow></mtd><mtd
   columnalign="left"><mrow><mo>=</mo><munder><mo>∑</mo><mi>i</mi></munder
   ><mn mathvariant="double-struck">1</mn><mrow><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>1
   </mn><mo stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">{</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>log</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>1</mn></mrow
   ></msub><mo
   stretchy="false">)</mo></mrow><mo>-</mo><msub><mi>θ</mi><mrow><mi>g</mi
   ><mn>1</mn></mrow></msub><mo
   stretchy="false">}</mo></mrow><mo>+</mo><munder><mo>∑</mo><mi>i</mi></m
   under><mn mathvariant="double-struck">1</mn><mrow><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>0
   </mn><mo stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">{</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>log</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>0</mn></mrow
   ></msub><mo
   stretchy="false">)</mo></mrow><mo>-</mo><msub><mi>θ</mi><mrow><mi>g</mi
   ><mn>0</mn></mrow></msub><mo
   stretchy="false">}</mo></mrow></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mspace width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace><mo>+</mo><munder><mo>∑</mo><mi>i</mi></mund
   er><msub><mi>c</mi><mi>i</mi></msub><mo>log</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>q</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>+</mo><munder><mo>∑</mo><mi>i</mi></m
   under><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>c</mi><mi>i</mi></
   msub><mo stretchy="false">)</mo></mrow><mo>log</mo><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>q</mi><mi>i</mi></
   msub><mo
   stretchy="false">)</mo></mrow><mo>-</mo><munder><mo>∑</mo><mi>i</mi></m
   under><mfrac><msup><mrow><mo
   stretchy="false">(</mo><msub><mi>q</mi><mi>i</mi></msub><mo>-</mo><msub
   ><mi>q</mi><mn>0</mn></msub><mo
   stretchy="false">)</mo></mrow><mn>2</mn></msup><mrow><mn>2</mn><msup><m
   i>λ</mi><mn>2</mn></msup></mrow></mfrac><mo>.</mo></mrow></mtd></mtr></
   mtable></mrow> :MATH]

   Directly solving the likelihood ([131]1) may not be feasible, and thus
   we use the EM algorithm by treating cell status
   [MATH: <mrow><mo
   stretchy="false">{</mo><msub><mi>c</mi><mi>i</mi></msub><mo
   stretchy="false">}</mo></mrow> :MATH]
   as the missing variables. The objective function of the EM algorithm is
   [MATH: <mrow><mtable><mtr><mtd columnalign="right"><mrow><mi>Q</mi><mo
   stretchy="false">(</mo><mrow><msub><mi
   mathvariant="bold-italic">θ</mi><mn
   mathvariant="bold">1</mn></msub></mrow><mo>,</mo><mrow><msub><mi
   mathvariant="bold-italic">θ</mi><mn
   mathvariant="bold">0</mn></msub></mrow><mo>,</mo><mrow><mi
   mathvariant="bold-italic">c</mi></mrow><mo>,</mo><mrow><mi
   mathvariant="bold-italic">q</mi></mrow><mo
   stretchy="false">)</mo></mrow></mtd><mtd
   columnalign="left"><mrow><mo>=</mo><munder><mo>∑</mo><mi>i</mi></munder
   ><msub><mi>p</mi><mi>i</mi></msub><mrow><mo
   stretchy="false">{</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>log</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>1</mn></mrow
   ></msub><mo
   stretchy="false">)</mo></mrow><mo>-</mo><msub><mi>θ</mi><mrow><mi>g</mi
   ><mn>1</mn></mrow></msub><mo
   stretchy="false">}</mo></mrow><mo>+</mo><munder><mo>∑</mo><mi>i</mi></m
   under><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>p</mi><mi>i</mi></
   msub><mo stretchy="false">)</mo></mrow><mrow><mo
   stretchy="false">{</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>log</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>0</mn></mrow
   ></msub><mo
   stretchy="false">)</mo></mrow><mo>-</mo><msub><mi>θ</mi><mrow><mi>g</mi
   ><mn>0</mn></mrow></msub><mo
   stretchy="false">}</mo></mrow><mo>+</mo><munder><mo>∑</mo><mi>i</mi></m
   under><msub><mi>c</mi><mi>i</mi></msub><mo>log</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>q</mi><mi>i</mi></msub><mo
   stretchy="false">)</mo></mrow></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mspace width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace><mo>+</mo><munder><mo>∑</mo><mi>i</mi></mund
   er><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>c</mi><mi>i</mi></
   msub><mo stretchy="false">)</mo></mrow><mo>log</mo><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><msub><mi>q</mi><mi>i</mi></
   msub><mo
   stretchy="false">)</mo></mrow><mo>-</mo><munder><mo>∑</mo><mi>i</mi></m
   under><mfrac><msup><mrow><mo
   stretchy="false">(</mo><msub><mi>q</mi><mi>i</mi></msub><mo>-</mo><msub
   ><mi>q</mi><mn>0</mn></msub><mo
   stretchy="false">)</mo></mrow><mn>2</mn></msup><mrow><mn>2</mn><msup><m
   i>λ</mi><mn>2</mn></msup></mrow></mfrac><mo>.</mo></mrow></mtd></mtr></
   mtable></mrow> :MATH]
   2

   In the M step of the t-th iteration, we solve for
   [MATH: <msubsup><mi>θ</mi><mrow><mi>g</mi><mn>0</mn></mrow><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   ,
   [MATH: <msubsup><mi>θ</mi><mrow><mi>g</mi><mn>1</mn></mrow><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   , and
   [MATH: <msubsup><mi>q</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   as follows:
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><msubsup><mi>θ</mi><mrow><mi>g</mi><mn>1</mn></mrow
   ><mrow><mo stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mtd><mtd
   columnalign="left"><mrow><mo>=</mo><mfrac><mrow><msub><mo>∑</mo><mi>i</
   mi></msub><msubsup><mi>q</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo>-</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub></mrow><mrow><msub><mo>∑</mo>
   <mi>i</mi></msub><msubsup><mi>q</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo>-</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup></mrow></mfrac><mo>,</mo></mrow
   ></mtd></mtr><mtr><mtd
   columnalign="right"><mrow><mrow></mrow><msubsup><mi>θ</mi><mrow><mi>g</
   mi><mn>0</mn></mrow><mrow><mo stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow></mtd><mtd
   columnalign="left"><mrow><mo>=</mo><mfrac><mrow><msub><mo>∑</mo><mi>i</
   mi></msub><mfenced close=")"
   open="("><mn>1</mn><mo>-</mo><msubsup><mi>q</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo>-</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup></mfenced><msub><mi>y</mi><mrow
   ><mi
   mathvariant="italic">gi</mi></mrow></msub></mrow><mrow><mi>N</mi><mo>-<
   /mo><msub><mo>∑</mo><mi>i</mi></msub><msubsup><mi>q</mi><mi>i</mi><mrow
   ><mo stretchy="false">(</mo><mi>t</mi><mo>-</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup></mrow></mfrac><mo>,</mo></mrow
   ></mtd></mtr><mtr><mtd
   columnalign="right"><mrow><mrow></mrow><msubsup><mi>q</mi><mi>i</mi><mr
   ow><mo stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow></mtd><mtd
   columnalign="left"><mrow><mo>=</mo><mo>arg</mo><munder><mo
   movablelimits="true">max</mo><mrow><msub><mi>q</mi><mi>i</mi></msub><ms
   pace width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace></mrow></munder><mrow><mo maxsize="2.470em"
   minsize="2.470em" stretchy="true">[</mo></mrow><msup><mfenced close=")"
   open="("><msubsup><mi>q</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo>-</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup></mfenced><mrow><msub><mo>∑</mo
   ><mi>i</mi></msub><msub><mi>c</mi><mi>i</mi></msub></mrow></msup><msup>
   <mfenced close=")"
   open="("><mn>1</mn><mo>-</mo><msubsup><mi>q</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo>-</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup></mfenced><mrow><mi>N</mi><mo>-
   </mo><msub><mo>∑</mo><mi>i</mi></msub><msub><mi>c</mi><mi>i</mi></msub>
   </mrow></msup><mo>exp</mo><mfenced close=")"
   open="("><mo>-</mo><mfrac><mrow><mi>n</mi><msup><mrow><mo
   stretchy="false">(</mo><msubsup><mi>q</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo>-</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup><mo>-</mo><msub><mi>q</mi><mn>0
   </mn></msub><mo
   stretchy="false">)</mo></mrow><mn>2</mn></msup></mrow><mrow><mn>2</mn><
   msup><mi>λ</mi><mn>2</mn></msup></mrow></mfrac></mfenced><mrow><mo
   maxsize="2.470em" minsize="2.470em"
   stretchy="true">]</mo></mrow><mo>.</mo></mrow></mtd></mtr></mtable></mr
   ow> :MATH]

   In the E step, we estimate the conditional expectation of the cell
   status by
   [MATH: <mrow><mtable><mtr><mtd
   columnalign="right"><mrow><msubsup><mi>p</mi><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>1
   </mn><mo stretchy="false">|</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">)</mo></mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>=</mo><mfrac><mrow><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">|</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>1
   </mn><mo stretchy="false">)</mo></mrow><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>c</mi><mi>i</mi></msub><mo>=</mo><mn>1
   </mn><mo stretchy="false">)</mo></mrow></mrow><mrow><mo>Pr</mo><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>,</mo><msub><mi>c</mi><mi
   >i</mi></msub><mo
   stretchy="false">)</mo></mrow></mfrac></mrow></mtd></mtr><mtr><mtd
   columnalign="right"><mrow></mrow></mtd><mtd
   columnalign="left"><mrow><mo>=</mo><mfrac><mrow><msub><mo>∏</mo><mi>g</
   mi></msub><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">|</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>1</mn></mrow
   ></msub><mo
   stretchy="false">)</mo></mrow><mo>·</mo><msubsup><mi>q</mi><mi>i</mi><m
   row><mo stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup></mrow><mrow><msub><mo>∏</mo><m
   i>g</mi></msub><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">|</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>0</mn></mrow
   ></msub><mo
   stretchy="false">)</mo></mrow><mo>·</mo><msubsup><mi>q</mi><mi>i</mi><m
   row><mo stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>+</mo><msub><mo>∏</mo><mi>g
   </mi></msub><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo
   stretchy="false">|</mo><msub><mi>θ</mi><mrow><mi>g</mi><mn>0</mn></mrow
   ></msub><mo stretchy="false">)</mo></mrow><mo>·</mo><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo>-</mo><msubsup><mi>q</mi><mi>i</mi
   ><mrow><mo stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo
   stretchy="false">)</mo></mrow></mrow></mfrac><mo>.</mo></mrow></mtd></m
   tr></mtable></mrow> :MATH]

   The algorithm repeats the M step and E step until convergence. The
   convergence criteria is defined as the absolute difference between
   [MATH: <mrow><msup><mrow><mi
   mathvariant="bold-italic">p</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msup><mo>=</mo><mrow><mo
   stretchy="false">(</mo><msubsup><mi>p</mi><mn>1</mn><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>,</mo><mo>⋯</mo><mo>,</mo><
   msubsup><mi>p</mi><mi>N</mi><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   and
   [MATH: <msup><mrow><mi mathvariant="bold-italic">p</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>t</mi><mo>-</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msup> :MATH]
   smaller than
   [MATH: <msup><mn>10</mn><mrow><mo>-</mo><mn>5</mn></mrow></msup> :MATH]
   . The core of partCNV is the described EM algorithm and the output is
   the inferred cell status. With the estimated probability
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold-italic">p</mi></mrow><mo
   stretchy="true">^</mo></mover> :MATH]
   , the cells with
   [MATH: <mrow><msub><mover accent="true"><mi>p</mi><mo
   stretchy="true">^</mo></mover><mi>i</mi></msub><mo>≥</mo><mn>0.5</mn></
   mrow> :MATH]
   are assigned as aneuploid cells and the rest as diploid cells.

Details of partCNVH

   Limited by the technology, cytogenetic data only provide crude
   information about the chromosomal deletion or amplification. When we
   include all the genes from the regions of interest, it is likely that
   not all of the genes have chromosomal changes in the aneuploid cells.
   It is helpful to select the genes that demonstrate chromosomal changing
   patterns. Motivated by this idea, we design partCNVH, the combination
   of partCNV and HMM for improving classification performance. Denote the
   underlying status for the observed G genes by
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">Z</mi></mrow><mo>=</mo><mo
   stretchy="false">(</mo><msub><mi>z</mi><mn>1</mn></msub><mo>,</mo><mo>⋯
   </mo><mo>,</mo><msub><mi>z</mi><mi>G</mi></msub><mo
   stretchy="false">)</mo></mrow> :MATH]
   . The first step of partCNVH is to apply partCNV on the scRNA-seq data
   from the region of interest. Denote the obtained cell status from
   partCNV by
   [MATH: <mrow><mo stretchy="false">{</mo><msub><mover
   accent="true"><mi>c</mi><mo
   stretchy="true">^</mo></mover><mi>i</mi></msub><mo
   stretchy="false">}</mo></mrow> :MATH]
   . Then for each gene g, we compute the mean expression of this gene for
   the two groups, i.e.,
   [MATH: <mrow><msubsup><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><munder><mo>∑</mo><mr
   ow><mi>j</mi><mo>∈</mo><mo
   stretchy="false">{</mo><mi>i</mi><mo>:</mo><msub><mi>c</mi><mi>i</mi></
   msub><mo>=</mo><mn>1</mn><mo
   stretchy="false">}</mo></mrow></munder><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mspace
   width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace><mtext>and</mtext><mspace
   width="3.33333pt"></mspace><mspace
   width="3.33333pt"></mspace><msubsup><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mo
   stretchy="false">(</mo><mn>0</mn><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><munder><mo>∑</mo><mr
   ow><mi>j</mi><mo>∈</mo><mo
   stretchy="false">{</mo><mi>i</mi><mo>:</mo><msub><mi>c</mi><mi>i</mi></
   msub><mo>=</mo><mn>0</mn><mo
   stretchy="false">}</mo></mrow></munder><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>.</mo></mrow> :MATH]

   Based on
   [MATH: <msubsup><mover accent="true"><mrow><mi>y</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mover accent="true"><mrow><mi>y</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mo
   stretchy="false">(</mo><mn>0</mn><mo
   stretchy="false">)</mo></mrow></msubsup> :MATH]
   , we obtain the mean expression ratio for all the genes by
   [MATH: <mrow><msub><mi>r</mi><mi>g</mi></msub><mo>=</mo><msubsup><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mo
   stretchy="false">(</mo><mn>0</mn><mo
   stretchy="false">)</mo></mrow></msubsup><mo
   stretchy="false">/</mo><msubsup><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup></mrow> :MATH]
   if the region has deletion and
   [MATH: <mrow><msub><mi>r</mi><mi>g</mi></msub><mo>=</mo><msubsup><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mo
   stretchy="false">(</mo><mn>1</mn><mo
   stretchy="false">)</mo></mrow></msubsup><mo
   stretchy="false">/</mo><msubsup><mover
   accent="true"><mrow><mi>y</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mo
   stretchy="false">(</mo><mn>0</mn><mo
   stretchy="false">)</mo></mrow></msubsup></mrow> :MATH]
   if the region has amplification. As shown in Figure [132]3, the signals
   from the mean expression ratio are weak, and thus we apply a rolling
   average on the mean expression ratios to strengthen the signals. Denote
   the window size for the rolling average by K, and the rolling average
   at gene g becomes
   [MATH: <mrow><msubsup><mover
   accent="true"><mrow><mi>r</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mi
   mathvariant="italic">rolling</mi></mrow></msubsup><mo>=</mo><mfrac><mn>
   1</mn><mi>K</mi></mfrac><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo
   ><mi>g</mi><mo>-</mo><mo stretchy="false">[</mo><mi>K</mi><mo
   stretchy="false">/</mo><mn>2</mn><mo
   stretchy="false">]</mo></mrow><mrow><mi>g</mi><mo>+</mo><mo
   stretchy="false">[</mo><mi>K</mi><mo
   stretchy="false">/</mo><mn>2</mn><mo
   stretchy="false">]</mo></mrow></munderover><msub><mi>r</mi><mi>i</mi></
   msub><mo>.</mo></mrow> :MATH]

   In R, we implement the rolling average computation by function
   frollmean from the data.table package. We then build HMM using the
   rolling average of the genes. Denote the underlying state of gene g by
   [MATH: <msub><mi>z</mi><mi>g</mi></msub> :MATH]
   , where
   [MATH:
   <mrow><msub><mi>z</mi><mi>g</mi></msub><mo>=</mo><mn>1</mn></mrow>
   :MATH]
   is gene g with deletion or amplification in the aneuploid cells, and
   [MATH:
   <mrow><msub><mi>z</mi><mi>g</mi></msub><mo>=</mo><mn>0</mn></mrow>
   :MATH]
   otherwise. HMM aims to infer the hidden status
   [MATH: <mrow><mrow><mi
   mathvariant="bold-italic">Z</mi></mrow><mo>=</mo><mo
   stretchy="false">{</mo><msub><mi>z</mi><mi>g</mi></msub><mo
   stretchy="false">}</mo></mrow> :MATH]
   through the observed sequence
   [MATH: <mrow><msup><mover accent="true"><mrow><mrow><mi
   mathvariant="bold-italic">R</mi></mrow></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi
   mathvariant="italic">rolling</mi></mrow></msup><mo>=</mo><mrow><mo
   stretchy="false">(</mo><msubsup><mover
   accent="true"><mrow><mi>r</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mi
   mathvariant="italic">rolling</mi></mrow></msubsup><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   by solving the likelihood
   [MATH: <mrow><mo>Pr</mo><mrow><mo stretchy="false">(</mo><msup><mover
   accent="true"><mrow><mrow><mi
   mathvariant="bold-italic">R</mi></mrow></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi
   mathvariant="italic">rolling</mi></mrow></msup><mo>,</mo><mrow><mi
   mathvariant="bold-italic">Z</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><msup><mover accent="true"><mrow><mrow><mi
   mathvariant="bold-italic">R</mi></mrow></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi
   mathvariant="italic">rolling</mi></mrow></msup><mo
   stretchy="false">|</mo><mrow><mi
   mathvariant="bold-italic">Z</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>×</mo><mo>Pr</mo><mrow><mo
   stretchy="false">(</mo><mrow><mi
   mathvariant="bold-italic">Z</mi></mrow><mo
   stretchy="false">)</mo></mrow><mo>=</mo><munderover><mo>∏</mo><mrow><mi
   >g</mi><mo>=</mo><mn>1</mn></mrow><mi>G</mi></munderover><mo>Pr</mo><mr
   ow><mo stretchy="false">(</mo><msubsup><mover
   accent="true"><mrow><mi>r</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mi
   mathvariant="italic">rolling</mi></mrow></msubsup><mo
   stretchy="false">|</mo><msub><mi>z</mi><mi>g</mi></msub><mo
   stretchy="false">)</mo></mrow><mo>×</mo><munderover><mo>∏</mo><mrow><mi
   >g</mi><mo>=</mo><mn>1</mn></mrow><mi>G</mi></munderover><mo>Pr</mo><mr
   ow><mo stretchy="false">(</mo><msub><mi>z</mi><mi>g</mi></msub><mo
   stretchy="false">|</mo><msub><mi>z</mi><mrow><mi>g</mi><mo>-</mo><mn>1<
   /mn></mrow></msub><mo stretchy="false">)</mo></mrow><mo>.</mo></mrow>
   :MATH]

   We solve HMM using the depmix function from the depmixS4 package in R
   by specifying the initial transition matrix between the two states
   (“genes in aneuploid region” or “A”, “genes in diploid region” or “D”)
   as
   [MATH: <mrow><mfenced close=")"
   open="("><mrow><mtable><mtr><mtd><mrow><mi>A</mi><mo
   stretchy="false">→</mo><mi>A</mi></mrow></mtd><mtd><mrow><mi>A</mi><mo
   stretchy="false">→</mo><mi>D</mi></mrow></mtd></mtr><mtr><mtd><mrow><mr
   ow></mrow><mi>D</mi><mo
   stretchy="false">→</mo><mi>A</mi></mrow></mtd><mtd><mrow><mi>D</mi><mo
   stretchy="false">→</mo><mi>D</mi></mrow></mtd></mtr></mtable></mrow></m
   fenced><mo>=</mo><mfenced close=")"
   open="("><mrow><mtable><mtr><mtd><mrow><mn>0.9</mn></mrow></mtd><mtd><m
   row><mn>0.1</mn></mrow></mtd></mtr><mtr><mtd><mrow><mrow></mrow><mn>0.1
   </mn></mrow></mtd><mtd><mrow><mn>0.9</mn></mrow></mtd></mtr></mtable></
   mrow></mfenced><mo>.</mo></mrow> :MATH]

   The initial states of the genes are decided as
   [MATH:
   <mrow><msub><mi>z</mi><mi>g</mi></msub><mo>=</mo><mn>1</mn></mrow>
   :MATH]
   if
   [MATH: <mrow><msubsup><mover
   accent="true"><mrow><mi>r</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mi
   mathvariant="italic">rolling</mi></mrow></msubsup><mo>≥</mo><mi>m</mi><
   mi>e</mi><mi>d</mi><mi>i</mi><mi>a</mi><mi>n</mi><mrow><mo
   stretchy="false">{</mo><msubsup><mover
   accent="true"><mrow><mi>r</mi></mrow><mrow><mo
   stretchy="false">¯</mo></mrow></mover><mrow><mi>g</mi></mrow><mrow><mi
   mathvariant="italic">rolling</mi></mrow></msubsup><mo>,</mo><mi>g</mi><
   mo>=</mo><mn>1</mn><mo>,</mo><mo>⋯</mo><mo>,</mo><mi>G</mi><mo
   stretchy="false">}</mo></mrow></mrow> :MATH]
   , and
   [MATH:
   <mrow><msub><mi>z</mi><mi>g</mi></msub><mo>=</mo><mn>0</mn></mrow>
   :MATH]
   otherwise. After the hidden states are inferred by HMM, we identify the
   states corresponding to the “genes in diploid region” by comparing the
   mean expression ratio of the two states, and the state with larger mean
   expression ratio is labeled as state “D”. Denote the expression count
   matrix for the selected genes by
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mo
   stretchy="true">~</mo></mover> :MATH]
   . We apply the EM algorithm described in partCNV to
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold-italic">Y</mi></mrow><mo
   stretchy="true">~</mo></mover> :MATH]
   and infer the final cell states.

Simulation designs

   To mimic the real data analysis, we generate the simulation datasets
   based on existing scRNA-seq data from patients with triple negative
   breast cancer (TNBC). The data were downloaded from the Gene Expression
   Omnibus (GEO) with accession number [133]GSE148673. We compared the
   data characteristics between the TNBC dataset and the scRNA-seq data
   from our MDS patients. In Supplementary Figure S10, we presented the
   mean expression levels of genes and cells, as well as the dropout rate
   for the two datasets. We found that the data characteristics are
   similar between the two data sources. As a data processing step, genes
   with zero expression in all cells in the TNBC data are removed. We keep
   all of the normal cells based on the cell type annotation from the
   original study^[134]15. As our method focuses on the region with known
   deletion or amplification from the cytogenetic data and the region
   often covers tens or hundreds of genes, we generate the expression
   count for
   [MATH: <msub><mi>n</mi><mn>0</mn></msub> :MATH]
   normal cells and
   [MATH: <msub><mi>n</mi><mn>1</mn></msub> :MATH]
   aneuploid cells with a total of G genes. For each iteration, we
   randomly draw the expression of G genes from the normal cells of the
   TNBC dataset. We compute the mean expression of these G genes and
   denote it by
   [MATH: <mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">ξ</mi></mrow><mo
   stretchy="true">^</mo></mover><mo>=</mo><mrow><mo
   stretchy="false">(</mo><msub><mi>ξ</mi><mn>1</mn></msub><mo>,</mo><mo>⋯
   </mo><mo>,</mo><msub><mi>ξ</mi><mi>G</mi></msub><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   . For normal cells, the expression is generated from Poisson
   [MATH: <mrow><mo stretchy="false">(</mo><mover
   accent="true"><msub><mi>ξ</mi><mi>g</mi></msub><mo
   stretchy="true">^</mo></mover><mo stretchy="false">)</mo></mrow> :MATH]
   for
   [MATH:
   <mrow><mi>g</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>⋯</mo><mo>,</mo><mi>G
   </mi></mrow> :MATH]
   . For aneuploid cells, assume
   [MATH: <msub><mi>G</mi><mn>1</mn></msub> :MATH]
   genes are located at the deleted or amplified regions and
   [MATH:
   <mrow><msub><mi>G</mi><mn>0</mn></msub><mo>=</mo><mi>G</mi><mo>-</mo><m
   sub><mi>G</mi><mn>1</mn></msub></mrow> :MATH]
   are from the normal regions.

   We assume half of the
   [MATH: <msub><mi>G</mi><mn>0</mn></msub> :MATH]
   genes are in the regions that are left-adjacent to a copy-number
   alteration and the other half are in regions right-adjacent that have
   normal expression in all cells. This partial chromosomal variation
   often happens in practice as cytogenetic data only provide a crude
   observation of the changed regions. Without loss of generality, we
   assume the beginning
   [MATH: <msub><mi>G</mi><mn>1</mn></msub> :MATH]
   genes are from the aneuploid region. For a gene g from this region for
   an aneuploid cell i, we generate the expression from
   [MATH: <mrow><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>∼</mo><mtext>Poisson</mte
   xt><mrow><mo
   stretchy="false">{</mo><msub><mi>λ</mi><mi>g</mi></msub><mo>·</mo><mrow
   ><mo stretchy="false">(</mo><mi>r</mi><mo>+</mo><mi>η</mi><mo
   stretchy="false">)</mo></mrow><mo stretchy="false">}</mo></mrow><mspace
   width="3.33333pt"></mspace><mtext>and the noise</mtext><mspace
   width="0.333333em"></mspace><mspace
   width="3.33333pt"></mspace><mi>η</mi><mo>∼</mo><mtext>Uniform</mtext><m
   row><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>τ</mi><mo
   stretchy="false">)</mo></mrow><mo>,</mo></mrow> :MATH]

   where r is the ratio controlling the impact level of chromosomal
   deletion or amplification on the expression and
   [MATH: <mi>η</mi> :MATH]
   is the noise, in the first setting with deletion in aneuploid cells. r
   takes value 0.5 or 0.6 in this setting (Simulation data 1). In the
   second setting, we consider amplification in the aneuploid cells. The
   expression is generated from
   [MATH: <mrow><msub><mi>y</mi><mrow><mi
   mathvariant="italic">gi</mi></mrow></msub><mo>∼</mo><mtext>Poisson</mte
   xt><mrow><mo
   stretchy="false">{</mo><msub><mi>λ</mi><mi>g</mi></msub><mo>·</mo><mrow
   ><mo stretchy="false">(</mo><mi>r</mi><mo>-</mo><mi>η</mi><mo
   stretchy="false">)</mo></mrow><mo stretchy="false">}</mo></mrow><mspace
   width="3.33333pt"></mspace><mtext>and the noise</mtext><mspace
   width="0.333333em"></mspace><mspace
   width="3.33333pt"></mspace><mi>η</mi><mo>∼</mo><mtext>Uniform</mtext><m
   row><mo stretchy="false">(</mo><mn>0</mn><mo>,</mo><mi>τ</mi><mo
   stretchy="false">)</mo></mrow><mo>,</mo></mrow> :MATH]

   and r takes value 1.5 or 1.4. (Simulation data 2) The noise parameter
   [MATH: <mi>η</mi> :MATH]
   controls the heterogeneity of the impacts among all the interested
   genes, mimicking the fact that gene expressions are heterogeneous when
   the genes are located in deleted or amplified regions. Through our
   experiments, we specify
   [MATH:
   <mrow><mi>η</mi><mo>=</mo><mn>0.1</mn><mo>,</mo><mn>0.2</mn></mrow>
   :MATH]
   , and 0.3 for low, medium, and high noise levels, respectively.

   To understand the impact of different sample sizes, we consider the
   combination of
   [MATH:
   <mrow><msub><mi>n</mi><mn>1</mn></msub><mo>=</mo><mn>500</mn></mrow>
   :MATH]
   and
   [MATH:
   <mrow><msub><mi>n</mi><mn>0</mn></msub><mo>=</mo><mn>2500</mn></mrow>
   :MATH]
   in one scenario and
   [MATH:
   <mrow><msub><mi>n</mi><mn>1</mn></msub><mo>=</mo><mn>300</mn></mrow>
   :MATH]
   and
   [MATH:
   <mrow><msub><mi>n</mi><mn>0</mn></msub><mo>=</mo><mn>1000</mn></mrow>
   :MATH]
   in the other (Simulation data 3). We also evaluate the impact of prior
   specification in all the scenarios. In the first scenario, the true
   prior proportion of the aneuploid cell is
   [MATH: <mrow><mn>17</mn><mo>%</mo></mrow> :MATH]
   and we evaluate the method with both
   [MATH: <mrow><mn>17</mn><mo>%</mo></mrow> :MATH]
   and
   [MATH: <mrow><mn>50</mn><mo>%</mo></mrow> :MATH]
   . In the second scenario, the true proportion is
   [MATH: <mrow><mn>23</mn><mo>%</mo></mrow> :MATH]
   and we also compare the results with both
   [MATH: <mrow><mn>23</mn><mo>%</mo></mrow> :MATH]
   and
   [MATH: <mrow><mn>50</mn><mo>%</mo></mrow> :MATH]
   (Simulation data 4). To compare the proposed methods versus existing
   ones that only perform on whole genome data, we randomly sample 2000
   normal cells from an existing scRNA-seq dataset (Simulation data 5). We
   then replace the expression of the genes located at the region
   chromosome 20 q11.1 to q13.1 using the simulated data as described
   above to mimic situations that 400 out of 2000 (20%) cells are locally
   aneuploid. The gene expressions in the region of interest are generated
   in the similar way as Simulation data 1. We then apply inferCNV and
   copyKAT using the suggested arguments as suggested by the original
   methods. All the simulation results are summarized over 100 Monte Carlo
   datasets.

Real data analysis

   The data for the AML patient were downloaded from European
   Genome-phenome Archive (EGA) with accession EGAD00001007672^[135]44.
   The data from sample 7A were used for the analysis due to the available
   cytogenetic information. The raw data for the two MDS patients are
   currently not publicly available due to confidentiality regulations.

Data preprocessing

   The raw sequencing data were preprocessed (demultiplexed cellular
   barcodes, read alignment, and generation of gene count matrix) using
   Cell Ranger Single Cell Software Suite (version 3.0,
   https://support.10xgenomics.com/single-cell-gene-expression/software/pi
   pelines/latest/what-is-cell-ranger) provided by 10X Genomics using
   Human Genome GRCh38. Genes were detected in
   [MATH: <mrow><mo><</mo><mn>0.1</mn><mo>%</mo></mrow> :MATH]
   of total sequenced cells and cells where
   [MATH: <mrow><mo><</mo><mn>200</mn></mrow> :MATH]
   genes had nonzero counts were filtered out and not included in the
   analysis. Low quality cells where
   [MATH: <mrow><mo>></mo><mn>20</mn><mo>%</mo></mrow> :MATH]
   of the counts were derived from the mitochondrial genome were also
   discarded. The doublet cell status was inferred using
   DoubletFinder^[136]53 and only the singlet cells were kept for further
   analysis. Data were normalized using the total sequencing count per
   cell to adjust for differences in sequencing depth. The chromosome
   database associated with cytogenetic location is downloaded via the
   UCSC genome website.

Functional over-representation analysis

   In the real data analysis section, after the aneuploid/diploid status
   has been inferred by the proposed or existing method, we perform cell
   type specific differential analysis using “MAST”^[137]32 (available in
   Seurat package^[138]2) to compare the diploid versus aneuploid cells.
   The genes with an adjusted p value smaller than 0.05 are included as
   the DEGs. We then perform functional over-representation analysis using
   the MSigDB platform^[139]33
   (http://www.gsea-msigdb.org/gsea/msigdb/annotate.jsp) with the Reactome
   pathway^[140]39, GO Biological Process^[141]54, and Hallmark^[142]40
   databases. We present the top 10 pathways or biological process terms
   regardless of the significance level.

Implementation of existing methods

   All of the existing methods take the normalized expression counts for
   the genes located in the region of interest as input. K-means and
   hierarchical clustering are implemented using the functions kmeans and
   hclust from the stat package in R, respectively. The PCA plus
   Louvain^[143]55 and Leiden methods^[144]29 are implemented using the
   Seurat package in function FindClusters with arguments algorithm
   [MATH: <mrow><mo>=</mo><mn>1</mn></mrow> :MATH]
   and 4. Since the Seurat clustering function does not allow
   specification of cluster numbers, we design a loop with a precision
   parameter ranging from 0.001 to 0.5 with distance 0.005 until the
   algorithm identifies exactly two clusters. All the analyses and
   plotting are performed in R v4.0.3.

Supplementary Information

   [145]Supplementary Information.^ (20.7MB, pdf)

Acknowledgements