Abstract Single-cell RNA sequencing is a powerful tool to investigate the cellular makeup of tumor samples. However, due to the sparse data and the complex tumor microenvironment, it can be challenging to identify neoplastic cells that play important roles in tumor growth and disease progression. This is especially relevant for blood cancers, where neoplastic cells may be highly similar to normal cells. To address this challenge, we have developed partCNV and partCNVH, two methods for rapid and accurate detection of aneuploid cells with local copy number deletion or amplification. PartCNV uses an expectation-maximization (EM) algorithm with mixtures of Poisson distributions and incorporates cytogenetic information to guide the classification. PartCNVH further improves partCNV by integrating a hidden Markov model for feature selection. We have thoroughly evaluated the performance of partCNV and partCNVH through simulation studies and real data analysis using three scRNA-seq datasets from blood cancer patients. Our results show that partCNV and partCNVH have favorable accuracy and provide more interpretable results compared to existing methods. In the real data analysis, we have identified multiple biological processes involved in the oncogenesis of myelodysplastic syndromes and acute myeloid leukemia. Subject terms: Cancer, Computational biology and bioinformatics Introduction Single-cell RNA sequencing (scRNA-seq) has greatly improved our ability to understand the cellular composition of the tissues and organs of interest, identify phenotype-associated cell groups, and elucidate the mechanisms behind many biological processes^[34]1–[35]3. These advantages make scRNA-seq a powerful tool for studying a wide range of human diseases, including Alzheimer’s disease^[36]4, cardiovascular disease^[37]5, and cancer^[38]6. In cancer research, a crucial step of scRNA-seq data analysis is to delineate tumor cells or neoplastic cells from other cell types^[39]7,[40]8. The tumors of each patient have their unique tumoral and microenvironmental evolution, and thus the scRNA-seq data from cancer patients tend to be more heterogeneous. Such heterogeneity is an exciting opportunity for improving our understanding of cancer with scRNA-seq, but it also imposes computational challenges to dissect composing cell types^[41]9. Neoplastic cells are abnormal cells that are undergoing excessive and uncontrolled proliferation^[42]10. These cells, which may or may not be malignant, can be extracted experimentally through cell sorting, although this is not always possible due to a lack of suitable markers or the high cost and labor requirements associated with these experiments^[43]11. In fact, it may be more useful to study all cells from a sequencing experiment simultaneously in order to understand the characteristics of neoplastic cells within their surrounding microenvironment. Neoplastic cells in certain types of cancer often have distinct features compared to non-neoplastic cells, such as high expression of certain cell type markers or genes belonging to some oncogenic pathways^[44]12. However, identifying neoplastic cells based on these markers or pathways can be difficult due to inter-individual heterogeneity, technical artifacts, and noise from the tumor microenvironment^[45]13. Recently, computational methods have been developed to identify large-scale copy number variations (CNVs) by comparing the smoothed scRNA-seq data against an internal or external normal reference, such as inferCNV, HoneyBADGER, and copyKAT^[46]3,[47]14,[48]15. For example, inferCNV is a popular visualization method for identifying large-scale CNVs. It uses smoothed averages over gene windows and compares the expression magnitude to the average over a set of reference ‘normal’ cells. CopyKat is a recently developed tool serving a similar purpose^[49]15. CopyKAT uses an integrative Bayesian segmentation approach combining CNV inference and hierarchical clustering, which has been shown to achieve high accuracy in distinguishing cancer cells from normal cells in multiple cancer types. Both inferCNV and CopyKAT generally work well with tumor cells that demonstrate extensive chromosomal alterations, but they do not work well for cancer types that have fewer and shorter CNVs. This is often the case in hematologic cancers such as myelodysplastic syndromes and acute myeloid leukemia^[50]16. Moreover, they couldn’t incorporate additional clinical information for detecting specific CNVs. There are also methods that sought to integrate scRNA-seq data with bulk DNA sequencing (DNA-seq) or single-cell DNA-seq (scDNA-seq) data, such as CONGAS, clonealign, and CCNMF^[51]17–[52]19. These methods serve a different purpose: to cluster cells based on CNV or mutation information. In this paper, we propose to exploit cytogenetic information to improve the sensitivity and specificity for CNV identification. Cytogenetic data are routinely measured and recorded for patients with hematologic cancers^[53]20 and they provide useful information to identify CNVs^[54]21. In the case of myelodysplastic syndromes, cytogenetic features, along with other factors such as morphology, immunophenotype, and clinical features, are included in the World Health Organization (WHO)-classification-based Prognostic Score System (WPSS) for myelodysplastic syndromes^[55]22 and its revised version^[56]23. Similar risk scoring systems also exist for other types of hematologic malignancies^[57]24,[58]25. For example, Leukemia patients with certain cytogenetic features, such as deletion of chromosome 7 or 7q, deletion of 3q, or amplification of chromosome 8, have been shown to have a poor prognosis^[59]26. Cytogenetic data provide location information of each CNV and the proportion of cells with specific CNVs based on the analysis of 20 metaphases. For example, if a patient has cytogenetic data as “46,XY,del(20)(q11.1q13.1)[5]/46,XY[15],” this means that approximately [MATH: 25% :MATH] percent of cells (5 out of 20) have a deletion of chromosome 20 in the region q11.1 to q13.1, while the rest of the cells have normal chromosomal features. Cytogenetic data are typically cheaper and more readily available in clinical settings compared to DNA-seq or scDNA-seq experiments. While the proportion in cytogenetic data is a crude estimate of the aberrant cells, they can still be useful in classifying cell status and identifying cells with chromosomal abnormalities, which may be markers for neoplastic cells^[60]27. None of the existing computational methods is able to incorporate such cytogenetic information in the analysis of scRNA-seq data. Here, we introduce two methods, partCNV and partCNVH, for identifying cells with regional chromosomal abnormalities from scRNA-seq data by integrating cytogenetic information. Both methods are based on a statistical framework that models the count expression matrix of scRNA-seq data using a mixture of Poisson distributions while incorporating the cytogenetic information through prior specification. PartCNVH is built on partCNV and it further includes a hidden Markov model (HMM) to improve feature selection and clustering accuracy. It should be noted that our proposal is complementary to the existing methods such as copyKAT and inferCNV, as they focus on identifying large-scale CNVs while we detect smaller variations with the incorporation of external information. We implement our proposed methods in a computationally efficient expectation-maximization (EM) algorithm^[61]28 and evaluate their performance through extensive simulation studies. We then apply them to three scRNA-seq data sets from patients with hematologic malignancies and show that they can identify cells with chromosomal deletions or amplifications in specific regions suggested by the cytogenetic data. We also perform additional analysis to understand the changes in the pathways and biologic processes in the identified aneuploid cells. Compared to existing methods, partCNV and partCNVH provide more interpretable results and additional findings. With the widespread use of single-cell technology in hematologic cancer research and clinical care of cancer patients, our methods offer a useful solution for fully leveraging cytogenetic data to identify cells with specific chromosomal abnormalities. Method overview PartCNV is a statistical framework that uses a hierarchical Poisson mixture model to differentiate two mixture components corresponding to normal and aberrant cells. PartCNVH is an extension of partCNV with the addition of HMM when there is a sufficient number of genes that allow feature selection. Figure [62]1 provides a schematic overview of the proposed methods. Our methods start with the normalized expression counts from the region with a known chromosomal deletion or amplification and explicitly incorporate the prior knowledge from cytogenetic data through imposing a Bernoulli prior on the cell status (i.e., normal or aberrant). We develop an EM algorithm that treats cell status as the missing variable and efficiently solves the mixture model. The inferred cell status from this step is the output of partCNV. Fig. 1. [63]Fig. 1 [64]Open in a new tab Schematic of PartCNV and PartCNVH. With the input of normalized expression counts from scRNA-seq experiments and the cytogenetic information from the patient, we develop an EM algorithm with mixtures of two Poisson distributions to infer the aneuploid/diploid status for the regions of interest. We further include a hidden Markov model to improve feature selection and the classification accuracy. Taking the ouptut from partCNV, partCNVH further refines it by a HMM. Specifically, a group average is taken for the inferred two groups of cells and the rolling average of the ratios between the two groups is used to infer the hidden status of the regions by a HMM. There are two reasons that we adopt this combination of rolling average and HMM in partCNVH. First, as shown in our later results, the group mean and rolling average can effectively magnify the signal of the regional deletion or amplification on the expression level. This is especially important when the signal of copy number alternations is weak related to noise in gene expression measurement. Second, it is possible that only a subset of the regions of interest has copy number changes. HMM can identify regions that are more likely to contain the chromosomal changes, which in turn improves the performance of aneuploid/diploid cell classification. After this HMM-based feature selection step, partCNVH performs a second round of the EM algorithm using the Poisson mixture model and reports the inferred cell status. Simulations We design comprehensive simulation settings to evaluate the performance of partCNV and partCNVH. As comparisons, we also consider existing methods using dimension reduction by principal component analysis (PCA) followed by Louvain or Leiden clustering. Previous literature has reported that the Leiden algorithm can generate better connected communities through including an extra refinement step and run faster than Louvain^[65]29. Additionally, we include two widely used machine learning clustering algorithms, K-means clustering and hierarchical clustering. All the previous mentioned methods can be applied to detect locally aneuploid cells. Although our proposed method is not directly comparable to existing methods that classify cells based on whole-genome CNV inference, we still design a separate simulation study to compare the proposed methods versus the two large-scale CNV detection-based methods, inferCNV and copyKAT^[66]14,[67]15. We consider two settings where the first one studies aneuploid cells with deletions and the second one studies amplifications: redSimulation data 1 and 2. The mean expression for these genes is generated by taking a ratio of the normal expression. This ratio (or log fold change) is randomly drawn from a Uniform distribution with different base levels (0.5/0.6 for Setting 1 and 1.5/1.4 in Setting 2) and different noise levels. A larger noise level makes the expression from the aneuploid cells similar to the normal cells, and thus creates harder scenarios for the methods. The evaluation criteria is the accuracy of the classification results of the proposed method evaluated by the true cell status (i.e., being aneuploid or normal). More details of the simulation settings are provided in the Methods section. First consider simulation data 1, where 500 out of 3000 cells have deletions. Our proposed methods partCNV and partCNVH have the highest accuracy among all the methods in all scenarios (Figure [68]2A-B). Using the same normalized gene expression counts input as our methods, K-means and hierarchical clustering have the lowest accuracy ranging from 0.5 to 0.7. PCA plus Louvain and Leiden have higher accuracy than K-means and hierarchical clustering. When the signal is strong (ratio = 0.5 in panel A) and noise is small, PCA plus Louvain/Leiden also have similar high accuracy as the proposed methods. But with the increase of the noise level, the accuracy of PCA plus Louvain or Leiden decreases. The advantage of the proposed methods becomes more obvious when the ratio is 0.6 and the noise level is high. For example, the mean accuracy of PCA plus Leiden is around 0.75 while partCNVH can achieve a high accuracy of 0.9. This is understandable since the proposed methods specifically model the data through two components for normal and aneuploid cells, and they allow mixtures of regions with and without deletions. Second, partCNV and partCNVH have similarly good performance with accuracy higher than 0.9, and partCNVH generally has higher accuracy than partCNV. To better understand the role of the HMM step of partCNVH and the result of feature selection, we use one simulation data set as an example and visualize the mean gene expression across the region (Figure [69]3A), ratios of the mean expressions of the two groups inferred by partCNV (Figure [70]3B), the rolling average of the mean expression ratios (Figure [71]3C), and the inferred status from HMM (Figure [72]3D). It can be seen that the rolling average of mean expression ratios between the two groups can effectively magnify the signal, and a majority of the HMM selected genes are located in the region with deletion. Figure [73]3E shows that our proposed method partCNVH has greater accuracy in classifying cells than the other methods. Fig. 2. [74]Fig. 2 [75]Open in a new tab Results of simulation Setting 1 with deletions. The methods that are compared include K-means clustering (Kmeans), hierarchical clustering (HClust), dimension reduction using PCA plus Louvain clustering (PCA+Louv), and dimension reduction using PCA and Leiden clustering (PCA+Leid). Each simulation dataset contains a total of 3000 cells: 2500 normal cells and 500 with deletions. (A) The accuracy of these methods when the ratio of gene expression in a deletion region versus normal expression is 0.5 at different noise levels (0.1: low, 0.2: medium, 0.3: high). (B) The results of the setting with ratio = 0.6 at different noise levels. All results are summarized over 100 Monte Carlo iterations. Fig. 3. [76]Fig. 3 [77]Open in a new tab Illustrating the procedure of the feature selection step using HMM in a simulation data set. (A) The log-transformed mean gene expression levels for the genes located in regions without and with deletion. Each dot is a gene. (B) The log-transformed ratio of the mean expression levels for cells without versus with deletion inferred by partCNV. (C) The expression mean ratio for the two groups of cells by partCNV after applying the rolling average with a bandwidth of 50. (D) The latent states inferred by HMM based on the rolling average from panel (C). (E) The results of classification accuracy of partCNVH, PCA plus Leiden, hierarchical clustering, and K-means clustering. The red dots are the cells incorrectly classified and the blue dots are the correct ones. Evaluation of PartCNV & PartCNVH for different prior information: amplification regions Next we evaluate different methods in simulation data 2 with amplifications (Figure [78]4). Note that for cells with deletion, the expression change odds is 2 (from 1 to 0.5) while in cells with amplification the odds is 0.67 (from 1 to 1.5) or its inverse 1.5. Thus the signals from the amplified regions can be harder to detect than the first setting with deleted regions. In Figure [79]4(A), we observe the performance of all methods decrease, especially for PCA plus Louvain and Leiden. When the noise level is high (0.3), PCA plus Leiden only reaches an accuracy of 0.74. In comparison, our proposed method still has a high accuracy of around 0.95. With the ratio level set as 1.4 in panel B, the signal level becomes lower and the existing methods have even lower accuracy ranging from 0.5 to 0.6, while the proposed methods still stay at a reasonable accuracy level around 0.9. These demonstrate the robustness of the proposed methods and highlight the importance of applying partCNV or partCNVH instead of existing methods when the region of interest has amplifications. Fig. 4. [80]Fig. 4 [81]Open in a new tab Results of simulation Setting 2 with amplifications. Each simulation dataset contains a total of 3000 cells: 2500 normal cells and 500 with amplifications. (A) The accuracy of the methods compared when the ratio = 1.5 at different noise levels (0.1: low, 0.2: medium, 0.3: high). (B) The results when the ratio = 1.4 at different noise levels. All results are summarized over 100 Monte Carlo iterations. Evaluation of PartCNV & PartCNVH for different prior information: cell numbers Our current simulation design considers a total of 3000 cells. When we study a region of interest suggested by cytogenetic data, more cells generally provide more information, and thus identifying signals from fewer cells can be more challenging. To evaluate the proposed methods under this scenario, we generate simulation data 3 by fixing the cell number as 1300, where 1000 cells are normal and 300 are aneuploid cells. Figure S1 shows the simulation results with 1300 total cells and the region of interest (200 out of 600 genes) has a deletion in the aneuploid cells. Compared with the results in Figure [82]2, all the existing methods have worse performance for the same ratio and noise combinations. For example, both PCA plus Louvain and PCA plus Leiden have a high accuracy of around 0.90 when the ratio is 0.6 with a medium noise of 0.2 using 3000 total cells, while the accuracy decreases to around 0.8 using 1300 cells. The variation of the classification results also increases. Although our proposed methods have slightly decreased performance when the ratio is 0.6 and the noise is 0.3, their overall performance remains similar in other scenarios (ratio = 0.5 at all noise levels, ratio = 0.6 with low and medium noise levels). Similar patterns can be observed in amplification settings by comparing the results in Figure S2 versus the results in Figure [83]4. Surprisingly, PCA plus Louvain or Leiden has even worse median accuracy than the hierarchical clustering, even though they all have quite low classification accuracy. These results suggest that our proposed methods tend to have more robust performance even with fewer cell numbers in the analyzed dataset, while the existing methods have decreased accuracy and more varied results, especially when the noise level is high. Evaluation of PartCNV & PartCNVH for different prior information: proportions of aneuploid cells As a methodology advantage, partCNV and partCNVH are able to incorporate the prior knowledge of an estimated proportion of aneuploid cells. If the prior is misspecified, we seek to understand the impact on the results. We generate simulation data 4 with the same data generation procedure but the prior information is specified as correct (0.17 for total cell number 3000 and 0.23 for total cell number 1300) and incorrect (0.5), and we examine the results of our proposed methods. Figures [84]5 and S3 illustrate the accuracy for the total cell numbers 3000 and 1300. First, it is clear that the correct prior knowledge improves the classification accuracy than a non-informative prior of 0.5. This improvement is small when the ratio is 0.5 or 1.5, but it can be substantial when the signal is harder to detect (ratio = 0.6 or 1.4) and the noise level is high. For example, when the ratio is 1.4 and the noise is 0.3, the improvement of the accuracy for both partCNV and partCNVH using a correct prior can be about [MATH: 10% :MATH] compared to using the incorrect prior. Second, both Figures 5 and S3 demonstrate the robustness of the proposed methods against incorrect prior information, especially in panels A and C where the ratios are 0.5 and 1.5, respectively. In these experiments, we choose 0.5 as the incorrect prior knowledge, which is far from the true proportion 0.17 and illustrate a worst scenario that the prior is completely non-informative. In reality, when a closer prior such as 0.20 or 0.15 is used, the impact would be much smaller. Even in the worst scenario, with an amplification ratio 1.4 and a high noise level 0.3, our method with an incorrect prior still reaches a median accuracy above 0.75, which is better than the existing methods under the same scenario. These results highlight the advantage of the proposed methods in accurately identifying aneuploid cells. Fig. 5. [85]Fig. 5 [86]Open in a new tab Simulation results of evaluating the impact of different prior information on the classification accuracy of the proposed methods. The dark and light blue bars correspond to the results with and without correct specification of the prior information (true prior: 0.17). Panel (A) and (B) are the simulation results based on the first simulation setting with ratios = 0.5, 0.6, respectively. Panel (C) and (D) are based on the second simulation setting with ratios = 1.5, 1.4, respectively. All results are summarized over 100 Monte Carlo iterations. Comparison with genome-wide CNV detection methods Lastly, we compare the proposed methods with two widely used genome-wide CNV detection methods, inferCNV, copyKAT as well as the regional versions of inferCNV and copyKAT. For the regional versions, we applied hierarchical clustering on the normalized expression matrix by inferCNV/copyKAT for the regions of interest only. One example of simulation data 4 is visualized in Figure [87]6A. As inferCNV requires the input of normal cells, we use an additional 100 normal cells as the reference for inferCNV. Both inferCNV and copyKAT generally are much more computationally intensive. We summarize the accuracy of 20 Monte Carlo simulations. It can be observed that neither inferCNV/copyKat nor the regional versions of these methods is able to accurately infer aneuploid/diploid cell status (Figure [88]6B-D), since the regional aneuploid signal is too weak compared to genome-wide copy number alterations that inferCNV and copyKAT are designed to detect. It is also interesting that when we re-cluster the cells based on the normalized expression from the sub-regions of chromosome 20, the performance of copyKAT increased but not inferCNV. These findings highlight the need for region-specific detection tools with the considerations of cell type mixtures to distinguishing between aneuploid and diploid cells. Fig. 6. [89]Fig. 6 [90]Open in a new tab Simulation results to compare the proposed method versus existing whole-genome based methods, InferCNV and CopyKAT. (A) Heatmap of the simulated expression values for genes in region Chr20(q11.1-q13.1), where the rows are the cells and columns are the genes. Logarithmic of the expression values plus one are used for visualization. Rows are labeled by the true aneuploid/diploid status and the inferred status by partCNV and partCNVH. (B) CopyKAT output of the aneuploid/diploid prediction versus the true cell status. The heatmap includes all the chromosomes. (C) Copy number results using InferCNV. (D) Boxplot of the aneuploid/diploid inference of the methods averaged over 20 Monte Carlo simulations. Regional InferCNV and Regional CopyKat are the hierarchical clustering results based on the normalized expression of the region of interest from InferCNV and CopyKAT. Real data application We demonstrate the usage of the proposed methods on three real data applications. For each application, the scRNA-seq data from one patient were collected by the 10X genomics platform and the cytogenetic data were collected from patients’ medical records. All three patients have a subset of the cells with regional copy number variations, and the rest are normal cells. The three applications have different complexity levels. The first subject (patient 1) is the most straightforward one, as a very long region (the whole chromosome Y) was reported as lost in a subset of the cells according to the patient’s cytogenetic data. The second subject (patient 2) has a subset of cells with partial chromosome 20 lost. The third subject (patient 3) has a complicated situation as this patient has cells with partial deletions, as well as cells with partial amplifications. Patient 1: MDS with loss of chromosome Y We obtain the scRNA-seq data of a bone marrow sample from an MDS patient treated at MD Anderson Cancer Center. The data, after alignment and quality control, contains a total of 33,538 genes and 655 cells. For this patient, the bone marrow sample has been specifically sorted for CD34+ cells to enrich hematopoietic stem and progenitor cells (HSPCs). As a result, the cell number is smaller than regular scRNA-seq experiments. Based on the clinically obtained cytogenetic data, this patient has around [MATH: 35% :MATH] of cells with the loss of chromosome Y and our goal is to identify these aneuploid cells. We first apply copyKAT to the whole transcriptome data from this sample to infer copy number variations. As shown in Figure S4, copyKAT clusters the cells based on the inferred CNV statuses across the whole genome, but it could not take regional data or the cytogenetic information into consideration when identifying the neoplastic cells. To specifically identify cells with the chromosome Y loss, we apply the proposed methods and the five existing methods (PCA plus Louvain, PCA plus Leiden, K-means clustering, hierarchical clustering, and CopyKAT) on this dataset. The input data are the normalized counts from the genes located on chromosome Y. In this application, the HMM step from partCNVH selects the whole set of genes so we only present the results from partCNV. Since the CNV in this dataset encompasses the entire chromosome Y, we expect some of the existing methods also work well for this analysis. We find that partCNV, PCA plus Louvain, PCA plus Leiden, and K-means clustering all have proportions of aneuploid cells close to 35% (40.9%, 39.5%, and 39.5%, respectively) (Figure [91]7 A-E). Hierarchical clustering identifies a much smaller number of aneuploid cells (9.31%). From visualizing the pairwise ARI values of these results, we find that partCNV, PCA plus Louvain/Leiden have very similar results, while K-means clustering and hierarchical clustering have very different results (Figure [92]7 F). As the UMAP coordinates in Figure [93]7(A-E) are obtained using the whole transcriptome data, the fact that copyKAT-identified aneuploid cells cluster together suggests that copyKAT captures the whole transcriptome pattern instead of chromosome Y specific changes. Fig. 7. [94]Fig. 7 [95]Open in a new tab Results of applying different methods to the scRNA-seq data using the bone marrow CD34 positive cells from patient 1. Cytogenetic information shows that [MATH: 35% :MATH] of the cells from the sample have chromosome Y loss. (A-E) The cell classification based on the inference of chromosome Y loss using different methods. Panel F shows the heatmap of the ARI values from comparing the classification results using different methods. (G) The expression of chromosome Y genes for cells that are labeled as normal in partCNV (partCNV:0) and aneuploid in CopyKAT (CopyKat:1), and three similar groups. (H,I) The expression of chromosome Y genes comparing partCNV versus k-means clustering and comparing partCNV and Hierarchical clustering. We further examine the average gene expression of chromosome Y genes among the aneuploid/normal cells identified by partCNV, copyKAT, K-means, and hierarchical clustering (Figure [96]7 G-I). It is apparent that partCNV-labeled aneuploid cells have much lower expression than the cells labeled as aneuploid by other methods but normal per partCNV, confirming that partCNV correctly identifies the cells with deletion on chromosome Y. In summary, these results suggest that partCNV and the PCA plus Louvain or Leiden clustering have identified the cells with the chromosome Y loss. In contrast, K-means clustering, hierarchical clustering, and copyKAT failed to do so. Patient 2: MDS with partial deletion of chromosome 20 We obtain the scRNA-seq data from the bone marrow sample for a different MDS patient. The data were also generated by the 10X genomics scRNA-seq technology. This sample was sequenced directly without the cell sorting step, and thus both HSPC and immune cells can be potentially identified. After alignment, preprocessing, and quality control, a total of 24,519 genes and 3,643 cells are kept for the analysis. Based on the cytogenetic data, about 20% of cells in the sample have deletions in chromosome 20 at regions q11.1 to q13.1, which is about 24.2 Mb long. We also apply CopyKAT to this data and present the heatmap result in Figure S5. Although cytogenetics reported deletions in chromosome 20, the log copy number ratio heatmap does not have an obvious deletion pattern in the suggested regions. Based on the whole genome copy number inference, copyKAT only reports about 500 aneuploid cells ( [MATH: :MATH] 5.5%). We analyze the whole-transcriptome data using Seurat^[97]2 through identifying highly variable genes, extracting top principal components (PCs) based on these genes, and we perform UMAP and clustering analysis (Figure [98]8). UMAP is used for dimension reduction and unsupervised clustering is performed with the default Louvain clustering using the top PCs. A total of 10 clusters are identified, and the cluster specific markers are used for annotating the cell type labels based on biological knowledge by our MDS biologist. We also apply the proposed methods and existing methods targeting the region on chromosome 20 with known chromosomal deletions. Figure [99]8 A-E shows cell type labels and the aneuploid/diploid inference result using partCNVH, PCA plus Louvain, K-means clustering, and hierarchical clustering. The results for partCNV and PCA plus Leiden are presented in Figure S6. We find that the proposed methods have the closest proportion of aneuploid cells to the cytogenetics reported proportion; the other methods all have much lower or higher proportions. The major difference between our proposed method and PCA plus Louvain is in the cycling RBC progenitors (Figure [100]8 B, C, and F). Our method reports high proportions in all three MDS-related cell groups (i.e., HSPCs, RBC progenitors, and cycling RBC progenitors), while PCA plus Louvain only reports aneuploid cells in the former two clusters. Previous literature found impaired erythroid-proliferating capacities to be a prominent characteristic in patients with MDS^[101]30,[102]31. Both RBC progenitors and cycling RBC progenitors are major cell types involved in the erythroid-proliferating function, and thus it makes sense to identify neoplastic cells in both cell types. Fig. 8. [103]Fig. 8 [104]Open in a new tab Application results of the proposed method and existing methods using the scRNA-seq data from patient 2. (A) The cell type annotation based on marker genes’ expression and biological knowledge. RBC progenitor: red blood cell progenitors. HSPC: hematopoietic stem and progenitor cells. NK: natural killer. (B-E) The inferred aneuploid cells with deletion chr20(q11.1q13.1) using different methods (B, partCNV; C, PCA+Louvian; D, K Means clustering; E, Hierarchical clustering). The proportions in the title bracket are the proportions of cells that are inferred as cells with this specific deletion. (F) The comparisons of results by partCNVH and PCA plus Louvain versus cell type labels. (G) Gene set enrichment analysis results for the genes that are differential expressed between the cells with and without deletion chr(20)(q11.1q13.1). Gene sets are defined using the Reactome Pathway Database. To understand the differences between the identified locally aneuploid cells and the normal cells, we conduct differential expression analysis for each cluster to compare the aneuploid versus diploid cells identified by our method^[105]32. A total of 177 cluster-specific differentially expressed genes were identified using a cutoff of 0.05 for adjusted p values. We also perform over-representation analysis using GO Biological Process database^[106]33,[107]34 and identify several functional categories over-represented by differential expression signals, including neutrophil degranulation, a few immune system related terms, and interleukin 4 and interleukin 13 signals (Figure [108]8 G). Many of these terms have been reported in previous literature to be related with MDS. For example, neutrophil degranulation and migration has been reported to be associated with MDS compared with normal controls^[109]35. The important role of the immune system and innate immune signaling in MDS has also been reported in multiple publications^[110]36–[111]38. The over-representation results using the Reactome pathway and Hallmark database are presented in Figure S7^[112]39,[113]40. In the Reactome pathway enrichment results, we also find several immune response-related and lymphocyte activation related terms. In Hallmark analysis, allograft rejection, complement, interferon gamma response, and TNFA signaling via NFKB are the top findings, which also have been associated with MDS pathogenesis^[114]41–[115]43. There are also some terms that have not been reported in previous MDS studies, such as generation of second messenger molecules, hemostasis, defense response, and cell activation, which could be promising targets for future research. Patient 3: AML with partial gain and partial deletion of chromosome 8 Lastly, we study the scRNA-seq data from an AML patient with complicated chromosomal variations^[116]44. Specifically, this patient has amplifications of the whole chromosome 8 in 25% of the cells, deletion of chromosome 8 at region q21.2 to q24.3 in 40% of the cells, and normal karyotype in the rest of the cells. We are interested in identifying which cells contain the chromosome 8 gain and which have del(8)(q21.2q24.3). The scRNA-seq data were generated using the 10X genomics technology platform. After preprocessing, the data contain a total of 20,521 genes from 4294 cells. Since there are overlaps between the deletion and amplification regions, we apply partCNVH through a two-step approach. We first focus on the region of chromosome 8 before q21.2 where about 25% of the cells have amplification; the rest of the cells are normal in the area. After the cells with chromosome 8 gain are detected, we apply partCNVH again to the rest of the cells, which contain del(8)(q21.2q24.3) in around 53.3% ( [MATH: =40%/(1-25%) :MATH] ) of the cells. In the two steps, [MATH: 25% :MATH] and [MATH: 53.3% :MATH] are used as the prior knowledge of the aneuploid cell proportions in partCNVH. We find that the majority of the inferred cells with chromosome 8 amplification are from the erythroid cells, while the cells with del(8)(q21.1q24.3) are mostly AML cells (Figure [117]9 A-B). The proportions of cells with chromosome 8 amplification and del(8)(q21.1.q24.3) are about 13% and 60%, which are close to the prior knowledge from cytogenetics. We visualize the normalized expression for the region of interest on chromosome 8 in Figure S8A. We also present the results by inferCNV and copyKAT in Figure S8B and Figure S9, respectively. We observe that the patterns for chromosome 8 varies between the three cell groups in Figure S8A. However, whether the pattern shows gain or loss is not easy to identify due to data sparsity. Neither CopyKAT nor InferCNV can be applied to regional data, and thus they couldn’t reproduce the patterns we observed in Figure S8A. Fig. 9. [118]Fig. 9 [119]Open in a new tab Results of applying the proposed method partCNVH to the scRNA-seq data from an AML patient with [MATH: 25% :MATH] cells having amplification of chromosome 8 and [MATH: 40% :MATH] cells having chromosome 8 deletion at region q21.2-q24.3. (A) The true cell type labels annotated by a clinician-scientist. (B) The inferred cells with amplification of 8 (“+8”) and the deletion of chromosome 8 at q21.2-q24.3 (“del(8)(q21.2q24.3)”). (C, D) The Reactome pathway enrichment analysis results using the DEGs by comparing the AML cells with del(8)(q21.2q24.3) versus diploid cells, and by comparing the erythroid cells with +8 versus diploid cells, respectively. To understand the different molecular mechanisms related to chromosome changes, we perform differential analysis for comparing the AML cells with del(8)(q21.2q24.3) versus diploid AML cells and obtain 266 differentially expressed genes (DEGs). Similarly, we compare the erythroid cells with a gain of chromosome 8 versus diploid erythroid cells and obtain 426 DEGs. Gene set enrichment analysis identify a few shared significant terms between AML and erythroid cells (Figure [120]9C-D), including multiple immune system related terms, neutrophil degranulation, and cellular responses to stimuli. Some unique terms in AML are hemostasis, platelet activation signaling and aggregation, and arachidonic acid metabolism. Hemostatic and thrombotic complications are prevalent symptoms in AML patients and hemostasis has been studied before for AML pathogenesis-related mechanisms^[121]45. The term platelet activation signaling and aggregation is also consistent with the previous literature that the platelet defects and other hemorrhagic symptoms are widely observed in AML patients^[122]46. The arachidonic acid metabolism is a process highlighted in a few cancer research publications, but the evidence for their involvement in AML is still accumulating^[123]47,[124]48. Overall, our results are consistent with literature and provide some novel disease-related biological processes for future research. Discussion We introduce partCNV/partCNVH, a statistical framework that distinguishes neoplastic cells with copy number alterations from normal diploid cells based on regional chromosomal deletions or amplifications. Unlike existing methods, our statistical framework can incorporate prior knowledge from cytogenetic data that includes both chromosomal locations of aberrations and the observed proportion. As demonstrated in our simulation study, this prior information can effectively improve the classification accuracy when the signal is weak. Our framework also includes a feature selection step using the hidden Markov model, which is able to filter the genes when part(s) of the region are diploid. This step further improves the signals of chromosomal changes and results in higher accuracy to detect neoplastic cells. We have illustrated the benefits of partCNV and partCNVH through extensive simulation studies and in-depth analysis of three scRNA-seq datasets from MDS or AML patients. Cytogenetic information is a key component of the scoring system of risk assessment, treatment selection, and outcome prediction for patients with hematologic malignancies^[125]49,[126]50. In this work, we exemplify the use of the proposed methods in patients with MDS and AML. These methods can also be applied to other hematologic malignancies or even other cancer types if similar problems are encountered. As cytogenetic analysis is a mature technology and is widely used in clinical settings, the cytogenetic data should be fairly accessible from clinical collaborators who provide patient samples. Such cytogenetic data-guided analysis can be a useful tool for identifying subgroups of cells with the chromosomal changes of interest. Implemented in an EM algorithm, our proposed methods have favorable computational performance. For a simulation dataset with 3000 cells and 600 genes, it takes 12 seconds and 21.5 seconds for partCNV and partCNVH to complete the analysis, respectively. This computation cost is similar to existing methods PCA plus Louvain ( [MATH: :MATH] 6.8 seconds) and PCA plus Leiden ( [MATH: :MATH] 30.7 seconds). In comparison, both inferCNV and CopyKAT take more than 10 minutes, sometimes more than an hour, to process a scRNA-seq dataset with a few thousand cells. Additionally, our proposed methods scale almost linearly to large datasets. If we increase the cell number from 3000 to 6000, partCNV takes about 26.5 seconds and partCNVH 41.5 seconds to complete the computation. Since cytogenetic data are generally unique for each subject, we expect that our methods are applied person by person in real applications. Thus, it is reasonable to assume a few thousand cells in the dataset, for which our methods can complete the computation within one minute. For future work, the methods can be further extended to incorporate additional biological knowledge, such as the marker or mutation information mentioned in Fan et al^[127]9. Our method has the potential to be applied on multiple samples in parallel or even to borrow information across different samples from the same subject, such extension is not trivial and needs further evaluations. Moreover, due to the complexity of sequencing depth, gene expression variations, number of genes impacted by the aneuploid event, we have not evaluated the power of the proposed methods or the minimum required size for CNV events. These should be carefully evaluated in future works. Previous work also developed machine learning models to predict the neoplastic/non-neoplastic status of the cells by splitting the annotated data and training a random forest model^[128]16. Such models usually rely on the training dataset and may not generalize well to other studies. With the accumulation of annotated single cell data, it is also possible to harness the power of deep learning algorithms to further improve existing models and achieve more accurate predictions. Our current methods assume a fixed prior based on the cytogenetic data. It is possible that different proportions of aneuploid cells have different confidence levels. The current method can be further extended to incorporate such confidence into the model to improve accuracy. When scDNA-seq data is available, it may be another prior knowledge to replace the cytogenetic information of clinical data as the input to PartCNV and PartCNHV. The method can be further extended to incorporate regions where the scDNA-seq data shows enriched hierarchical aneuploid CNVs with high resolution. Methods Details of partCNV We aim to identify the cells with the known chromosomal deletion or amplification from the scRNA-seq data with incorporation of the prior cytogenetic knowledge. Assume a total of N cells were sequenced by scRNA-seq and G genes fall in the region with deletion or amplification. The count matrix is denoted by [MATH: Y=(ygi) :MATH] , which is a [MATH: G×N :MATH] matrix with rows being the genes and columns being the samples. Without loss of generality, we assume the genes in [MATH: Y :MATH] are ordered by their locations on the chromosome. As the observed data contain the mixture of cells with and without chromosomal changes, denote the underlying status of the cell i by [MATH: ci :MATH] . Assume the prior proportion of the aneuploid cells is [MATH: q0 :MATH] for the region of interest. In our motivating problem, [MATH: q0 :MATH] is calculated based on the number of metaphases observing the chromosomal changes divided by the total number of metaphases from an cytogenetics test. As shown in some recent literature regarding the distribution of the scRNA-seq count^[129]51,[130]52, the scRNA-seq data may not be zero-inflated and the excessive zeros are due to low expression level of each single cell. Additionally, since the region of interest generally contains a limited number of genes, the estimated dispersion parameters are not accurate enough if we use a negative binomial distribution. Thus, we assume the expression count of gene i follows a Poisson distribution with mean [MATH: θg1 :MATH] if the cell is aneuploid or mean [MATH: θg0 :MATH] if diploid, i.e., [MATH: Pr(ygi|θg1,ci=1 )=θg1giexp(-θg1< /mn>)ygi!andPr(ygi|θg0,ci=0 )=θg0giexp(-θg0< /mn>)ygi! . :MATH] We assume the cell status variable [MATH: ci :MATH] follows a Bernoulli distribution [MATH: Pr(ci|qi)=qi(1-qi) :MATH] where [MATH: qi :MATH] denotes the probability of cell i having the chromosomal changes at the region of interest, i.e., [MATH: qi=Pr(ci=1 ) :MATH] . The prior knowledge of the aneuploid cell proportion is best described through a beta distribution, which we approximate through a normal distribution. Though cytogenetic information is obtained based on 20 metaphases, the involved cells can number in the hundreds of thousands, and thus a normal distribution can adequately approximate the underlying beta distribution. We assume [MATH: qi :MATH] follows a prior Normal distribution with mean [MATH: q0 :MATH] and variance [MATH: λ2 :MATH] : [MATH: Pr(qi|q0,λ )=1λ2πex p-(qi-q0)22λ2. :MATH] The variance [MATH: λ2 :MATH] represents the confidence about the prior information. Smaller variance indicates stronger confidence in the prior knowledge. However, in our experiments, we found that the actual value of [MATH: λ :MATH] has minimal impact on the classification results as long as the value is within a decent range (e.g., [MATH: λ :MATH] between 0.01 and 1). Throughout our experiments, we use [MATH: λ=0.1 :MATH] . Together, the full likelihood of the problem can be written as [MATH: L(θ1,θ0,c,q|Y,q0< /mn>,λ)=i[Pr(ygi|θg1)]1(ci=1 )[Pr(ygi|θg0)]1(ci=0 )Pr(ci|qi)Pr(qi|q0,λ ) :MATH] 1 and the detailed log-likelihood is [MATH: l(θ1,θ0,c,q|Y,q0< /mn>,λ)=i1(ci=1 ){ygilog(θg1)-θg1}+i1(ci=0 ){ygilog(θg0)-θg0}+icilog(qi)+i(1-ci)log(1-qi)-i(qi-q0)22λ2. :MATH] Directly solving the likelihood ([131]1) may not be feasible, and thus we use the EM algorithm by treating cell status [MATH: {ci} :MATH] as the missing variables. The objective function of the EM algorithm is [MATH: Q(θ1,θ0,c,q)=ipi{ygilog(θg1)-θg1}+i(1-pi){ygilog(θg0)-θg0}+icilog(qi)+i(1-ci)log(1-qi)-i(qi-q0)22λ2. :MATH] 2 In the M step of the t-th iteration, we solve for [MATH: θg0(t) :MATH] , [MATH: θg1(t) :MATH] , and [MATH: qi(t) :MATH] as follows: [MATH: θg1(t)=iqi(t-1)ygi iqi(t-1),θg0(t)=i1-qi(t-1)ygiN-< /mo>iqi(t-1),qi(t)=argmaxqi[qi(t-1)ici 1-qi(t-1)N- ici exp-n(qi(t-1)-q0 )22< msup>λ2]. :MATH] In the E step, we estimate the conditional expectation of the cell status by [MATH: pi(t)=Pr(ci=1 |ygi)=Pr(ygi|ci=1 )Pr(ci=1 )Pr(ygi,ci)=gPr(ygi|θg1)·qi(t)gPr(ygi|θg0)·qi(t)+g Pr(ygi|θg0)·(1-qi(t)). :MATH] The algorithm repeats the M step and E step until convergence. The convergence criteria is defined as the absolute difference between [MATH: p(t)=(p1(t),,< msubsup>pN(t)) :MATH] and [MATH: p(t-1) :MATH] smaller than [MATH: 10-5 :MATH] . The core of partCNV is the described EM algorithm and the output is the inferred cell status. With the estimated probability [MATH: p^ :MATH] , the cells with [MATH: p^i0.5 :MATH] are assigned as aneuploid cells and the rest as diploid cells. Details of partCNVH Limited by the technology, cytogenetic data only provide crude information about the chromosomal deletion or amplification. When we include all the genes from the regions of interest, it is likely that not all of the genes have chromosomal changes in the aneuploid cells. It is helpful to select the genes that demonstrate chromosomal changing patterns. Motivated by this idea, we design partCNVH, the combination of partCNV and HMM for improving classification performance. Denote the underlying status for the observed G genes by [MATH: Z=(z1,,zG) :MATH] . The first step of partCNVH is to apply partCNV on the scRNA-seq data from the region of interest. Denote the obtained cell status from partCNV by [MATH: {c^i} :MATH] . Then for each gene g, we compute the mean expression of this gene for the two groups, i.e., [MATH: y¯g(1)=j{i:ci=1}ygiandy¯g(0)=j{i:ci=0}ygi. :MATH] Based on [MATH: y¯g(1) :MATH] and [MATH: y¯g(0) :MATH] , we obtain the mean expression ratio for all the genes by [MATH: rg=y¯g(0)/y¯g(1) :MATH] if the region has deletion and [MATH: rg=y¯g(1)/y¯g(0) :MATH] if the region has amplification. As shown in Figure [132]3, the signals from the mean expression ratio are weak, and thus we apply a rolling average on the mean expression ratios to strengthen the signals. Denote the window size for the rolling average by K, and the rolling average at gene g becomes [MATH: r¯grolling= 1Ki=g-[K/2]g+[K/2]ri. :MATH] In R, we implement the rolling average computation by function frollmean from the data.table package. We then build HMM using the rolling average of the genes. Denote the underlying state of gene g by [MATH: zg :MATH] , where [MATH: zg=1 :MATH] is gene g with deletion or amplification in the aneuploid cells, and [MATH: zg=0 :MATH] otherwise. HMM aims to infer the hidden status [MATH: Z={zg} :MATH] through the observed sequence [MATH: R¯rolling=(r¯grolling) :MATH] by solving the likelihood [MATH: Pr(R¯rolling,Z)=Pr(R¯rolling|Z)×Pr(Z)=g=1GPr(r¯grolling|zg)×g=1GPr(zg|zg-1< /mn>). :MATH] We solve HMM using the depmix function from the depmixS4 package in R by specifying the initial transition matrix between the two states (“genes in aneuploid region” or “A”, “genes in diploid region” or “D”) as [MATH: AAADDADD=0.90.10.1 0.9. :MATH] The initial states of the genes are decided as [MATH: zg=1 :MATH] if [MATH: r¯grollingm< mi>edian{r¯grolling,g< mo>=1,,G} :MATH] , and [MATH: zg=0 :MATH] otherwise. After the hidden states are inferred by HMM, we identify the states corresponding to the “genes in diploid region” by comparing the mean expression ratio of the two states, and the state with larger mean expression ratio is labeled as state “D”. Denote the expression count matrix for the selected genes by [MATH: Y~ :MATH] . We apply the EM algorithm described in partCNV to [MATH: Y~ :MATH] and infer the final cell states. Simulation designs To mimic the real data analysis, we generate the simulation datasets based on existing scRNA-seq data from patients with triple negative breast cancer (TNBC). The data were downloaded from the Gene Expression Omnibus (GEO) with accession number [133]GSE148673. We compared the data characteristics between the TNBC dataset and the scRNA-seq data from our MDS patients. In Supplementary Figure S10, we presented the mean expression levels of genes and cells, as well as the dropout rate for the two datasets. We found that the data characteristics are similar between the two data sources. As a data processing step, genes with zero expression in all cells in the TNBC data are removed. We keep all of the normal cells based on the cell type annotation from the original study^[134]15. As our method focuses on the region with known deletion or amplification from the cytogenetic data and the region often covers tens or hundreds of genes, we generate the expression count for [MATH: n0 :MATH] normal cells and [MATH: n1 :MATH] aneuploid cells with a total of G genes. For each iteration, we randomly draw the expression of G genes from the normal cells of the TNBC dataset. We compute the mean expression of these G genes and denote it by [MATH: ξ^=(ξ1,,ξG) :MATH] . For normal cells, the expression is generated from Poisson [MATH: (ξg^) :MATH] for [MATH: g=1,,G :MATH] . For aneuploid cells, assume [MATH: G1 :MATH] genes are located at the deleted or amplified regions and [MATH: G0=G-G1 :MATH] are from the normal regions. We assume half of the [MATH: G0 :MATH] genes are in the regions that are left-adjacent to a copy-number alteration and the other half are in regions right-adjacent that have normal expression in all cells. This partial chromosomal variation often happens in practice as cytogenetic data only provide a crude observation of the changed regions. Without loss of generality, we assume the beginning [MATH: G1 :MATH] genes are from the aneuploid region. For a gene g from this region for an aneuploid cell i, we generate the expression from [MATH: ygiPoisson{λg·(r+η)}and the noiseηUniform(0,τ), :MATH] where r is the ratio controlling the impact level of chromosomal deletion or amplification on the expression and [MATH: η :MATH] is the noise, in the first setting with deletion in aneuploid cells. r takes value 0.5 or 0.6 in this setting (Simulation data 1). In the second setting, we consider amplification in the aneuploid cells. The expression is generated from [MATH: ygiPoisson{λg·(r-η)}and the noiseηUniform(0,τ), :MATH] and r takes value 1.5 or 1.4. (Simulation data 2) The noise parameter [MATH: η :MATH] controls the heterogeneity of the impacts among all the interested genes, mimicking the fact that gene expressions are heterogeneous when the genes are located in deleted or amplified regions. Through our experiments, we specify [MATH: η=0.1,0.2 :MATH] , and 0.3 for low, medium, and high noise levels, respectively. To understand the impact of different sample sizes, we consider the combination of [MATH: n1=500 :MATH] and [MATH: n0=2500 :MATH] in one scenario and [MATH: n1=300 :MATH] and [MATH: n0=1000 :MATH] in the other (Simulation data 3). We also evaluate the impact of prior specification in all the scenarios. In the first scenario, the true prior proportion of the aneuploid cell is [MATH: 17% :MATH] and we evaluate the method with both [MATH: 17% :MATH] and [MATH: 50% :MATH] . In the second scenario, the true proportion is [MATH: 23% :MATH] and we also compare the results with both [MATH: 23% :MATH] and [MATH: 50% :MATH] (Simulation data 4). To compare the proposed methods versus existing ones that only perform on whole genome data, we randomly sample 2000 normal cells from an existing scRNA-seq dataset (Simulation data 5). We then replace the expression of the genes located at the region chromosome 20 q11.1 to q13.1 using the simulated data as described above to mimic situations that 400 out of 2000 (20%) cells are locally aneuploid. The gene expressions in the region of interest are generated in the similar way as Simulation data 1. We then apply inferCNV and copyKAT using the suggested arguments as suggested by the original methods. All the simulation results are summarized over 100 Monte Carlo datasets. Real data analysis The data for the AML patient were downloaded from European Genome-phenome Archive (EGA) with accession EGAD00001007672^[135]44. The data from sample 7A were used for the analysis due to the available cytogenetic information. The raw data for the two MDS patients are currently not publicly available due to confidentiality regulations. Data preprocessing The raw sequencing data were preprocessed (demultiplexed cellular barcodes, read alignment, and generation of gene count matrix) using Cell Ranger Single Cell Software Suite (version 3.0, https://support.10xgenomics.com/single-cell-gene-expression/software/pi pelines/latest/what-is-cell-ranger) provided by 10X Genomics using Human Genome GRCh38. Genes were detected in [MATH: <0.1% :MATH] of total sequenced cells and cells where [MATH: <200 :MATH] genes had nonzero counts were filtered out and not included in the analysis. Low quality cells where [MATH: >20% :MATH] of the counts were derived from the mitochondrial genome were also discarded. The doublet cell status was inferred using DoubletFinder^[136]53 and only the singlet cells were kept for further analysis. Data were normalized using the total sequencing count per cell to adjust for differences in sequencing depth. The chromosome database associated with cytogenetic location is downloaded via the UCSC genome website. Functional over-representation analysis In the real data analysis section, after the aneuploid/diploid status has been inferred by the proposed or existing method, we perform cell type specific differential analysis using “MAST”^[137]32 (available in Seurat package^[138]2) to compare the diploid versus aneuploid cells. The genes with an adjusted p value smaller than 0.05 are included as the DEGs. We then perform functional over-representation analysis using the MSigDB platform^[139]33 (http://www.gsea-msigdb.org/gsea/msigdb/annotate.jsp) with the Reactome pathway^[140]39, GO Biological Process^[141]54, and Hallmark^[142]40 databases. We present the top 10 pathways or biological process terms regardless of the significance level. Implementation of existing methods All of the existing methods take the normalized expression counts for the genes located in the region of interest as input. K-means and hierarchical clustering are implemented using the functions kmeans and hclust from the stat package in R, respectively. The PCA plus Louvain^[143]55 and Leiden methods^[144]29 are implemented using the Seurat package in function FindClusters with arguments algorithm [MATH: =1 :MATH] and 4. Since the Seurat clustering function does not allow specification of cluster numbers, we design a loop with a precision parameter ranging from 0.001 to 0.5 with distance 0.005 until the algorithm identifies exactly two clusters. All the analyses and plotting are performed in R v4.0.3. Supplementary Information [145]Supplementary Information.^ (20.7MB, pdf) Acknowledgements