Abstract Copy number variations (CNVs), the gain or loss of genomic regions, are associated with disease, especially cancer. Single cell technologies offer new possibilities to capture within-sample heterogeneity of CNVs and identify subclones relevant for tumor progression and treatment outcome. Several computational tools have been developed to identify CNVs from scRNA-seq data. However, an independent benchmarking of them is lacking. Here, we evaluate six popular methods in their ability to correctly identify ground truth CNVs, euploid cells and subclonal structures in 21 scRNA-seq datasets. We discover dataset-specific factors influencing the performance, including dataset size, the number and type of CNVs in the sample and the choice of the reference dataset. Methods which include allelic information perform more robustly for large droplet-based datasets, but require higher runtime. Furthermore, the methods differ in their additional functionalities. We offer a benchmarking pipeline to identify the optimal method for new datasets, and improve methods’ performance. Subject terms: Genome informatics, Cancer genomics, Bioinformatics, RNA sequencing __________________________________________________________________ Several computational tools have now been developed to identify copy number variations (CNVs) from scRNA-seq data. Here authors benchmark these methods, showing that performance is affected by dataset quality, CNV type and reference dataset, with methods including allelic information being more robust in large datasets. Introduction Copy number variations (CNVs) describe the gain or loss of genomic regions, from small sequences up to complete chromosomes. These genomic alterations lead to aneuploidy and are associated with different diseases and cancer types^[36]1. Specific CNVs are hallmarks for the classification of tumors, and are related to tumor progression and treatment outcomes^[37]2–[38]4. However, the direct functional consequences of CNVs are not yet fully understood^[39]4. Tumors are very heterogeneous, with different tumor cells having distinct molecular phenotypes. This also applies to CNVs, which can differ substantially between cellular subclones among samples and within the same sample, emphasizing the importance of cell-specific analyses^[40]5–[41]7. Single-cell whole-genome sequencing is considered the gold-standard technique to obtain per-cell CNV profiles^[42]8, as changes in DNA copy number should lead to observable changes in the read sequencing depth. However, the technology is not frequently used in the laboratory compared to other single-cell technologies. Instead, computational methods have been developed to infer the CNV profiles from single-cell RNA-seq data^[43]5,[44]9–[45]14 and single-cell assay for transposase-accessible chromatin with sequencing (scATAC-seq) data^[46]15,[47]16. These approaches have the advantage that, apart from the copy number gains and losses, the information about the cellular state can also be obtained from the same measurement (gene expression or open chromatin). For scATAC-seq, the read-out is relatively similar to whole-genome sequencing, as the genome is also sequenced, and therefore, the read coverage provides information about ploidy. However, for scRNA-seq, the inference of CNVs is challenging, as the expression level of genes is highly affected by regulatory mechanisms, and therefore, it provides only indirect information about the CNV state. Nevertheless, the general assumption of all computational methods that infer CNVs from scRNA-seq data is that genes in gained regions show higher expression, and in lost regions lower expression, compared to genes in diploid regions. This requires all methods to have sophisticated data normalization strategies, using generally reference diploid samples, often in combination with denoising approaches, before the different CNV inference strategies can be applied. Because of the wealth of scRNA-seq data available, the correct identification of CNVs from this data modality is crucial to studying the role of CNVs in cancer and other aneuploid tissues. scRNA-seq CNV callers are currently used in many applications, e.g., ref. ^[48]17–[49]20. However, there is no independent validation that shows whether scRNA-seq CNV calling methods can correctly identify CNVs, and which of the CNV callers works the best. In this work, we benchmark six popular CNV callers for scRNA-seq data using 21 different datasets. We include datasets generated with different technologies, droplet-based and plate-based, and from different organisms, such as humans and mice. We evaluate the general CNV prediction performance for each method, comparing its results to a ground truth provided by an orthogonal CNV measurement (either (single cell) whole-genome sequencing ((sc)WGS) or whole exome sequencing (WES)), using correlation, area under the curve (AUC) values and F1 scores. We also assess the prediction of CNVs on a diploid sample, the correctness of the inferred clonal structure, the impact of the selected reference dataset on the performance, and the runtime and memory requirements. In addition, we evaluate the automatic identification of cancer cells for the methods that allow it. Our evaluation is publicly available with a reproducible Snakemake pipeline ([50]https://github.com/colomemaria/benchmark_scrnaseq_cnv_callers)^[51 ]21. This enables the direct testing of new datasets to determine optimal CNV calling strategies, and it facilitates comparisons between methods to improve the performance of newly developed computational tools. Results scRNA-seq CNV calling benchmarking We included in our benchmarking study six CNV calling methods that were developed specifically for scRNA-seq data (Table [52]1). The methods can be broadly classified into two categories: one class that uses only the expression levels per gene, consisting of InferCNV^[53]5, copyKat^[54]12, SCEVAN^[55]13 and CONICSmat^[56]9; and a second class that combines the expression values with minor allele frequency (AF) information, consisting of CaSpER^[57]11 and Numbat^[58]14. CaSpER and Numbat use AFs per SNP called directly from the scRNA-seq reads, and both models implement a Hidden Markov Model (HMM) to call CNVs. Also, InferCNV identifies CNVs using an HMM, but based on expression levels only. copyKat and SCEVAN both apply a segmentation approach, while CONICSmat estimates the CNVs based on a Mixture Model. All methods were run as recommended in the respective tutorials or based on default parameters. Table 1. CNV calling methods from scRNA-seq Method (tested version) Model Input Output resolution Explicit reference optional Cancer cell identification InferCNV (v1.10.0) HMM & Bayesian MM Expression Gene and subclone No No CONICSmat (v0.0.0.1) Mixture model Expression Chromosome arm and cell Yes No CaSpER (v0.2.0) Expression HMM & BAF signal shift Expression & Genotypes Segment and cell No No copyKat (v1.1.0) integrative Bayesian segmentation Expression Gene and cell Yes Yes Numbat (v1.4.0) haplotyping AFs & combined HMM Expression & Genotypes Gene and subclone (Yes) Yes SCEVAN (v1.0.1) segmentation with a variational region growing algorithm Expression Segment and subclone Yes Yes [59]Open in a new tab HMM hidden Markov model, MM mixture mode, (B)AF (B-)allele frequency. The output of the CNV prediction depends on the method (Table [60]1). Half of the methods report the results per cell (CONICSmat, copyKat and CaSpER), while InferCNV, SCEVAN and Numbat group cells into subclones with the same CNV profile. Also, the resolution differs, with CONICSmat reporting the results only per chromosome arm, and all other methods either per gene or per segment consisting of multiple genes. Several of the methods have two possible outputs: a discrete CNV prediction and a normalized expression score; in these cases, both outputs were included separately in the evaluation and were abbreviated as “(CNV)” and “(Expr)”, respectively. More details can be found in the Methods. We tested all scRNA-seq CNV callers on 21 different single cell RNA-seq datasets (Fig. [61]1), comprising 13 human cancer cell lines (nine gastric cell lines, two colorectal adenocarcinoma lines (COLO320, HCT116), one breast cancer line (MCF7) and one melanoma cell line (A375)), six human primary tumor samples (three acute lymphoblastic leukemia samples (iAMP21, ALL1, ALL2), two basal cell carcinoma (BCC) samples and one multiple myeloma (MM) sample), one mouse primary tumor sample and one human diploid dataset (peripheral blood mononuclear cells (PBMCs)) (Supplementary Data [62]1). Seventeen datasets were measured with droplet-based technologies, and the four others with a plate-based technology. Fig. 1. Overview of the benchmarking workflow. [63]Fig. 1 [64]Open in a new tab Input data in blue, evaluation results in pink. Different metrics were included for benchmarking, most of them based on the comparison with a ground truth for the CNVs. We obtained this ground truth from either (sc)WGS or WES data (Supplementary Data [65]1). The different datasets showed large variation in CNV distribution, with CNVs covering between 7% and 93% of the total genome, and more gained regions than lost regions in the majority of the datasets (Supplementary Fig. [66]1). Since the scRNA-seq methods are only able to predict the CNV status for genomic regions comprising genes, while the WGS ground truth covers (nearly) the complete genome, we could only compare modalities in gene regions. As in most cases, the ground truth was not measured in the same set of cells as the scRNA-seq, and was, in some cases, obtained from bulk measurements, we combined the per-cell results from the scRNA-seq methods to an average CNV profile, called pseudobulk, before the comparison. For the plate-based datasets, where scRNA-seq and scWGS were measured in the same cells, a cell-by-cell comparison was also performed. We applied threshold-independent evaluation metrics using correlation and AUC scores. For the AUC scores, predictions were evaluated separately for gain versus all and loss versus all, resulting in two scores. Not the complete range of thresholds is biologically meaningful for classifying regions as gains or losses, as every method defines a baseline score. For this reason, we chose to additionally calculate a partial AUC^[67]22,[68]23, with a maximal sensitivity defined by the baseline score so that only thresholds up to the baseline score were evaluated for losses and only scores higher than the baseline score were evaluated for gains (see Methods, Supplementary Fig. [69]2). Of note, partial AUC values below 0.5 indicate that most thresholds are outside the biologically meaningful value range (see Methods). With the same systematic, we evaluated the optimal gain and loss thresholds based on a multi-class F1 score (Supplementary Fig. [70]3), testing again only biologically meaningful gain and loss thresholds. These thresholds were then used to obtain sensitivity and specificity values for gains and losses. Every scRNA-seq method requires a set of euploid reference cells to normalize the expression of the analyzed cells. For the primary tissue samples, the common assumption is that the measured tissues are a mixture of tumor and normal cells, of which the latter can be used as a reference. Some methods rely on cell type annotations provided by the user to specify the reference, while other methods provide two options: user-provided cell type annotations or automatic detection of normal cells. To ensure reproducibility between methods, cells were annotated manually into tumor and healthy cells per sample, and the same healthy cells were used as reference for all methods unless specified otherwise. We applied the published cell type annotation for the BCC and ALL datasets and performed manual annotation based on Louvain clustering and known marker genes for the MM, the iAMP21 and the mouse datasets (see Methods). For the cancer cell lines there exist no directly matched reference cells and therefore we chose, for each dataset, a matched external reference dataset with healthy cells from the same or at least very similar cell types (Supplementary Data [71]1). Since the choice of the reference euploid dataset used for normalization may affect the final CNV calling results, we tested the impact of different references on