Abstract

   Copy number variations (CNVs), the gain or loss of genomic regions, are
   associated with disease, especially cancer. Single cell technologies
   offer new possibilities to capture within-sample heterogeneity of CNVs
   and identify subclones relevant for tumor progression and treatment
   outcome. Several computational tools have been developed to identify
   CNVs from scRNA-seq data. However, an independent benchmarking of them
   is lacking. Here, we evaluate six popular methods in their ability to
   correctly identify ground truth CNVs, euploid cells and subclonal
   structures in 21 scRNA-seq datasets. We discover dataset-specific
   factors influencing the performance, including dataset size, the number
   and type of CNVs in the sample and the choice of the reference dataset.
   Methods which include allelic information perform more robustly for
   large droplet-based datasets, but require higher runtime. Furthermore,
   the methods differ in their additional functionalities. We offer a
   benchmarking pipeline to identify the optimal method for new datasets,
   and improve methods’ performance.

   Subject terms: Genome informatics, Cancer genomics, Bioinformatics, RNA
   sequencing
     __________________________________________________________________

   Several computational tools have now been developed to identify copy
   number variations (CNVs) from scRNA-seq data. Here authors benchmark
   these methods, showing that performance is affected by dataset quality,
   CNV type and reference dataset, with methods including allelic
   information being more robust in large datasets.

Introduction

   Copy number variations (CNVs) describe the gain or loss of genomic
   regions, from small sequences up to complete chromosomes. These genomic
   alterations lead to aneuploidy and are associated with different
   diseases and cancer types^[36]1. Specific CNVs are hallmarks for the
   classification of tumors, and are related to tumor progression and
   treatment outcomes^[37]2–[38]4. However, the direct functional
   consequences of CNVs are not yet fully understood^[39]4. Tumors are
   very heterogeneous, with different tumor cells having distinct
   molecular phenotypes. This also applies to CNVs, which can differ
   substantially between cellular subclones among samples and within the
   same sample, emphasizing the importance of cell-specific
   analyses^[40]5–[41]7. Single-cell whole-genome sequencing is considered
   the gold-standard technique to obtain per-cell CNV profiles^[42]8, as
   changes in DNA copy number should lead to observable changes in the
   read sequencing depth. However, the technology is not frequently used
   in the laboratory compared to other single-cell technologies.

   Instead, computational methods have been developed to infer the CNV
   profiles from single-cell RNA-seq data^[43]5,[44]9–[45]14 and
   single-cell assay for transposase-accessible chromatin with sequencing
   (scATAC-seq) data^[46]15,[47]16. These approaches have the advantage
   that, apart from the copy number gains and losses, the information
   about the cellular state can also be obtained from the same measurement
   (gene expression or open chromatin). For scATAC-seq, the read-out is
   relatively similar to whole-genome sequencing, as the genome is also
   sequenced, and therefore, the read coverage provides information about
   ploidy. However, for scRNA-seq, the inference of CNVs is challenging,
   as the expression level of genes is highly affected by regulatory
   mechanisms, and therefore, it provides only indirect information about
   the CNV state. Nevertheless, the general assumption of all
   computational methods that infer CNVs from scRNA-seq data is that genes
   in gained regions show higher expression, and in lost regions lower
   expression, compared to genes in diploid regions. This requires all
   methods to have sophisticated data normalization strategies, using
   generally reference diploid samples, often in combination with
   denoising approaches, before the different CNV inference strategies can
   be applied.

   Because of the wealth of scRNA-seq data available, the correct
   identification of CNVs from this data modality is crucial to studying
   the role of CNVs in cancer and other aneuploid tissues. scRNA-seq CNV
   callers are currently used in many applications, e.g., ref.
   ^[48]17–[49]20. However, there is no independent validation that shows
   whether scRNA-seq CNV calling methods can correctly identify CNVs, and
   which of the CNV callers works the best.

   In this work, we benchmark six popular CNV callers for scRNA-seq data
   using 21 different datasets. We include datasets generated with
   different technologies, droplet-based and plate-based, and from
   different organisms, such as humans and mice. We evaluate the general
   CNV prediction performance for each method, comparing its results to a
   ground truth provided by an orthogonal CNV measurement (either (single
   cell) whole-genome sequencing ((sc)WGS) or whole exome sequencing
   (WES)), using correlation, area under the curve (AUC) values and F1
   scores. We also assess the prediction of CNVs on a diploid sample, the
   correctness of the inferred clonal structure, the impact of the
   selected reference dataset on the performance, and the runtime and
   memory requirements. In addition, we evaluate the automatic
   identification of cancer cells for the methods that allow it. Our
   evaluation is publicly available with a reproducible Snakemake pipeline
   ([50]https://github.com/colomemaria/benchmark_scrnaseq_cnv_callers)^[51
   ]21. This enables the direct testing of new datasets to determine
   optimal CNV calling strategies, and it facilitates comparisons between
   methods to improve the performance of newly developed computational
   tools.

Results

scRNA-seq CNV calling benchmarking

   We included in our benchmarking study six CNV calling methods that were
   developed specifically for scRNA-seq data (Table [52]1). The methods
   can be broadly classified into two categories: one class that uses only
   the expression levels per gene, consisting of InferCNV^[53]5,
   copyKat^[54]12, SCEVAN^[55]13 and CONICSmat^[56]9; and a second class
   that combines the expression values with minor allele frequency (AF)
   information, consisting of CaSpER^[57]11 and Numbat^[58]14. CaSpER and
   Numbat use AFs per SNP called directly from the scRNA-seq reads, and
   both models implement a Hidden Markov Model (HMM) to call CNVs. Also,
   InferCNV identifies CNVs using an HMM, but based on expression levels
   only. copyKat and SCEVAN both apply a segmentation approach, while
   CONICSmat estimates the CNVs based on a Mixture Model. All methods were
   run as recommended in the respective tutorials or based on default
   parameters.

Table 1.

   CNV calling methods from scRNA-seq
   Method (tested version) Model Input Output resolution Explicit
   reference optional Cancer cell identification
   InferCNV (v1.10.0) HMM & Bayesian MM Expression Gene and subclone No No
   CONICSmat (v0.0.0.1) Mixture model Expression Chromosome arm and cell
   Yes No
   CaSpER (v0.2.0) Expression HMM & BAF signal shift Expression &
   Genotypes Segment and cell No No
   copyKat (v1.1.0) integrative Bayesian segmentation Expression Gene and
   cell Yes Yes
   Numbat (v1.4.0) haplotyping AFs & combined HMM Expression & Genotypes
   Gene and subclone (Yes) Yes
   SCEVAN (v1.0.1) segmentation with a variational region growing
   algorithm Expression Segment and subclone Yes Yes
   [59]Open in a new tab

   HMM hidden Markov model, MM mixture mode, (B)AF (B-)allele frequency.

   The output of the CNV prediction depends on the method (Table [60]1).
   Half of the methods report the results per cell (CONICSmat, copyKat and
   CaSpER), while InferCNV, SCEVAN and Numbat group cells into subclones
   with the same CNV profile. Also, the resolution differs, with CONICSmat
   reporting the results only per chromosome arm, and all other methods
   either per gene or per segment consisting of multiple genes. Several of
   the methods have two possible outputs: a discrete CNV prediction and a
   normalized expression score; in these cases, both outputs were included
   separately in the evaluation and were abbreviated as “(CNV)” and
   “(Expr)”, respectively. More details can be found in the Methods.

   We tested all scRNA-seq CNV callers on 21 different single cell RNA-seq
   datasets (Fig. [61]1), comprising 13 human cancer cell lines (nine
   gastric cell lines, two colorectal adenocarcinoma lines (COLO320,
   HCT116), one breast cancer line (MCF7) and one melanoma cell line
   (A375)), six human primary tumor samples (three acute lymphoblastic
   leukemia samples (iAMP21, ALL1, ALL2), two basal cell carcinoma (BCC)
   samples and one multiple myeloma (MM) sample), one mouse primary tumor
   sample and one human diploid dataset (peripheral blood mononuclear
   cells (PBMCs)) (Supplementary Data [62]1). Seventeen datasets were
   measured with droplet-based technologies, and the four others with a
   plate-based technology.

Fig. 1. Overview of the benchmarking workflow.

   [63]Fig. 1
   [64]Open in a new tab

   Input data in blue, evaluation results in pink.

   Different metrics were included for benchmarking, most of them based on
   the comparison with a ground truth for the CNVs. We obtained this
   ground truth from either (sc)WGS or WES data (Supplementary
   Data [65]1). The different datasets showed large variation in CNV
   distribution, with CNVs covering between 7% and 93% of the total
   genome, and more gained regions than lost regions in the majority of
   the datasets (Supplementary Fig. [66]1). Since the scRNA-seq methods
   are only able to predict the CNV status for genomic regions comprising
   genes, while the WGS ground truth covers (nearly) the complete genome,
   we could only compare modalities in gene regions. As in most cases, the
   ground truth was not measured in the same set of cells as the
   scRNA-seq, and was, in some cases, obtained from bulk measurements, we
   combined the per-cell results from the scRNA-seq methods to an average
   CNV profile, called pseudobulk, before the comparison. For the
   plate-based datasets, where scRNA-seq and scWGS were measured in the
   same cells, a cell-by-cell comparison was also performed.

   We applied threshold-independent evaluation metrics using correlation
   and AUC scores. For the AUC scores, predictions were evaluated
   separately for gain versus all and loss versus all, resulting in two
   scores. Not the complete range of thresholds is biologically meaningful
   for classifying regions as gains or losses, as every method defines a
   baseline score. For this reason, we chose to additionally calculate a
   partial AUC^[67]22,[68]23, with a maximal sensitivity defined by the
   baseline score so that only thresholds up to the baseline score were
   evaluated for losses and only scores higher than the baseline score
   were evaluated for gains (see Methods, Supplementary Fig. [69]2). Of
   note, partial AUC values below 0.5 indicate that most thresholds are
   outside the biologically meaningful value range (see Methods).

   With the same systematic, we evaluated the optimal gain and loss
   thresholds based on a multi-class F1 score (Supplementary Fig. [70]3),
   testing again only biologically meaningful gain and loss thresholds.
   These thresholds were then used to obtain sensitivity and specificity
   values for gains and losses.

   Every scRNA-seq method requires a set of euploid reference cells to
   normalize the expression of the analyzed cells. For the primary tissue
   samples, the common assumption is that the measured tissues are a
   mixture of tumor and normal cells, of which the latter can be used as a
   reference. Some methods rely on cell type annotations provided by the
   user to specify the reference, while other methods provide two options:
   user-provided cell type annotations or automatic detection of normal
   cells. To ensure reproducibility between methods, cells were annotated
   manually into tumor and healthy cells per sample, and the same healthy
   cells were used as reference for all methods unless specified
   otherwise. We applied the published cell type annotation for the BCC
   and ALL datasets and performed manual annotation based on Louvain
   clustering and known marker genes for the MM, the iAMP21 and the mouse
   datasets (see Methods).

   For the cancer cell lines there exist no directly matched reference
   cells and therefore we chose, for each dataset, a matched external
   reference dataset with healthy cells from the same or at least very
   similar cell types (Supplementary Data [71]1). Since the choice of the
   reference euploid dataset used for normalization may affect the final
   CNV calling results, we tested the impact of different references on