Abstract Despite its popularity, characterization of subpopulations with transcript abundance is subject to a significant amount of noise. We propose to use effective and expressed nucleotide variations (eeSNVs) from scRNA-seq as alternative features for tumor subpopulation identification. We develop a linear modeling framework, SSrGE, to link eeSNVs associated with gene expression. In all the datasets tested, eeSNVs achieve better accuracies than gene expression for identifying subpopulations. Previously validated cancer-relevant genes are also highly ranked, confirming the significance of the method. Moreover, SSrGE is capable of analyzing coupled DNA-seq and RNA-seq data from the same single cells, demonstrating its value in integrating multi-omics single cell techniques. In summary, SNV features from scRNA-seq data have merits for both subpopulation identification and linkage of genotype-phenotype relationship. __________________________________________________________________ Identification of cell subpopulations using transcript abundance is noisy. Here, the authors developed a linear modeling framework, SSrGE, which utilizes effective and expressed nucleotide variations from single-cell RNA-seq to identify tumor subpopulations. Introduction Characterization of phenotypic diversity is a key challenge in the emerging field of single-cell RNA-sequencing (scRNA-seq). In scRNA-seq data, patterns of gene expression (GE) are conventionally used as features to explore the heterogeneity among single cells^[30]1–[31]3. However, GE features are subject to a significant amount of noises^[32]4. For example, GE might be affected by batch effect, where results obtained from two different runs of experiments may present substantial variations^[33]5, even when the input materials are identical. Additionally, the expression of particular genes varies with cell cycle^[34]6, increasing the heterogeneity observed in single cells^[35]7. To cope with these sources of variations, normalization of GE is usually a mandatory step before downstream functional analysis^[36]7. Even with these procedures, other sources of biases still exist, e.g., dependent on read depth, cell capture efficiency and experimental protocols etc. Single-nucleotide variations (SNVs) are genetic alterations of one single base occurring in specific cells as compared to the population background. SNVs may manifest their effects on gene expression by cis and/or trans effect^[37]8,[38]9.The disruption of the genetic stability, e.g. increasing number of new SNVs, is known to be linked with cancer evolution^[39]10,[40]11. A cell may become the precursor of a subpopulation (clone) upon gaining a set of SNVs. Considerable heterogeneity exists not only between tumors but also within the same tumor^[41]12,[42]13. Therefore, investigating the patterns of SNVs provides means to understand tumor heterogeneity. In single cells, SNVs are conventionally obtained from single-cell exome-sequencing and whole-genome sequencing approaches^[43]14. The resulting SNVs can then be used to infer cancer cell subpopulations^[44]15,[45]16. In this study, we propose to obtain useful SNV-based genetic information from scRNA-seq data, in addition to the GE information. Rather than being considered the by-products of scRNA-seq, the SNVs not only have the potential to improve the accuracy of identifying subpopulations compared to GE, but also offer unique opportunities to study the genetic events (genotype) associated with gene expression (phenotype)^[46]17,[47]18. Moreover, when the coupled DNA- and RNA-based single-cell sequencing techniques become mature, the computational methodology proposed in this report can be adopted as well^[48]19. Here we first built a computational pipeline to identify SNVs from scRNA-seq raw reads directly. We then constructed a linear modeling framework to obtain filtered, effective, and expressed SNVs (eeSNVs) associated with gene expression profiles. In all the datasets tested, these eeSNVs show better accuracies at retrieving cell subpopulation identities, compared to those from gene expression (GE). Moreover, when combined with cell entities into bipartite graphs, they demonstrate improved visual representation of the cell subpopulations. We ranked eeSNVs and genes according to their overall significance in the linear models and discovered that several top-ranked genes (e.g., HLA genes) appear commonly in all cancer scRNA-seq data. In summary, we emphasize that extracting SNV from scRNA-seq analysis can successfully identify subpopulation complexity and highlight genotype–phenotype relationships. Results SNV calling from scRNA-seq data We implemented a pipeline to identify SNVs directly from FASTQ files of scRNA-seq data, following the SNV guideline of GATK (Supplementary Figure [49]1). We applied this pipeline to five scRNA-seq cancer datasets (Kim^[50]20, Ting^[51]21, Miyamoto^[52]22, Patel^[53]23, and Chung^[54]24 see Methods), and tested the efficiency of SNV features on retrieving single cell groups of interest. These datasets vary in tissue types, origins (Mouse or Human), read lengths and map-ability (Table [55]1). They all have pre-defined cell types (subclasses), providing useful references for assessing the performance of a variety