Abstract

   Despite its popularity, characterization of subpopulations with
   transcript abundance is subject to a significant amount of noise. We
   propose to use effective and expressed nucleotide variations (eeSNVs)
   from scRNA-seq as alternative features for tumor subpopulation
   identification. We develop a linear modeling framework, SSrGE, to link
   eeSNVs associated with gene expression. In all the datasets tested,
   eeSNVs achieve better accuracies than gene expression for identifying
   subpopulations. Previously validated cancer-relevant genes are also
   highly ranked, confirming the significance of the method. Moreover,
   SSrGE is capable of analyzing coupled DNA-seq and RNA-seq data from the
   same single cells, demonstrating its value in integrating multi-omics
   single cell techniques. In summary, SNV features from scRNA-seq data
   have merits for both subpopulation identification and linkage of
   genotype-phenotype relationship.
     __________________________________________________________________

   Identification of cell subpopulations using transcript abundance is
   noisy. Here, the authors developed a linear modeling framework, SSrGE,
   which utilizes effective and expressed nucleotide variations from
   single-cell RNA-seq to identify tumor subpopulations.

Introduction

   Characterization of phenotypic diversity is a key challenge in the
   emerging field of single-cell RNA-sequencing (scRNA-seq). In scRNA-seq
   data, patterns of gene expression (GE) are conventionally used as
   features to explore the heterogeneity among single cells^[30]1–[31]3.
   However, GE features are subject to a significant amount of
   noises^[32]4. For example, GE might be affected by batch effect, where
   results obtained from two different runs of experiments may present
   substantial variations^[33]5, even when the input materials are
   identical. Additionally, the expression of particular genes varies with
   cell cycle^[34]6, increasing the heterogeneity observed in single
   cells^[35]7. To cope with these sources of variations, normalization of
   GE is usually a mandatory step before downstream functional
   analysis^[36]7. Even with these procedures, other sources of biases
   still exist, e.g., dependent on read depth, cell capture efficiency and
   experimental protocols etc.

   Single-nucleotide variations (SNVs) are genetic alterations of one
   single base occurring in specific cells as compared to the population
   background. SNVs may manifest their effects on gene expression by cis
   and/or trans effect^[37]8,[38]9.The disruption of the genetic
   stability, e.g. increasing number of new SNVs, is known to be linked
   with cancer evolution^[39]10,[40]11. A cell may become the precursor of
   a subpopulation (clone) upon gaining a set of SNVs. Considerable
   heterogeneity exists not only between tumors but also within the same
   tumor^[41]12,[42]13. Therefore, investigating the patterns of SNVs
   provides means to understand tumor heterogeneity.

   In single cells, SNVs are conventionally obtained from single-cell
   exome-sequencing and whole-genome sequencing approaches^[43]14. The
   resulting SNVs can then be used to infer cancer cell
   subpopulations^[44]15,[45]16. In this study, we propose to obtain
   useful SNV-based genetic information from scRNA-seq data, in addition
   to the GE information. Rather than being considered the by-products of
   scRNA-seq, the SNVs not only have the potential to improve the accuracy
   of identifying subpopulations compared to GE, but also offer unique
   opportunities to study the genetic events (genotype) associated with
   gene expression (phenotype)^[46]17,[47]18. Moreover, when the coupled
   DNA- and RNA-based single-cell sequencing techniques become mature, the
   computational methodology proposed in this report can be adopted as
   well^[48]19.

   Here we first built a computational pipeline to identify SNVs from
   scRNA-seq raw reads directly. We then constructed a linear modeling
   framework to obtain filtered, effective, and expressed SNVs (eeSNVs)
   associated with gene expression profiles. In all the datasets tested,
   these eeSNVs show better accuracies at retrieving cell subpopulation
   identities, compared to those from gene expression (GE). Moreover, when
   combined with cell entities into bipartite graphs, they demonstrate
   improved visual representation of the cell subpopulations. We ranked
   eeSNVs and genes according to their overall significance in the linear
   models and discovered that several top-ranked genes (e.g., HLA genes)
   appear commonly in all cancer scRNA-seq data. In summary, we emphasize
   that extracting SNV from scRNA-seq analysis can successfully identify
   subpopulation complexity and highlight genotype–phenotype
   relationships.

Results

SNV calling from scRNA-seq data

   We implemented a pipeline to identify SNVs directly from FASTQ files of
   scRNA-seq data, following the SNV guideline of GATK (Supplementary
   Figure [49]1). We applied this pipeline to five scRNA-seq cancer
   datasets (Kim^[50]20, Ting^[51]21, Miyamoto^[52]22, Patel^[53]23, and
   Chung^[54]24 see Methods), and tested the efficiency of SNV features on
   retrieving single cell groups of interest. These datasets vary in
   tissue types, origins (Mouse or Human), read lengths and map-ability
   (Table [55]1). They all have pre-defined cell types (subclasses),
   providing useful references for assessing the performance of a variety