Abstract

   RNA splicing shapes the gene regulatory programs that underlie various
   physiological and disease processes. Here, we present the SCASL
   (single-cell clustering based on alternative splicing landscapes)
   method for interrogating the heterogeneity of RNA splicing with
   single-cell RNA-seq data. SCASL resolves the issue of biased and sparse
   data coverage on single-cell RNA splicing and provides a new scheme for
   classifications of cell identities. With previously published datasets
   as examples, SCASL identifies new cell clusters indicating potentially
   precancerous and early-tumor stages in triple-negative breast cancer,
   illustrates cell lineages of embryonic liver development, and provides
   fine clusters of highly heterogeneous tumor-associated CD4 and CD8 T
   cells with functional and physiological relevance. Most of these
   findings are not readily available via conventional cell clustering
   based on single-cell gene expression data. Our study shows the
   potential of SCASL in revealing the intrinsic RNA splicing
   heterogeneity and generating biological insights into the dynamic and
   functional cell landscapes in complex tissues.

   Subject terms: Bioinformatics, Data mining, RNA splicing,
   Transcriptomics, Transcriptomics
     __________________________________________________________________

   RNA splicing serves as a critical layer of gene expression regulation.
   Here, authors introduce SCASL for investigating the heterogeneity of
   RNA splicing landscapes at single-cell resolution, offering a novel
   scheme for classifying cell identities with physiological relevance.

Introduction

   Variations in RNA splicing serve as a major cause of extensive
   transcriptome complexity in mammalian species^[30]1–[31]3. Alternative
   splicing (AS) is closely related to various cellular activities such as
   cell lineage differentiation, cell proliferation, and
   apoptosis^[32]4–[33]8. Shifted landscapes of RNA splicing are primary
   outcomes as well as key driving factors of various physiological and
   pathological processes^[34]9,[35]10. For example, complexity in RNA
   splicing is a main characteristic of tumor heterogeneity related to
   metastasis, drug sensitivity, and relapse, thereby serving as a
   prognostic marker for multiple cancer types, including breast
   cancer^[36]11, glioblastoma^[37]12, non-small cell lung cancer^[38]13,
   and ovarian cancer^[39]14.

   Transcriptome profiling at single-cell resolution has become a common
   practice for dissection of cell heterogeneity in complex
   tissues^[40]15,[41]16. Single-cell transcriptome profiling based on the
   SMART-seq method generates RNA-seq reads with coverage of full-length
   transcripts, rather than just the 3’ regions, as do most droplet-based
   methods such as Drop-seq and 10x Genomics. Therefore, the RNA splicing
   profiles are readily available from these datasets. However, the vast
   majority of previous studies employing single-cell RNA-seq (scRNA-seq)
   have solely relied on the overall expression levels of genes for
   classifications of cell subpopulations^[42]17. The RNA splicing
   landscapes, which represent a critical layer of transcriptome
   regulation, have not been considered in most of the previous studies.

   Unlike single-cell gene expression profiles, AS profiles represent the
   relative abundance of different transcript isoforms, thereby sheltering
   the intrinsic fluctuations of single-cell gene expression levels.
   Importantly, even though bulk RNA-seq data of heterogeneous tissues or
   cells commonly shows mixtures of isoforms, individual cells are
   frequently dominated by specific isoforms for many genes^[43]9,
   indicating strong cell specificity of AS events. Finally, in many
   cases, AS results in different protein products, rather than just
   changes in protein abundance, therefore potentially leading to more
   dramatic shifts in biological pathways and processes. These features
   make single-cell AS profiles promising indexes for the classification
   of single cells with potentially high physiological relevance.

   The existing approaches for quantifying alternative splicing (AS)
   levels were based on either PSI (percent splice-in) or junction reads.
   Most of these AS analysis tools primarily focus on identifying
   differential splicing events. For instance, BRIE^[44]18 (for
   single-cell data) and Expedition^[45]19 (for single-cell data) employ
   Bayesian models to estimate PSI for differential splicing analysis,
   while rMATS^[46]20 (for bulk data) employs a linear mixed-effects
   model. Psix^[47]21 quantifies AS using PSI values and identifies
   splicing events associated with cell state through autocorrelation
   models. On the other hand, Leafcutter^[48]22 identifies junction reads
   for definitions of AS events and then adopts a generalized linear model
   to quantify the AS events. FRASER^[49]23 uses a different strategy for
   identifying and quantifying the AS events from junction reads.
   MARVEL^[50]24 relies on PSI defined by rMATS and junctions reads to
   define single-cell AS and then integrates gene expression data to
   analyze the effects of differential splicing between cell groups.
   Additionally, newer methods have emerged that define AS levels at the
   gene level rather than for specific AS events, such as SpliZ^[51]25.
   Nonetheless, these tools typically require known cell labels for
   differential analysis, and methods specifically designed for
   unsupervised clustering based on single-cell splicing landscapes are
   urgently needed.

   In the present study, we developed SCASL (single-cell clustering based
   on alternative splicing landscapes), a new strategy of cell clustering
   based on single-cell RNA splicing landscapes. Specifically, SCASL
   employs a strategy similar to that of LeafCutter^[52]22 and
   FRASER^[53]23 to identify AS events from the junction reads in
   single-cell SMART-seq data. This method does not rely on predefined
   annotation of transcriptomes, thereby recovering both known and novel
   AS events. AS probabilities are then inferred from proportions of the
   junction reads assigned to the same 5’ AS or 3’ AS events. The profiles
   of AS probabilities are then used for imputations of the missing
   values, followed by spectral clustering of the cells.

   We tested the performance of SCASL on defining cell subpopulations with
   single-cell RNA-seq data of embryonic tissue development, tumor cells,
   and tumor-associated immune cells. Cell lineage differentiation is
   extremely important in physiological and disease development. However,
   owing to the intrinsic gene expression fluctuations in the progressive
   process of lineage differentiation, it has been challenging to clearly
   define cells at key transitional stages simply based on gene expression
   profiles. In addition, highly heterogeneous cells in complex tissues,
   such as tumor cells and tumor-associated immune cells, usually show
   very diverse gene expression profiles, making it difficult to clearly
   and consistently define boundaries between cell clusters.

   When applied to datasets of triple negative breast cancer, SCASL
   defined clear cell subtypes indicating precancerous transformation of
   epithelial cells and early-stage tumor cells. With embryonic liver
   data, SCASL recovered a series of transitional stages during
   developments of the hepatocyte and cholangiocyte lineage lineages.
   Finally, highly diverse tumor associated T cells were also classified
   into subpopulations with distinct molecular and cellular
   characteristics related to tumor infiltration, activation, and
   exhaustion. In contrast, the methods commonly used with single-cell
   gene expression data did not recapitulate such cell groups with
   physiological relevance. Therefore, the heterogeneous landscapes of RNA
   splicing provide unique information for precise definitions of
   physiologically relevant cell subtypes, which is valuable for
   understanding the fine tuning that occurs during complex development
   and disease contexts. We propose that SCASL can be used as an efficient
   method for mining such information.

Results

The SCASL pipeline for cell clustering based on single-cell RNA splicing
landscapes

   The SCASL pipeline consists of three major steps, i.e., (i)
   establishment of a single-cell RNA AS probability matrix, (ii)
   imputation of the missing values in the matrix, and (iii) spectral
   clustering of the single cells based on the patched AS probability
   matrix (details in the Methods section). In brief, first, de novo
   identification of RNA splicing events is performed by mapping
   single-cell RNA-seq reads, which are usually obtained from SMART-seq
   techniques, to the reference genome. This procedure does not rely on
   prior annotation of exons, thereby capturing complex AS events in a
   more comprehensive manner. Multiple splicing schemes taking place with
   a common 5’ or 3’ splice site are then clustered together, and the
   split reads are counted to estimate the probabilities of these
   different 3’ AS or 5’ AS events (Fig. [54]1). This procedure is
   performed for each single cell, generating an AS probability matrix
   representing the transcriptome-wide landscape of splicing at
   single-cell resolution.

Fig. 1. Schematic overview of the SCASL pipeline.

   [55]Fig. 1
   [56]Open in a new tab

   SCASL takes raw scRNA-seq data as input and generates classifications
   of cell subpopulations. The pipeline is composed of 3 major steps:
   establishment of an AS probability matrix from the input data,
   imputation of the missing values in the matrix, and spectral clustering
   of the single cells.

   Due to the relatively low sequencing depth and limited coverage of
   single-cell RNA-seq data, the probabilities of many AS events cannot be
   reliably estimated for all the single cells in a dataset. Therefore,
   the AS probability matrix greatly suffers from frequent missing values
   (marked as NA). SCASL introduces a strategy of iterative weighted KNN
   for imputation of these missing values (Fig. [57]1). In brief, SCASL
   determines k nearest neighbors for each cell based on the Euclidean
   distances between the AS probabilities of the cells. The missing values
   of each cell are then inferred by taking the weighted averages of the
   neighboring cells. The new AS probability matrix is then used to update
   the missing values inferred from the previous round. This process is
   conducted iteratively until convergence. We chose KNN because it
   significantly outperformed other imputation methods, such as mean value
   imputation and two model-dependent methods, MAGIC^[58]26 and
   scImpute^[59]27 (data not shown).

   The imputation strategy was tested with forced dropouts of different
   proportions of the non-NA values in 3 datasets. The imputed AS
   probabilities were highly accurate compared to the true values
   (Fig. [60]S1A). Even if 90% of the AS probability values were
   dropped-out, the majority of these data points were still correctly
   recovered. This suggests that transcriptome-wide splicing information
   is highly redundant for identifying cells with similar splicing
   landscapes, thereby allowing precise imputation of missing AS
   probabilities.

   Despite the high fidelity of the imputation process, potential errors
   of the AS probabilities inferred for the originally missing values
   still cannot be completely ruled out. In addition, the sparsity pattern
   of the AS matrix, which is remotely related to gene expression, should
   also be informative for the cellular physiological status, as suggested
   by previous studies^[61]3,[62]28. Therefore, new binary vectors marking
   the positions of the imputed values are concatenated with the vectors
   of AS probabilities, giving rise to a new and expanded AS probability
   matrix.

   SCASL employs spectral clustering, a graph-based clustering
   algorithm^[63]29, with the modified AS probability matrix above. The
   elbow method is implemented to help determine the optimal number of
   clusters^[64]30. To showcase its performance, SCASL was applied to 6
   sets of single-cell transcriptome data generated by the SMART-seq
   method (Table [65]1). The resulting cell clusters were significantly
   different than those obtained simply from the RNA expression profiles
   with the canonical Seurat^[66]31 pipeline (Fig. [67]S1B), which will be
   discussed in the following sections.

Table 1.

   Data resources of single-cell RNA-seq
   Cell source Species Method Cell number AS events AS genes Reference
   TNBC Homo sapiens Smart-seq2 422 6111 2178 ^[68]33
   TNBC Homo sapiens Smart-seq2 443 3565 2248 ^[69]34
   Hepatoblast Mus musculus Smart-seq2 486 3204 1572 ^[70]47
   HCC Immune cells Homo sapiens Smart-seq2 2349 10257 3879 ^[71]61
   Brain Mus musculus Smart-seq2 1734 3058 1615 ^[72]55
   GBM Immune cells Homo sapiens Smart-seq2 1218 8235 2795 ^[73]98
   Brain Mus musculus 10X 2442 267 170 ^[74]97
   [75]Open in a new tab

   We then tested the robustness of SCASL to data sparsity and lack of
   coverage, i.e., tolerance to forced dropouts of AS probability values
   or randomly removed genes. As shown in Fig. [76]S1C, even in the cases
   when 50% of the AS probability values were dropped out, SCASL still
   largely reproduced the cell clustering landscapes obtained from the
   original data. Randomly removing large proportions of the AS sites or
   genes from the original data also did not compromise the cell
   clustering patterns (Fig. [77]S1D). These results indicate the
   redundancy of information within the splicing landscapes for defining
   cell clusters by SCASL.

   Finally, to test the stability of the clustering results of SCASL, we
   performed random sub-sampling of the cells (100%, 75%, or 50%), which
   were then used for cell clustering based on their splicing or gene
   expression profiles (Figs. [78]S2A, B). For the group of 100%
   sub-sampling, clustering procedures were simply repeated with the full
   datasets, and the results provide an assessment of robustness when
   different random seeds were used by the computer. As shown in
   Fig. [79]S2C, for different ratios of sub-sampling, the cell clusters
   defined by SCASL based on single-cell AS profiles are much more robust
   than the ones defined by Seurat based on the gene expression profiles.
   When repeated with the full datasets, SCASL also demonstrated greater
   stability compared to the expression profile-based method.

SCASL reveals heterogeneous tumor cell clusters, providing new insights into
tumor progression

   Due to extensive intra-tumor heterogeneity of single-cell gene
   expression profiles, it has been technically challenging to define
   cancer cell clusters with clear physiological relevance^[80]32. With a
   dataset of 422 tumor cells (259 primary tumor cells and 163
   micrometastatic cells) from one patient with triple-negative breast
   cancer (TNBC)^[81]33, SCASL defined 3 clusters (Fig. [82]2A), which
   were largely inconsistent with the cell clusters defined based on the
   single-cell gene expression data (Figs. [83]2B, [84]S3A). Specifically,
   one of the newly defined clusters based on AS, C0, was dominated by
   micrometastatic cells (Figs. [85]2A, [86]S3B). The genes upregulated in
   C0 served as strong prognostic markers for patients with
   micrometastatic breast cancer with worse survival rates (Fig. [87]S3C).
   The other two cell clusters, C1 and C2, were mainly composed of primary
   tumor cells, with small proportions of micrometastatic cells
   (Figs. [88]2A, [89]S3B).

Fig. 2. TNBC tumor cell heterogeneity characterized by SCASL.

   [90]Fig. 2
   [91]Open in a new tab

   A, B UMAP plot showing clustering of 422 TNBC tumor cells by SCASL
   based on the AS landscapes (A) or clustering by Seurat based on the
   gene expression profiles with 3500 variable features. The cells are
   color labeled by the clusters defined by SCASL. The principal component
   analysis (PCA) is performed with a dimensionality reduction number of
   20 for both methods. Source data are provided as a Source Data file. C
   Pseudotime analysis by CytoTRACE for the three cell clusters (the
   numbers of cells in C2, C0, and C1 are: 115, 94, and 196 respectively).
   Wilcoxon rank-sum test were used to evaluate the statistical
   significance between different groups (two-sided test, using Bonferroni
   correction to adjust for multiple comparisons). The X-axis is sorted
   from small to large according to the CytoTRACE value. The p-values of
   the differences between C2 and C0, and C0 and C1 are 3.9e-15 and 0.023
   respectively. Data are presented as median values +/- SEM. Each box
   shows the median and interquartile range (IQR 25th–75th percentiles).
   Source data are provided as a Source Data file. D Gene Ontology (GO)
   functional enrichment of the genes downregulated in C2 cells compared
   to the other tumor cells (Wilcoxon test, two-sided test,
   p-value < 0.01). E, F Spearman correlations between the splicing
   profiles (E) or gene expression profiles (F) of TNBC tumor cells and
   normal breast epithelial cells. G Differential splicing profiles of C1
   (X-axis) or C0 (Y-axis) compared to C2. Dot sizes represent the mean
   differences in the AS probabilities, and the p-values of differential
   splicing calculated by Fisher’s exact test are shown in grayscale
   (two-sided test). H Kyoto Encyclopedia of Genes and Genomes (KEGG)
   pathway enrichment analysis of the differentially expressed genes in C1
   (X-axis) or C0 (Y-axis) compared to C2, represented by adjusted
   p-values obtained from GSEA.

   Among the three clusters of tumor cells, those in C2 were generally
   more differentiated than the others (Figs. [92]2C, [93]S4A). Note that
   the cell clusters defined by gene expression profiles did not show such
   significant differences in their differentiation stages (Fig. [94]S4B).
   Furthermore, compared to the other tumor cells, the cells in C2 were
   characterized by lower expression of genes involved in tumor-promoting
   processes, such as G2/M phase transition, nuclear division, DNA
   replication, response to hypoxia, and Wnt signaling (Fig. [95]2D).
   Therefore, we suspect that this cluster indicates a tumor cell subtype
   with less malignancy and potentially represents an early intermediate
   stage during TNBC cell development. We then compared these three
   clusters of tumor cells with normal breast epithelial cells from
   another published study of TNBC^[96]34. Interestingly, despite being
   doubtlessly classified as tumor cells, only the C2 cluster showed a
   great extent of similarity with normal epithelial cells based on AS
   profiles (Fig. [97]2E). This suggests that C2 cells may indeed be an
   intermediate stage during the malignant transformation of normal
   epithelial cells into cancerous cells in TNBC. Interestingly, such
   similarity between C2 and normal epithelial cells cannot be seen from
   the gene expression profiles (Fig. [98]2F), suggesting that the AS
   profiles indeed brought in additional and critical information to
   define cell identities. In addition, none of the tumor cell clusters
   defined in the original study based on the gene expression profiles
   showed such a pattern of similarity and thereby revealed such
   early-stage tumor cells (Figs. [99]S4C, D).

   Furthermore, the analysis of single nucleotide variations (SNVs) with
   Monovar^[100]35 showed that the cells in C1 bear significantly more
   SNVs than the cells in C2 (Fig. [101]S4E). This is consistent to our
   hypothesis that the cluster of C2 represents an early stage of tumor
   cells, whereas C1 represents a more malignant stage. By contrast, the
   cluster of C0, which mainly consists of the cells from micro-metastatic
   sites, exhibited slightly elevated rates of mutations, indicating that
   micro-metastasis takes places during early stage of tumor development.
   This is consistent to the previous notes^[102]36,[103]37. It should be
   noted that due to lack of non-cancerous cells from the same patient,
   these SNVs were identified by alignment of the RNA-seq reads with the
   human genome reference. They are not all somatic mutations.
   Nevertheless, the general trend of SNV numbers in the three clusters
   supports the proposed cellular patterns of tumor development revealed
   by SCASL based on the AS landscapes.

   Next, by comparing the AS landscapes of C1 and C0 to that of C2, we
   found that the micrometastatic cell cluster C0 and the primary tumor
   cell cluster C1 represented two branched trajectories of tumor
   development, rather than a sequential lineage (Fig. [104]2G). This
   observation is in line with the current understanding of the
   development of tumor micrometastasis at early stages of breast
   cancer^[105]36,[106]37. Indeed, compared to C2, C1 and C0 were
   characterized by different patterns of gene expression profile shifts
   (Fig. [107]S4F). As a result, in addition to a number of biological
   processes related to cancer development that were perturbed in both
   clusters of C1 and C0, many other processes and pathways were
   upregulated or downregulated only in C1 or C0 (Fig. [108]2H).
   Specifically, key metabolic processes, such as the TCA cycle, amino
   acid metabolism, and steroid biosynthesis, were upregulated only in C1
   (Fig. [109]2H), whereas the cells in C2 were characterized by
   repression of cell-cell/ECM interactions, ferroptosis, and signaling
   pathways such as the HIF-1 and PI3k-Akt signaling. Together, the
   results above indicate that the landscapes of RNA splicing indeed
   reveal additional insights into heterogeneous tumor cell populations
   with high physiological relevance.

SCASL identifies precancerous epithelial cells in TNBC based on AS profiles

   With another dataset of TNBC^[110]34, which consisted of 382 primary
   tumor cells and 61 normal epithelial cells from 6 patients, SCASL first
   correctly identified malignant and other normal cell types, including
   lymphocytes, macrophages, epithelial and stromal cells (Fig. [111]S5A).
   Further clustering of the tumor cells and normal epithelial cells
   identified two clusters of nonmalignant epithelial cells, C5 and C3,
   the latter of which appeared to be closer to the malignant cells
   (Fig. [112]3A). Although the clustering result based on gene expression
   data can also classify the tumor and normal cells into different
   clusters, it did not capture the pattern of C3 being positioned
   in-between the normal cell cluster C5 and other tumor cell clusters
   (Fig. [113]S5B). Indeed, compared to C5, C3 showed much higher
   similarity to other tumor cell clusters based on the single-cell AS
   profiles (Fig. [114]3B), a phenomenon not reproduced by the gene
   expression data of the cells (Fig. [115]3B). Pseudotime analysis
   further showed that the epithelial cells in C3 were generally less
   differentiated than the other normal epithelial cells in C5 but more
   differentiated than the cells in the malignant cell clusters
   (Figs. [116]3C, [117]S5C).

Fig. 3. SCASL identifies precancerous epithelial cells in TNBC based on AS
landscapes.

   [118]Fig. 3
   [119]Open in a new tab

   A TSNE plot showing clustering of 443 normal epithelial cells and
   primary TNBC tumor cells based on AS landscapes. The principal
   component analysis (PCA) is performed with a dimensionality reduction
   number of 20. Source data are provided as a Source Data file. B
   Heatmaps showing Spearman correlations between all normal and tumor
   cells based on AS (left) or normalized gene expression (right)
   profiles. C Pseudotime analysis by CytoTRACE for the clusters defined
   by SCASL (the numbers of cells are 14, 70, 73, 57, 86, 29, and 114,
   from left to right). Data are presented as median values +/- SEM. Each
   box shows the median and interquartile range (IQR 25th–75th
   percentiles). Source data are provided as a Source Data file. D The
   heatmap on the left shows the AS probabilities of the differential
   splicing events in C3 compared to C5 (p-value < 1e-5, hypergeometric
   distribution test were used in enrichment analysis, one-sided test).
   The dot plot on the right shows the pathways enriched by the genes
   bearing the differential spliced events, and some examples are
   provided. E Volcano plot showing the differentially expressed genes
   (t-test, two-sided test, adjusted p-value < 0.01) in C3 vs. C5.
   Differences in the median gene expression levels between the cells of
   C3 and C5 are shown on the X-axis, and the statistical significance
   values of the differentially expressed genes (adjusted p-values) are
   shown on the Y-axis. The sizes of the dots represent the proportions of
   the cells with expression. Biological functions and processes enriched
   in the differentially expressed genes are listed to the right, and some
   representative genes are provided. Source data are provided as a Source
   Data file.

   The signature genes of both C5 and C3 included well-known negative
   markers of epithelial-mesenchymal transition (EMT), such as
   PRG4^[120]38, EFEMP1^[121]39 and IFI16^[122]40,[123]41. Interestingly,
   the expression levels of these negative EMT markers were much lower in
   C3 than in C5 and were further decreased in the tumor cell clusters
   (Fig. [124]S5D). On the other hand, signature genes known to promote
   tumorigenesis, such as SMARCE1^[125]42, SPINT2^[126]43 and
   CPNE1^[127]44, were upregulated in C3 compared to C5 and, as expected,
   further upregulated in the tumor cell clusters (Fig. [128]S5D). These
   data above suggest that C3 represents a potentially intermediate stage
   during malignant development of TNBC cells from normal epithelial
   cells. Indeed, as shown in Fig. [129]S5E, the cells in C3 bear
   significantly more SNVs than the cells in C5. As expected, the tumor
   cells in C2 and C0 cells from the same patient bear more mutations than
   C3. Moreover, we used inferCNV^[130]45 to infer the DNA copy number
   changes for each single-cell, which are summarized as an cumulative CNV
   score. As shown in Fig. [131]S5F, the cells in C3 bear intermediate
   levels of DNA copy number changes, which are higher than the normal
   cells in C5 but lower than the tumor cells in C2 and C0, a pattern
   highly consistent to that of the SNVs.

   The normal epithelial cells in C3 and C5 were obtained mostly from
   patient PT089 (Fig. [132]S5G), indicating that the potential
   differences between these two clusters of normal epithelial cells were
   not due to heterogeneity between patients. The RNA splicing landscapes
   in C3 and C5 showed significant differences for the genes involved in
   cancer-related processes such as cell adhesion, apoptosis, ERBB
   signaling, and cadherin binding (Fig. [133]3D), suggesting that the AS
   landscape shift in the C3 normal epithelial cells contributed to
   tumorigenic perturbations of cellular processes.

   Finally, we compared the gene expression profiles between C3 and C5
   (Fig. [134]3E). In general, the genes downregulated in C3 were
   significantly enriched in classical antitumoral pathways, such as those
   related to ferroptosis, P53, TNF signaling, and cellular senescence,
   while the upregulated genes were related to pathways promoting cell
   localization, survival, and proliferation, such as the negative
   regulation of apoptosis, ECM-receptor interaction, PI3K-Akt, Hippo, and
   HIF-1 signaling pathways. Some of these characteristics could be
   conveniently attributed to AS profile shifts. For example, the ERBB
   signaling pathway, which was perturbed by AS in C3 compared to C5
   (Fig. [135]3D), promotes breast cancer tumorigenesis by activating the
   PI3K/Akt pathway^[136]46, which was indeed upregulated in C3
   (Fig. [137]3E).

   Taken together, the results above strongly suggest that although they
   were classified as noncancerous normal epithelial cells, the C3 cells
   already contained perturbations of some key cancer-related processes,
   potentially due to shifts in the AS landscape. Therefore, the RNA
   splicing landscape, as characterized by SCASL, can identify potential
   precancerous transitional cellular stages, thereby shedding light on
   the transformation of normal cells into tumor cells, which has been
   overlooked by mining of only single-cell gene expression data.

SCASL reveals cell development lineages and potential crosstalk

   In addition to the studies above of tumor cells, we also investigated
   the splicing landscapes of cells in normal development processes. With
   a scRNA-seq dataset of embryonic hepatoblasts^[138]47, SCASL
   recapitulated two cell lineages in the development of hepatocytes and
   cholangiocytes from hepatoblasts (Fig. [139]4A), which have been well
   described by previous studies^[140]48,[141]49. Specifically, the orders
   of the cell clusters were as follows: C2-C10-C3-C1-C8-C5 for hepatocyte
   development and C0-C9-C6 for cholangiocyte development (Figs. [142]4A,
   [143]S6A). Such orders of the cell clusters in these two lineages were
   confirmed by the embryonic development time (Fig. [144]S6B) and the
   pseudotime inferred by Cytotrace (Fig. [145]4B) or Monocle
   (Fig. [146]S6C). When comparing the gene expression profiles, the two
   series of hepatocyte and cholangiocyte clusters showed accumulative
   enrichment of the biological processes essential for the physiological
   functions of hepatocytes and cholangiocytes, respectively, along the
   proposed lineages (Fig. [147]4C, D).

Fig. 4. SCASL identifies developmental lineages of embryonic hepatoblasts.

   [148]Fig. 4
   [149]Open in a new tab

   A TSNE plot showing clustering of 486 embryonic liver cells by SCASL
   based on AS profiles. The principal component analysis (PCA) is
   performed with a dimensionality reduction number of 20. The arrows in
   the figure represent the approximate sequence of embryonic days
   corresponding to each cluster. Source data are provided as a Source
   Data file. B Pseudotime analysis by CytoTRACE for the
   hepatoblast/hepatocyte and cholangiocyte clusters. Seven
   hepatoblast/hepatocyte clusters contained a total of 349 cells (the
   numbers of cells are 31, 11, 30, 52, 140, 39, and 56, from left to
   right), and four cholangiocyte clusters contained a total of 137 cells
   (the number of cells are 57, 36, 13, and 21, from left to right). Data
   are presented as median values +/- SEM. Each box shows the median and
   interquartile range (IQR 25th–75th percentiles). Source data are
   provided as a Source Data file. C, D Biological function enrichment
   analysis of the differentially expressed genes (Wilcoxon test,
   two-sided test, p-value < 1e-5) in each cluster of the hepatocyte
   lineage compared to C2 (C) or in each cluster of the cholangiocyte
   lineage compared to C0. Red indicates upregulation, and blue indicates
   downregulation. E, F AS profiles of the top differential splicing
   events based on pairwise comparisons between the adjacent clusters
   along the hepatocyte (E) and cholangiocyte (F) lineages. G Biological
   function enrichment analysis of the differentially expressed genes for
   each cluster compared to all the other cells (Wilcoxon test, two-sided
   test, p-value < 1e-5). Functional processes related to hepatocytes,
   cholangiocytes, and stemness were selected and displayed in the dot
   plots.

   The similarity matrix among the splicing profiles of these cell
   clusters was well in line with the proposed lineages above
   (Fig. [150]S6D). Next, we identified the signature AS sites along the
   two lineages by comparing each cluster to its nearest precursor in the
   lineage. Together, these sites showed a clear accumulative shift of the
   splicing landscapes along the proposed multistage differentiation
   lineage from hepatoblasts to mature hepatocytes and cholangiocytes
   (Fig. [151]4E, F). Note that once the orders of the cell clusters in
   each lineage were shuffled, the same analysis procedure would not
   produce such patterns of accumulative shifting of the splicing
   landscapes (Fig. [152]S6E, F).

   More interestingly, the single-cell RNA splicing landscape revealed a
   special cluster, C4, which was composed of annotated cholangiocytes but
   positioned between the two major developmental lineages for
   cholangiocytes and hepatocytes (Fig. [153]4A). Compared to the poorly
   differentiated cholangiocyte progenitor cell cluster C0, C4 was more
   differentiated, as shown by their pseudotimes (Fig. [154]4B).
   Unexpectedly, although the cells in C4 highly expressed genes enriched
   in cholangiocyte functions, enrichment of genes in hepatocyte functions
   such as fatty acid metabolism, steroid metabolism, and alcohol
   metabolism was also found in C4 (Fig. [155]4G). Indeed, although the
   dynamics and molecular mechanisms remain to be revealed,
   transdifferentiation between cholangiocytes and hepatocytes in both
   directions has been well described in previous studies^[156]50–[157]54.
   Therefore, the fact that C4 possesses the signatures of both
   cholangiocyte and hepatocyte functions leads to a speculation that this
   cell cluster represents a potential crosstalk bridging the transition
   from cholangiocytes to hepatocytes, or vice versa. However, more direct
   evidence is need to draw a firm conclusion.

   Finally, it worth nothing that the two major lineages of hepatocytes
   and cholangiocytes can be roughly classified based on gene expression
   data (Fig. [158]S7A), which has been reported in the original
   study^[159]47. However, such a cluster distribution pattern does not
   reflect the time-dependent differentiation trajectory (Fig. [160]S7B).
   The potential intermediate cell cluster between the lineages of
   hepatocytes and cholangiocytes is not apparent (Fig. [161]S7C).

   As another example, based on the RNA splicing landscapes of mouse brain
   cells^[162]55, SCASL revealed the well-acknowledged differentiation
   trajectory from oligodendrocyte precursor cells to oligodendrocytes and
   astrocytes^[163]56 (Fig. [164]S8A). In addition, microglia and
   endothelial cells formed two small clusters very close to each other
   (Fig. [165]S8A). Microglia are derived from blood
   monocytes/macrophages^[166]57,[167]58, and studies have shown that
   monocytes can also differentiate into endothelial progenitor cells
   (EPCs), which migrate and give rise to endothelial cells (ECs) in
   response to proangiogenic stimuli^[168]59,[169]60. Some of the ECs in
   this cluster still expressed the EPC marker CD133 (Fig. [170]S8B).
   Therefore, the closeness between microglia and ECs may reflect their
   common ancestors. Such informative patterns were not observed in the
   clustering results simply based on the single-cell gene expression
   profiles (Fig. [171]S8B). In summary, by interrogating single-cell
   splicing landscapes, SCASL can reveal the transitional stages of cell
   lineages, even for small subpopulations of low-frequency cells.

Fine clustering of tumor-associated T cells based on RNA splicing
heterogeneity

   We then used SCASL to delineate heterogeneous tumor-associated
   lymphocytes based on their RNA splicing landscapes. First, immune cells
   from different tissues of 6 hepatocellular carcinoma patients^[172]61
   were precisely clustered into different major types (Fig. [173]S9A).
   Next, we looked further into the clusters of T cells (Fig. [174]5A).
   The two major types of CD4+ and CD8 + T cells were identified based on
   either the AS (Figs. [175]5A, [176]S9B) or gene expression profiles
   (Fig. [177]S9C). The finer clusters of T cells classified by SCASL
   based on the RNA splicing landscapes are largely different than those
   classified by the conventional Seurat pipeline simply based on gene
   expression data (Figs. [178]S9D, [179]S10).

Fig. 5. Clusters of tumor-associated T cells defined by SCASL.

   [180]Fig. 5
   [181]Open in a new tab

   A UMAP plot showing clustering of 2349 T cells by SCASL based on AS
   profiles. The principal component analysis (PCA) is performed with a
   dimensionality reduction number of 30. Source data are provided as a
   Source Data file. Source data are provided as a Source Data file. B The
   heatmap on the left represents the distribution preferences of the