Abstract RNA splicing shapes the gene regulatory programs that underlie various physiological and disease processes. Here, we present the SCASL (single-cell clustering based on alternative splicing landscapes) method for interrogating the heterogeneity of RNA splicing with single-cell RNA-seq data. SCASL resolves the issue of biased and sparse data coverage on single-cell RNA splicing and provides a new scheme for classifications of cell identities. With previously published datasets as examples, SCASL identifies new cell clusters indicating potentially precancerous and early-tumor stages in triple-negative breast cancer, illustrates cell lineages of embryonic liver development, and provides fine clusters of highly heterogeneous tumor-associated CD4 and CD8 T cells with functional and physiological relevance. Most of these findings are not readily available via conventional cell clustering based on single-cell gene expression data. Our study shows the potential of SCASL in revealing the intrinsic RNA splicing heterogeneity and generating biological insights into the dynamic and functional cell landscapes in complex tissues. Subject terms: Bioinformatics, Data mining, RNA splicing, Transcriptomics, Transcriptomics __________________________________________________________________ RNA splicing serves as a critical layer of gene expression regulation. Here, authors introduce SCASL for investigating the heterogeneity of RNA splicing landscapes at single-cell resolution, offering a novel scheme for classifying cell identities with physiological relevance. Introduction Variations in RNA splicing serve as a major cause of extensive transcriptome complexity in mammalian species^[30]1–[31]3. Alternative splicing (AS) is closely related to various cellular activities such as cell lineage differentiation, cell proliferation, and apoptosis^[32]4–[33]8. Shifted landscapes of RNA splicing are primary outcomes as well as key driving factors of various physiological and pathological processes^[34]9,[35]10. For example, complexity in RNA splicing is a main characteristic of tumor heterogeneity related to metastasis, drug sensitivity, and relapse, thereby serving as a prognostic marker for multiple cancer types, including breast cancer^[36]11, glioblastoma^[37]12, non-small cell lung cancer^[38]13, and ovarian cancer^[39]14. Transcriptome profiling at single-cell resolution has become a common practice for dissection of cell heterogeneity in complex tissues^[40]15,[41]16. Single-cell transcriptome profiling based on the SMART-seq method generates RNA-seq reads with coverage of full-length transcripts, rather than just the 3’ regions, as do most droplet-based methods such as Drop-seq and 10x Genomics. Therefore, the RNA splicing profiles are readily available from these datasets. However, the vast majority of previous studies employing single-cell RNA-seq (scRNA-seq) have solely relied on the overall expression levels of genes for classifications of cell subpopulations^[42]17. The RNA splicing landscapes, which represent a critical layer of transcriptome regulation, have not been considered in most of the previous studies. Unlike single-cell gene expression profiles, AS profiles represent the relative abundance of different transcript isoforms, thereby sheltering the intrinsic fluctuations of single-cell gene expression levels. Importantly, even though bulk RNA-seq data of heterogeneous tissues or cells commonly shows mixtures of isoforms, individual cells are frequently dominated by specific isoforms for many genes^[43]9, indicating strong cell specificity of AS events. Finally, in many cases, AS results in different protein products, rather than just changes in protein abundance, therefore potentially leading to more dramatic shifts in biological pathways and processes. These features make single-cell AS profiles promising indexes for the classification of single cells with potentially high physiological relevance. The existing approaches for quantifying alternative splicing (AS) levels were based on either PSI (percent splice-in) or junction reads. Most of these AS analysis tools primarily focus on identifying differential splicing events. For instance, BRIE^[44]18 (for single-cell data) and Expedition^[45]19 (for single-cell data) employ Bayesian models to estimate PSI for differential splicing analysis, while rMATS^[46]20 (for bulk data) employs a linear mixed-effects model. Psix^[47]21 quantifies AS using PSI values and identifies splicing events associated with cell state through autocorrelation models. On the other hand, Leafcutter^[48]22 identifies junction reads for definitions of AS events and then adopts a generalized linear model to quantify the AS events. FRASER^[49]23 uses a different strategy for identifying and quantifying the AS events from junction reads. MARVEL^[50]24 relies on PSI defined by rMATS and junctions reads to define single-cell AS and then integrates gene expression data to analyze the effects of differential splicing between cell groups. Additionally, newer methods have emerged that define AS levels at the gene level rather than for specific AS events, such as SpliZ^[51]25. Nonetheless, these tools typically require known cell labels for differential analysis, and methods specifically designed for unsupervised clustering based on single-cell splicing landscapes are urgently needed. In the present study, we developed SCASL (single-cell clustering based on alternative splicing landscapes), a new strategy of cell clustering based on single-cell RNA splicing landscapes. Specifically, SCASL employs a strategy similar to that of LeafCutter^[52]22 and FRASER^[53]23 to identify AS events from the junction reads in single-cell SMART-seq data. This method does not rely on predefined annotation of transcriptomes, thereby recovering both known and novel AS events. AS probabilities are then inferred from proportions of the junction reads assigned to the same 5’ AS or 3’ AS events. The profiles of AS probabilities are then used for imputations of the missing values, followed by spectral clustering of the cells. We tested the performance of SCASL on defining cell subpopulations with single-cell RNA-seq data of embryonic tissue development, tumor cells, and tumor-associated immune cells. Cell lineage differentiation is extremely important in physiological and disease development. However, owing to the intrinsic gene expression fluctuations in the progressive process of lineage differentiation, it has been challenging to clearly define cells at key transitional stages simply based on gene expression profiles. In addition, highly heterogeneous cells in complex tissues, such as tumor cells and tumor-associated immune cells, usually show very diverse gene expression profiles, making it difficult to clearly and consistently define boundaries between cell clusters. When applied to datasets of triple negative breast cancer, SCASL defined clear cell subtypes indicating precancerous transformation of epithelial cells and early-stage tumor cells. With embryonic liver data, SCASL recovered a series of transitional stages during developments of the hepatocyte and cholangiocyte lineage lineages. Finally, highly diverse tumor associated T cells were also classified into subpopulations with distinct molecular and cellular characteristics related to tumor infiltration, activation, and exhaustion. In contrast, the methods commonly used with single-cell gene expression data did not recapitulate such cell groups with physiological relevance. Therefore, the heterogeneous landscapes of RNA splicing provide unique information for precise definitions of physiologically relevant cell subtypes, which is valuable for understanding the fine tuning that occurs during complex development and disease contexts. We propose that SCASL can be used as an efficient method for mining such information. Results The SCASL pipeline for cell clustering based on single-cell RNA splicing landscapes The SCASL pipeline consists of three major steps, i.e., (i) establishment of a single-cell RNA AS probability matrix, (ii) imputation of the missing values in the matrix, and (iii) spectral clustering of the single cells based on the patched AS probability matrix (details in the Methods section). In brief, first, de novo identification of RNA splicing events is performed by mapping single-cell RNA-seq reads, which are usually obtained from SMART-seq techniques, to the reference genome. This procedure does not rely on prior annotation of exons, thereby capturing complex AS events in a more comprehensive manner. Multiple splicing schemes taking place with a common 5’ or 3’ splice site are then clustered together, and the split reads are counted to estimate the probabilities of these different 3’ AS or 5’ AS events (Fig. [54]1). This procedure is performed for each single cell, generating an AS probability matrix representing the transcriptome-wide landscape of splicing at single-cell resolution. Fig. 1. Schematic overview of the SCASL pipeline. [55]Fig. 1 [56]Open in a new tab SCASL takes raw scRNA-seq data as input and generates classifications of cell subpopulations. The pipeline is composed of 3 major steps: establishment of an AS probability matrix from the input data, imputation of the missing values in the matrix, and spectral clustering of the single cells. Due to the relatively low sequencing depth and limited coverage of single-cell RNA-seq data, the probabilities of many AS events cannot be reliably estimated for all the single cells in a dataset. Therefore, the AS probability matrix greatly suffers from frequent missing values (marked as NA). SCASL introduces a strategy of iterative weighted KNN for imputation of these missing values (Fig. [57]1). In brief, SCASL determines k nearest neighbors for each cell based on the Euclidean distances between the AS probabilities of the cells. The missing values of each cell are then inferred by taking the weighted averages of the neighboring cells. The new AS probability matrix is then used to update the missing values inferred from the previous round. This process is conducted iteratively until convergence. We chose KNN because it significantly outperformed other imputation methods, such as mean value imputation and two model-dependent methods, MAGIC^[58]26 and scImpute^[59]27 (data not shown). The imputation strategy was tested with forced dropouts of different proportions of the non-NA values in 3 datasets. The imputed AS probabilities were highly accurate compared to the true values (Fig. [60]S1A). Even if 90% of the AS probability values were dropped-out, the majority of these data points were still correctly recovered. This suggests that transcriptome-wide splicing information is highly redundant for identifying cells with similar splicing landscapes, thereby allowing precise imputation of missing AS probabilities. Despite the high fidelity of the imputation process, potential errors of the AS probabilities inferred for the originally missing values still cannot be completely ruled out. In addition, the sparsity pattern of the AS matrix, which is remotely related to gene expression, should also be informative for the cellular physiological status, as suggested by previous studies^[61]3,[62]28. Therefore, new binary vectors marking the positions of the imputed values are concatenated with the vectors of AS probabilities, giving rise to a new and expanded AS probability matrix. SCASL employs spectral clustering, a graph-based clustering algorithm^[63]29, with the modified AS probability matrix above. The elbow method is implemented to help determine the optimal number of clusters^[64]30. To showcase its performance, SCASL was applied to 6 sets of single-cell transcriptome data generated by the SMART-seq method (Table [65]1). The resulting cell clusters were significantly different than those obtained simply from the RNA expression profiles with the canonical Seurat^[66]31 pipeline (Fig. [67]S1B), which will be discussed in the following sections. Table 1. Data resources of single-cell RNA-seq Cell source Species Method Cell number AS events AS genes Reference TNBC Homo sapiens Smart-seq2 422 6111 2178 ^[68]33 TNBC Homo sapiens Smart-seq2 443 3565 2248 ^[69]34 Hepatoblast Mus musculus Smart-seq2 486 3204 1572 ^[70]47 HCC Immune cells Homo sapiens Smart-seq2 2349 10257 3879 ^[71]61 Brain Mus musculus Smart-seq2 1734 3058 1615 ^[72]55 GBM Immune cells Homo sapiens Smart-seq2 1218 8235 2795 ^[73]98 Brain Mus musculus 10X 2442 267 170 ^[74]97 [75]Open in a new tab We then tested the robustness of SCASL to data sparsity and lack of coverage, i.e., tolerance to forced dropouts of AS probability values or randomly removed genes. As shown in Fig. [76]S1C, even in the cases when 50% of the AS probability values were dropped out, SCASL still largely reproduced the cell clustering landscapes obtained from the original data. Randomly removing large proportions of the AS sites or genes from the original data also did not compromise the cell clustering patterns (Fig. [77]S1D). These results indicate the redundancy of information within the splicing landscapes for defining cell clusters by SCASL. Finally, to test the stability of the clustering results of SCASL, we performed random sub-sampling of the cells (100%, 75%, or 50%), which were then used for cell clustering based on their splicing or gene expression profiles (Figs. [78]S2A, B). For the group of 100% sub-sampling, clustering procedures were simply repeated with the full datasets, and the results provide an assessment of robustness when different random seeds were used by the computer. As shown in Fig. [79]S2C, for different ratios of sub-sampling, the cell clusters defined by SCASL based on single-cell AS profiles are much more robust than the ones defined by Seurat based on the gene expression profiles. When repeated with the full datasets, SCASL also demonstrated greater stability compared to the expression profile-based method. SCASL reveals heterogeneous tumor cell clusters, providing new insights into tumor progression Due to extensive intra-tumor heterogeneity of single-cell gene expression profiles, it has been technically challenging to define cancer cell clusters with clear physiological relevance^[80]32. With a dataset of 422 tumor cells (259 primary tumor cells and 163 micrometastatic cells) from one patient with triple-negative breast cancer (TNBC)^[81]33, SCASL defined 3 clusters (Fig. [82]2A), which were largely inconsistent with the cell clusters defined based on the single-cell gene expression data (Figs. [83]2B, [84]S3A). Specifically, one of the newly defined clusters based on AS, C0, was dominated by micrometastatic cells (Figs. [85]2A, [86]S3B). The genes upregulated in C0 served as strong prognostic markers for patients with micrometastatic breast cancer with worse survival rates (Fig. [87]S3C). The other two cell clusters, C1 and C2, were mainly composed of primary tumor cells, with small proportions of micrometastatic cells (Figs. [88]2A, [89]S3B). Fig. 2. TNBC tumor cell heterogeneity characterized by SCASL. [90]Fig. 2 [91]Open in a new tab A, B UMAP plot showing clustering of 422 TNBC tumor cells by SCASL based on the AS landscapes (A) or clustering by Seurat based on the gene expression profiles with 3500 variable features. The cells are color labeled by the clusters defined by SCASL. The principal component analysis (PCA) is performed with a dimensionality reduction number of 20 for both methods. Source data are provided as a Source Data file. C Pseudotime analysis by CytoTRACE for the three cell clusters (the numbers of cells in C2, C0, and C1 are: 115, 94, and 196 respectively). Wilcoxon rank-sum test were used to evaluate the statistical significance between different groups (two-sided test, using Bonferroni correction to adjust for multiple comparisons). The X-axis is sorted from small to large according to the CytoTRACE value. The p-values of the differences between C2 and C0, and C0 and C1 are 3.9e-15 and 0.023 respectively. Data are presented as median values +/- SEM. Each box shows the median and interquartile range (IQR 25th–75th percentiles). Source data are provided as a Source Data file. D Gene Ontology (GO) functional enrichment of the genes downregulated in C2 cells compared to the other tumor cells (Wilcoxon test, two-sided test, p-value < 0.01). E, F Spearman correlations between the splicing profiles (E) or gene expression profiles (F) of TNBC tumor cells and normal breast epithelial cells. G Differential splicing profiles of C1 (X-axis) or C0 (Y-axis) compared to C2. Dot sizes represent the mean differences in the AS probabilities, and the p-values of differential splicing calculated by Fisher’s exact test are shown in grayscale (two-sided test). H Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis of the differentially expressed genes in C1 (X-axis) or C0 (Y-axis) compared to C2, represented by adjusted p-values obtained from GSEA. Among the three clusters of tumor cells, those in C2 were generally more differentiated than the others (Figs. [92]2C, [93]S4A). Note that the cell clusters defined by gene expression profiles did not show such significant differences in their differentiation stages (Fig. [94]S4B). Furthermore, compared to the other tumor cells, the cells in C2 were characterized by lower expression of genes involved in tumor-promoting processes, such as G2/M phase transition, nuclear division, DNA replication, response to hypoxia, and Wnt signaling (Fig. [95]2D). Therefore, we suspect that this cluster indicates a tumor cell subtype with less malignancy and potentially represents an early intermediate stage during TNBC cell development. We then compared these three clusters of tumor cells with normal breast epithelial cells from another published study of TNBC^[96]34. Interestingly, despite being doubtlessly classified as tumor cells, only the C2 cluster showed a great extent of similarity with normal epithelial cells based on AS profiles (Fig. [97]2E). This suggests that C2 cells may indeed be an intermediate stage during the malignant transformation of normal epithelial cells into cancerous cells in TNBC. Interestingly, such similarity between C2 and normal epithelial cells cannot be seen from the gene expression profiles (Fig. [98]2F), suggesting that the AS profiles indeed brought in additional and critical information to define cell identities. In addition, none of the tumor cell clusters defined in the original study based on the gene expression profiles showed such a pattern of similarity and thereby revealed such early-stage tumor cells (Figs. [99]S4C, D). Furthermore, the analysis of single nucleotide variations (SNVs) with Monovar^[100]35 showed that the cells in C1 bear significantly more SNVs than the cells in C2 (Fig. [101]S4E). This is consistent to our hypothesis that the cluster of C2 represents an early stage of tumor cells, whereas C1 represents a more malignant stage. By contrast, the cluster of C0, which mainly consists of the cells from micro-metastatic sites, exhibited slightly elevated rates of mutations, indicating that micro-metastasis takes places during early stage of tumor development. This is consistent to the previous notes^[102]36,[103]37. It should be noted that due to lack of non-cancerous cells from the same patient, these SNVs were identified by alignment of the RNA-seq reads with the human genome reference. They are not all somatic mutations. Nevertheless, the general trend of SNV numbers in the three clusters supports the proposed cellular patterns of tumor development revealed by SCASL based on the AS landscapes. Next, by comparing the AS landscapes of C1 and C0 to that of C2, we found that the micrometastatic cell cluster C0 and the primary tumor cell cluster C1 represented two branched trajectories of tumor development, rather than a sequential lineage (Fig. [104]2G). This observation is in line with the current understanding of the development of tumor micrometastasis at early stages of breast cancer^[105]36,[106]37. Indeed, compared to C2, C1 and C0 were characterized by different patterns of gene expression profile shifts (Fig. [107]S4F). As a result, in addition to a number of biological processes related to cancer development that were perturbed in both clusters of C1 and C0, many other processes and pathways were upregulated or downregulated only in C1 or C0 (Fig. [108]2H). Specifically, key metabolic processes, such as the TCA cycle, amino acid metabolism, and steroid biosynthesis, were upregulated only in C1 (Fig. [109]2H), whereas the cells in C2 were characterized by repression of cell-cell/ECM interactions, ferroptosis, and signaling pathways such as the HIF-1 and PI3k-Akt signaling. Together, the results above indicate that the landscapes of RNA splicing indeed reveal additional insights into heterogeneous tumor cell populations with high physiological relevance. SCASL identifies precancerous epithelial cells in TNBC based on AS profiles With another dataset of TNBC^[110]34, which consisted of 382 primary tumor cells and 61 normal epithelial cells from 6 patients, SCASL first correctly identified malignant and other normal cell types, including lymphocytes, macrophages, epithelial and stromal cells (Fig. [111]S5A). Further clustering of the tumor cells and normal epithelial cells identified two clusters of nonmalignant epithelial cells, C5 and C3, the latter of which appeared to be closer to the malignant cells (Fig. [112]3A). Although the clustering result based on gene expression data can also classify the tumor and normal cells into different clusters, it did not capture the pattern of C3 being positioned in-between the normal cell cluster C5 and other tumor cell clusters (Fig. [113]S5B). Indeed, compared to C5, C3 showed much higher similarity to other tumor cell clusters based on the single-cell AS profiles (Fig. [114]3B), a phenomenon not reproduced by the gene expression data of the cells (Fig. [115]3B). Pseudotime analysis further showed that the epithelial cells in C3 were generally less differentiated than the other normal epithelial cells in C5 but more differentiated than the cells in the malignant cell clusters (Figs. [116]3C, [117]S5C). Fig. 3. SCASL identifies precancerous epithelial cells in TNBC based on AS landscapes. [118]Fig. 3 [119]Open in a new tab A TSNE plot showing clustering of 443 normal epithelial cells and primary TNBC tumor cells based on AS landscapes. The principal component analysis (PCA) is performed with a dimensionality reduction number of 20. Source data are provided as a Source Data file. B Heatmaps showing Spearman correlations between all normal and tumor cells based on AS (left) or normalized gene expression (right) profiles. C Pseudotime analysis by CytoTRACE for the clusters defined by SCASL (the numbers of cells are 14, 70, 73, 57, 86, 29, and 114, from left to right). Data are presented as median values +/- SEM. Each box shows the median and interquartile range (IQR 25th–75th percentiles). Source data are provided as a Source Data file. D The heatmap on the left shows the AS probabilities of the differential splicing events in C3 compared to C5 (p-value < 1e-5, hypergeometric distribution test were used in enrichment analysis, one-sided test). The dot plot on the right shows the pathways enriched by the genes bearing the differential spliced events, and some examples are provided. E Volcano plot showing the differentially expressed genes (t-test, two-sided test, adjusted p-value < 0.01) in C3 vs. C5. Differences in the median gene expression levels between the cells of C3 and C5 are shown on the X-axis, and the statistical significance values of the differentially expressed genes (adjusted p-values) are shown on the Y-axis. The sizes of the dots represent the proportions of the cells with expression. Biological functions and processes enriched in the differentially expressed genes are listed to the right, and some representative genes are provided. Source data are provided as a Source Data file. The signature genes of both C5 and C3 included well-known negative markers of epithelial-mesenchymal transition (EMT), such as PRG4^[120]38, EFEMP1^[121]39 and IFI16^[122]40,[123]41. Interestingly, the expression levels of these negative EMT markers were much lower in C3 than in C5 and were further decreased in the tumor cell clusters (Fig. [124]S5D). On the other hand, signature genes known to promote tumorigenesis, such as SMARCE1^[125]42, SPINT2^[126]43 and CPNE1^[127]44, were upregulated in C3 compared to C5 and, as expected, further upregulated in the tumor cell clusters (Fig. [128]S5D). These data above suggest that C3 represents a potentially intermediate stage during malignant development of TNBC cells from normal epithelial cells. Indeed, as shown in Fig. [129]S5E, the cells in C3 bear significantly more SNVs than the cells in C5. As expected, the tumor cells in C2 and C0 cells from the same patient bear more mutations than C3. Moreover, we used inferCNV^[130]45 to infer the DNA copy number changes for each single-cell, which are summarized as an cumulative CNV score. As shown in Fig. [131]S5F, the cells in C3 bear intermediate levels of DNA copy number changes, which are higher than the normal cells in C5 but lower than the tumor cells in C2 and C0, a pattern highly consistent to that of the SNVs. The normal epithelial cells in C3 and C5 were obtained mostly from patient PT089 (Fig. [132]S5G), indicating that the potential differences between these two clusters of normal epithelial cells were not due to heterogeneity between patients. The RNA splicing landscapes in C3 and C5 showed significant differences for the genes involved in cancer-related processes such as cell adhesion, apoptosis, ERBB signaling, and cadherin binding (Fig. [133]3D), suggesting that the AS landscape shift in the C3 normal epithelial cells contributed to tumorigenic perturbations of cellular processes. Finally, we compared the gene expression profiles between C3 and C5 (Fig. [134]3E). In general, the genes downregulated in C3 were significantly enriched in classical antitumoral pathways, such as those related to ferroptosis, P53, TNF signaling, and cellular senescence, while the upregulated genes were related to pathways promoting cell localization, survival, and proliferation, such as the negative regulation of apoptosis, ECM-receptor interaction, PI3K-Akt, Hippo, and HIF-1 signaling pathways. Some of these characteristics could be conveniently attributed to AS profile shifts. For example, the ERBB signaling pathway, which was perturbed by AS in C3 compared to C5 (Fig. [135]3D), promotes breast cancer tumorigenesis by activating the PI3K/Akt pathway^[136]46, which was indeed upregulated in C3 (Fig. [137]3E). Taken together, the results above strongly suggest that although they were classified as noncancerous normal epithelial cells, the C3 cells already contained perturbations of some key cancer-related processes, potentially due to shifts in the AS landscape. Therefore, the RNA splicing landscape, as characterized by SCASL, can identify potential precancerous transitional cellular stages, thereby shedding light on the transformation of normal cells into tumor cells, which has been overlooked by mining of only single-cell gene expression data. SCASL reveals cell development lineages and potential crosstalk In addition to the studies above of tumor cells, we also investigated the splicing landscapes of cells in normal development processes. With a scRNA-seq dataset of embryonic hepatoblasts^[138]47, SCASL recapitulated two cell lineages in the development of hepatocytes and cholangiocytes from hepatoblasts (Fig. [139]4A), which have been well described by previous studies^[140]48,[141]49. Specifically, the orders of the cell clusters were as follows: C2-C10-C3-C1-C8-C5 for hepatocyte development and C0-C9-C6 for cholangiocyte development (Figs. [142]4A, [143]S6A). Such orders of the cell clusters in these two lineages were confirmed by the embryonic development time (Fig. [144]S6B) and the pseudotime inferred by Cytotrace (Fig. [145]4B) or Monocle (Fig. [146]S6C). When comparing the gene expression profiles, the two series of hepatocyte and cholangiocyte clusters showed accumulative enrichment of the biological processes essential for the physiological functions of hepatocytes and cholangiocytes, respectively, along the proposed lineages (Fig. [147]4C, D). Fig. 4. SCASL identifies developmental lineages of embryonic hepatoblasts. [148]Fig. 4 [149]Open in a new tab A TSNE plot showing clustering of 486 embryonic liver cells by SCASL based on AS profiles. The principal component analysis (PCA) is performed with a dimensionality reduction number of 20. The arrows in the figure represent the approximate sequence of embryonic days corresponding to each cluster. Source data are provided as a Source Data file. B Pseudotime analysis by CytoTRACE for the hepatoblast/hepatocyte and cholangiocyte clusters. Seven hepatoblast/hepatocyte clusters contained a total of 349 cells (the numbers of cells are 31, 11, 30, 52, 140, 39, and 56, from left to right), and four cholangiocyte clusters contained a total of 137 cells (the number of cells are 57, 36, 13, and 21, from left to right). Data are presented as median values +/- SEM. Each box shows the median and interquartile range (IQR 25th–75th percentiles). Source data are provided as a Source Data file. C, D Biological function enrichment analysis of the differentially expressed genes (Wilcoxon test, two-sided test, p-value < 1e-5) in each cluster of the hepatocyte lineage compared to C2 (C) or in each cluster of the cholangiocyte lineage compared to C0. Red indicates upregulation, and blue indicates downregulation. E, F AS profiles of the top differential splicing events based on pairwise comparisons between the adjacent clusters along the hepatocyte (E) and cholangiocyte (F) lineages. G Biological function enrichment analysis of the differentially expressed genes for each cluster compared to all the other cells (Wilcoxon test, two-sided test, p-value < 1e-5). Functional processes related to hepatocytes, cholangiocytes, and stemness were selected and displayed in the dot plots. The similarity matrix among the splicing profiles of these cell clusters was well in line with the proposed lineages above (Fig. [150]S6D). Next, we identified the signature AS sites along the two lineages by comparing each cluster to its nearest precursor in the lineage. Together, these sites showed a clear accumulative shift of the splicing landscapes along the proposed multistage differentiation lineage from hepatoblasts to mature hepatocytes and cholangiocytes (Fig. [151]4E, F). Note that once the orders of the cell clusters in each lineage were shuffled, the same analysis procedure would not produce such patterns of accumulative shifting of the splicing landscapes (Fig. [152]S6E, F). More interestingly, the single-cell RNA splicing landscape revealed a special cluster, C4, which was composed of annotated cholangiocytes but positioned between the two major developmental lineages for cholangiocytes and hepatocytes (Fig. [153]4A). Compared to the poorly differentiated cholangiocyte progenitor cell cluster C0, C4 was more differentiated, as shown by their pseudotimes (Fig. [154]4B). Unexpectedly, although the cells in C4 highly expressed genes enriched in cholangiocyte functions, enrichment of genes in hepatocyte functions such as fatty acid metabolism, steroid metabolism, and alcohol metabolism was also found in C4 (Fig. [155]4G). Indeed, although the dynamics and molecular mechanisms remain to be revealed, transdifferentiation between cholangiocytes and hepatocytes in both directions has been well described in previous studies^[156]50–[157]54. Therefore, the fact that C4 possesses the signatures of both cholangiocyte and hepatocyte functions leads to a speculation that this cell cluster represents a potential crosstalk bridging the transition from cholangiocytes to hepatocytes, or vice versa. However, more direct evidence is need to draw a firm conclusion. Finally, it worth nothing that the two major lineages of hepatocytes and cholangiocytes can be roughly classified based on gene expression data (Fig. [158]S7A), which has been reported in the original study^[159]47. However, such a cluster distribution pattern does not reflect the time-dependent differentiation trajectory (Fig. [160]S7B). The potential intermediate cell cluster between the lineages of hepatocytes and cholangiocytes is not apparent (Fig. [161]S7C). As another example, based on the RNA splicing landscapes of mouse brain cells^[162]55, SCASL revealed the well-acknowledged differentiation trajectory from oligodendrocyte precursor cells to oligodendrocytes and astrocytes^[163]56 (Fig. [164]S8A). In addition, microglia and endothelial cells formed two small clusters very close to each other (Fig. [165]S8A). Microglia are derived from blood monocytes/macrophages^[166]57,[167]58, and studies have shown that monocytes can also differentiate into endothelial progenitor cells (EPCs), which migrate and give rise to endothelial cells (ECs) in response to proangiogenic stimuli^[168]59,[169]60. Some of the ECs in this cluster still expressed the EPC marker CD133 (Fig. [170]S8B). Therefore, the closeness between microglia and ECs may reflect their common ancestors. Such informative patterns were not observed in the clustering results simply based on the single-cell gene expression profiles (Fig. [171]S8B). In summary, by interrogating single-cell splicing landscapes, SCASL can reveal the transitional stages of cell lineages, even for small subpopulations of low-frequency cells. Fine clustering of tumor-associated T cells based on RNA splicing heterogeneity We then used SCASL to delineate heterogeneous tumor-associated lymphocytes based on their RNA splicing landscapes. First, immune cells from different tissues of 6 hepatocellular carcinoma patients^[172]61 were precisely clustered into different major types (Fig. [173]S9A). Next, we looked further into the clusters of T cells (Fig. [174]5A). The two major types of CD4+ and CD8 + T cells were identified based on either the AS (Figs. [175]5A, [176]S9B) or gene expression profiles (Fig. [177]S9C). The finer clusters of T cells classified by SCASL based on the RNA splicing landscapes are largely different than those classified by the conventional Seurat pipeline simply based on gene expression data (Figs. [178]S9D, [179]S10). Fig. 5. Clusters of tumor-associated T cells defined by SCASL. [180]Fig. 5 [181]Open in a new tab A UMAP plot showing clustering of 2349 T cells by SCASL based on AS profiles. The principal component analysis (PCA) is performed with a dimensionality reduction number of 30. Source data are provided as a Source Data file. Source data are provided as a Source Data file. B The heatmap on the left represents the distribution preferences of the