Abstract Most clinical diagnostic and genomic research setups focus almost exclusively on coding regions and essential splice sites, thereby overlooking other non-coding variants. As a result, intronic variants that can promote mis-splicing events across a range of diseases, including cancer, are yet to be systematically investigated. Such investigations would require both genomic and transcriptomic data, but there currently exist very few datasets that satisfy these requirements. We address this by developing a single-nucleus full-length RNA-sequencing approach that allows for the detection of potentially pathogenic intronic variants. We exemplify the potency of our approach by applying pancreatic cancer tumor and tumor-derived specimens and linking intronic variants to splicing dysregulation. We specifically find that prominent intron retention and pseudo-exon activation events are shared by the tumors and affect genes encoding key transcriptional regulators. Our work paves the way for the assessment and exploitation of intronic mutations as powerful prognostic markers and potential therapeutic targets in cancer. Graphical Abstract Graphical Abstract. [42]Graphical Abstract [43]Open in a new tab Introduction Introns in human genes can be extraordinarily large (up to 1 Mbp, with ∼3400 being longer than 50 kb and ∼1200 longer than 100 kb), and account for half of the non-coding human genome. As mutations in introns do not directly affect protein-coding sequences, they are usually overlooked ([44]1–3). As a result, little attention is paid to the importance of intronic-located splicing regulatory elements that control the fidelity of pre-mRNA splicing and transcription timing. This is surprising given that pathogenic variants cause abnormal splicing changes, typically by damaging existing splicing motifs or creating novel splicing motifs and may comprise 15–60% of all human disease variants ([45]1–3). Recent studies have reported significant numbers of intronic variants and deletions in protein-coding genes that are associated with under- or overexpression of the affected genes or of distant genes interacting in 3D, thus influencing their regulation in normal or pathogenic conditions ([46]4–6). It is also well established that introns contribute to the control of gene expression by including regulatory regions and non-coding (yet functional) genes or even directly by their extensive length ([47]6). A substantial number of pathogenic variants located deep within introns (i.e. >100 bp from an exon-intron boundary) were recently reported, suggesting that sequence analysis of full introns may help to identify causal mutations for many undiagnosed clinical cases ([48]4–9). Moreover, direct associations between intronic mutations and certain diseases have also been reported, albeit sporadically ([49]7–9). These results agree with the findings we present here. For instance, we identified genes that are simultaneously overexpressed in basal-like cells from pancreatic cancer (PDAC) tumors and rank as the most mutated transcripts, particularly when we consider intronic variants. In this regard, PDAC-associated mutations were reported to synergize in tumorigenesis by globally altering the splicing program of cell ([50]10). Moreover, splicing factors were recently shown to either promote the early events in pancreatic tumorigenesis and resistance to chemotherapy or to limit the metastatic potential of PDAC cells ([51]11,[52]12). A very recent publication has identified a splicing signature specific to basal-like cells, distinguishing PDAC subtypes as accurate survival predictors when considered in the overall population of PDAC patients, as well as within homogeneous subtype cohorts, indicating their efficacy as biomarkers ([53]13). However, there exist many limitations when investigating intronic mis-splicing variants, the main ones being the lack of approaches capable of simultaneously interrogating genomic and transcriptional information, and the lack of guidelines designed for assessing intronic variants and their contribution to abnormal splicing changes in disease. We address these limitations by introducing a novel pipeline that utilizes full-length single-nuclei and bulk RNA sequencing strategy for the ‘deep’ characterization of genetic variability within introns and of their effects on splicing and gene expression in PDAC. Indeed, cancer is one of those diseases, where alternative splicing is the basis for the identification of novel diagnostics, and therapeutic strategies for therapy (e.g. antisense oligonucleotides or small-molecule modulators of spliceosome ([54]9,[55]14,[56]15)). This new approach therefore holds promise for both the elucidation of fundamental biological principles connected to splicing regulation, and the identification of therapeutic targets in human disease. Materials and methods Patients’ sample information Utilization and characterization of human PDAC data and samples within the CRU5002 has been approved by the ethical review board of the UMG (11/5/17). Informed consent was obtained from all subjects involved in the study. Sequencing studies and the generation of organoids and PDX models have been performed using tumor tissue from CRU5002 PDAC patients with progressed disease upon histological PDAC confirmation. Generation of PDX models For the generation of PDX models, bulk tumor tissue was subcutaneously transplanted into SHO-prkdc^scidHr^hrmice. Engrafted subcutaneous tumors were passaged in mice for three generations prior to snRNA-sequencing. Generation of organoids models Tumor tissue was minced and digested in Dulbecco's modified Eagle's medium containing 5 mg/ml of collagenase XI (Sigma-Aldrich, C9407), DNAse final concentration 10 μg/ml (SIGMA D5025-150KU) and Y-27632 final concentration 10.5 μM (Adooq Bioscience, 129830-38-2) and incubated at 37°C for 45 min. The material was further embedded in Matrigel (Corning, New York, USA; Cat#356231) and cultured in human pancreatic cancer complete medium (Wnt3a, R-spondin1, Gastrin, hEGF, A 83-01, hFGF-10, mNoggin, Primocin, N-acetylcystein, Nicotinamide, B27 supplement and Y-27632). For passage, the Matrigel-containing organoid was digested by TrypLE™ Express (Thermo Fisher, 12605-028) with DNAse and Y-27632 as described above for 15 min. The sample was centrifuged at 500 × g for 5 min, and the precipitated cells were embedded in GFR Matrigel and cultured in human pancreas organoid complete feeding medium. Nuclei extraction from primary PDAC and PDAC models Nuclei isolation of PDAC was performed according to the ‘Nuclei Isolation from Cell Suspensions & Tissues for Single Cell RNA sequencing’, Document Number [57]CG000124 Rev E, 10× Genomics, (30 June 2021). For organoids, minor modifications were performed. The cells were spun down at 500 × g, lysis took place for 15 minutes and only two washing steps were performed. Following the second wash, the nuclei pellet was resuspended in 1 ml (about 0.03 oz) wash buffer in preparation for the ICELL8 protocol. Full-length single-cell RNA-seq using ICELL8 The Takara ICELL8 5184 nano-well chip was used with the full-length SMART-Seq ICELL8 Reagent Kit. Nuclei suspensions were fluorescent-labelled with Hoechst 33342 for 15 min prior to their dispensing into the Takara ICELL8 5184 nano-well chips. Cell Select Software (Takara Bio) was used to visualize and select wells containing single nuclei. Nine 5184 nano-wells chips were used for all samples and 11 084 nuclei were processed for data analysis. Specifically, after quality control, 3416 nuclei were used for primary PDAC, 3037 for PDX and 4631 for organoids respectively. cDNA synthesis and library preparation were done according to description in previous study ([58]16). Libraries were sequenced on the HiSeq 4000 (Illumina) to obtain on average ∼ 0.3 Mio reads per nuclei (SE; 50 bp). Bulk-RNA-Seq from primary PDAC samples Total RNA was extracted from FFPE tumor patient samples using the ReliaPrep™ FFPE Total RNA Miniprep System (Promega). RNA Integrity was determined using the Fragment Analyzer. Because of low RNA integrity (sizing from 50 to 140 bp), we performed a modified TruSeq Stranded Total RNA Library Prep Human/Mouse/Rat (Cat. No. 20020596) starting with 200 ng of total RNA. The modifications include (a) ignoring fragmentation step, (b) ligation optimization by adjusting adapters concentration during library preparation, (c) increasing PCR cycles (15 in total) and eliminating primer dimers prior to sequencing (Agencourt AMPure XP magnetic beads, Beckman Coulter). Primary PDAC were sequenced on the NovaSeq6000; S4 flow cell PE 300 cycles generating a data set of 50–400 Mio reads per sample. Bulk-RNA-Seq from organoids and PDXs RNA libraries were prepared starting with 300 ng of total RNA using a non-stranded mRNA Seq (TruSeq RNA Library Preparation Cat. RS-122-2001) from Illumina according to the manufacturer's recommendations. Libraries were sequenced on the Illumina HiSeq 4000 (SE; 1 × 50 bp; 30–35 Mio reads/sample). Whole genome sequencing WGS data from two primary PDAC samples TM56 and TM27 were sequenced at ∼40× coverage on the Illumina NovaSeq 6000 sequencer following the protocol provided by the supplier. Libraries were performed using the PCR Free DNA library preparation from Illumina Cat. No.: 20041794). Alignment, variant calling, and benchmarking were performed using Illumina DRAGEN Germline pipeline 4.2.4. Pre-processing of single-nuclei RNA-seq data Raw sequencing files were processed as described in ([59]16). Briefly, Cogent NGS analysis pipeline (CogentAP) from Takara Bio (v1.0) was applied for de-multiplexing and creating the gene expression matrices from each FASTQ file. Reads were aligned against the human genome GRCh38 v107 ([60]https://www.ensembl.org/Homo_sapiens/Info/Index). Quality control of the data has been done by using CogentDS QC as outlined in ([61]16). QC considered the amount of usable unique reads per nucleus (over 65%), the number of reads generated per nucleus (over 250K), and the median mitochondrial content of the PDAC tumors of 18% (with 26% IQR-3). Intron regions of genes were also included in further evaluation. Generated gene matrices were used as input for the SingleCellExperiment R package (v3.0) ([62]17) to generate SingleCellExperiment objects for the subsequent downstream analysis. Identification of disease subtypes and cell types To identify the tumor subtypes and cell types of which the single cells are composed, two different methods were employed. First, we used the marker-based method AUCell ([63]18) to identify the PDAC subtypes classical and basal-like tumor. This analysis was based on established marker genes ([64]19). AUCell ranks the genes in each cell by decreasing expression value, and marks cells according to their most expressed marker genes. Secondly, we performed cell type annotation for more refined subpopulations to address the heterogeneity of the tumor and its matched models. We utilized the reference data set as provided by the SeuratData R package (panc8.SeuratData) ([65]20) and the prediction function as implemented in the R package SingleR ([66]21). Downstream analyses performing the UMAP algorithm were done as implemented in CogentDS (v1.0) for dimensionality reduction and data visualization. To determine which genes were differentially expressed between tumor subtypes and cell types in a particular patient, Wilcoxon Rank Sum and Signed Rank Test was used, together with P values adjustment with Benjamini–Hochberg method. Pseudo-variant calling in snRNA and bulk RNA-seq data Bam files resulting from CogentDS were used as input for the pseudo bulk variant calling using GATK best practices pipeline for RNAseq variant calling ([67]22). Consistent with the recommendations of GATK, duplicates were removed with Picard MarkDuplicates, and read groups were added with Picard AddOrReplaceReadGroups. Subsequently, Cigar reads were split into exon segment and hard-clip any sequences overhanging into the intronic regions with SplitNCigarReads. Variant calling was performed by HaplotypeCaller and all variants were then hard filtered by the following criteria: FS > 30 and QD < 2. Patient specific vcf files were intersected or merged using bcftools-isec to collect shared or unique mutations from the samples. Resulting variants were converted into maf (mutation annotation file) using Ensembl VEP ([68]23) annotation tool for the identification of the intronic, intergenic, and splice junction mutations (Figure [69]3A). Finally, mutations were grouped into three categories: SS, Proximal, and Deep by using Ensembl GTF file version 107. Total number of SS and Prox mutations were subtracted from total number of mutations in maf files to calculate the number of deep intronic variants. Negative values indicate that the deep intronic mutation affects multiple transcripts of the gene. Pre-Ranked enrichment analysis was performed by using mutation lists from three locational groups. Mutation numbers from SS and Proximal regions were normalized by using total intron number of each gene. Deep mutation numbers were normalized by per unit length (kb). Normalized values are ranked, and overall survival was calculated using Kaplan–Meier analysis based on TCGA-PAAD cohort. The result is shown as Kaplan–Meier plot with P value from log rank test generated by the cBioPortal ([70]24). Figure 3. [71]Figure 3. [72]Open in a new tab Selection and classification of genes exhibiting highest numbers of intronic variants. (A) Mutation percentages of all discovered mutations based on the variant locations on the left (SS, Prox, Deep) log[2] converted counts of each group represented in y-axis, base-pair distances for SS and PROX regions and locational percentages for intronic region in x-axis. (B) Protein network representation with nodes colored according to the location of identified mutations. (C) Performance of deep learning-based methods for mis-splicing detection: Validation of DeepVarSplice with SpliceAI and Pangolin. (D) Pathway Enrichment Analysis of SS and Prox Intronic Mutations (top) and GO Enrichment Analysis Deep Intronic Mutations (bottom) using the g:Profiler online tool. Integration of the SpliceAI and Pangolin scoring To perform scoring on the discovered variant list, hard filtered and intersected VCF files were used. Scores of each variant aggregated on their genes and maximum scores from both algorithms have been taken. Genes that scored with higher than 0.1 one of the tools have been selected as possibly mis-splicing related variants as it stated in their reference manual. SpiceAI v1.3 and Pangolin have been used in our analysis with default parameters with GRCh38 reference genome. PubMed annotation of identified genes from different approaches The role of specific genes as tumor suppressors and their relevance to pancreatic cancer have been analyzed using PubMed data through Entrez Python API (BioPython v-1.83) ([73]25). Automated PubMed queries have been done, identifying articles that associate each gene with ‘tumor suppressor’ and ‘pancreatic cancer’ in their titles or abstracts related to the gene. The outcomes are then aggregated into a DataFrame and visualized in a heatmap to elucidate patterns of gene involvement in tumor suppression and pancreatic cancer. Cell specific variant calling To perform single cell specific variant calling, bam files were separated into chip- and sample-specific bam files. Separated bam files were used for the variant calling pipeline, which was applied to individual bam file as before, without the base quality recalibration step (BQSR) to be able to get more variants possible per cell. Quality filtering was applied with the same quality filters as mentioned in pseudo-bulk RNA variant calling in the methods section. VCF files for each barcode were used to collect barcode-specific mutations and visualize as a mutation distribution plot using CogentDS UMAP functions. Enrichment analysis of non-canonical (Deep) and canonical (SS-Prox) intronic variants Genes with high Intronic mutation numbers (Table [74]1) were forwarded to over representation analysis (ORA) to identify enriched pathways and GO terms using the gProfiler online tool ([75]26). Canonical and non-canonical intronic mutations were separately analysed setting the P-value threshold to 0.05 under multiple correction method g:SCS. Table 1. Classification of high intronic-mutated genes based on variant location Gene_ID Number_ of_Total Number_ of_SS Number_ of_Prox Number_ of_Deep RBFOX1 208 0 0 208 CSMD1 125 0 0 125 WWOX 80 0 0 80 BAGE2 75 0 0 75 SMYD3 64 0 0 64 MTUS2 61 0 0 61 CDH13 58 0 0 58 LRP1B 57 0 0 57 MAGI2 57 0 0 57 PTPRN2 52 0 0 52 DLG2 52 0 0 52 EXOC4 50 0 1 49 JMJD1C 48 0 0 48 LINC02055 48 0 0 48 NRG1 47 0 0 47 SLC30A10 47 0 0 47 SNX29 47 0 1 46 FHIT 47 0 0 47 NRXN3 45 0 0 45 HHAT 45 0 0 45 ALK 44 0 0 44 PRKN 43 0 0 43 NOS1AP 43 0 0 43 NAV2 44 0 1 43 CNTNAP2 42 0 0 42 KMT2C 41 0 0 41 THSD4 41 0 0 41 MAPK4 6 0 0 6 GNAQ 15 0 0 15 BCAS3 31 0 1 30 OAS1 4 11 0 -7 SUPT5H 3 1 0 2 ASAH1 3 1 1 1 POU6F2 14 1 0 13 ACTG1 5 2 4 -1 MDM2 4 1 0 3 ASH2L 3 1 0 2 EPHB6 1 5 3 -7 C12orf43 1 1 1 -1 RHNO1 2 1 0 0 HNRNPH1 1 0 10 -9 MGAT5B 10 0 7 3 CARD8 1 0 9 -8 MACF1 23 0 5 18 COBLL1 11 0 6 5 MFF 6 0 5 1 MPHOSPH9 22 0 5 17 FOXRED1 1 0 5 -4 TBPL1 1 0 3 -2 POLR2L 2 0 1 1 ELOB 1 0 2 -1 TM4SF5 5 0 4 1 RPL26L1 4 0 5 -1 SNRNP35 1 0 1 0 INTS10 6 0 3 3 RPL27 3 0 2 1 DDR2 18 0 2 16 ADAMTS12 12 0 1 11 BCAT1 8 0 3 5 [76]Open in a new tab Classification of highly intronic mutated genes based on the variant location: donor and acceptor splice sites (SS, 1–2 bp away from the nearest exon-intron junction); proximal site (3–20 bp away from the nearest exon-intron junction), and deep intronic (>20 bp away from the nearest exon-intron junction). Total number of intronic mutations based on genes described in [77]Supplemental Table S2. Identification of intron retention events from RNA-Seq data Intron retention analysis was performed using IRFinder 1.3.0 ([78]27). IRFinder calculates IR-ratios to measure IR level reflecting the proportion of intron retaining transcripts. To compare PDAC samples to normal tissue, we downloaded fastq files from nine PDAC and nine healthy human pancreas samples from NBCI GEO database (accession ID: [79]GSE211398). These were pre-processed using the same methods as described in Bulk-RNA-Seq from primary PDAC. Tumor samples from [80]GSE211398 dataset were then analyzed using the IRFinder pipeline to identify IR events. Intronic depth for all samples normalized by CPM due to the library size variations between the public data and CRU samples. Due to the different number of samples between the two dataset, weighted means of IR scores were calculated to balance the variable sample sizes by using 0.1 threshold among 80% of all samples. Normalized intron depths were also used as further filtering to reduce the number of identified possible false negative discoveries. Results Strategy for the identification of intronic variants affecting splicing in PDAC One of the most critical post-transcriptional mechanisms reprogramming transcriptional output and proteomic diversity in cancer cells is the loss of splicing precision when removing introns from pre-mRNAs ([81]28). Consequently, many mis-spliced variants are instead targeted for nuclear degradation or for nonsense-mediated mRNA decay (NMD) and thus, only few annotated alternative isoforms correspond to the precursors of the proteins mapped by large-scale proteomics studies ([82]29). To investigate this mechanism, we applied a new pipeline using a full-length single nuclei RNA-seq (snRNA-seq) approach to three primary PDAC tumors (TM16, TM27 and TM56) and matched tumor-derived preclinical PDAC models (i.e. organoids and subcutaneous patient-derived Xenograft, PDXs). We then sought to determine pathogenic intronic variants causing abnormal splicing in these patient samples (Figure [83]1A). All samples were part of a unique patient cohort recruited for the Clinical Research Unit 5002 ([84]https://gccc.umg.eu/en/cru-5002/). Figure 1. [85]Figure 1. [86]Open in a new tab Strategy for the identification of intronic variants related to splicing in PDAC. (A) snRNA-seq using the ICELL8 platform and the SMART full-length chemistry allowing for full transcript coverage was applied to PDAC primary tumors (TM16, TM27 and TM56) and matched organoids and PDX lines. A total of nine 5184-nanowell chips were performed (three for primary PDAC, three for organoids and three for PDX respectively and visualized in the UMAP. Finally, data from all nine nanowell chips were merged. Bulk RNA sequencing was performed additionally for each of the samples. (B) DeepVarSplice variant calling pipeline combining snRNA-seq and bulk RNA-seq data determining intronic variant positions in the genome and intron retention events. 1) Gold standard RNA-seq variant calling pipeline pre-processing; 2) sample-level variant calling; 3) combination strategies for variant discovery (Shared-Combined); 4) Further filtering for RNA-editing variant removal; and 5) determination of shared intron retention events with weighted IR-ratio calculation between public and own PDAC samples. For our snRNA-Seq experiments, we used the ICELL8 platform previously established in our group connecting genotype to phenotype in individual cells ([87]16). In contrast to the chemistry used by droplet-based platforms (i.e. in 3′-end approaches), ICELL8 is based on the SMART full-length chemistry allowing for the full read coverage of transcripts. Notably, when single nuclei are processed with this platform and chemistry, a strong enrichment in pre-mRNA is observed, including comprehensive coverage of introns and exons along these pre-mRNAs (Figure [88]1A and [89]Supplemental Figure S1). Furthermore, utilization of full-length chemistry allowed strong detection of intergenic and intronic sequences, as well as of non-coding RNAs, especially long intergenic non-coding RNAs (lincRNAs) ([90]16,[91]30) as demonstrated in Figure [92]2E. Figure 2. [93]Figure 2. [94]Open in a new tab Identification of intronic variants using snRNA-seq data. (A) UMAP plot for three primary PDAC tumors and corresponding model's organoids and patient-derived Xenograft, PDXs (TM16, TM27 and TM56). (B) UMAP plot showing clustering by patients (primary PDAC and models) for TM16, TM27 and TM56. (C) Pie charts of the three PDACs (tumor, organoids and PDX) showing percentage of exonic and intronic mutations (above) and bar charts represent number of top intronic mutated genes (below). (D) Ti/Tv ratio of the variant analysis of primary PDAC and their corresponding model's organoids and patient-derived Xenograft, PDXs (TM16, TM27 and TM56). (E) Shared intronic variants and their classifications among all primary PDAC: TM16, TM27 and TM 56. Number of totals discovered and quality filtered intronic mutations from each sample (left), distribution of variant types among shared (78K) intronic mutations (middle), pie charts showing the distribution and location of the intronic variants in intronic, intergenic and coding regions, and a pie chart showing the intronic variants detected on coding and non-coding transcripts (right). For the identification of intronic mis-splicing variants, we developed a variant-calling pipeline called ‘DeepVarSplice’ that combines (i) snRNA-seq data determining the variant's position in the genome with (ii) bulk RNA-seq data mainly capturing mRNAs and thus, charting splicing events in our samples (Figure [95]1B). The pipeline begins with a pseudo-bulk snRNA-seq variant calling analysis using the per-sample GATK method (Figures [96]1B ([97]1–4)), followed by an intersection of the individual findings to identify exact mutations present in all samples. The variants are then classified according to their genomic position considering intergenic, intronic and exonic regions using the maf file (Mutation Annotation File). By filtering the intronic variants, the genes were normalized and ranked according to intronic mutation load. The intronic variants are then forwarded to two parallel branches of the pipeline. The first branch is set up for the investigation of these variants on the single cell level. Therefore, barcode-variant matrices containing the number of mutations for each gene (columns) and cell (rows) have been created to allow further transcriptional analysis related to the variants. Simultaneously, the second branch of the pipeline connects genes that exhibit intronic mis-splicing to pathogenic splicing events through the investigation of partial or total intron retention (IR) or pseudo-exon activation (PEA) using the IRFinder algorithm (Figure [98]1B ([99]5)). Finally, the intron mutated genes from the snRNA data were merged with the intron retained genes from bulk RNA-Seq data. As an outcome, we highlight genes showing mis-splicing related intronic variants that contribute to malignancy and thus proposed as potential therapeutic targets in pancreatic cancer. To ensure the reliability of the variants detected by DeepVarSplice, we conducted comprehensive whole-genome sequencing (WGS) on two primary PDAC samples, namely TM27 and TM56. Our analyses encompassed gold-standard variant discovery, rigorous filtering, and benchmarking, utilizing both WGS and snRNA-Seq data, as depicted in [100]Supplemental Figure S2. Out of the 153 000 shared variants detected by DeepVarSplice (snRNA-Seq data) in the two primary PDAC samples, 127 000 (83%) were confirmed using stringent criteria for variant calling in the WGS dataset. Both methods exhibited higher precision, with a Ti/Tv ratio of 2.3 for the snRNA-Seq data and 2.01 for the WGS data, thereby affirming the trustworthiness of the DeepVarSplice method. Identification of PDAC intronic variants using DeepVarSplice on snRNA-seq data UMAP visualization of single nuclei clustering from the primary PDAC tumors, organoids and PDXs (Figure [101]2A) was strongly influenced by the patient of origin of each sample. To evaluate the consistency in transcriptomics and intronic mis-splicing variants, we merged data from all PDAC models of each patient and regenerated separate UMAP plots for TM16, TM27 and TM56 (Figure [102]2B). Then, by applying the DeepVarSplice pipeline (Figure [103]1B) to this snRNA-seq data, we observed that > 95% of the mutations we identified were in non-coding sequences (i.e. in intronic or intergenic regions) and only ∼5% were found in protein-coding sequences. The most mutated genes at the level of introns included RBFOX1, CSMD1, WWOX, CNTNAP2 and LRP1B and were notably shared among PDAC tumors and models (Figure [104]2C). To identify bona fide intronic variants, we determined the Ti/Tv ratio in each primary tumor-, organoid- or PDX-derived dataset. Interestingly, primary tumor data showed higher precision and a more realistic Ti/Tv ratio (2.18) in comparison to organoid- and PDX-derived data (Figure [105]2D). We therefore decided to focus on the three tumor-derived datasets for all ensuing analyses, which we complemented with bulk RNA-Seq data generated for 24 primary PDAC tumors from the CRU5002 cohort, as well as with public normal and tumor pancreas tissues data (accession ID: [106]GSE211398). In the end, variants shared between all three-tumor snRNA-seq datasets amounted to ∼78 000 non-coding variants with 41 834 in introns, 25 449 in intergenic regions (IGR), 4390 in 3′ and 3882 5′ gene flanks, 610 in 3′ UTRs and 79 in 5′ UTRs (Figure [107]2E). All shared variants detected in the primary PDAC with QUAL > 30 is listed in [108]Supplemental Table S1. Notably, 90% of these variants (69 488) were reported in the dbSNP_RS database based on WGS data remarking the validity of the variant calling performed with the snRNA Seq data. Moreover, the remaining 7926 intronic variants identified as novel ones underscore the potential of snRNA-Seq datasets to detect both validated intronic variants and novel ones, demonstrating the added value of our approach. Of these, 98.82% (n = 76 499) qualified as SNPs, 0.66% (n = 510) as deletions, and 0.52% (n = 404) as insertions. Regarding the genomic annotation, most (n = 36 412) are mapped in protein-coding genes or lncRNAs (n = 10 926) and very few (n = 121) in miRNAs (Figure [109]2E and [110]Supplemental Table S1). Classification and validation of intronic mis-splicing variants based on location Next, to classify intronic variants we first performed a ranking based on the number of variants detected per gene normalized by the length and number of introns in each gene (top 20 visualized in [111]Supplemental Figure S3). Then, to link intronic variants to mis-splicing, we stratified mutations located 1–2 and 3–20 nt away from the nearest exon-intron junction, which we classified as donor/acceptor sites (SS) and branchpoint-proximal regions (BPs), polypyrimidine tracts (PPTs). All other variants >20 nt are categorized as ‘deep intronic’^5. (see Table [112]1 and [113]Supplemental Table S2). Based on these indexes, many genes showed a high number of deep intronic variants, and some of these were reported as tumor suppressors expressed in specific tissues or only in tumor cells, e.g. LRP1B, CSMD1, WWOX, FHIT, MTUS2, MAPK4 and MAP3K14 ([114]31–38) ([115]Supplemental Table S2). The most prominent of these was RBFOX1 and showed 208 deep intronic variants. The RBFOX family of RNA-binding proteins is well known to regulate alternative splicing (AS) ([116]39,[117]40). Recently, RBFOX2 was reported to modulate a metastatic AS signature in pancreatic cancer ([118]41). We next examined the fraction of intronic variants over all variants at each position (Figure [119]3A). 99.6% of them were located deep in introns. Variants located in the proximal region are 0.32% of the total, with just 0.02% near SSs. By conducting functional network analysis on the key genes identified as mutated in the splice site (SS), proximal, and deep intronic regions outlined in Table [120]1, we discovered connections among genes implicated in mRNA capping (HNRNPH1, INTS10, TBPL1, POLR2L, SNRNP35) ([121]42), as well as in the regulation of transcription via histone methylation and H3K4-specific histone methyltransferase activity (SMYD3, KMT2C, JMJD1C, ASH2L), as illustrated in Figure [122]3B. Pathway enrichment analysis carried out on genes mutated in the SS and proximal regions revealed enrichment in RNA polymerase II transcription elongation (REAC:R-HSA-75955) and the TP53 regulates transcription DNA repair genes pathway (REAC:R-HSA-6796648) as depicted in Figure [123]3D. Additionally, gene ontology analysis conducted exclusively on genes mutated deep within introns demonstrated enrichment in histone H3K4 trimethyltransferase activity (GO:0140999) and pathways associated with adenoid cystic carcinoma (WP:WP3651), also shown in Figure [124]3D. Finally, visualization of the distribution of intronic variants in top mutated genes WWOX, SMYD3, JMJD1C and NAV ([125]Supplemental Figure S1) exemplifies read coverage from snRNA-seq and bulk RNA-seq in primary PDAC tumors and how these can be superimposed to evaluate splicing events. Notably, DeepVarSplice identified 1132 genes showing a substantial density of intronic variants that could potentially be linked to abnormal splicing. These genes were selected based on the modified Z-score method. This method is particularly apt for the normalized variant numbers dataset, which exhibits a non-normal distribution. The modified Z-score uses the median and the Median Absolute Deviation (MAD), providing robustness against skewed distributions (Figure [126]3A). To validate our findings, we also employed recently developed deep learning based tools, specifically (i) Pangolin ([127]43) and (ii) SpliceAI ([128]44) designed for detecting potential intronic variants causing pathogenic splicing (Figure [129]3C). Out of the 78000 variants originally identified ([130]Supplemental Table S1), 295 variants affecting 107 genes were scored as potentially pathogenic by at least one of these tools (Figure [131]3C). Through a direct comparison involving Pangolin, SpliceAI, and DeepVarSplice, we identified 34 genes depicted as tumor suppressors in pancreatic cancer in ([132]Supplemental Figure S4). One explanation for the heightened sensitivity of variant detection using our approach is its ability to access variants deep within introns from the intronic regions of pre-mRNAs present in sn-RNA data. In fact, a limitation of SpliceAI and Pangolin is their optimization primarily for variants located within 50 bp on the splice site defined as SS and proximal intronic regions, affecting canonical splicing. Furthermore, it's important to acknowledge that both Pangolin and SpliceAI were developed predominantly using bulk-RNA sequencing methods, leading to potential gaps in the dataset due to the absence of intronic sequences. This limitation highlights the advantage of our approach in capturing a more comprehensive spectrum of intronic variants. Transcriptional regulation relates to mis-splicing in basal-like tumor cells We performed transcriptional subtyping of single tumor nuclei into the more aggressive/drug-resistant ‘basal-like’ (BL) or the better prognosis-associated ‘classical’ subtype (CLA) using a ranking markers method described previously on snRNA-seq data ([133]18). The distributions of each subtype in each tumor showed BL cells highly represented in primary tumors, and partially in matched PDXs. In contrast, CL cells predominated in our organoid models (Figure [134]4A). It is important to note that the resolution of our analysis of these patient-matched models PDX and organoids undergo quite some selection that can change a substantial part of their transcriptional profiles. Specifically, tissue samples prepared for organoid generation are only small parts of the whole tumor. The higher heterogeneity of PDAC primary tumors questions the reliability of substituting small pieces for whole tumor tissues especially replicating the complexity of the patient-specific environment, e.g. tumor stroma and tumor types possess distinct immune components and different cell quantities affecting the cell composition in the early stage of tumoroid cultures. Figure 4. [135]Figure 4. [136]Open in a new tab Tumor subtype and marker identification at the single-nucleus level. (A) UMAP plot of single cells from the three patients TM16, TM27 and TM 56 (colored by tumor model (left) and by inferred tumor subtype on CL and BL tumor pancreatic cells (right). (B) UMAP plot showing the gene expression of the top transcriptional markers comparing CL vs BL cells in primary PDACs, organoids and PDXs. (C) UMAP plot showing mutated cells (intronic variants) for CNTNAP2, CSMD1 and RBFOX1 in primary PDACs. Next, we performed differential expression analysis (DEG) between BL and CL tumor cells, where we identified a few hundred markers for both subtypes with absolute log[2]FC > 0.5 and P[adj] < 0.05 ([137]Supplemental Table S3) and visualized the most prominent ones in UMAP plots (Figure [138]4B). Tumor heterogeneity was assessed by evaluating cell types using snRNA-seq from all models at hand. As expected, primary tumors exhibited the highest heterogeneity as regards cell type composition while organoids mostly contained ductal-like cells (Figures [139]5A-[140]C). Strikingly, RBFOX1, CSMD1 and CNTNAP2, our topmost intronic mutated genes (Figures [141]1B and [142]4C) appear to be simultaneously the genes most upregulated in BL cells (Figure [143]5A). At the same time, the markers found in CL cells have already been described as biomarkers for pancreatic cancer: MALAT1, NEAT1, CEACAM6 or MUC1 ([144]45–48). For the lncRNA NEAT1, two novel mutations are reported here for the first time ([145]Supplemental Table S1). Taken together, CL cells and models are characterized by less abundance and relevance of intronic variants according to the output of our pipeline, thus suggesting that mis-splicing mechanisms are linked to the aggressiveness of BL tumors (Figure [146]5D). Figure 5. [147]Figure 5. [148]Open in a new tab Identification of markers for classical-like (CL) and basal-like (BL) tumor cells. (A) Heatmap showing gene expression of the top 15 BL and CL markers overexpressed in all PDACs: ‘T’ primary PDACs, ‘O’ organoids and ‘P’ PDXs. Venn diagrams illustrate the overlap of marker genes between tumor PDAC patient samples. (B) Dot plot showing expression of the markers in each of the pancreas cell subtypes. Circle size is proportional to the percentage of cells in each cell type expressing the marker and circle color represents the average marker gene expression in the cell type. (C) UMAP plots by patient showing clusters annotated to specific cell subtypes. The violin plots below represent prominent CL tumor pancreas markers across cell subtypes. (D) Table showing the number of intronic mutations of the top CL and BL marker genes. Contribution of intronic mis-splicing variants to pathogenic splicing and poor PDAC prognosis In general, mis-splicing affects the regulation of genes (up- or downregulation) or generates new isoforms in normal and tumor tissues ([149]49–51). However, our focus here is on the identification of potentially pathogenic splicing events in PDAC. It is important to note that the definition of pathogenicity in this study is based on comparisons to human references (via SNP calling) or to non-tumor