Abstract

   Most clinical diagnostic and genomic research setups focus almost
   exclusively on coding regions and essential splice sites, thereby
   overlooking other non-coding variants. As a result, intronic variants
   that can promote mis-splicing events across a range of diseases,
   including cancer, are yet to be systematically investigated. Such
   investigations would require both genomic and transcriptomic data, but
   there currently exist very few datasets that satisfy these
   requirements. We address this by developing a single-nucleus
   full-length RNA-sequencing approach that allows for the detection of
   potentially pathogenic intronic variants. We exemplify the potency of
   our approach by applying pancreatic cancer tumor and tumor-derived
   specimens and linking intronic variants to splicing dysregulation. We
   specifically find that prominent intron retention and pseudo-exon
   activation events are shared by the tumors and affect genes encoding
   key transcriptional regulators. Our work paves the way for the
   assessment and exploitation of intronic mutations as powerful
   prognostic markers and potential therapeutic targets in cancer.

Graphical Abstract

Graphical Abstract.

   [42]Graphical Abstract
   [43]Open in a new tab

Introduction

   Introns in human genes can be extraordinarily large (up to 1 Mbp, with
   ∼3400 being longer than 50 kb and ∼1200 longer than 100 kb), and
   account for half of the non-coding human genome. As mutations in
   introns do not directly affect protein-coding sequences, they are
   usually overlooked ([44]1–3). As a result, little attention is paid to
   the importance of intronic-located splicing regulatory elements that
   control the fidelity of pre-mRNA splicing and transcription timing.
   This is surprising given that pathogenic variants cause abnormal
   splicing changes, typically by damaging existing splicing motifs or
   creating novel splicing motifs and may comprise 15–60% of all human
   disease variants ([45]1–3).

   Recent studies have reported significant numbers of intronic variants
   and deletions in protein-coding genes that are associated with under-
   or overexpression of the affected genes or of distant genes interacting
   in 3D, thus influencing their regulation in normal or pathogenic
   conditions ([46]4–6). It is also well established that introns
   contribute to the control of gene expression by including regulatory
   regions and non-coding (yet functional) genes or even directly by their
   extensive length ([47]6). A substantial number of pathogenic variants
   located deep within introns (i.e. >100 bp from an exon-intron boundary)
   were recently reported, suggesting that sequence analysis of full
   introns may help to identify causal mutations for many undiagnosed
   clinical cases ([48]4–9). Moreover, direct associations between
   intronic mutations and certain diseases have also been reported, albeit
   sporadically ([49]7–9). These results agree with the findings we
   present here. For instance, we identified genes that are simultaneously
   overexpressed in basal-like cells from pancreatic cancer (PDAC) tumors
   and rank as the most mutated transcripts, particularly when we consider
   intronic variants. In this regard, PDAC-associated mutations were
   reported to synergize in tumorigenesis by globally altering the
   splicing program of cell ([50]10). Moreover, splicing factors were
   recently shown to either promote the early events in pancreatic
   tumorigenesis and resistance to chemotherapy or to limit the metastatic
   potential of PDAC cells ([51]11,[52]12). A very recent publication has
   identified a splicing signature specific to basal-like cells,
   distinguishing PDAC subtypes as accurate survival predictors when
   considered in the overall population of PDAC patients, as well as
   within homogeneous subtype cohorts, indicating their efficacy as
   biomarkers ([53]13).

   However, there exist many limitations when investigating intronic
   mis-splicing variants, the main ones being the lack of approaches
   capable of simultaneously interrogating genomic and transcriptional
   information, and the lack of guidelines designed for assessing intronic
   variants and their contribution to abnormal splicing changes in
   disease. We address these limitations by introducing a novel pipeline
   that utilizes full-length single-nuclei and bulk RNA sequencing
   strategy for the ‘deep’ characterization of genetic variability within
   introns and of their effects on splicing and gene expression in PDAC.
   Indeed, cancer is one of those diseases, where alternative splicing is
   the basis for the identification of novel diagnostics, and therapeutic
   strategies for therapy (e.g. antisense oligonucleotides or
   small-molecule modulators of spliceosome ([54]9,[55]14,[56]15)). This
   new approach therefore holds promise for both the elucidation of
   fundamental biological principles connected to splicing regulation, and
   the identification of therapeutic targets in human disease.

Materials and methods

Patients’ sample information

   Utilization and characterization of human PDAC data and samples within
   the CRU5002 has been approved by the ethical review board of the UMG
   (11/5/17). Informed consent was obtained from all subjects involved in
   the study. Sequencing studies and the generation of organoids and PDX
   models have been performed using tumor tissue from CRU5002 PDAC
   patients with progressed disease upon histological PDAC confirmation.

Generation of PDX models

   For the generation of PDX models, bulk tumor tissue was subcutaneously
   transplanted into SHO-prkdc^scidHr^hrmice. Engrafted subcutaneous
   tumors were passaged in mice for three generations prior to
   snRNA-sequencing.

Generation of organoids models

   Tumor tissue was minced and digested in Dulbecco's modified Eagle's
   medium containing 5 mg/ml of collagenase XI (Sigma-Aldrich, C9407),
   DNAse final concentration 10 μg/ml (SIGMA D5025-150KU) and Y-27632
   final concentration 10.5 μM (Adooq Bioscience, 129830-38-2) and
   incubated at 37°C for 45 min. The material was further embedded in
   Matrigel (Corning, New York, USA; Cat#356231) and cultured in human
   pancreatic cancer complete medium (Wnt3a, R-spondin1, Gastrin, hEGF, A
   83-01, hFGF-10, mNoggin, Primocin, N-acetylcystein, Nicotinamide, B27
   supplement and Y-27632). For passage, the Matrigel-containing organoid
   was digested by TrypLE™ Express (Thermo Fisher, 12605-028) with DNAse
   and Y-27632 as described above for 15 min. The sample was centrifuged
   at 500 × g for 5 min, and the precipitated cells were embedded in GFR
   Matrigel and cultured in human pancreas organoid complete feeding
   medium.

Nuclei extraction from primary PDAC and PDAC models

   Nuclei isolation of PDAC was performed according to the ‘Nuclei
   Isolation from Cell Suspensions & Tissues for Single Cell RNA
   sequencing’, Document Number [57]CG000124 Rev E, 10× Genomics, (30
   June 2021). For organoids, minor modifications were performed. The
   cells were spun down at 500 × g, lysis took place for 15 minutes and
   only two washing steps were performed. Following the second wash, the
   nuclei pellet was resuspended in 1 ml (about 0.03 oz) wash buffer in
   preparation for the ICELL8 protocol.

Full-length single-cell RNA-seq using ICELL8

   The Takara ICELL8 5184 nano-well chip was used with the full-length
   SMART-Seq ICELL8 Reagent Kit. Nuclei suspensions were
   fluorescent-labelled with Hoechst 33342 for 15 min prior to their
   dispensing into the Takara ICELL8 5184 nano-well chips. Cell Select
   Software (Takara Bio) was used to visualize and select wells containing
   single nuclei. Nine 5184 nano-wells chips were used for all samples and
   11 084 nuclei were processed for data analysis. Specifically, after
   quality control, 3416 nuclei were used for primary PDAC, 3037 for
   PDX and 4631 for organoids respectively. cDNA synthesis and library
   preparation were done according to description in previous study
   ([58]16). Libraries were sequenced on the HiSeq 4000 (Illumina) to
   obtain on average ∼ 0.3 Mio reads per nuclei (SE; 50 bp).

Bulk-RNA-Seq from primary PDAC samples

   Total RNA was extracted from FFPE tumor patient samples using the
   ReliaPrep™ FFPE Total RNA Miniprep System (Promega). RNA Integrity was
   determined using the Fragment Analyzer. Because of low RNA integrity
   (sizing from 50 to 140 bp), we performed a modified TruSeq Stranded
   Total RNA Library Prep Human/Mouse/Rat (Cat. No. 20020596) starting
   with 200 ng of total RNA. The modifications include (a) ignoring
   fragmentation step, (b) ligation optimization by adjusting adapters
   concentration during library preparation, (c) increasing PCR cycles (15
   in total) and eliminating primer dimers prior to sequencing (Agencourt
   AMPure XP magnetic beads, Beckman Coulter). Primary PDAC were sequenced
   on the NovaSeq6000; S4 flow cell PE 300 cycles generating a data set of
   50–400 Mio reads per sample.

Bulk-RNA-Seq from organoids and PDXs

   RNA libraries were prepared starting with 300 ng of total RNA using a
   non-stranded mRNA Seq (TruSeq RNA Library Preparation Cat. RS-122-2001)
   from Illumina according to the manufacturer's recommendations.
   Libraries were sequenced on the Illumina HiSeq 4000 (SE; 1 × 50 bp;
   30–35 Mio reads/sample).

Whole genome sequencing

   WGS data from two primary PDAC samples TM56 and TM27 were sequenced at
   ∼40× coverage on the Illumina NovaSeq 6000 sequencer following the
   protocol provided by the supplier. Libraries were performed using the
   PCR Free DNA library preparation from Illumina Cat. No.: 20041794).
   Alignment, variant calling, and benchmarking were performed using
   Illumina DRAGEN Germline pipeline 4.2.4.

Pre-processing of single-nuclei RNA-seq data

   Raw sequencing files were processed as described in ([59]16). Briefly,
   Cogent NGS analysis pipeline (CogentAP) from Takara Bio (v1.0) was
   applied for de-multiplexing and creating the gene expression matrices
   from each FASTQ file. Reads were aligned against the human genome
   GRCh38 v107 ([60]https://www.ensembl.org/Homo_sapiens/Info/Index).
   Quality control of the data has been done by using CogentDS QC as
   outlined in ([61]16). QC considered the amount of usable unique reads
   per nucleus (over 65%), the number of reads generated per nucleus (over
   250K), and the median mitochondrial content of the PDAC tumors of 18%
   (with 26% IQR-3). Intron regions of genes were also included in further
   evaluation. Generated gene matrices were used as input for the
   SingleCellExperiment R package (v3.0) ([62]17) to generate
   SingleCellExperiment objects for the subsequent downstream analysis.

Identification of disease subtypes and cell types

   To identify the tumor subtypes and cell types of which the single cells
   are composed, two different methods were employed. First, we used the
   marker-based method AUCell ([63]18) to identify the PDAC subtypes
   classical and basal-like tumor. This analysis was based on established
   marker genes ([64]19). AUCell ranks the genes in each cell by
   decreasing expression value, and marks cells according to their most
   expressed marker genes. Secondly, we performed cell type annotation for
   more refined subpopulations to address the heterogeneity of the tumor
   and its matched models. We utilized the reference data set as provided
   by the SeuratData R package (panc8.SeuratData) ([65]20) and the
   prediction function as implemented in the R package SingleR ([66]21).
   Downstream analyses performing the UMAP algorithm were done as
   implemented in CogentDS (v1.0) for dimensionality reduction and data
   visualization. To determine which genes were differentially expressed
   between tumor subtypes and cell types in a particular patient, Wilcoxon
   Rank Sum and Signed Rank Test was used, together with P values
   adjustment with Benjamini–Hochberg method.

Pseudo-variant calling in snRNA and bulk RNA-seq data

   Bam files resulting from CogentDS were used as input for the pseudo
   bulk variant calling using GATK best practices pipeline for RNAseq
   variant calling ([67]22). Consistent with the recommendations of GATK,
   duplicates were removed with Picard MarkDuplicates, and read groups
   were added with Picard AddOrReplaceReadGroups. Subsequently, Cigar
   reads were split into exon segment and hard-clip any sequences
   overhanging into the intronic regions with SplitNCigarReads. Variant
   calling was performed by HaplotypeCaller and all variants were then
   hard filtered by the following criteria: FS > 30 and QD < 2. Patient
   specific vcf files were intersected or merged using bcftools-isec to
   collect shared or unique mutations from the samples. Resulting variants
   were converted into maf (mutation annotation file) using Ensembl VEP
   ([68]23) annotation tool for the identification of the intronic,
   intergenic, and splice junction mutations (Figure [69]3A). Finally,
   mutations were grouped into three categories: SS, Proximal, and Deep by
   using Ensembl GTF file version 107. Total number of SS and Prox
   mutations were subtracted from total number of mutations in maf files
   to calculate the number of deep intronic variants. Negative values
   indicate that the deep intronic mutation affects multiple transcripts
   of the gene. Pre-Ranked enrichment analysis was performed by using
   mutation lists from three locational groups. Mutation numbers from SS
   and Proximal regions were normalized by using total intron number of
   each gene. Deep mutation numbers were normalized by per unit length
   (kb). Normalized values are ranked, and overall survival was calculated
   using Kaplan–Meier analysis based on TCGA-PAAD cohort. The result is
   shown as Kaplan–Meier plot with P value from log rank test generated by
   the cBioPortal ([70]24).

Figure 3.

   [71]Figure 3.
   [72]Open in a new tab

   Selection and classification of genes exhibiting highest numbers of
   intronic variants. (A) Mutation percentages of all discovered mutations
   based on the variant locations on the left (SS, Prox, Deep) log[2]
   converted counts of each group represented in y-axis, base-pair
   distances for SS and PROX regions and locational percentages for
   intronic region in x-axis. (B) Protein network representation with
   nodes colored according to the location of identified mutations. (C)
   Performance of deep learning-based methods for mis-splicing detection:
   Validation of DeepVarSplice with SpliceAI and Pangolin. (D) Pathway
   Enrichment Analysis of SS and Prox Intronic Mutations (top) and GO
   Enrichment Analysis Deep Intronic Mutations (bottom) using the
   g:Profiler online tool.

Integration of the SpliceAI and Pangolin scoring

   To perform scoring on the discovered variant list, hard filtered and
   intersected VCF files were used. Scores of each variant aggregated on
   their genes and maximum scores from both algorithms have been taken.
   Genes that scored with higher than 0.1 one of the tools have been
   selected as possibly mis-splicing related variants as it stated in
   their reference manual. SpiceAI v1.3 and Pangolin have been used in our
   analysis with default parameters with GRCh38 reference genome.

PubMed annotation of identified genes from different approaches

   The role of specific genes as tumor suppressors and their relevance to
   pancreatic cancer have been analyzed using PubMed data through Entrez
   Python API (BioPython v-1.83) ([73]25). Automated PubMed queries have
   been done, identifying articles that associate each gene with ‘tumor
   suppressor’ and ‘pancreatic cancer’ in their titles or abstracts
   related to the gene. The outcomes are then aggregated into a DataFrame
   and visualized in a heatmap to elucidate patterns of gene involvement
   in tumor suppression and pancreatic cancer.

Cell specific variant calling

   To perform single cell specific variant calling, bam files were
   separated into chip- and sample-specific bam files. Separated bam files
   were used for the variant calling pipeline, which was applied to
   individual bam file as before, without the base quality recalibration
   step (BQSR) to be able to get more variants possible per cell. Quality
   filtering was applied with the same quality filters as mentioned in
   pseudo-bulk RNA variant calling in the methods section. VCF files for
   each barcode were used to collect barcode-specific mutations and
   visualize as a mutation distribution plot using CogentDS UMAP
   functions.

Enrichment analysis of non-canonical (Deep) and canonical (SS-Prox) intronic
variants

   Genes with high Intronic mutation numbers (Table [74]1) were forwarded
   to over representation analysis (ORA) to identify enriched pathways and
   GO terms using the gProfiler online tool ([75]26). Canonical and
   non-canonical intronic mutations were separately analysed setting the
   P-value threshold to 0.05 under multiple correction method g:SCS.

Table 1.

   Classification of high intronic-mutated genes based on variant location
    Gene_ID  Number_ of_Total Number_ of_SS Number_ of_Prox Number_ of_Deep
   RBFOX1    208              0             0               208
   CSMD1     125              0             0               125
   WWOX      80               0             0               80
   BAGE2     75               0             0               75
   SMYD3     64               0             0               64
   MTUS2     61               0             0               61
   CDH13     58               0             0               58
   LRP1B     57               0             0               57
   MAGI2     57               0             0               57
   PTPRN2    52               0             0               52
   DLG2      52               0             0               52
   EXOC4     50               0             1               49
   JMJD1C    48               0             0               48
   LINC02055 48               0             0               48
   NRG1      47               0             0               47
   SLC30A10  47               0             0               47
   SNX29     47               0             1               46
   FHIT      47               0             0               47
   NRXN3     45               0             0               45
   HHAT      45               0             0               45
   ALK       44               0             0               44
   PRKN      43               0             0               43
   NOS1AP    43               0             0               43
   NAV2      44               0             1               43
   CNTNAP2   42               0             0               42
   KMT2C     41               0             0               41
   THSD4     41               0             0               41
   MAPK4     6                0             0               6
   GNAQ      15               0             0               15
   BCAS3     31               0             1               30
   OAS1      4                11            0               -7
   SUPT5H    3                1             0               2
   ASAH1     3                1             1               1
   POU6F2    14               1             0               13
   ACTG1     5                2             4               -1
   MDM2      4                1             0               3
   ASH2L     3                1             0               2
   EPHB6     1                5             3               -7
   C12orf43  1                1             1               -1
   RHNO1     2                1             0               0
   HNRNPH1   1                0             10              -9
   MGAT5B    10               0             7               3
   CARD8     1                0             9               -8
   MACF1     23               0             5               18
   COBLL1    11               0             6               5
   MFF       6                0             5               1
   MPHOSPH9  22               0             5               17
   FOXRED1   1                0             5               -4
   TBPL1     1                0             3               -2
   POLR2L    2                0             1               1
   ELOB      1                0             2               -1
   TM4SF5    5                0             4               1
   RPL26L1   4                0             5               -1
   SNRNP35   1                0             1               0
   INTS10    6                0             3               3
   RPL27     3                0             2               1
   DDR2      18               0             2               16
   ADAMTS12  12               0             1               11
   BCAT1     8                0             3               5
   [76]Open in a new tab

   Classification of highly intronic mutated genes based on the variant
   location: donor and acceptor splice sites (SS, 1–2 bp away from the
   nearest exon-intron junction); proximal site (3–20 bp away from the
   nearest exon-intron junction), and deep intronic (>20 bp away from the
   nearest exon-intron junction). Total number of intronic mutations based
   on genes described in [77]Supplemental Table S2.

Identification of intron retention events from RNA-Seq data

   Intron retention analysis was performed using IRFinder 1.3.0 ([78]27).
   IRFinder calculates IR-ratios to measure IR level reflecting the
   proportion of intron retaining transcripts. To compare PDAC samples to
   normal tissue, we downloaded fastq files from nine PDAC and nine
   healthy human pancreas samples from NBCI GEO database (accession ID:
   [79]GSE211398). These were pre-processed using the same methods as
   described in Bulk-RNA-Seq from primary PDAC. Tumor samples from
   [80]GSE211398 dataset were then analyzed using the IRFinder pipeline to
   identify IR events. Intronic depth for all samples normalized by CPM
   due to the library size variations between the public data and CRU
   samples. Due to the different number of samples between the two
   dataset, weighted means of IR scores were calculated to balance the
   variable sample sizes by using 0.1 threshold among 80% of all samples.
   Normalized intron depths were also used as further filtering to reduce
   the number of identified possible false negative discoveries.

Results

Strategy for the identification of intronic variants affecting splicing in
PDAC

   One of the most critical post-transcriptional mechanisms reprogramming
   transcriptional output and proteomic diversity in cancer cells is the
   loss of splicing precision when removing introns from pre-mRNAs
   ([81]28). Consequently, many mis-spliced variants are instead targeted
   for nuclear degradation or for nonsense-mediated mRNA decay (NMD) and
   thus, only few annotated alternative isoforms correspond to the
   precursors of the proteins mapped by large-scale proteomics studies
   ([82]29).

   To investigate this mechanism, we applied a new pipeline using a
   full-length single nuclei RNA-seq (snRNA-seq) approach to three primary
   PDAC tumors (TM16, TM27 and TM56) and matched tumor-derived preclinical
   PDAC models (i.e. organoids and subcutaneous patient-derived Xenograft,
   PDXs). We then sought to determine pathogenic intronic variants causing
   abnormal splicing in these patient samples (Figure [83]1A). All samples
   were part of a unique patient cohort recruited for the Clinical
   Research Unit 5002 ([84]https://gccc.umg.eu/en/cru-5002/).

Figure 1.

   [85]Figure 1.
   [86]Open in a new tab

   Strategy for the identification of intronic variants related to
   splicing in PDAC. (A) snRNA-seq using the ICELL8 platform and the SMART
   full-length chemistry allowing for full transcript coverage was applied
   to PDAC primary tumors (TM16, TM27 and TM56) and matched organoids and
   PDX lines. A total of nine 5184-nanowell chips were performed (three
   for primary PDAC, three for organoids and three for PDX respectively
   and visualized in the UMAP. Finally, data from all nine nanowell chips
   were merged. Bulk RNA sequencing was performed additionally for each of
   the samples. (B) DeepVarSplice variant calling pipeline combining
   snRNA-seq and bulk RNA-seq data determining intronic variant positions
   in the genome and intron retention events. 1) Gold standard RNA-seq
   variant calling pipeline pre-processing; 2) sample-level variant
   calling; 3) combination strategies for variant discovery
   (Shared-Combined); 4) Further filtering for RNA-editing variant
   removal; and 5) determination of shared intron retention events with
   weighted IR-ratio calculation between public and own PDAC samples.

   For our snRNA-Seq experiments, we used the ICELL8 platform previously
   established in our group connecting genotype to phenotype in individual
   cells ([87]16). In contrast to the chemistry used by droplet-based
   platforms (i.e. in 3′-end approaches), ICELL8 is based on the SMART
   full-length chemistry allowing for the full read coverage of
   transcripts. Notably, when single nuclei are processed with this
   platform and chemistry, a strong enrichment in pre-mRNA is observed,
   including comprehensive coverage of introns and exons along these
   pre-mRNAs (Figure [88]1A and [89]Supplemental Figure S1). Furthermore,
   utilization of full-length chemistry allowed strong detection of
   intergenic and intronic sequences, as well as of non-coding RNAs,
   especially long intergenic non-coding RNAs (lincRNAs) ([90]16,[91]30)
   as demonstrated in Figure [92]2E.

Figure 2.

   [93]Figure 2.
   [94]Open in a new tab

   Identification of intronic variants using snRNA-seq data. (A) UMAP plot
   for three primary PDAC tumors and corresponding model's organoids and
   patient-derived Xenograft, PDXs (TM16, TM27 and TM56). (B) UMAP plot
   showing clustering by patients (primary PDAC and models) for TM16, TM27
   and TM56. (C) Pie charts of the three PDACs (tumor, organoids and PDX)
   showing percentage of exonic and intronic mutations (above) and bar
   charts represent number of top intronic mutated genes (below). (D)
   Ti/Tv ratio of the variant analysis of primary PDAC and their
   corresponding model's organoids and patient-derived Xenograft, PDXs
   (TM16, TM27 and TM56). (E) Shared intronic variants and their
   classifications among all primary PDAC: TM16, TM27 and TM 56. Number of
   totals discovered and quality filtered intronic mutations from each
   sample (left), distribution of variant types among shared (78K)
   intronic mutations (middle), pie charts showing the distribution and
   location of the intronic variants in intronic, intergenic and coding
   regions, and a pie chart showing the intronic variants detected on
   coding and non-coding transcripts (right).

   For the identification of intronic mis-splicing variants, we developed
   a variant-calling pipeline called ‘DeepVarSplice’ that combines (i)
   snRNA-seq data determining the variant's position in the genome with
   (ii) bulk RNA-seq data mainly capturing mRNAs and thus, charting
   splicing events in our samples (Figure [95]1B).

   The pipeline begins with a pseudo-bulk snRNA-seq variant calling
   analysis using the per-sample GATK method (Figures [96]1B ([97]1–4)),
   followed by an intersection of the individual findings to identify
   exact mutations present in all samples. The variants are then
   classified according to their genomic position considering intergenic,
   intronic and exonic regions using the maf file (Mutation Annotation
   File). By filtering the intronic variants, the genes were normalized
   and ranked according to intronic mutation load. The intronic variants
   are then forwarded to two parallel branches of the pipeline.

   The first branch is set up for the investigation of these variants on
   the single cell level. Therefore, barcode-variant matrices containing
   the number of mutations for each gene (columns) and cell (rows) have
   been created to allow further transcriptional analysis related to the
   variants.

   Simultaneously, the second branch of the pipeline connects genes that
   exhibit intronic mis-splicing to pathogenic splicing events through the
   investigation of partial or total intron retention (IR) or pseudo-exon
   activation (PEA) using the IRFinder algorithm (Figure [98]1B ([99]5)).
   Finally, the intron mutated genes from the snRNA data were merged with
   the intron retained genes from bulk RNA-Seq data. As an outcome, we
   highlight genes showing mis-splicing related intronic variants that
   contribute to malignancy and thus proposed as potential therapeutic
   targets in pancreatic cancer.

   To ensure the reliability of the variants detected by DeepVarSplice, we
   conducted comprehensive whole-genome sequencing (WGS) on two primary
   PDAC samples, namely TM27 and TM56. Our analyses encompassed
   gold-standard variant discovery, rigorous filtering, and benchmarking,
   utilizing both WGS and snRNA-Seq data, as depicted in [100]Supplemental
   Figure S2. Out of the 153 000 shared variants detected by DeepVarSplice
   (snRNA-Seq data) in the two primary PDAC samples, 127 000 (83%) were
   confirmed using stringent criteria for variant calling in the WGS
   dataset. Both methods exhibited higher precision, with a Ti/Tv ratio of
   2.3 for the snRNA-Seq data and 2.01 for the WGS data, thereby affirming
   the trustworthiness of the DeepVarSplice method.

Identification of PDAC intronic variants using DeepVarSplice on snRNA-seq
data

   UMAP visualization of single nuclei clustering from the primary PDAC
   tumors, organoids and PDXs (Figure [101]2A) was strongly influenced by
   the patient of origin of each sample. To evaluate the consistency in
   transcriptomics and intronic mis-splicing variants, we merged data from
   all PDAC models of each patient and regenerated separate UMAP plots for
   TM16, TM27 and TM56 (Figure [102]2B). Then, by applying the
   DeepVarSplice pipeline (Figure [103]1B) to this snRNA-seq data, we
   observed that > 95% of the mutations we identified were in non-coding
   sequences (i.e. in intronic or intergenic regions) and only ∼5% were
   found in protein-coding sequences. The most mutated genes at the level
   of introns included RBFOX1, CSMD1, WWOX, CNTNAP2 and LRP1B and were
   notably shared among PDAC tumors and models (Figure [104]2C).

   To identify bona fide intronic variants, we determined the Ti/Tv ratio
   in each primary tumor-, organoid- or PDX-derived dataset.
   Interestingly, primary tumor data showed higher precision and a more
   realistic Ti/Tv ratio (2.18) in comparison to organoid- and PDX-derived
   data (Figure [105]2D). We therefore decided to focus on the three
   tumor-derived datasets for all ensuing analyses, which we complemented
   with bulk RNA-Seq data generated for 24 primary PDAC tumors from the
   CRU5002 cohort, as well as with public normal and tumor pancreas
   tissues data (accession ID: [106]GSE211398).

   In the end, variants shared between all three-tumor snRNA-seq datasets
   amounted to ∼78 000 non-coding variants with 41 834 in introns, 25 449
   in intergenic regions (IGR), 4390 in 3′ and 3882 5′ gene flanks, 610 in
   3′ UTRs and 79 in 5′ UTRs (Figure [107]2E). All shared variants
   detected in the primary PDAC with QUAL > 30 is listed in
   [108]Supplemental Table S1. Notably, 90% of these variants (69 488)
   were reported in the dbSNP_RS database based on WGS data remarking the
   validity of the variant calling performed with the snRNA Seq data.
   Moreover, the remaining 7926 intronic variants identified as novel ones
   underscore the potential of snRNA-Seq datasets to detect both validated
   intronic variants and novel ones, demonstrating the added value of our
   approach.

   Of these, 98.82% (n = 76 499) qualified as SNPs, 0.66% (n = 510) as
   deletions, and 0.52% (n = 404) as insertions. Regarding the genomic
   annotation, most (n = 36 412) are mapped in protein-coding genes or
   lncRNAs (n = 10 926) and very few (n = 121) in miRNAs (Figure [109]2E
   and [110]Supplemental Table S1).

Classification and validation of intronic mis-splicing variants based on
location

   Next, to classify intronic variants we first performed a ranking based
   on the number of variants detected per gene normalized by the length
   and number of introns in each gene (top 20 visualized in
   [111]Supplemental Figure S3). Then, to link intronic variants to
   mis-splicing, we stratified mutations located 1–2 and 3–20 nt away from
   the nearest exon-intron junction, which we classified as donor/acceptor
   sites (SS) and branchpoint-proximal regions (BPs), polypyrimidine
   tracts (PPTs). All other variants >20 nt are categorized as ‘deep
   intronic’^5. (see Table [112]1 and [113]Supplemental Table S2). Based
   on these indexes, many genes showed a high number of deep intronic
   variants, and some of these were reported as tumor suppressors
   expressed in specific tissues or only in tumor cells, e.g. LRP1B,
   CSMD1, WWOX, FHIT, MTUS2, MAPK4 and MAP3K14 ([114]31–38)
   ([115]Supplemental Table S2). The most prominent of these was RBFOX1
   and showed 208 deep intronic variants. The RBFOX family of RNA-binding
   proteins is well known to regulate alternative splicing (AS)
   ([116]39,[117]40). Recently, RBFOX2 was reported to modulate a
   metastatic AS signature in pancreatic cancer ([118]41).

   We next examined the fraction of intronic variants over all variants at
   each position (Figure [119]3A). 99.6% of them were located deep in
   introns. Variants located in the proximal region are 0.32% of the
   total, with just 0.02% near SSs. By conducting functional network
   analysis on the key genes identified as mutated in the splice site
   (SS), proximal, and deep intronic regions outlined in Table [120]1, we
   discovered connections among genes implicated in mRNA capping (HNRNPH1,
   INTS10, TBPL1, POLR2L, SNRNP35) ([121]42), as well as in the regulation
   of transcription via histone methylation and H3K4-specific histone
   methyltransferase activity (SMYD3, KMT2C, JMJD1C, ASH2L), as
   illustrated in Figure [122]3B. Pathway enrichment analysis carried out
   on genes mutated in the SS and proximal regions revealed enrichment in
   RNA polymerase II transcription elongation (REAC:R-HSA-75955) and the
   TP53 regulates transcription DNA repair genes pathway
   (REAC:R-HSA-6796648) as depicted in Figure [123]3D. Additionally, gene
   ontology analysis conducted exclusively on genes mutated deep within
   introns demonstrated enrichment in histone H3K4 trimethyltransferase
   activity (GO:0140999) and pathways associated with adenoid cystic
   carcinoma (WP:WP3651), also shown in Figure [124]3D.

   Finally, visualization of the distribution of intronic variants in top
   mutated genes WWOX, SMYD3, JMJD1C and NAV ([125]Supplemental Figure S1)
   exemplifies read coverage from snRNA-seq and bulk RNA-seq in primary
   PDAC tumors and how these can be superimposed to evaluate splicing
   events.

   Notably, DeepVarSplice identified 1132 genes showing a substantial
   density of intronic variants that could potentially be linked to
   abnormal splicing. These genes were selected based on the modified
   Z-score method. This method is particularly apt for the normalized
   variant numbers dataset, which exhibits a non-normal distribution. The
   modified Z-score uses the median and the Median Absolute Deviation
   (MAD), providing robustness against skewed distributions (Figure
   [126]3A).

   To validate our findings, we also employed recently developed deep
   learning based tools, specifically (i) Pangolin ([127]43) and (ii)
   SpliceAI ([128]44) designed for detecting potential intronic variants
   causing pathogenic splicing (Figure [129]3C).

   Out of the 78000 variants originally identified ([130]Supplemental
   Table S1), 295 variants affecting 107 genes were scored as potentially
   pathogenic by at least one of these tools (Figure [131]3C). Through a
   direct comparison involving Pangolin, SpliceAI, and DeepVarSplice, we
   identified 34 genes depicted as tumor suppressors in pancreatic cancer
   in ([132]Supplemental Figure S4). One explanation for the heightened
   sensitivity of variant detection using our approach is its ability to
   access variants deep within introns from the intronic regions of
   pre-mRNAs present in sn-RNA data. In fact, a limitation of SpliceAI and
   Pangolin is their optimization primarily for variants located within 50
   bp on the splice site defined as SS and proximal intronic regions,
   affecting canonical splicing. Furthermore, it's important to
   acknowledge that both Pangolin and SpliceAI were developed
   predominantly using bulk-RNA sequencing methods, leading to potential
   gaps in the dataset due to the absence of intronic sequences. This
   limitation highlights the advantage of our approach in capturing a more
   comprehensive spectrum of intronic variants.

Transcriptional regulation relates to mis-splicing in basal-like tumor cells

   We performed transcriptional subtyping of single tumor nuclei into the
   more aggressive/drug-resistant ‘basal-like’ (BL) or the better
   prognosis-associated ‘classical’ subtype (CLA) using a ranking markers
   method described previously on snRNA-seq data ([133]18). The
   distributions of each subtype in each tumor showed BL cells highly
   represented in primary tumors, and partially in matched PDXs. In
   contrast, CL cells predominated in our organoid models (Figure
   [134]4A). It is important to note that the resolution of our analysis
   of these patient-matched models PDX and organoids undergo quite some
   selection that can change a substantial part of their transcriptional
   profiles. Specifically, tissue samples prepared for organoid generation
   are only small parts of the whole tumor. The higher heterogeneity of
   PDAC primary tumors questions the reliability of substituting small
   pieces for whole tumor tissues especially replicating the complexity of
   the patient-specific environment, e.g. tumor stroma and tumor types
   possess distinct immune components and different cell quantities
   affecting the cell composition in the early stage of tumoroid cultures.

Figure 4.

   [135]Figure 4.
   [136]Open in a new tab

   Tumor subtype and marker identification at the single-nucleus level.
   (A) UMAP plot of single cells from the three patients TM16, TM27 and TM
   56 (colored by tumor model (left) and by inferred tumor subtype on CL
   and BL tumor pancreatic cells (right). (B) UMAP plot showing the gene
   expression of the top transcriptional markers comparing CL vs BL cells
   in primary PDACs, organoids and PDXs. (C) UMAP plot showing mutated
   cells (intronic variants) for CNTNAP2, CSMD1 and RBFOX1 in primary
   PDACs.

   Next, we performed differential expression analysis (DEG) between BL
   and CL tumor cells, where we identified a few hundred markers for both
   subtypes with absolute log[2]FC > 0.5 and P[adj] < 0.05
   ([137]Supplemental Table S3) and visualized the most prominent ones in
   UMAP plots (Figure [138]4B).

   Tumor heterogeneity was assessed by evaluating cell types using
   snRNA-seq from all models at hand. As expected, primary tumors
   exhibited the highest heterogeneity as regards cell type composition
   while organoids mostly contained ductal-like cells (Figures
   [139]5A-[140]C). Strikingly, RBFOX1, CSMD1 and CNTNAP2, our topmost
   intronic mutated genes (Figures [141]1B and [142]4C) appear to be
   simultaneously the genes most upregulated in BL cells (Figure [143]5A).
   At the same time, the markers found in CL cells have already been
   described as biomarkers for pancreatic cancer: MALAT1, NEAT1, CEACAM6
   or MUC1 ([144]45–48). For the lncRNA NEAT1, two novel mutations are
   reported here for the first time ([145]Supplemental Table S1). Taken
   together, CL cells and models are characterized by less abundance and
   relevance of intronic variants according to the output of our pipeline,
   thus suggesting that mis-splicing mechanisms are linked to the
   aggressiveness of BL tumors (Figure [146]5D).

Figure 5.

   [147]Figure 5.
   [148]Open in a new tab

   Identification of markers for classical-like (CL) and basal-like (BL)
   tumor cells. (A) Heatmap showing gene expression of the top 15 BL and
   CL markers overexpressed in all PDACs: ‘T’ primary PDACs, ‘O’ organoids
   and ‘P’ PDXs. Venn diagrams illustrate the overlap of marker genes
   between tumor PDAC patient samples. (B) Dot plot showing expression of
   the markers in each of the pancreas cell subtypes. Circle size is
   proportional to the percentage of cells in each cell type expressing
   the marker and circle color represents the average marker gene
   expression in the cell type. (C) UMAP plots by patient showing clusters
   annotated to specific cell subtypes. The violin plots below represent
   prominent CL tumor pancreas markers across cell subtypes. (D) Table
   showing the number of intronic mutations of the top CL and BL marker
   genes.

Contribution of intronic mis-splicing variants to pathogenic splicing and
poor PDAC prognosis

   In general, mis-splicing affects the regulation of genes (up- or
   downregulation) or generates new isoforms in normal and tumor tissues
   ([149]49–51). However, our focus here is on the identification of
   potentially pathogenic splicing events in PDAC. It is important to note
   that the definition of pathogenicity in this study is based on
   comparisons to human references (via SNP calling) or to non-tumor