Graphical abstract

   graphic file with name fx1.jpg
   [41]Open in a new tab

Highlights

     * •
       A comprehensive identification of onco-exaptation events in bladder
       cancer cell lines
     * •
       Transposable elements from LINE1 family contribute most events in
       bladder cancer
     * •
       Multiomics analysis validated L1PA2-SYT1 as tumor specific and
       prognosis relevant
     __________________________________________________________________

   Cancer; Genomics; Molecular mechanism of gene regulation;
   Transcriptomics

Introduction

   Transposable element (TE) is a type of repetitive sequence in human
   genes that accounts for about 45% of the whole genome length.[42]^1
   Although TEs, once considered as “junk DNA,” have been discovered to
   exhibit regulatory functions, their precise roles remain largely
   unknown. TEs can move within the genome by a mechanism called
   transposition. Based on different transposition mechanisms, TE can be
   divided into class-I transposons and class-II transposons. Class-I
   transposons primarily transpose through RNA intermediate. Depending on
   whether their structure contains long terminal repeat (LTR) sequences,
   they can be divided into LTR elements and non-LTR elements. Class-I
   transposons can also be classified as autonomous retrotransposons or
   non-autonomous retrotransposons based on whether they have open reading
   frames (ORFs) that encode proteins required for retro-transposition.
   For example, in non-LTR elements, long interspersed element (LINE) is
   autonomous, while short interspersed element (SINE) is non-autonomous.
   Class-II transposons, known as DNA transposons (DNA), mainly transpose
   by a “cut-and-paste” mechanism. The proportion and composition of TE
   differ among different species. In humans, TE mainly includes four
   classes: DNA, LTR, SINE, and LINE, and each class can be further
   subdivided into different families and subfamilies.[43]^2

   Recent studies have revealed a close relationship between TE and the
   development of tumors. Epigenetic dysregulation has been found in many
   tumors, including changes in methylation levels and histone
   modifications, which are related to TE
   activation.[44]^3^,[45]^4^,[46]^5 TE activation can further induce
   microsatellite instability and increased genomic
   alterations.[47]^6^,[48]^7 It can also affect DNA repair mechanisms
   mediated by genes such as BRCA1/2 and APC.[49]^8^,[50]^9 The expression
   products of TE can induce in vivo immune response, indicating potential
   therapeutic targets of tumor.[51]^10^,[52]^11

   In addition, TE can promote tumor development by participating in
   transcriptional regulation. TE insertions can introduce new splicing
   sites,[53]^12 change ORFs,[54]^13 or affect post-transcriptional
   regulation.[55]^14 Furthermore, regulatory elements present in TE
   sequences can be reactivated in tumors to promote oncogene expression,
   a mechanism known as “onco-exaptation.”[56]^15 For example, LTR
   contains promoter and enhancer elements.[57]^16 The THE1B subfamily
   from the MaLR-ERVL family of LTR class, can splice into oncogene CSF1R,
   driving full-length CSF1R transcription in Hodgkin’s lymphoma.[58]^17
   This event is represented as “THE1B-CSF1R,” indicating the combination
   of the TE subfamily and the gene. The 5-end untranslated region
   (5′-UTR) of LINE1 contains two oppositely oriented promoter sequences,
   with the forward promoter used for its own transposition and the
   reverse promoter driving the expression of adjacent genes.[59]^18 For
   instance, the L1PA2 subfamily of LINE1 drives oncogene MET expression
   in bladder cancer and colorectal cancer.[60]^5^,[61]^19 SINE elements
   are known to recruit RNA polymerase (RNAP) III,[62]^20 but they can
   accumulate mutations during evolution to produce new transcription
   factor-binding sites that recruit RNAP II, such as AluJb-LIN28B
   identified in lung cancer cell lines.[63]^21

   Recently, a pan-cancer study across 15 cancer types identified
   onco-exaptation events comprehensively.[64]^21 It is impressive that
   more than 1 onco-exaptation event exist in 77% of bladder tumors,
   second only to squamous cell lung cancer (89%), suggesting the
   significance of onco-exaptation in bladder cancers. However, the
   biological significance of only a few events has been experimentally
   elucidated. L1PA2-MET is one such event that involves a novel exon 1
   located in the intron 2 of the MET gene, and spliced into the exon 3 of
   MET, encoding a LINE1-driven MET isoform. In a study that includes ten
   bladder cancer cell lines, L1PA2-MET expression was significantly
   negatively correlated with the methylation level of L1PA2,[65]^5 and
   this event has also been identified in colon cancer.[66]^3 However,
   most of the frequent events have not been studied, mainly due to the
   incomplete knowledge of onco-exaptation in bladder cancer cell lines,
   which limits the use of this fundamental biological tool. Although
   TE-derived novel transcripts, including onco-exaptation events, were
   identified in 675 cell lines recently, only six bladder cancer cell
   lines were analyzed.[67]^22^,[68]^23

   To better utilize cell lines to study onco-exaptation events, we
   collected 137 RNA-seq data for 44 bladder cancer cell lines in six
   datasets and identified 44 events. We found that the cell lines
   reflected frequent events in primary tumors, including L1PA2-SYT1,
   L1PA2-MET, and L1PA2-XCL1, whose alternative promoters were all
   provided by L1PA2. We verified the splicing site of L1PA2-SYT1 and
   found that its expression was associated with poor prognosis in
   patients in an external dataset, potentially regulated by DNA
   hypomethylation.

Results

Cell line data collection and identity confirmation

   We searched for untreated bladder cancer cell lines in the SRA and ENA
   databases. In 137 datasets, a total of 44 different named cell lines
   were identified ([69]STAR Methods). 18 cell lines were sequenced in two
   or more datasets, with RT4 being sequenced in five datasets, and J82,
   T24, TCCSUP, and 5637 being sequenced in four datasets. HT1197, HT1376,
   SCABER, SW780, and UMUC3 were profiled in three datasets
   ([70]Figure 1A). In addition, SVHUC1 is an immortalized non-malignant
   urothelial cell line. Considering that cell line contamination is a
   common issue, we confirmed the identity of the cell lines using two
   methods ([71]STAR Methods). One was principal component analysis based
   on gene expression profiles to determine phenotypic similarities
   between cell lines. We found that the similarities between cell lines
   exceeded the effects of different sequencing strategies and
   experimental batches ([72]Figures 1B and [73]S1A). Another method was
   adapted from the CeL-ID approach to detect curated somatic mutations
   among cell lines ([74]STAR Methods). The same cell lines profiled in
   different datasets were highly similar in terms of curated somatic
   mutation profile ([75]Figure 1C). Besides, the background mutation
   correlation between different cell lines was very low, whereas derived
   cell lines and parental clones showed high similarity, such as RT4 and
   RT4v6, 253J and 253JBV, and 1A6 and 5637[76]^24 ([77]Figure 1C). Based
   on these results, we conclude that the possibility of
   cross-contamination in the studied cell lines is low.

Figure 1.

   [78]Figure 1
   [79]Open in a new tab

   Overview of 44 bladder cancer cell lines from six publicly available
   dataset

   (A) Description of the dataset containing cell line data, where rows
   represent dataset identifiers, columns represent Cellosaurus cell line
   names ([80]STAR Methods), and values represent the number of biological
   replicates; (B) Principal component analysis based on the top 5000
   highly variable genes, where the x and y axis represent PC1 and PC2,
   respectively. The same cell lines are annotated by the same color, and
   the same datasets are annotated by the same shape. Ellipse was
   estimated for cell lines profiled in at least three datasets; (C)
   Heatmap showing the pairwise Pearson correlation of somatic mutations
   curated in COSMIC v97 database between cell lines, where the distance
   metric is represented by 1-Pearson correlation coefficient. The heatmap
   color represents the correlation coefficient, and the cell line color
   coding is the same as in [81]Figure 1B. See also [82]Figure S1A.

Identification of onco-exaptation events in bladder cancer cell lines

   Afterward, we used the TEPROF2 algorithm to identify onco-exaptation
   events in the de novo transcript assemblies. Considering the potential
   high noise in short-read sequencing data during transcript assembly,
   TEPROF2 applied multiple heuristic filters to reduce false positive
   events ([83]STAR Methods). For the identified transcripts that met the
   criteria, we retained events that had TPM >1 in at least one sample and
   accounted for at least 25% of the total expression of the corresponding
   gene. 44 events were identified and summarized in [84]Table S1.
   Overall, we identified at least 1 onco-exaptation event in each cell
   line. We then compared the number of events identified between low
   grade (grades 1 and 2) and high grade (grade 3 and 4), and between
   luminal and basal subtypes, but found no significant differences
   ([85]Figures S1B and S1C). Interestingly, we found that the number of
   events identified in the low genomic instability (GIN) group was
   significantly higher than that in the high GIN group ([86]Figure S1D).
   Of the 44 events, 9 TE sequences were from intergenic regions (20.45%)
   and 35 were from intronic regions (79.54%) ([87]Figure S1E). 2 TE
   sequences were DNA class, 16 were LINE, 15 were LTR, and 11 were SINE,
   with events significantly enriched in LINE and LTR sequences
   (enrichment score: DNA 0.503, LINE 1.195, LTR 2.156, SINE 0.509)
   ([88]Figure S1F). We presented 17 events defined in at least 5 samples
   ([89]Figure 2A). We found that in cell lines where L1PA2-SYT1,
   AluJb-SALL4, L1PA2-XCL1, L1PA3-TYRP1, MLT2B1-VAV2, and THE1D-WNT5A
   exist, the overall expression of the corresponding genes was
   significantly elevated compared to cell lines without the events,
   suggesting that these TE sequences promoted strong transcriptional
   activity and played a role in driving the expression of oncogenes
   ([90]Figure 2A). Although the expression of L1PA2-XCL1 correlated well
   with the involved L1PA2 subfamily (Pearson’s r = 0.49), overall, we
   found that the onco-exaptation events did not show a stronger
   correlation with the involved TE subfamily than with uninvolved TE
   subfamilies (p = 0.15) ([91]Figures S2A and S2B). For onco-exaptation
   events that exist in at least three cell lines in each gender, we found
   L1PA2-MET exhibited elevated expression in cell lines originating from
   female patients, suggesting a gender-related influence
   ([92]Figure S2C). Besides, no evident association was found between
   germline mutations within the promoter region of the novel
   transcription start site (TSS) and the occurrence of an onco-exaptation
   event ([93]Figure S3).

Figure 2.

   [94]Figure 2
   [95]Open in a new tab

   Identification of onco-exaptation events in bladder cancer cell lines

   (A) Description of 17 onco-exaptation events identified in five or more
   samples. From left to right: 1. schematic structure of the top 5
   high-frequency events, 2. event names, 3. frequency of the event in 137
   cell line samples, 4. oncogene expression levels between cell line
   groups with and without the corresponding event, and 5. the proportion
   of gene expression accounted for by the event in cell lines where it is
   present. The error bar indicated the interquartile range. interqu∗p <
   0.05, ∗∗p < 0.01, ∗∗∗p < 0.001, ∗∗∗p < 0.0001, Wilcoxon rank sum test;
   ∗p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001. (B) Frequency distribution bar
   plot of 25 events identified in both TCGA and cell line data, arranged
   from top to bottom based on the frequency of detection in cell lines,
   with the frequency indicated at the top of each bar. The x axis
   representing the fraction of the event; (C) Enrichment scores of three
   high-frequency events originating from L1PA2 in the TCGA BLCA bladder
   cancer cohort against the TCGA pan-cancer background. See also
   [96]Figures S2–S4.

   To verify the significance of those onco-exaptation events, we
   quantified their expression in 22 cases of paired bladder tumor and
   normal samples. Notably, the most prevalent events, including
   L1PA2-SYT1, AluJb-SALL4, L1PA2-MET, and L1PA2-XCL1, demonstrated
   upregulated expression within the tumor samples ([97]Figure S4).
   L1PA2-MET is also expressed in adjacent normal samples, which is
   consistent with the previous result that L1PA2-MET is activated in
   pre-malignant normal tissue, a phenomenon named “field defect.”[98]^5
   Altogether, we found L1PA2-SYT1 expression nearly absent in adjacent
   normal samples, exhibiting high tumor specificity.

Cell lines can reflect the onco-exaptation patterns of tumor tissues

   Based on the TE’s location and splicing target exon, 25 candidates were
   identified as consistent between TCGA-BLCA cohort[99]^21 and the cell
   lines, including 13 frequent events shown in [100]Figure 2A and 8
   events strictly defined as “tumor specific” in TCGA-BLCA cohort (not
   exist in adjacent normal samples and exist in at least 4 samples or
   10-fold enriched in tumor samples),[101]^21 although it has been proven
   that onco-exaptation events can occur in adjacent normal tissues under
   the influence of “field defect.”[102]^5 By comparing the frequency of
   the 25 shared events, it was found that L1PA2-SYT1, L1PA2-MET, and
   L1PA2-XCL1 had the highest frequency in both datasets ([103]Figure 2B).
   Interestingly, the three most frequent candidates all came from L1PA2
   of the LINE1 family, which has been shown to provide abundant
   transcription factor-binding sites in other cancer cell lines.[104]^25

   In addition, we found that L1PA2-SYT1, L1PA2-MET, and L1PA2-XCL1 are
   not only highly frequent events in bladder cancer but are also enriched
   in bladder cancer against the background of pan-cancer[105]^21
   ([106]Figure 2C). L1PA2-SYT1 is predicted to be an in-frame event,
   widely exist in various cancers (11.6%), and is found in 44% of bladder
   cancers.[107]^21 The SYT1 gene is located on chromosome 12q21.2 and is
   a type of synaptic vesicle membrane fusion protein containing two
   protein kinase C homology sequences. In vitro experiments have shown
   that SYT1 overexpression can promote proliferation, migration, and
   invasion of colon cancer cell lines.[108]^26 However, the function of
   the SYT1 gene in bladder cancer is still unclear. L1PA2-MET is
   predicted to be an in-frame event,[109]^21 and LINE1-driven MET
   transcript isoforms have been demonstrated to exist in bladder cancer
   cell lines (GenBank: [110]BF208095.1).[111]^5 The promoter activity of
   this LINE1 is activated by hypomethylation and chromatin remodeling,
   and L1PA2-MET has been observed to be highly expressed in both cancer
   and adjacent tissues, but lowly expressed in normal bladder
   epithelium,[112]^5 which has been demonstrated as a diagnostic
   biomarker for bladder cancer.[113]^27 Recently, the role of MET in
   bladder cancer has been systematically reviewed.[114]^28 In addition,
   in colorectal cancer, the expression of MET is driven by LINE1,
   promoting tumor metastasis.[115]^3 Although XCL1 is a protein-coding
   oncogene, L1PA2-XCL1 is predicted to be a noncoding transcript, thus
   may have a different function compared with its coding
   counterpart.[116]^21

Heterogeneous L1PA2-SYT1 expression across different cell lines

   Considering the prevalence of L1PA2-SYT1 in bladder cancer cell lines
   and the unclear expression and function in bladder cancer, we focused
   on L1PA2-SYT1 in our next analysis. We arranged the bladder cancer cell
   lines first by the expression level of L1PA2-SYT1 and then by the total
   TPM of all SYT1 isoforms ([117]Figure 3A). Typical expression patterns
   of SYT1 gene and L1PA2-SYT1 transcript in bladder cancer cell lines
   could be categorized into three groups: 1) almost all the expression of
   SYT1 gene was driven by L1PA2, such as UMUC6, RT4, and HT1376; 2) SYT1
   was highly expressed and not driven by L1PA2, such as 5637, 253JP, and
   KU1919; and 3) SYT1 was lowly expressed, such as J82, T24, and
   immortalized normal cell line SVHUC1. The raw alignment confirmed the
   reads across the splicing junction between L1PA2 and the known exon
   (ENSE00001399568.1) of SYT1 ([118]Figure 3B). Considering the
   repetitive nature of L1PA2, we designed four pairs of primers to
   validate the novel splice junction (SJ) of L1PA2-SYT1 in RT4 and HT1376
   ([119]Table S4A). Gel electrophoresis proved the target products are
   located at 200 bp ([120]Figure 3C). The sequence was further confirmed
   identical as the prediction by Sanger sequencing of the PCR product
   ([121]Figure 3D). Then, we selected J82 as a negative control and
   demonstrated that the SJ did not exist in J82 using both RT-PCR and
   qPCR ([122]Figures 3E and 3F).

Figure 3.

   [123]Figure 3
   [124]Open in a new tab

   Heterogeneous L1PA2-SYT1 expression across different cell lines

   (A) The L1PA2-SYT1 expression group is arranged by L1PA2-SYT1
   expression level, while the non-expression group is arranged by the
   overall expression level of the SYT1 gene. Data are represented as
   mean ± SEM.

   (B) Sashimi plot showing the number of reads supporting the presence of
   the L1PA2-SYT1 splicing site. RT4 and HT1376 were selected as cell
   lines expressing L1PA2-SYT1, while J82 was selected as a control cell
   line that does not express L1PA2-SYT1. GAPDH was selected as a positive
   control. The x axis represents genomic coordinates. The bar plot shows
   the reads coverage. The curve connects the acceptor and donor sites of
   the splicing event, with the numbers on the curve indicating the number
   of reads supporting the splicing event; (C) Gel electrophoresis results
   of RT-PCR products using primers specific to the L1PA2-SYT1 splicing
   site. DNA lengths are labeled on the right. The products in the red box
   were cut and used for Sanger sequencing; (D) Sanger sequencing results
   of the PCR product shown in [125]Figure 3C. The position of the
   primers, the predicted TSS, and the predicted SJ was shown; (E) Gel
   electrophoresis results of RT-PCR products using primers specific to
   the L1PA2-SYT1 splicing site. J82 as a negative control that does not
   express L1PA2-SYT1. DNA lengths are labeled on the right; (F) The
   relative expression of L1PA2-SYT1 in J82, HT1376, and RT4 using qPCR.
   GAPDH was used as reference; (G) The alignment of 5′RACE fragments
   captured by a gene-specific primer that target the exon3 of L1PA2-SYT1.
   See also [126]Figure S5.

   TEPROF2 predicts the TSS and the full length of onco-exaptation using
   short-read sequencing data. Considering the low coverage at the ends of
   transcripts and the uncertainty in transcript assembly associated with
   short-read sequencing, we validated the TSS and full length of
   L1PA2-SYT1 using 3^rd generation, long-read transcriptome sequencing
   (ONT-seq), which revealed that the novel SJ of L1PA2-SYT1 exists in
   HT1376 and RT4 but is absent in J82 ([127]Figure S5A). Compared with
   the prediction of TEPROF2, the full-length L1PA2-SYT1 from both HT1376
   and RT4 exhibited longer 5′ ends ([128]Figure S5B). However, TEPROF2
   accurately predicts all exons of L1PA2-SYT1 ([129]Figures S5C and S5D).
   Besides, using a gene-specific primer, we confirmed with 5′RACE
   sequencing that the TSS of L1PA2-SYT1 indeed originates from the L1PA2
   element (chr12:79039622-79045645) ([130]Figures 3G and [131]S5C).

The expression of L1PA2-SYT1 is associated with DNA hypomethylation

   The activity of LINE1 sequences is regulated by various epigenetic
   mechanisms in the host, including methylation and histone
   modifications. Dysregulation of methylation in tumors may activate
   LINE1 sequences that are normally suppressed by methylation, thereby
   exhibiting promoter activity. We searched for CpG islands located in
   the promoter region, i.e., 2000 bp upstream to 500 bp downstream of the
   TSS of L1PA2-SYT1 and found that CpG:_22 (chr12:79045269-79045493) met
   the standard. Considering that low methylation of CpG islands in
   promoters is related to transcriptional activity, we hypothesized that
   the low methylation of this CpG island accounts for the heterogeneous
   expression of L1PA2-SYT1 across bladder cancer cell lines. The paired
   RRBS data in the bladder cancer cell line RNA-seq dataset in the CCLE
   dataset (SRA: SRP186887) were used to analyze the relationship between
   the CpG:_22 methylation level and L1PA2-SYT1 expression in 25 cell
   lines. The CpG:_22 contains four measurable CpG sites, i.e.,
   chr12:79045341, chr12:79045345, chr12:79045353, and chr12:79045362.
   After filtering out 6 cell lines with a minimum read coverage <5, we
   performed hierarchical clustering and defined two methylation groups.
   We found the group with overall low methylation level showed high level
   of L1PA2-SYT1 expression ([132]Figure 4A). We then calculated the
   average methylation value of four CpG sites and found that the average
   CpG:_22 methylation level was significantly negatively correlated with
   the expression of this transcript (R = −0.73, p < 0.001)
   ([133]Figure 4B). In contrast, in cell lines that did not express
   L1PA2-SYT1 (n = 11), the expression level of the SYT1 gene was not
   related to the methylation level of this CpG island ([134]Figure 4C).
   These results suggested that demethylation of LINE1 is one of the
   mechanisms that specifically regulates the upregulation of L1PA2-SYT1
   expression.

Figure 4.

   [135]Figure 4
   [136]Open in a new tab

   The expression of L1PA2-SYT1 is associated with DNA hypomethylation

   (A) Heatmap showing the methylation levels of four measurable CpG sites
   within the L1PA2-SYT1 promoter region CpG_22 in 19 bladder cancer cell
   lines from the CCLE project, with L1PA2-SYT1 expression levels labeled
   at the top of the heatmap; (B) Pearson correlation between the average
   methylation level of CpG_22 and L1PA2-SYT1 expression levels across the
   19 cell lines in the CCLbE project; (C) Pearson correlation between the
   average methylation level of CpG_22 in 11 cell lines that do not
   express L1PA2-SYT1 and the total expression level of SYT1.

High expression of L1PA2-SYT1 is associated with poor prognosis of MIBC

   We further explored the function of L1PA2-SYT1 in bladder cancer by
   ectopically expressing the fusion protein in J82 cells. It is
   interesting to find that the anti-synaptotagmin-1 monoclonal antibody
   (mAb) detects two bands of about 50 kDa, which could be two L1PA2-SYT1
   protein isoforms ([137]Figures 5A, 5B, [138]S6A, and S6B). Besides,
   L1PA2-SYT1-overexpressed J82 exhibits higher invasion capability
   compared with the negative control ([139]Figure 5C).

Figure 5.

   [140]Figure 5
   [141]Open in a new tab

   High expression of L1PA2-SYT1 is associated with poor prognosis of MIBC

   (A) qPCR validates the overexpression of L1PA2-SYT1 in transfected J82
   cell line; (B) Western blot found L1PA2-SYT1 could be translated into
   two protein isoforms near 50 kDa; (C) Invasion assay suggests enhanced
   invasion capability with increased level of L1PA2-SYT1. The scale bar
   indicated 250μm; (D) Comparison of the overall expression levels of the
   SYT1 gene between the L1PA2-SYT1-present and L1PA2-SYT1-absent groups
   in the validation cohort (n = 52). The dot denotes the expression level
   of SYT1 for individual samples (Wilcoxon rank-sum test). Boxplot shows
   the fraction of SYT1 expression contributed by L1PA2-SYT1 expression in
   the L1PA2-SYT1-present group. The error bar indicated the interquartile
   range; (E) Comparison of OS between the L1PA2-SYT1 exist and L1PA2-SYT1
   not exist groups in the validation cohort (n = 52), Log-rank test; (F)
   Forest plot showing the multivariable Cox analysis including tumor
   grade, N stage, L1PA2-SYT1 expression, and TP53 mutation in the
   validation cohort (n = 52). The error bar indicated the 95% confidence
   interval; (G) Volcano plot showing the differential gene expression
   analysis between the L1PA2-SYT1-present and L1PA2-SYT1-absent groups in
   the validation cohort (n = 52), with differentially expressed genes
   (|log2FoldChange| >5 and FDR <1E-3) labeled; (H) Network plot of GO
   enrichment analysis for the upregulated genes in the L1PA2-SYT1-present
   group in [142]Figure 5D, with similar semantic terms labeled with the
   same color. Black circles indicating further manual summarization. See
   also [143]Figure S6.

   To investigate whether the high expression of L1PA2-SYT1 is associated
   with poor prognosis in cancer, we retrieved RNA-seq data
   (SRA: PRJNA891747) from 52 muscle-invasive bladder cancer patients with
   follow-up information as the validation group.[144]^29 Again, we that
   found patients having L1PA2-SYT1 showed higher level of SYT1
   expression, which validated the driving effect of L1PA2 on SYT1
   ([145]Figures 2A and [146]5D). Also, the existence of L1PA2-SYT1 is
   associated with poor overall survival of patients ([147]Figure 5E).
   After adjusting for TP53 mutation, tumor grade, and lymph node
   metastasis stage, we found that the expression level of L1PA2-SYT1
   remained an independent risk factor for overall survival of patients
   ([148]Figure 5F).

   Finally, we analyzed the differentially expressed genes between exist
   group and not exist group of L1PA2-SYT1 ([149]Figure 5G). The results
   of Gene Ontology enrichment analysis showed that the exist group of
   L1PA2-SYT1 enriched genes related to cell cycle, methylation, retinoid
   metabolism, sodium ion transport, and the ERBB signaling pathway
   ([150]Figure 5H).

Discussion

   Bladder cancer is a common tumor in urinary system, and many studies
   have suggested TEs are activated in bladder cancer and may be involved
   in the development and progression of bladder cancer. Interspersed
   repeat elements, including LINEs and SINEs, have been shown to undergo
   abnormal hypomethylation in bladder cancer.[151]^30 LINE1 is the most
   studied TE in bladder cancer, and studies have found that the abnormal
   hypomethylation of LINE1 in tumors may be activated by reactive oxygen
   metabolism pathways, which further promotes the loss of tumor
   suppressor genes (e.g., CDKN2A[152]^31) and the activation of oncogenes
   (e.g., MET[153]^5), promoting the progression of bladder cancer. MET
   was first discovered to be activated by an ectopic promoter derived
   from LINE1 in leukemia, promoting the development of tumors, and was
   subsequently confirmed to be driven by an antisense promoter derived
   from LINE1 in bladder cancer.[154]^5 In our research, we found that
   L1PA2-MET, other than L1PA2-XCL1 and L1PA2-SYT1, exhibited a
   gender-related expression pattern. In a recent study, it was observed
   that L1PA2 elements contained abundant transcription factor-binding
   motifs, including estrogen receptor 1, in the breast cancer cell line
   MCF7.[155]^32 Therefore, we postulate the existence of transcriptional
   regulatory heterogeneity among L1PA2-derived transcripts. These results
   suggest the potential value of transposons in the diagnosis and
   prognosis of bladder cancer. In fact, LINE1-MET has been included in a
   diagnostic panel for bladder cancer.[156]^27

   Some studies have identified onco-exaptation events in the TCGA-BLCA
   cohort,[157]^21 but the function of most events has not been fully
   studied. One important reason is that whether the corresponding events
   exist or how frequently they occur is poorly understood in bladder
   cancer cell lines. Therefore, a comprehensive identification of events
   in bladder cancer cell lines can help us better utilize them to
   understand the biological importance of onco-exaptation. In this study,
   we collected 137 RNA-seq data from 44 bladder cancer cell lines from
   six different datasets publicly available and identified
   onco-exaptation events. We found that the events identified in tumor
   tissues could be validated in cell lines, and L1PA2 contributed the
   most events, which may be associated with the global hypomethylation of
   L1PA2 in bladder cancer, which could be reflected in the results of
   pathway enrichment analysis, where methylation-related pathways
   exhibited alterations. Furthermore, differential expression analysis
   revealed that the existence of L1PA2-SYT1 correlated with the activity
   of the cell cycle and the ERBB pathway. Recent study has confirmed that
   primate-specific transposable elements, including L1PA2, contribute to
   a significant number of open chromatin regions in decidual stromal
   cells. These regions contribute to the binding motif of the
   progesterone receptor and mediate regulatory effects on pathways like
   the cell cycle and ERBB signaling.[158]^33 Furthermore, we also
   identified an onco-exaptation event related to SALL4 (AluJb-SALL4).
   This event was also identified in the pan-cancer study, although the
   TSS is predicted to be contributed by a different TE subfamily (MLT1J).
   The function of SALL4 in bladder cancer remains unexplored, making this
   event an intriguing candidate for further investigation.[159]^21

   SYT1 protein is a member of the synaptotagmin family and is a membrane
   transport protein consisting of an N terminus, a variable link domain,
   a single transmembrane domain, and two C2 domains (C2A, C2B), with C2
   domains serving as Ca^2+-binding sites.[160]^34 Overexpression of SYT1
   was found to promote the invasion and metastasis ability of colon
   cancer cell lines in vitro. We found that L1PA2-SYT1 was the most
   frequent event in bladder cancer cell lines and was associated with
   poor prognosis in patients, and in vitro overexpression of L1PA2-SYT1
   enhances the invasion capability of bladder cancer cell line,
   suggesting the potential tumor-driving role and clinical application
   potential of L1PA2-SYT1.

Limitations of the study

   In this study, we observed widespread onco-exaptation event in all
   bladder cancer cell lines, highlighting the significance of these
   events in promoting bladder cancers. Although initial exploration
   unraveled epigenetic mechanisms in regulating such events, further
   evidence is required to elucidate the direct upstream regulator.
   Besides, multiple interesting events remained unexplored, like the
   potential effect of L1PA2-XCL1 after losing coding potential, and the
   association between AluJb-SALL4 identified in this study and
   MLT1J-SALL4 identified in TCGA-BLCA project.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Antibodies
     __________________________________________________________________

   Rabbit monoclonal to Synaptotagmin-1 Cell Signaling Technology
   Cat#14558; RRID:[161]AB_2798510
   Rabbit monoclonal to GAPDH Cell Signaling Technology Cat#2118;
   RRID:[162]AB_561053
   Mouse Anti-Rabbit IgG (Light-Chain Specific) (D4W3E) mAb (HRP
   Conjugate) Cell Signaling Technology Cat#93702; RRID:[163]AB_2800208
     __________________________________________________________________

   Bacterial and virus strains
     __________________________________________________________________

   DH5alpha Vazyme, Jiangsu, China Cat#C502
     __________________________________________________________________

   Chemicals, peptides, and recombinant proteins
     __________________________________________________________________

   Ultra GelRed (10,000 ×) Vazyme, Jiangsu, China Cat#GR501
   VeZol Reagent Vazyme, Jiangsu, China Cat#R411
   ReverTra Ace -a- High Efficient Reverse Transcription Kit Toyobo,
   Osaka, Japan Cat#FSK-101
   KOD -Plus- Ver.2 Toyobo, Osaka, Japan Cat#KOD-211
   Gel Extraction Kit Cwbio, Jiangsu, China Cat#CW2302
   HiScript-TS 5'/3' RACE Kit Vazyme, Jiangsu, China Cat#RA101
   FastPure Cell/Tissue Total RNA Isolation Kit V2 Vazyme, Jiangsu, China
   Cat#RC112
   HiScript III All-in-one RT SuperMix Perfect for qPCR Vazyme, Jiangsu,
   China Cat#R333
   Taq Pro Universal SYBR qPCR Master Mix Vazyme, Jiangsu, China Cat#Q712
   FastPure Cell/Tissue DNA Isolation Mini Kit Vazyme, Jiangsu, China
   Cat#DC102
   Ligation Sequencing Kit Oxford Nanopore Technologies Cat#SQK-LSK110
   The PCR Barcoding Expansion 1-96 Oxford Nanopore Technologies
   EXP-PCB096
   LongAmp® Taq DNA Polymerase New England BioLabs, Massachusetts, US
   Cat#M0323S
   AMPure XP Beads Beckman Coulter Life Sciences Cat#A63880
   pCE2 TA/blunt-Zero vector Vazyme, Jiangsu, China Cat#C601
   EcoRI New England BioLabs, Massachusetts, US Cat#R3101S
   BamHI New England BioLabs, Massachusetts, US Cat#R3136S
   ClonExpress II One Step Cloning Kit Vazyme, Jiangsu, China Cat#C112
   Ampicillin (100mg/ml,1000X) Beyotime, Shanghai, China Cat#ST008
   EndoFree Mini Plasmid Kit II Tiangen Biotech, Beijing, China Cat#DP118
   PEI 40K Transfection Reagent Servicebio, Hubei, China Cat#G1802
   Pierce BCA Protein Assay Kit Thermo Scientific Cat#23225
   10× ice-bath free fast transfer buffer Servicebio, Hubei, China
   Cat#G2154
   Clarity Western ECL Substrate Biorad Cat#1705061
     __________________________________________________________________

   Deposited data
     __________________________________________________________________

   SRP343250 Sequence Read Archive (SRA) SRA: SRP343250
   SRP095491 Sequence Read Archive (SRA) SRA: SRP095491
   SRP265359 Sequence Read Archive (SRA) SRA: SRP265359
   SRP343258 Sequence Read Archive (SRA) SRA: SRP343258
   SRP186687 Sequence Read Archive (SRA) SRA: SRP186687
   SRP103878 Sequence Read Archive (SRA) SRA: SRP103878
   PRJNA891747 Sequence Read Archive (SRA) SRA: PRJNA891747
   PRJNA552055 Sequence Read Archive (SRA) SRA: PRJNA552055
   ONT-seq data for J82, HT1376, and RT4 Genome Sequence Archive
   (GSA)-human; This paper GSA: HRA005343
   Original code This paper Zenodo:
   [164]https://doi.org/10.5281/zenodo.8417976
     __________________________________________________________________

   Experimental models: Cell lines
     __________________________________________________________________

   J82 the Cell Bank of the Type Culture Collection of the Chinese Academy
   of Sciences (Shanghai, China) Cat#TCHu218; RRID:CVCL_0359
   HT1376 the Cell Bank of the Type Culture Collection of the Chinese
   Academy of Sciences (Shanghai, China) RRID: CVCL_1292
   RT4 the Cell Bank of the Type Culture Collection of the Chinese Academy
   of Sciences (Shanghai, China) Cat#TCHu226; RRID: CVCL_0036
     __________________________________________________________________

   Oligonucleotides
     __________________________________________________________________

   All primers used in this paper See [165]Table S1 N/A
     __________________________________________________________________

   Software and algorithms
     __________________________________________________________________

   R version 4.1.2 R project [166]https://www.r-project.org/
   fastp version 0.23.2 Github [167]https://github.com/OpenGene/fastp
   STAR version 2.7.10b Github [168]https://github.com/alexdobin/STAR
   Bismark version 0.24.0 Github
   [169]https://github.com/FelixKrueger/Bismark
   Methylkit version 1.24.0 Bioconductor
   [170]https://bioconductor.org/packages/release/bioc/html/methylKit.html
   tximport version 1.26.1 Bioconductor
   [171]https://bioconductor.org/packages/release/bioc/html/tximport.html
   Varscan2 version 2.3.9 Github [172]https://github.com/dkoboldt/varscan
   vcftools version 0.1.16 Github
   [173]https://github.com/vcftools/vcftools
   StringTie version 2.2.1 Github
   [174]https://github.com/gpertea/stringtie
   TEPROF2 version 0.1 Github [175]https://github.com/twlab/TEProf2Paper
   ggsashimi version 1.1.5 Github
   [176]https://github.com/guigolab/ggsashimi
   sangeranalyseR version 1.8.0 Bioconductor
   [177]https://bioconductor.org/packages/release/bioc/html/sangeranalyseR
   .html
   ggmsa version 1.3.4 Bioconductor
   [178]https://bioconductor.org/packages/release/bioc/html/ggmsa.html
   Salmon version 1.10.0 Github [179]https://github.com/COMBINE-lab/salmon
   TEtranscript version 2.2.3 Github
   [180]https://github.com/mhammell-laboratory/TEtranscripts
   DESeq2 version 1.38.3 Bioconductor
   [181]https://bioconductor.org/packages/release/bioc/html/DESeq2.html
   minimap2 version 2.26(r1175) Github
   [182]https://github.com/lh3/minimap2
   survival version 3.5-0 CRAN
   [183]https://cran.r-project.org/web/packages/survival/
   Integrative Genomics Viewer (IGV) version 2.16.0 Broad Institute
   [184]https://software.broadinstitute.org/software/igv/
     __________________________________________________________________

   Other
     __________________________________________________________________

   The Genome Reference Consortium Human Build 38 patch release 13
   (GRCh38.p13) GENCODE project
   [185]https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_
   42/GRCh38.primary_assembly.genome.fa.gz
   Comprehensive gene annotation (Gencode V42) GENCODE project
   [186]https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_
   42/gencode.v42.annotation.gtf.gz
   the Catalogue Of Somatic Mutations In Cancer (COSMIC) v97 database the
   COSMIC database [187]https://cancer.sanger.ac.uk/cosmic
   dbSNP build 150 National Center for Biotechnology Information (NCBI)
   [188]https://www.ncbi.nlm.nih.gov/snp/
   repeatmasker track UCSC Genome Browser [189]https://genome.ucsc.edu/
   [190]Open in a new tab

Resource availability

Lead contact

   Further information and requests should be directed to the lead
   contact, Chuanliang Xu (chuanliang_xu@126.com).

Materials availability

   This study did not generate new unique reagents.

Data and code availability

   3^rd generation long-read transcriptome RNA-seq data have been
   deposited at GSA-Human and are publicly available as of the date of
   publication. Accession numbers are listed in the [191]key resources
   table. This paper analyzes existing, publicly available data. These
   accession numbers for the datasets are listed in the [192]key resources
   table.

   All original code has been deposited at Zenodo and is publicly
   available as of the date of publication. DOIs are listed in the
   [193]key resources table.

   Any additional information required to reanalyze the data reported in
   this paper is available from the [194]lead contact upon request.

Experimental model and subject details

Cell lines

   The RT4, HT1376, and J82 cell lines were purchased from the Cell Bank
   of the Type Culture Collection of the Chinese Academy of Sciences
   (Shanghai, China) in December 2022. HT1376 and J82 were cultured in MEM
   medium (Gibco, C11095500BT), and RT4 was cultured in RPMI-1640 medium
   (Gibco, C11875500BT). All culture media were supplemented with 10%
   fetal bovine serum (Gibco, 10099-141) and 1% penicillin/streptomycin
   (Gibco, 15140122) and cells were grown at 37°C with 5% CO2. RT4 and J82
   come from male patients, and J82 comes from a female patient. All cel
   lines have been authenticated through short tandem repeat (STR)
   profiling and have been tested to be free from mycoplasma
   contamination.

Method details

Genome build and genomic locations

   The Genome Reference Consortium Human Build 38 patch release 13
   (GRCh38.p13) of the human genome was used. Only the reference
   chromosomes were analyzed, while all genetic elements that mapped to
   scaffolds, assembly patches and alternate loci (haplotypes) were
   excluded from this study. All genomic locations were described using
   hg38 coordinates.

The collection and processing of datasets

   We searched for transcriptome sequencing (RNA-seq) data for bladder
   cancer cell lines in the SRA and ENA datasets using the query
   "((cell line AND "biomol rna"[Properties] AND "platform
   illumina"[Properties])) AND ((((bladder) OR urothelia) OR urothelium)
   AND "biomol rna"[Properties] AND "platform illumina"[Properties])".
   Cell lines were selected by reviewing the sample processing methods in
   the corresponding literature and datasets, and those treated with DMSO
   or empty vectors were excluded, leaving only untreated cell lines. Cell
   line names were standardized according to the Cellosaurus database
   ([195]https://www.cellosaurus.org/). TCCSUPB from the CCLE database was
   renamed TCCSUP, and BV from the CCLE database was renamed 253JBV. 253JP
   and 253J were collectively referred to as 253J. Whole genome sequencing
   data for bladder cancer cell lines were searched for in the CCLE
   project (SRP186687). Cell line annotations were referenced from Tahlita
   et al. 2018[196]^24 and Julie et al. 2015,[197]^35 which reviewed the
   grading, staging, and molecular subtypes of 127 bladder cancer cell
   lines and the genomic instability (GIN) of 40 bladder cancer cell
   lines, respectively. The 22 cases of paired bladder tumors and adjacent
   normal samples were downloaded from the SRA database through project ID
   PRJNA552055. The MIBC data was downloaded from the SRA database through
   project ID PRJNA891747, while the corresponding follow-up data for the
   patients were obtained from the original publication27. The publicly
   available raw fastq data were downloaded using Aspera and
   quality-controlled using fastp v0.23.228, with adapters and low-quality
   sequences removed. Detailed quality control metrics of all cell line
   RNA-seq datasets were provided in [198]Table S2. STAR v2.7.10b was used
   for 2-pass alignment based on the hg38 genome build and Gencode V42
   transcript annotation ([199]gencodegenes.org). For novel splice
   junction (SJ) events identified in all samples during 1-pass mapping
   (stored in the SJ.out file), canonical and intron motifs found having
   at least 15 unique alignments in 5 samples were retained. These novel
   junctions were then input along with the original SJ.out files in the
   2-pass mapping for each sample. The mRRBS methylation data of bladder
   cancer cell lines in the CCLE project were downloaded. Detailed quality
   control metrics of all cell line datasets were provided in
   [200]Table S3. After quality control, the mRRBS data were aligned using
   Bismark v0.24.029, and the methylation status of CpG sites were
   analyzed using the R Methylkit v1.24.0 package30, with the “mincov”
   parameter set to 5.

Mutation calling

   To exclude cell line contamination and identity confusion, the identity
   of the cell lines was identified using both transcriptome-based
   phenotype and mutation-based methods. The fast quantification was
   performed using tximport v1.26.131, and the top 5000 variable genes
   were used to plot the principal component analysis (PCA) to observe if
   the same cell lines exhibit cross-dataset similarity. The
   mutation-based method was adapted from Cel-ID.[201]^36 Varscan2
   v2.3.933 was used to identify mutations in RNA-seq data and WGS data.
   For mutation calling in RNA-seq data, vcftools34 was used to retain
   mutations with DP>10, FREQ>25, only locating in the exon regions, and
   have been recorded in COSMIC v97 database
   ([202]https://cancer.sanger.ac.uk/cosmic). Finally, the distance
   between the data was described using the 1-Pearson correlation
   coefficient of the allele fraction (AF) and the correlation heatmap was
   plotted. For germline mutation calling in WGS data, only germline
   mutations curated in dbSNP build 151 were called.

Identification and quantification of onco-exaptation events

   Onco-exaptation events were identified using TEPROF2
   ([203]https://doi.org/10.5281/zenodo.7670515). Detailed methodology has
   been described before23. In brief, StringTie v2.2.1 was used to perform
   de novo assembly on the STAR 2-pass alignment results35. Based on the
   GTF assembly, TEPROF2 searched for onco-exaptation events in assembled
   transcripts that originated from TE sequences (annotated in
   Repeatmasker as LTR, LINE, SINE, and DNA) and were spliced into known
   exons of transcripts annotated as “appris principal” in Gencode V42. To
   remove false positive TE-exaptation structures, multiple heuristic
   filters were applied to candidates, including: 1. First exon
   length < 99th percentile of known first exon length in Gencode V42, and
   the first exon contains a retained intron; 2. TE has at least 10 unique
   alignments, and at least one pair-end fragment crosses the TE-gene
   splicing site; 3. Transcript is present in two or more samples; 4.
   Suspected exonization of TE is excluded (more than 15% of reads end
   within TE rather than start within TE); The cutoff values for all
   filters in the software were kept as default. The class, family, and
   subfamily of all events, the genes spliced into by TE were recorded.
   StringTie ballgown module implemented within the TEPROF2 pipeline was
   used to quantify the TPM values of onco-exaptation events and the total
   expression of the corresponding genes. The upstream intron read count
   of splice targets was obtained using StringTie ballgown. If an event
   had sufficient expression (>1 TPM) and reads crossing the splicing site
   in a sample, it was identified as “exist” in the sample. The novel SJ
   from TE to splice target was visualized using ggsashimi v1.1.536. The
   distribution of onco-exaptation events compared to the relative
   presence of each class of TEs in the human genome was quantified with
   Fisher enrichment score using the formula (ad) / (bc), where 'a' is the
   number of onco-exaptation events originating from a specific TE class,
   'b' is the total number of onco-exaptation events, 'c' is the number of
   copies of the specific TE class in the entire genome, and 'd' is the
   number of copies of all TE classes in the entire genome.

Quantification of gene expression

   Gene expression was quantified with Salmon version 1.10.0. The library
   type was detected automatically by the software. Salmon quantifies and
   outputs TPM abundance of transcripts annotated in the Gencode V42. The
   raw count matrix at gene-level was generated by tximport using the TPM
   abundance and length information of each transcript. The raw count
   matrix at gene-level was used for principal component analysis (PCA)
   and differential expression analysis.

Quantification of TE expression

   TE expression at subfamily level was quantified with TEtranscript
   version 2.2.3 using the ‘TE count’ module. We used the default
   parameter expect for the ‘--stranded’ parameter, which was set
   according to whether the RNA-seq library was first-stranded or
   unstranded ([204]Table S2). TEtranscripts quantifies and outputs raw
   counts of genes annotated in the Gencode V42 and TE subfamilies
   annotated in Repeatmasker as LTR, LINE, SINE, and DNA simultaneously.
   The raw count matrix at gene- and TE-level were used for principal
   component analysis (PCA).

PCA

   The raw count matrix, generated by either Salmon or TEtranscripts, was
   normalized using the variance stabilizing transformation (vst)
   algorithm, and was adjusted for dataset-derived batch effect using the
   removeBatchEffect function from the limma package. The top5000 most
   variable features were selected based on the standard deviation of the
   normalized, batch-corrected expression matrix and used for calculating
   the principal components. The 1st and 2nd principal components were
   displayed on the scatter plot.

Cell culture

   The RT4, HT1376, and J82 cell lines were purchased from the Cell Bank
   of the Type Culture Collection of the Chinese Academy of Sciences
   (Shanghai, China) in December 2022. HT1376 and J82 were cultured in MEM
   medium (Gibco, C11095500BT), and RT4 was cultured in RPMI-1640 medium
   (Gibco, C11875500BT). All culture media were supplemented with 10%
   fetal bovine serum (Gibco, 10099-141) and 1% penicillin/streptomycin
   (Gibco, 15140122) and cells were grown at 37°C with 5% CO2. The cell
   lines were authenticated using STR profiling.

PCR and qPCR analysis

   Total RNA was isolated from RT4, HT1376, and J82 cells using RNA
   extraction reagent (R401, Vazyme, Jiangsu, China) and subjected to
   reverse transcription-polymerase chain reaction (RT-PCR) for
   complementary DNA (cDNA) synthesis (FSK-101, Toyobo, Osaka, Japan)
   according to manufacturer’s instructions. Polymerase chain reaction
   (PCR) (KOD-211, Toyobo, Osaka, Japan) was performed to amplify DNA
   sequence specific to L1PA2-SYT1 using four pairs of primers spanning
   L1PA2-SYT1 junction ([205]Table S4A). The PCR product was then
   subjected to electrophoresis on a 2% agarose gel containing 1× GelRed
   DNA Dye (GR501, Vazyme, Nanjing, China). DNA band was visualized and
   cut under ultraviolet light for DNA purification using a gel extraction
   kit (CW2302, Cwbio, Jiangsu, China). The purified DNA was
   Sanger-sequenced on the ABI 3730XL platform (Applied Biosystem,
   California, USA). The sequencing data were processed and visualized
   using the R packages sangeranalyseR v1.8.037 and ggmsa v 1.3.438,
   respectively. To quantify the expression of L1PA2-SYT1, total RNA
   isolated from J82, RT4, and HT1376 cell lines was subjected to reverse
   transcription for qPCR (R333, Vazyme, Jiangsu, China). The expression
   of L1PA2-SYT1 relative to GAPDH was detected with SYBR (Q712, Vazyme,
   Jiangsu, China) using the ΔΔCt method (QuantStudio 6Pro/6 Flex,
   ThermoFisher Scientific) ([206]Table S4B). Forward primers were
   designed for exon1 and reverse primers for exon2-10 to validate the
   full-length of predicted L1PA2-SYT1 by PCR ([207]Table S4C). To
   validate L1PA2-SYT1 being an alternative splicing event other than
   genomic fusion event, genomic RNA was also isolated from J82, RT4, and
   HT1376 cell lines (DC102, Vazyme, Jiangsu, China) and subjected to PCR
   electrophoresis (data not shown).

3rd generation long-read transcriptome sequencing

   1ug total RNA was prepared for cDNA libraries using cDNA-PCR Sequencing
   Kit (SQK-LSK110+EXP-PCB096) protocol provided by Oxford Nanopore
   Technologies (ONT). Briefly, the template switching activity of reverse
   transcriptase enrich for full-length cDNAs and add defined PCR adapters
   directly to both ends of the first-strand cDNA. And following cDNA PCR
   for 14 circles with LongAmp Tag (New England BioLabs, Massachusetts,
   US). The PCR products were then subjected to ONT adaptor ligation using
   T4 DNA ligase (New England BioLabs, Massachusetts, US). Agencourt XP
   beads was used for DNA purification according to ONT protocol. The
   final cDNA libraries were added to FLO-MIN109 flowcells and run on
   PromethION platform at Biomarker Technology Company (Beijing, China).
   The data has been deposited in Genome Sequence Archive (GSA)-Human with
   the accession number [208]HRA005343.

5’RACE analysis

   5’RACE gene-specific primer (GSP) was designed to target the 3rd exon
   of the predicted L1PA2-SYT1 ([209]Table S4D). The PCR product was
   amplified with 5’/3’ RACE reagent kit (Cat#RA101, Vazyme), enriched
   with electrophoresis, extracted with gel extraction kit (Cat#CW2302,
   Cwbio), connected with the pCE2 TA/blunt-Zero vector (Cat#C601,
   Vazyme), cloned into DH5α competent cell (Cat#C502, Vazyme), and
   Sanger- sequenced on ABI 3730XL platform using the M13R primer
   (5'-CAGGAAACAGCTATGAC-3'). The sequenced reads were mapped using
   minimap2 version 2.26-r1175 and visualized using Integrated Genomics
   Viewer (IGV).

Overexpression of L1PA2-SYT1 in J82 cell line

   The predicted L1PA2-SYT1 sequence was amplified and cloned into
   pLVX-IRES-Puro plasmids. Specifically, primers were designed according
   to the predicted L1PA2-SYT1 sequence. PCR (KOD-211, Toyobo, Osaka,
   Japan) reaction was performed using RT4 cell line cDNA as the template
   to synthesize the L1PA2-SYT1 sequence. Double digestion of
   pLVX-IRES-Puro plasmids was performed using restriction enzymes EcoRI
   and BamHI (R3101S & R3136S, New England BioLabs, Massachusetts, US)
   according to the manufacturer’s instructions. Predicted L1PA2-SYT1 PCR
   product and linearized vectors were subjected to electrophoresis on 1%
   agarose gel and purified in the same manner as the above-mentioned
   method. Linearized vector and the PCR products was subjected to
   recombination using recombinase (C112, Vazyme, Nanjing, China). The
   recombinant plasmid was transformed into DH5α competent E. coli cells
   (C502, Vazyme, Nanjing, China) and plated on LB plate containing
   100μg/ml ampicillin (ST008, Beyotime, Shanghai, China). Single colonies
   were picked from the plate using pipette tip and transferred into LB
   broth containing 100μg/ml ampicillin. Plasmid DNA was extracted from
   the expanded culture using extraction kit (DP118, Tiangen Biotech,
   Beijing, China) and Sanger-sequenced on ABI 3730XL platform. After
   validation of recombinant pLVX-IRES- L1PA2-SYT1-Puro plasmid, it was
   co-transfected into 293T cell line along with packaging plasmids pMD2.G
   and psPAX2 using PEI 40K reagent (G1802, Servicebio, Hubei, China) for
   lentivirus packaging. Medium containing lentiviral particles was
   harvested and transferred to J82 cell line for lentivirus infection.
   Puromycin-containing medium was added to J82 cell 48h after lentivirus
   infection for selection. Expression of L1PA2-SYT1 in J82 cell line was
   validated using PCR and Western blotting.

Invasion assays

   Invasion assays were conducted in Transwell chamber with Matrigel
   (Corning). Tumor cells containing 4×10E4 were seeded inside Transwell
   inserts with 200μl culture medium without FBS in triple. As a
   chemoattractant, 600μl culture media containing 10% FBS was placed in
   the lower chamber. After 24h, 4% formaldehyde fixed the lower surface
   of the filters, then the filters were stained with 0.1% crystal violet
   solution.

Western blot assay

   The cells were lysed in buffer containing 50 mM Tris-HCl (pH 7.4),
   150 mM NaCl, 1 mM EDTA-2Na2, 0.25% Sodium deoxycholate, 1% NP-40, and
   1X protease inhibitor. After centrifugation at 12,000rpm for 5min at
   4°C, the supernatants were recovered and total protein concentrations
   were measured using a bicinchoninic acid assay (Pierce BCA Protein
   Assay Kit, Thermo Scientific). For immunoblot analysis, extracted
   proteins were separated on 10% PAGE Gel (10% PAGE Gel Fast Preparation
   Kit, sEpizyme) under 60V for 30min and 110V for 60min and transferred
   onto PVDF membranes with 1× ice-bath free fast transfer buffer (10×
   ice-bath free fast transfer buffer, Servicebio) under 400mA for 30min.
   The membranes were blocked with 1x protein-free rapid blocking buffer
   (Servicebio) for 30min and incubated overnight at 4°C with a
   synaptotagmin-1 rabbit monoclonal antibody (mAb) (1:3000, cat#14588,
   Cell Signaling Technology) and an GAPDH rabbit mAb (1:5000, Cat#2118,
   Cell Signaling Technology) as a loading control. Primary antibodies
   were detected with horseradish peroxidase (HRP)-conjugated
   species-specific secondary antibody (1:3000, Cat#D4W3E, Cell Signaling
   Technology) and ECL Substrate (Clarity Western ECL Substrate,
   Cat#1705061, Biorad).

Survival analysis

   The survival analysis was performed with survival version 3.5-0. The
   separation of survival probability between the exists and not exists
   group was visualized using the Kaplan-Meier curve and tested by
   log-rank test. Multivariable Cox regression model was visualized using
   the forest plot and tested by Wald test.

Differential gene analysis and enrichment analysis

   Differential analysis was performed between the L1PA2-SYT1 exist group
   and not exists group. Gene-level counts were quantified using Salmon
   and differential analysis was performed using DESeq2. Genes curated in
   the Gencode V42 annotations were considered in differential expression
   analysis. The gne expression matrix was filtered to include genes with
   TPM > 1 in at least one sample. Other parameters of DESeq2 remain
   default. Genes with |log2FC| ≥ 1 and FDR < 0.05 were defined as
   differentially expressed genes. For the up-regulated genes in the exist
   group, GO enrichment analysis was performed using Metascape
   ([210]metascape.org/).

Quantification and statistical analysis

   The statistical details could be found in the figure legends. The
   significance level was defined as P < 0.05. R (version 4.1.2) was
   utilized for data processing and visualization.

Acknowledgments