Abstract Lung cancer is the leading cause of cancer-related death worldwide, and the complex molecular mechanisms underlying its development, particularly the role of alternative splicing (AS) in different subtypes, remain poorly understood. In this study, we performed RNA sequencing of 178 lung cancer patients and conducted a comprehensive analysis of the transcriptomic landscape with a focus on AS. We identified robust lung cancer- and subtype-specific AS biomarkers that were consistently effective in both tissue samples and cancer cell lines. Notably, several of these biomarkers also serve as critical regulators in lung cancer progression. Our regulatory network analysis, with a focus on RNA-binding proteins, revealed QKI and SR proteins as key splicing factors. Specifically, QKI was found to modulate the splicing of PLEKHA1 exon 15, a cancer-specific AS biomarker, while SRSF1 regulated the splicing of MKNK2 exon 14, a subtype-specific AS biomarker. Our study provides valuable insights into key AS events and their regulatory mechanisms in lung cancer, paving the way for potential therapeutic targets. Keywords: MT: Bioinformatics, lung cancer, alternative splicing, biomarker, RNA-binding proteins, regulation network Graphical abstract graphic file with name fx1.jpg [41]Open in a new tab __________________________________________________________________ Feng and colleagues generate a high-quality lung cancer transcriptome resource and identify cancer- and subtype-specific splicing biomarkers through integrative analysis. By modeling RNA-binding protein activity, they uncover critical regulators of alternative splicing with implications for cancer biology and prognosis. Introduction Lung cancer remains a significant global health challenge, representing the leading cause of cancer-related mortality worldwide.[42]^1^,[43]^2^,[44]^3 It is primarily classified into two main histological subtypes: non-small cell lung cancer (NSCLC) and small cell lung cancer. NSCLC accounts for approximately 85% of lung cancer cases and can be further divided into three major subtypes: lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), and large cell carcinoma (LUL).[45]^4^,[46]^5 Lineage plasticity has been implicated as a driver of resistance to targeted therapies in lung cancer, with histological transdifferentiation of LUAD to LUSC often resulting in lung adenosquamous cell carcinoma (LUAS).[47]^6^,[48]^7 Key risk factors for lung cancer include tobacco smoking, exposure to environmental pollutants, occupational hazards such as asbestos, and genetic predispositions.[49]^8 The prognosis for lung cancer patients is generally poor, with a 5-year survival rate of less than 20%, highlighting the urgent need for novel diagnostic and therapeutic strategies.[50]^9 Alternative splicing (AS) is a widespread and critical post-transcriptional regulatory mechanism that allows a single gene to produce multiple mRNA isoforms, generating diverse protein variants with distinct functions.[51]^10 Compared with gene expression alone, AS offers a more nuanced view of gene activity, which is essential in the complex context of cancer biology.[52]^11^,[53]^12 In humans, more than 95% of multi-exon genes undergo AS, playing a central role in regulating a variety of biological processes.[54]^13 AS events are typically classified into six major types: cassette exon (CASS), tandem CASS, mutually exclusive exons, intron retention, alternative 5′ splice site, and alternative 3′ splice site. These different splicing patterns generate distinct transcript isoforms from the same gene. It has been discovered that spliced genes can act as tumor suppressors or proto-oncogenes, related to the initiation, progression, metastasis, angiogenesis, and drug resistance of cancer, which are involved in several critical signaling pathways, immune responses, and tumor microenvironments.[55]^14^,[56]^15^,[57]^16^,[58]^17 Understanding the intricate relationship between AS and cancer biology is essential for developing novel therapeutic strategies and improving patient prognosis. The regulation of AS events is largely mediated by RNA-binding proteins (RBPs), which recognize specific RNA sequences and structures, influencing splice site selection and the stability of mRNA transcripts. Dysregulation of RBPs can lead to significant alterations in AS patterns, which have been implicated in various cancers, including lung cancer.[59]^18^,[60]^19^,[61]^20 Understanding the molecular mechanisms underlying cancer development and progression paves the way for targeted therapies, such as the modulation of RBP activity through antisense oligonucleotides or small molecules, which have shown promise in preclinical studies. Such interventions can potentially correct aberrant splicing patterns, restore normal gene function, and mitigate cancer-related symptoms.[62]^21^,[63]^22^,[64]^23 Research on AS in lung cancer has revealed its critical role in tumorigenesis, progression, and treatment resistance.[65]^24 AS influences various processes in lung cancer, including cell proliferation, apoptosis, metastasis, and drug resistance. Several AS events lead to the production of oncogenic isoforms, such as short-REST, GLDCV1, and DA-Raf,[66]^25^,[67]^26^,[68]^27 which promote tumorigenesis. Splicing regulators, including QKI, RBM10, PTBP1, and hnRNP proteins,[69]^28^,[70]^29^,[71]^30^,[72]^31 also modulate key genes involved in lung cancer progression. Potential therapeutic strategies targeting RBPs are emerging. For instance, inhibitors of splicing regulators like PHF5A have shown the ability to prevent the growth, proliferation, and pathogenesis of tumor cells.[73]^32 These findings underscore the importance of AS in lung cancer. However, critical gaps remain in our understanding of AS in lung cancer. Current research predominantly focuses on a limited range of histological subtypes, and studies integrating multiple subtypes often rely on data from different batches, introducing substantial batch effects that obscure meaningful biological insights. Additionally, there is a notable lack of exploration into cancer- and subtype-specific AS biomarkers, which hold considerable promise for clinical applications. Furthermore, most existing studies focus on individual RBPs in isolation; a comprehensive, integrative approach is needed to reveal global splicing regulatory networks and pinpoint therapeutically actionable splicing factors. Addressing these gaps is essential for advancing our knowledge and improving therapeutic strategies in lung cancer. In this study, we generated RNA sequencing (RNA-seq) data from paired tumor and adjacent normal tissue samples obtained from 178 lung cancer patients, representing the four distinct subtypes of NSCLC. Through comprehensive transcriptomic analysis, we identified the potential of AS events as both cancer-specific and subtype-specific biomarkers in lung cancer. Furthermore, we explored the regulatory networks that influence these AS biomarkers, with a particular focus on the role of RBPs in modulating differentially spliced (DS) events through an integrative analysis, which deepens our understanding of the regulator of AS biomarkers. Results Overview of transcriptomic profiling in lung cancer We collected both tumor and adjacent normal tissue samples from 178 lung cancer patients, including 82 LUAD patients, 66 LUSC patients, 23 LUAS patients, and 7 LUL patients ([74]Table S1). RNA was extracted from these samples and sequenced. After quality control, we utilized the Quantas pipeline to quantify gene expression and AS events. Quantas offers a highly curated and comprehensive annotation of exon-exon junctions by multiple data source integration,[75]^33 providing a systematic analysis on the transcriptomic landscape of lung cancer ([76]Figures 1A and 1B). Figure 1. [77]Figure 1 [78]Open in a new tab The analysis framework of this study (A) Sample collection and RNA-seq profiling. We collected lung cancer and adjacent samples from 178 patients, categorized into four subtypes: LUAD (82 patients), LUSC (66 patients), LUAS (23 patients), and LUL (lung LUL, 7 patients). RNA was extracted from patients’ tissues, followed by RNA-seq to generate RNA-seq reads. These reads underwent bioinformatic analysis to assess gene expression and AS. (B) Differential analysis and biomarker identification. Splicing events were compared between different groups to figure out cancer-specific and subtype-specific DS events. Important features were selected to separate different groups and then overlap with DS events to identify AS biomarkers. These biomarkers were validated across datasets, and was further assessed prognostic values. (C) Regulation of AS biomarkers by RBPs. We used an integrative strategy to infer the regulation of AS biomarkers by RBPs. We performed differential expression of RBPs, calculated the Pearson’s correlation of RBP expression and DS events’ PSI, marked RBP binding motifs on DS sequences, and analyzed the RNA-seq data of RBP-perturbance cell lines. We integrated these datasets to infer the RBPs regulating AS biomarkers. Our analysis revealed significant differences in gene expression and splicing events across the four lung cancer subtypes. Initially, we performed a tumor vs. adjacent normal comparison across all samples, identifying 1,929 differentially expressed genes (DEGs) (|log[2]FC| ≥ 1, false discovery rate [FDR] ≤ 0.05) ([79]Figure 2A and [80]Table S2) and 1,158 DS events from 867 genes ([81]Figure 2B and [82]Table S3). We also performed a more rigorous analysis by testing each subtype individually. This precise analysis revealed that LUSC exhibited the highest number of upregulated genes, while LUL showed the lowest (1,225 vs. 511) ([83]Figure 2C). Similarly, LUSC also had the highest number of downregulated genes, with LUAD having the fewest (1,816 vs. 732) ([84]Figure 2C). The finding that LUSC has the highest number of DEGs among lung cancer subtypes is consistent with previous studies in The Cancer Genome Atlas (TCGA).[85]^34^,[86]^35 Regarding DS events, LUSC again exhibited the largest number, while LUL had the least (1,333 vs. 489) ([87]Figure 2D). Notably, the overlap of DEGs and DS events between two subtypes accounted for at least 30% of the total genes in one of the subtypes ([88]Figure 2E). We identified 727 DEGs shared across all four subtypes, with 237 genes upregulated and 490 genes downregulated in tumor tissues. We also observed significant overlap of DS events across the subtypes, with a total of 127 dysregulated events involving 90 DS genes (DSGs). While six distinct types of AS were analyzed, we focused primarily on CASS events, as they were the most prevalent, representing nearly three times as many dysregulated events as any other splicing type ([89]Figure S1). Figure 2. [90]Figure 2 [91]Open in a new tab Transcriptomic landscape of lung cancer patients (A) Volcano plot showing DEGs between tumor and paired adjacent normal tissues analyzed with all samples, with upregulated genes highlight in red and downregulated genes highlight in blue. (B) Volcano plot showing DS events between tumor and paired adjacent normal tissues analyzed with all samples, with significant DS events highlighted. (C) Volcano plots showing DEGs between tumor and paired adjacent normal tissues in four subtypes, with shared DEGs among four subtypes highlighted. (D) Volcano plots depicting DS events between tumor and paired adjacent normal tissues in four subtypes, with shared dysregulated events highlighted. (E) From top to bottom, a summary of the count of upregulated genes, downregulated genes, dysregulated events, and DS genes is shown, with each square indicating the overlap count between two subtypes, and the diagonal squares representing the counts for each subtype individually. (F) Functional enrichment analysis of differentially expressed and DSGs. Heatmaps showing the FDR of the enriched GO terms, with darker colors indicating stronger significance. Shared GO terms in four subtypes or specific GO terms in only one subtype are shown in the heatmap. This observation suggests that analyzing the shared cancer-specific differential gene expression and splicing across subtypes provides valuable insights into the molecular commonalities in lung cancer. Additionally, the distinct patterns observed between subtypes also remind us to investigate subtype-specific alterations, which may provide a deeper understanding of the molecular diversity within lung cancer. To further explore the biological implications of these transcriptomic changes in lung cancer, we conducted Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways enrichment analysis on both DSGs and DEGs. We identified that upregulated genes across all four subtypes are enriched in terms related to cell division, cell cycle, and p53 signaling pathway. In contrast, downregulated genes shared across the subtypes are associated with cell migration, vascular smooth muscle, the ERK1 and ERK2 cascades, and tumor necrosis factor ([92]Figure 2F). Subtype-specific dysregulated genes exhibited enrichment in terms related to endothelial cell migration, the G1/S transition of the mitotic cell cycle, epithelial-to-mesenchymal transition, and T helper type 1 (Th1) and Th2 cell differentiation. Additionally, we found dysregulated genes enriched in splicing-related terms, including spliceosomal small nuclear ribonucleoprotein particle assembly, RNA binding, and SMN-Sm protein complex, which supported the rationale for focusing on AS. Analysis of DSGs revealed that focal adhesion, nucleoplasm, and protein binding are enriched across all subtypes, while terms associated with AP-1 adaptor complex, motor proteins, and centriole are enriched in individual subtypes. Interestingly, we observed that some terms are regulated cooperatively by DSGs and DEGs, like small GTPase binding, kinase activity and protein binding. These enriched terms are consistent with the established understanding that processes involved in cell signaling, immune response, apoptosis, tissue migration, and angiogenesis play active roles in cancer ([93]Figure 2F). Lung cancer-specific AS biomarkers To further examine the potential of AS events as lung cancer biomarkers, we focuses on the 127 AS events found to be significantly regulated across all four subtypes. Sixty-two of these events showed consistent upregulation, and 65 showed consistent downregulation ([94]Figure 3A). To verify the significance of these differences, we conducted a paired test on these events using statistical analysis in EventPointer.[95]^36 Interestingly, all of these DS events showed significant difference by the paired tests ([96]Figure 3B). We then performed GO enrichment analysis on the genes associated with these events to explore their functional relevance. The results showed significant enrichment in pathways such as protein binding, cell surface, and cell adhesion ([97]Figure 3C). Domain enrichment analysis using the chi-square test revealed a significant over-representation of Dus (p = 6.54 × 10^−16) in tumor-upregulated events, while vascular endothelial growth factor (VEGF)-C (p = 9.94 × 10^−98) and CENP-X (p = 4.47 × 10^−42) were preferentially skipped in tumor samples. Notably, Dus is related with pulmonary carcinogenesis,[98]^37 VEGF-C is implicated in angiogenesis,[99]^38 while CENP-X is involved in DNA repair[100]^39 ([101]Table S4). Figure 3. [102]Figure 3 [103]Open in a new tab Cancer-specific AS biomarkers in lung cancer (A) Splicing pattern of DS events in all samples, with events existing in more than 50% of all samples shown on the heatmap. (B) Bar plot showing the significance of 127 DS events using EventPointer in a paired test. The dashed horizontal line marks the threshold of statistical significance (–log[10](0.05) ≈ 1.3). (C) Statistically enriched GO terms of genes with DS events between tumor and adjacent normal samples. (D) Splicing pattern of 24 cancer-specific AS biomarkers in all samples. (E) The ROC curve of kNN prediction conducted for cross-validation in this study. The bottom right displays the AUC values of this experiment. (F) ROC curve of kNN prediction conducted in TCGA samples, with AUC value labeled in the bottom corner of the figure. (G) Pearson’s correlation of the mean PSI of cancer-specific biomarkers in the tumor samples of this study and CCLE samples. (H–J) Scatterplot showing the first two dimensions from MDS, applied separately to our samples, TCGA samples, and CCLE samples. (K) Multivariate Cox model analysis results of the 24 cancer-specific biomarkers in the TCGA LUAD cohort. The x axis and y axis indicate hazard ratios and the corresponding p values, respectively (both were log-transformed). (L) Coefficients of the three selected important biomarkers (PRMT2 exon 2, TMP1 exon 8 and COL6A3 exon 6) in the prognostic model in the TCGA LUAD cohort. (M) Survival analysis performance of the two groups distinguished by the prognostic model in the TCGA LUAD cohort. Statistical significance was evaluated using the log rank test. (N–P) The same analysis method applied to TCGA LUAD was repeated for TCGA LUSC. The results of the multivariate Cox model, the coefficients of the three biomarkers (COL6A3 exon 6, VEGFA exon 7, and PLEKHA1 exon 15) used to establish the prognostic model are shown, along with a survival analysis performance of the prognostic model in LUSC. Next, we applied the Boruta algorithm to identify predictive AS biomarkers while minimizing noise and redundancy. Boruta is a feature selection algorithm that identifies important variables by comparing them to random shadow features.[104]^40 By overlapping the candidates identified by Boruta and Quantas, we obtained a total of 24 AS events with stable and significant differences between the two groups ([105]Figure 3D and [106]Table S5). We propose that these 24 AS events may serve as cancer-specific biomarkers and conduct further investigation and validation. To evaluate the robustness of our biomarker selection strategy, we performed a 5-fold cross-validation in which the same selection procedure was applied to each training fold to identify candidate events, and the resulting features were used to classify the test fold using a k-nearest neighbor (kNN) classifier. The model achieved consistently high performance, with an receiver operating characteristic-area under the curve (ROC-AUC) of 0.95 (accuracy, 0.93; precision, 0.90; recall, 0.95; F1-score, 0.93) ([107]Figures 3E and [108]S2A). To validate the robustness of these AS biomarkers, we further performed kNN on 1,015 lung cancer and 108 normal samples from TCGA, achieving a high ROC-AUC of 0.9969 (accuracy, 0.98; precision, 0.995; recall, 0.98; F1-score, 0.99) ([109]Figures 3F and [110]S2B), confirming that our model did not suffer from overfitting and the effectiveness of these markers in distinguishing lung cancer samples from normal. Additionally, an analysis using 17 lung cancer cell lines from the Cancer Cell Line Encyclopedia (CCLE) demonstrated a strong correlation (0.763, Pearson’s correlation coefficient) between mean percent spliced in (PSI) values in cell lines and tumor tissues of selected AS markers, confirming their consistency across cancer contexts ([111]Figure 3G). To further assess the generalizability of our findings and address potential overfitting concerns from the supervised kNN results, we conducted an unsupervised analysis using multi-dimensional scaling (MDS) of AS marker profiles. The results demonstrated a clear separation of tumor samples from normal across three datasets including our dataset, TCGA and CCLE dataset ([112]Figures 3H–3J and [113]S3). This robust separation reinforces that these AS events are consistent, intrinsic features in distinguishing between groups, thus further supporting their biomarker potential in cancer contexts. Among these potential biomarkers, several events have been previously reported to be differently spliced in tumors, and the genes associated with these events have been implicated as oncogenes or tumor suppressors. For example, COL6A3 exon 6 is a DS among multiple cancer types and enriched within the tumor stroma.[114]^41 ADD3 exon 15 is reported to be related to cell proliferation and migration and tends to be included in lung cancer.[115]^42 FBLN2, a tumor suppressor, whose ectopic expression results in decreased cell proliferation, migration, and invasion, is accompanied by inactivated cancer signaling pathways.[116]^43 VEGFA, an important factor in angiogenesis, contributes to tumor growth.[117]^44 We further evaluated the prognostic potential of these AS biomarkers using TCGA dataset. A Cox survival model was applied to assess the impact of all identified markers on the survival of LUAD and LUSC separately. Eight events showed significantly different survival patterns in LUAD, while five showed significant differences in LUSC ([118]Figures S4A and S4B). We then constructed multi-covariate Cox model considering all biomarkers ([119]Figure 3K for LUAD, [120]Figure 3N for LUSC), which revealed that the top three features may help us to build the best linear models ([121]Figures 3L and 3O). The prognostic model effectively stratified patients into two distinct risk groups for both LUAD and LUSC based on AS biomarkers ([122]Figures 3M and 3P). Among these prognostic AS biomarker genes, PRMT2 has been reported to be associated with prognosis in breast cancer and to promote occurrence and metastasis in renal cell carcinoma.[123]^45^,[124]^46 TPM1 has been reported to be related to apoptosis and metastasis and is considered a promising therapeutic target for ovarian cancer and renal cell carcinoma.[125]^47^,[126]^48 To ensure that this stratification was not merely reflecting well-known prognostic factors like age or sex, we applied the model to subgroups divided by age and sex ([127]Figure S5). Excitingly, the AS biomarkers maintained strong predictive power, effectively distinguishing patient risk groups within these subgroups, which further confirms the effectiveness of the model constructed by AS biomarkers in lung cancer. AS biomarkers for identifying LUAD and LUSC We further extended our analysis to identify subtype-specific AS biomarkers in lung cancer. In NSCLC, LUAD and LUSC account for the majority of cases, with LUAD comprising approximately 50% and LUSC approximately 30%[128]^49; therefore, we focused on the DS events between LUAD and LUSC. Initially, we identified LUAD-specific and LUSC-specific AS events, and surprisingly, events common to both LUAD and LUSC exhibited different regulation patterns, with 135 events upregulated in LUAD and 43 events upregulated in LUSC ([129]Figure 4A and [130]Table S6). To further assess the statistical significance of these subtype-specific AS alterations, we also applied EventPointer algorithm,[131]^36 revealing that most of these events demonstrate significant differences ([132]Figure 4B). To investigate the potential biological relevance of these subtype-specific AS events, we performed GO enrichment analysis. The results revealed that DSGs were strongly enriched in terms associated with protein binding, membrane, and cell surface, which were also found in that of cancer-specific events ([133]Figure 4C). Domain enrichment analysis using the chi-square test identified several significant domain enrichments. EpoR_lig-bind (p = 1.33 × 10^−24), Neurensin (p = 1.33 × 10^−24), and FHA_2 (p = 4.50 × 10^−14) were significantly enriched in LUAD-specific events, while TIR_3 (p = 6.81 × 10^−8) and Xlink (p = 5.83 × 10^−6) showed significance in LUSC-specific events. These domains are implicated in hematopoiesis,[134]^50 neuronal signaling,[135]^51 DNA repair,[136]^52 and immune system regulation[137]^53^,[138]^54 ([139]Table S7). For robust biomarker identification, we performed feature selection using the Boruta algorithm and identified eight AS events that consistently and significantly differentiate LUAD from LUSC ([140]Figure 4D and [141]Table S8). Figure 4. [142]Figure 4 [143]Open in a new tab Subtype-specific AS biomarkers in lung cancer (A) Subtype-specific splicing profiles in tumor subtypes, with events existing in more than 50% of LUAD and LUSC samples shown on the heatmap. (B) Bar plot of 178 DS events analyzed by EventPointer. The dashed line indicates the significance cutoff. (C) Statistically enriched GO terms of genes with subtype-specific DS events. (D) Splicing profiles of eight subtype-specific AS biomarkers in tumor samples. (E–G) ROC curves of KNN predictions conducted in this study, TCGA, and CCLE samples, respectively. (H–J) Scatterplots illustrating the first two MDS dimensions for FDU, TCGA, and CCLE samples. (K) Results of multivariate Cox regression analysis for the eight subtype-specific biomarkers in the TCGA LUAD cohort. The x axis and y axis display the log-transformed hazard ratios and p values, respectively. (L) Coefficients of the two selected important biomarkers (CC2D2A exon 5 and CLK4 exon 4) in the prognostic model in the TCGA LUAD cohort. (M) Survival analysis comparing the two groups stratified by the prognostic model with a log rank test in the TCGA LUAD cohort. (N–P) The same analysis method applied to TCGA LUAD was repeated in TCGA LUSC. The results of the multivariate Cox regression analysis, the prognostic model constructed by MKNK2 exon 1 and BCL2L12 exon 3, and the survival analysis performance of the prognostic model in the TCGA LUSC cohort are shown. We also performed a kNN classification to test the robust of biomarker selection method, which achieved an ROC-AUC of 0.93 (accuracy, 0.86; precision, 0.76; recall, 0.91; F1-score, 0.83) ([144]Figures 4E and [145]S6A). We also validated the robustness of these markers in independent datasets, achieving a higher AUC of 0.97 in TCGA (accuracy, 0.91; precision, 0.85; recall, 0.95; F1-score, 0.90) ([146]Figures 4F and [147]S6B) and 0.88 in CCLE (accuracy, 0.85; precision, 0.71; recall, 1.00; F1-score, 0.83) ([148]Figures 4G and [149]S6C). To further assess the generalizability of our findings, we used unsupervised MDS. The MDS plot clearly distinguished LUAD from LUSC along the first dimension, where the two subtypes were almost symmetrically distributed on opposite sides of the plot for dataset generated in this study, the TCGA dataset, and CCLE data ([150]Figures 4H–4J and [151]S7). This strong separation confirms the robustness and consistency of these AS biomarkers in distinguishing LUAD from LUSC. Among the identified subtype-specific biomarkers, several events have functional relevance in pulmonary disease. PDGFA exon 6 has been found to be skipped in emphysema and idiopathic pulmonary fibrosis, which is able to diffuse away and trigger other cells, and associates with hypoxic lung injury.[152]^55 AS in CD44 has been found in lung cancer and breast cancer, resulting in a mesenchymal isoform, which confers a malignant epithelial cell with metastatic capability.[153]^56^,[154]^57^,[155]^58 We further analyzed the prognostic potential of subtype-specific biomarkers. In a single-covariate survival model, only CLK4 exon 4 showed prognostic significance in LUAD ([156]Figure S8), which was expected, as individual events were generally insufficient to distinguish LUAD from LUSC. Recognizing that linear combinations of biomarkers were more effective predictors, we constructed a multivariate Cox model to identify the top AS events that were most predictive of LUAD and LUSC patient outcomes ([157]Figures 4K and 4N). The model constructed by top AS events stratified patients into high- and low-risk group, demonstrating prognostic strength in LUAD ([158]Figures 4L, 4M, 4O, and 4P), which was also validated across patient subgroups ([159]Figure S9). The ability of subtype-specific biomarkers to establish robust prognostic models in LUAD underscores their biological significance in lung cancer, beyond their ability to distinguish and predict subtypes. RBPs modulate DS events in lung cancer and its subtypes Given the pivotal role of RBPs in regulating AS, we examined the expression patterns of RBPs in lung cancer. Our analysis revealed that RBP expression profiles robustly differentiate tumor from normal tissue, as well as LUAD from LUSC samples, with these findings being consistent across both our dataset and TCGA data ([160]Figures 5A and [161]S10A). These results underscore the crucial role that RBPs play in lung cancer, sparking further interest in their involvement in regulating differential splicing events. Figure 5. [162]Figure 5 [163]Open in a new tab Regulation of RBPs in lung cancer (A) MDS analysis of RBP expression, presented as scatterplots, distinguishing tumor and normal samples by color. The top plot represents data from this study, while the bottom plot corresponds with data from TCGA. (B) Differential expression analysis of RBPs in two datasets. Genes with a FDR of <0.05, log[2]FC of ≥0.6, and RPKM of ≥1 are classified as upregulated, while genes with a FDR of <0.05, log[2]FC of ≤−0.6, and RPKM of ≥1 are classified as downregulated. Significant genes in the differential expression analysis were marked. (C) Heatmaps showing the count of dysregulated genes in tumor tissues across two datasets. The overlapping gene counts are indicated by the diagonal line. (D) Significantly enriched motifs in the DS exons and 200-bp flanking introns. The top shows the downregulated DS events in cancer, while the bottom presents the upregulated DS events. The background color indicates the positional relationship with the spliced sequences. Bolded RBPs are those that have been previously reported in the literature. The p values in the last column indicate the significance of the motifs in cancer-specific events, determined by a hypergeometric distribution test compared with all annotated exons. (E) DS events regulated by RBPs supported by ENCODE KD RNA-seq. We connected the RBPs that regulate at least seven common DS events between tumor and adjacent normal samples. The color of each bubble represents the number of events regulated by the corresponding RBP, while the size of the bubble indicates the significance of the RBP target counts based on a hypergeometric distribution test. This test assesses the number of RBP targets using all annotated events as the background and DS exons as the foreground. (F) Examples of RBPs regulating DS events. Expression of RBPs in different groups are shown in violin plot in the top, with FDR values indicating the significance of differential expression. Barplots aim to show the consistency of DS events regulation direction in lung cancer and ENCODE cell lines. An asterisk indicates a significant bias toward cancer-specific inclusion for exons activated or repressed by the respective RBP, compared with all DS exons between tumor and adjacent normal tissue samples (p < 0.05, Fisher’s exact test). Exons activated and repressed by the RBP with opposing regulatory directions are also compared with Fisher’s exact test, with the resulting p value shown at the top. To explore this, we conducted a differential expression analysis of RBPs between groups, defining differentially expressed RBPs as those with an FDR of ≤0.05 and a |log[2]FC| of ≥0.6 ([164]Table S9). Our analysis identified RBPs with differential expression, which has been previously reported to be important in lung cancer, such as QKI, HNRNPC, RBM5,[165]^28^,[166]^59^,[167]^60 and the RBP expression variation pattern is highly consistent with that in TCGA ([168]Figures 5B and [169]S10B). In the comparison between tumor and normal samples, we also performed paired testing using edgeR and identified 31 upregulated and 27 downregulated RBPs; in approximately two-thirds of them was found the same regulation pattern in TCGA ([170]Figure 5C). IGF2BP3 and KHDRBS2 ranked as the most significant dysregulated genes between tumor and normal tissue in two datasets. QKI, CELF2, CIRBP, ESRP1, and DKC1[171]^61^,[172]^62^,[173]^63^,[174]^64^,[175]^65 showed consistent changes in our data and TCGA and were thought to be active in lung cancer. In the LUAD vs. LUSC comparison, we observed 38 upregulated RBPs in LUAD and 16 in LUSC. In TCGA dataset, 17 were also identified as upregulated in LUAD, and 8 were similarly upregulated in LUSC. Noteworthy examples include the significant dysregulation of RBM47, CELF2, RBMS3, LARP6, and FXR1 ([176]Figures S10B and S10C). These dysregulated RBPs may regulate the splicing of DS events among lung cancer subtypes. Next, we performed correlation tests between RBP expression and the cancer- and subtype-specific DS events that we focused on when identifying biomarkers ([177]Figures S11A and S11B). A clear separation between the two groups was observed in the correlation heatmap, indicating that RBPs play a significant role in exon splicing in lung cancer. Specifically, 1,477 RBP-exon pairs showed strong correlations (|r| > 0.6, Pearson’s correlation coefficient), and 8,653 pairs showed moderate correlations (between 0.4 and 0.6, Pearson’s correlation coefficient) in cancer-specific DS events.[178]^66 In subtype-specific DS events, 478 RBP-exon pairs showed strong correlations (|r| > 0.6, Pearson’s correlation coefficient), and 5,454 pairs showed moderate correlations (between 0.4 and 0.6, Pearson’s correlation coefficient). We identified several RBPs with relatively stronger correlation pairs with cancer-specific exons, such as RBMS3 (51), SRSF5 (42), CELF2 (37), QKI (25), HNRNPC (18), and RBM5 (13), which has been discovered as oncogenes or tumor suppressor genes in lung cancer.[179]^28^,[180]^67^,[181]^68^,[182]^69^,[183]^70 Similarly, other RBPs showed a stronger correlation with subtype-specific exons, including SNRPD1 (24), SNRPF (20), PNO1 (18), SRSF5 (14), and RBMS3 (12). Given that RBPs bind to specific motifs in RNA sequences to regulate AS, we conducted a comprehensive motif enrichment analysis of the sequences of the DS exons and their neighboring intron sequences ([184]Table S10). We identified enriched 7mers in exon sequences and their 200-bp upstream and downstream regions. T-rich and A-rich motifs were significantly more abundant than other motifs, suggesting active involvement of the HNRNPC, CSTF2, TRA2A, PTBP1, and TIA1 in lung cancer splicing regulation. Among these RBPs, PTBP1 has been reported to play an important role in AS in lung cancer, with targets like MENA, FBXO5, and EXOC7, regulating carcinogenesis, age-related pathologies, and cell migration and invasion.[185]^71^,[186]^72^,[187]^73 To exclude the influence of common motifs, which are frequently found in the genome, we compared the motifs found in DS events with those from all annotated AS events. We focused on motifs with a significantly higher frequency in DS events compared with all annotated events. Motifs of PTBP1, QKI, RBFOX2, SRSF7, SRSF9, SFPQ, HNRNPU, HNRNPK, and HNRNPM are significantly enriched in the cancer-specific DS events, and these of RBFOX2, FAM120A, SRSF1, HNRNPK, HNRNPM, BUD13 and U2AF exhibit significance in subtype-specific DS events ([188]Figures 5D and [189]S10D). To further evaluate RBP activity in DS events, we analyzed RNA-seq data from ENCODE RBP-knockdown (KD) experiments.[190]^74 We found that KD of PUF60, SRSF1, U2AF, and HNRNPC impacted more than 40% of cancer-specific DS events ([191]Figure 5E). In subtype-specific DS events, U2AF, HNRNPC, PUF60, PTBP1, and SRSF1 demonstrated significant roles ([192]Figure S10E). However, some RBPs may be broadly active in numerous splicing events, such as U2AF, which is involved in many exon-skipping events. To account for these cases, we applied a hypergeometric distribution test to assess the target enrichment of each RBP in DS events. In cancer-specific DS events, QKI, TRA2A, CIRBP, HNRNPA1, SRSF5, CCAR1, PUM2, and HNRNPM showed significant enrichment ([193]Figure 5E and [194]Table S11). When compared with SFPointer results,[195]^75 which also identified them as critical RBPs (QKI: p = 1.42 × 10^−20; TRA2A: p = 3.03 × 10^−12; CIRBP: p = 4.52 × 10^−15; HNRNPA1: p = 4.15 × 10^−21; SRSF5: p = 1.70 × 10^−22; CCAR1: p = 8.95 × 10^−21; PUM2: p = 7.59 × 10^−25; HNRNPM: p = 5.50 × 10^−19). In subtype-specific DS events, PUM2, CIRBP, HNRNPA1, and RBM22 were among the top list of significant ones ([196]Figure S10E and [197]Table S11). These findings were also validated by SFPointer, where these RBPs were identified with significance (PUM2: p = 1.28 × 10^−27; CIRBP: p = 1.07 × 10^−22; HNRNPA1: p = 1.23 × 10^−27; RBM22: p = 1.43 × 10^−37). Considering the differences between tissue and cell line environments, we compared RBP expression and exon inclusion relationships in lung cancer with those observed in KD experiments to identify active RBPs with consistent relationships across contexts. In cancer-specific DS events, QKI, HNRNPM, and SRSF3 were notable, whereas in subtype-specific DS events, RBM22, RBM39, and CELF1 were significant ([198]Figures 5F and [199]S10F). Taking the RBP-exon correlation into account, we also identified several exons strongly suggested to be regulated by these active RBPs. For instance, QKI and RBM22 may promote the inclusion of PRMT2 exon 2 and SEC31A exon 27, respectively, both of which have been identified as biomarkers. In contrast, SRSF1 and FXR1 may repress the inclusion of FAM45A exon 2 and SCRIB exon 16 ([200]Figure S12). Notably, many RBPs that were upregulated in tumor tissues were also upregulated in LUSC compared with LUAD, and similar patterns were observed for downregulated RBPs, which indicates LUAD retains a more similar RBP expression signature with adjacent normal tissues compared with LUSC ([201]Figures 5B and [202]S10B). We confirmed this observation by pseudotime analysis with dysregulated RBPs. Indeed, LUAD tumor samples are close to adjacent normal samples compared with LUSC tumor samples ([203]Figure S13A and [204]Table S12). This regulation pattern aligns with the understanding that LUSC is often associated with lifestyle factors such as smoking, poor working conditions, and environmental exposures, which contribute to its poorer prognosis and limited treatment options.[205]^76^,[206]^77^,[207]^78 We also performed pseudotime analysis on the 178 DS events identified across subtypes. Remarkably, we found a similar pattern that LUAD samples followed a trajectory more closely resembling that of adjacent normal tissues, while LUSC events diverged further from adjacent normal ([208]Figure S13B and [209]Table S13). This suggests that LUAD may retain splicing regulation features more similar to adjacent normal tissue compared with LUSC, highlighting potentially distinct roles of RBPs in regulating AS events within lung cancer subtypes. The regulation of cancer-specific AS biomarkers by QKI Building on previous analyses, we found that QKI exhibited significantly differential expression between tumor and adjacent normal tissues, ranking among the RBPs with the strongest correlations with cancer-specific DS events. Moreover, the binding motif of QKI was enriched in spliced sequences, prompting us to focus on this RBP as a potential regulatory factor in cancer-specific biomarkers. To investigate its potential regulatory role, we first checked the correlations between QKI expression and 127 cancer-specific DS events ([210]Figure 6A). We found that 25 events showed strong correlation (|r| > 0.6, Pearson’s correlation coefficient), 70 events showed moderate correlation (between 0.4 and 0.6, Pearson’s correlation coefficient), and the remaining events showed weak correlation (|r| < 0.4, Pearson’s correlation coefficient). We further examined the QKI motif in exons and the intronic regions 200 bp upstream and downstream of cancer-specific DS events. We found that DS events with a high correlation with QKI expression also contained more QKI motifs. To further confirm this, we performed a gene set enrichment analysis (GSEA), which revealed an S-shaped enrichment score curve with two distinct peaks in the strongly correlated regions. This suggests that these strongly correlated exons are likely regulated by QKI ([211]Figure 6B; [212]Tables S14 and [213]S15). Figure 6. [214]Figure 6 [215]Open in a new tab Regulation of cancer-specific AS biomarkers by QKI (A) Correlation between QKI expression and 127 DS events, with four dashed lines to indicate strong (|r| > 0.6, Pearson’s correlation coefficient), moderate (between 0.4 and 0.6, Pearson’s correlation coefficient), and weak (|r| < 0.4, Pearson’s correlation coefficient) correlation groups. A summary table on the right showing the count of strong, moderate, and weak correlation pairs. (B) Frequency and enrichment of QKI motifs in the exon, upstream, and downstream intron of the 127 DS events. The events are ordered according to the sequence indicated in (A). (C) Impact of QKI KD or OE on 127 DS events in seven cell lines, along with GSEA of QKI-activated and QKI-suppressed events. The events are also in the same order as in (A). The seven cancer-specific biomarkers frequently perturbed in the cell line experiments are annotated on the right. (D) Pearson’s correlation of QKI expression with two previously reported events (ADD3 exon 15 and ESYT2 exon 14) and the focused PLEKHA1 exon 15. (E) QKI activation of PLEKHA1 exon 15 inclusion across six different cell lines. (F) QKI binds to the downstream of PLEKHA1 exon 15 and activate its inclusion, as shown by QKI eCLIP-seq data in HepG2 and KD RNA-seq data in PANC-1. The regulation direction in cell line is consistent with that observed in lung cancer. (G) Prognostic value of PLEKHA1 exon 15 splicing in TCGA LUAD and TCGA LUSC. We next investigated the impact of QKI modulation on 127 cancer-specific DS events by analyzing RNA-seq data from QKI KD and overexpression (OE) experiments in 7 different cell lines collected from public dataset ([216]Figure 6E and [217]Table S16). Interestingly, the DS exons positively correlated with QKI expression were enriched in exon targets activated by QKI and the ones with negative correlation were enriched in exon targets repressed by QKI. This observation was confirmed by GSEA of the enrichment QKI targets defined by the seven RNA-seq data in the cancer-specific DS exons ([218]Figure 6C). Notably, ADD3 exon 15 and ESYT2 exon 14,[219]^42^,[220]^62 identified as QKI target events in these experiments, have been previously reported as regulated by QKI. These findings underscore the crucial role of QKI in modulating splicing events in lung cancer and highlight its potential in regulating key cancer-specific pathways. Among the cancer-specific biomarkers, we focused on PLEKHA1 exon 15, which exhibited a strong positive correlation (0.737, Pearson’s correlation coefficient) between PSI and QKI expression, consistent with two previously reported events showing similarly strong correlations ([221]Figure 6D). This suggests that QKI activates the inclusion of this exon. In cell line experiments, the splicing regulation observed in six of seven cell lines was consistent with our tissue data, with five of these showing statistically significant changes ([222]Figure 6E). The only exception was K562, where low PLEKHA1 expression (reads per kilobase per million mapped reads [RPKM] < 1) prevented PSI detection. Further analysis identified four QKI motifs within the exon and its flanking 200-bp regions, three of which were located downstream. QKI enhanced crosslinking-immunoprecipitation and high-throughout sequencing (eCLIP-seq) data revealed several significant peaks surrounding the spliced exon, further confirming QKI’s ability to bind the spliced sequence ([223]Figure 6F). Taken together, the strong correlation between expression, in vitro splicing regulation, and binding evidence provides compelling support that QKI directly suppresses splicing of this exon. PLEKHA1 is frequently reported in studies on metabolic diseases, such as diabetes,[224]^79 and is thought to play a role in signal transduction pathways due to the presence of homologous domains within its family. In addition, PLEKHA1 has been implicated in various cancer-related functions, including the regulation of cell migration, invasion, and proliferation, and is also considered an important target for precision medicine in breast cancer.[225]^80^,[226]^81^,[227]^82 We further examined the PSI of PLEKHA1 exon 15 in the TCGA pan-cancer cohort and found a similar tumor-associated decrease across six different cancer types, including bladder urothelial carcinoma, breast-invasive carcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma, pancreatic adenocarcinoma, rectum adenocarcinoma (READ), and uterine corpus endometrial carcinoma ([228]Figures S14A and S14B). To assess the prognostic impact of this event, we found that, in LUSC, a higher PSI of PLEKHA1 exon 15 was associated with a significantly worse prognosis, while in LUAD, it was able to distinguish two groups in the early stages, although no substantial difference in overall survival was observed ([229]Figure 6G). We also investigated the prognostic value of PLEKHA1 exon 15 in a pan-cancer dataset, and found that it can distinguish patients into two risk groups in several cancer types, including colon adenocarcinoma (COAD), kidney renal clear cell carcinoma (KIRC), READ, and stomach adenocarcinoma (STAD) ([230]Figure S14C), highlighting PLEKHA1 as a survival-related functional biomarker. The regulation of subtype-specific AS biomarkers by SRSF1 Among the subtype-specific biomarkers identified, MKNK2 is involved in the well-established mitogen-activated protein kinase (MAPK) signaling pathway, a hallmark pathway in cancer and previously highlighted in our enrichment analysis ([231]Figure 2F). Notably, MKNK2 exon 14 has been implicated in cancer-related DS events in our previous analysis ([232]Table S3), the PSI of which was found to be higher in LUAD compared with LUSC, and was also significantly downregulated in tumor tissues. MKNK2 expression was also downregulated in tumors, suggesting that the isoforms containing exon 14 is also suppressed in tumor tissues ([233]Figures 7A and 7B). This dual observation of splicing and expression changes in MKNK2 exon 14 drew our attention to its potential role in regulation. Figure 7. [234]Figure 7 [235]Open in a new tab Regulation of subtype-specific AS biomarkers by SRSF1 (A) A schematic diagram showing the structure of two isoforms of MKNK2. The isoform containing exon 14 contains an early stop codon, leading to a truncated protein. (B) Violin plot showing the PSI of MKNK2 exon 14 across four subtypes in tumor and adjacent normal tissues. (C) Pearson’s correlation between SRSF1 expression and the PSI of MKNK2 exon 14. (D) Suppression of MKNK2 exon 14 splicing by SRSF1 in 10 cell lines. (E) SRSF1 eCLIP-seq peaks surrounding MKNK2 exon 14 (the left exon in the reference sequence), with the GGA motifs marked. SR proteins have been shown to play a role in lung cancer splicing regulation, as evidenced by previous RBP analyses and previous studies.[236]^83^,[237]^84 Using lung cancer data from the Clinical Proteomic Tumor Analysis Consortium, we observed that SR proteins are commonly upregulated in tumors, with elevated phosphorylation levels observed at most phosphorylation sites ([238]Figure S15A). As reported, both phosphorylation and partial dephosphorylation of SR proteins affect spliceosome assembly, thereby influencing splicing mechanisms.[239]^85 Several studies have explored the kinase-SR protein-spliced gene axis in cancer, making SR proteins a compelling target for further investigation.[240]^86^,[241]^87^,[242]^88 We further analyzed SR protein KD and OE experiments from various cell lines and found that SRSF1 plays a significant regulatory role in cancer-related DS events ([243]Figures S15B and S15C and [244]Table S17). Specifically, the PSI of MKNK2 exon 14 and SRSF1 expression levels were negatively correlated (r = −0.218, Pearson’s correlation coefficient) ([245]Figure 7C), suggesting splicing repression by SRSF1, which was further validated by RNA-seq data with SRSF1 perturbance ([246]Table S16). We observed that MKNK2 exon 14 has a higher inclusion ratio with SRSF1 KD cell lines and the opposite in OE data, showing significant effects in 9 of 10 cell lines ([247]Figure 7D). ENCODE eCLIP data further confirmed this regulatory relationship, as we identified SRSF1 binding peaks around MKNK2 exon 14 in both HepG2 and K562 cell line and we also found the GGA core motif of SRSF1 is also aligned well with eCLIP peaks ([248]Figure 7E). Our integrated analysis suggests that the splicing of MKNK2 exon 14 might be regulated by SRSF1 in lung cancer subtype-specific way. Discussion In this study, we generated a high-quality RNA-seq dataset comprising paired tumor and adjacent normal tissues from 178 lung cancer patients. These patients represent four distinct subtypes of lung cancer, including LUAD and LUSC, which constitute the majority of NSCLC cases. This comprehensive dataset provides an invaluable resource and enables a more precise exploration of the mechanisms underlying lung cancer. By focusing on AS in addition to gene expression, we aimed to shed light on an important but often underappreciated layer of gene regulation in lung cancer. By integrating transcriptomic profiling data, we identified cancer- and subtype-specific DSGs and DEGs across four subtypes of lung cancer. Functional enrichment analysis revealed that many of these genes are involved in classical cancer-related pathways or splicing-related pathways, and several pathways were co-regulated by both DSEs and DEGs, highlighting their potential synergistic roles in lung cancer progression and emphasizing the significance of exploring the interplay between splicing regulation and gene expression in tumor biology. Using machine learning algorithms, we identified 24 cancer-specific AS biomarkers and 8 subtype-specific AS biomarkers, all of which demonstrated robust statistical performance with additional validation using both the TCGA and CCLE datasets. Notably, many of the genes associated with these biomarkers, such as CD44, VEGFA, MKNK2, and CLK1, have been reported to play important roles in cell signaling, immunity, and angiogenesis in cancer. Biomarkers including COL6A3 exon 6, ADD3 exon 15, FBLN2 exon 11, CD44 exon 11, VEGFA exon 7, and PDGFA exon 6 have been previously reported, and several showed prognostic relevance in the TCGA dataset, underscoring their clinical utility. To elucidate the regulatory mechanisms underlying differential AS in lung cancer, we further investigated the role of RBPs. Our findings revealed a strong concordance between the RBP expression patterns observed in this study and those in TCGA datasets, validating the reliability of the RBP analysis and explaining the consistency of AS events observed in both our study and public datasets. Through an integrative analysis of RBP expression, RBP-exon correlations, and RBP binding motifs, we identified key RBPs, including QKI, RBM22, HNRNP proteins, and SR proteins, as active regulators of AS in lung cancer. Interestingly, LUAD exhibited RBP expression patterns more similar to adjacent normal tissues compared with LUSC, a phenomenon mirrored in their respective splicing landscapes. Among the RBPs, QKI emerged as a critical regulator of cancer-specific AS events. Specifically, QKI regulates the splicing of PLEKHA1 exon 15, a biomarker that exhibited prognostic significance not only in LUSC but also in other cancers such as COAD, KIRC, READ, and STAD. Patients with higher inclusion levels of PLEKHA1 exon 15 had with worse survival outcomes, suggesting that this splicing event may represent a conserved mechanism across multiple cancer types. These findings highlight the broader implications of QKI-mediated splicing in cancer biology and its potential as a therapeutic target. Additionally, we observed widespread phosphorylation and OE of SR proteins in lung cancer, which are also identified as active RBPs in lung cancer. Our findings demonstrate that SRSF1 regulates the splicing of MKNK2 exon 14, producing a longer isoform of higher expression in tumor tissues. MKNK2 is annotated to involve in MAPK signaling pathway in KEGG to accept the signal from ERK and then pass it to CREB, which is a key pathway regulating cell proliferation in cancer. The MKNK2 isoform without exon 14 is previously reported to lack the MAPK binding sites,[249]^89 and thus cannot contribute to the important signaling pathway in cancer.[250]^90 The MKNK2 long isoform has been reported to be associated with cell proliferation and tumor growth in various cancers.[251]^91^,[252]^92^,[253]^93 These findings suggest that dysregulation of SRSF1-mediated splicing may contribute directly to the oncogenic potential of lung cancer through altered MAPK signaling. While this study has advanced our understanding of AS regulation in lung cancer and identified promising biomarkers, there are limitations. While the identified AS biomarkers show potential clinical relevance, their practical application requires extensive evaluation in larger, independent cohorts. Future work could focus on leveraging the identified biomarkers and their associated RBPs to develop targeted therapies. Such efforts could pave the way for precision medicine strategies in lung cancer, aiming to modulate splicing events for improved patient outcomes. Material and methods Clinical human specimen collection All NSCLC tumor and adjacent normal tissue samples for this study were collected Shanghai Pulmonary Hospital from 2010 to 2012. Fresh tissue samples were collected within 1 h of surgery and were stored liquid nitrogen. Ethical approval for the study was obtained from ethics committee of School of Life Sciences at Fudan University (Approval Number: BE1804). Written informed consent was provided by all participating patients prior to inclusion in the study. RNA-seq library construction and sequencing Total RNA was extracted from the tissues using Trizol (Invitrogen), followed by purification using a RNeasy Mini Kit (Qiagen). All procedures were carried out under RNase-free conditions to avoid RNA degradation. RNA quality and quantity were assessed using a Bioanalyzer (Agilent) and Qubit fluorometer (Thermo Fisher Scientific), respectively. RNA-seq libraries were prepared using Illumina’s TruSeq RNA Library Prep Kit (Illumina) according to the manufacturer’s instructions. Briefly, 1 μg of total RNA was used for poly(A) mRNA selection, fragmentation, and cDNA synthesis. The resulting cDNA was end-repaired, adenylated, and ligated with Illumina-specific adapters. PCR amplification was then performed to enrich the adapter-ligated fragments. PCR products were subsequently cleaned and prepared for sequencing. The purified PCR amplicons were sequenced using Illumina HiSeq 2000. Sequencing was performed to generate paired-end reads. Public databases integration Gene expression and survival data from TCGA cohorts were utilized (UCSC Xena),[254]^94 along with RNA-seq data from 17 lung cancer cell lines from the CCLE database.[255]^95 For AS analysis in TCGA, we used the SpliceSeq TCGA database.[256]^96 To explore the functional role of specific RBPs, we incorporated ENCODE’s short hairpin RNA RNA-seq data and experiments targeting QKI and SR proteins in GEO datasets ([257]Table S16).[258]^28^,[259]^74^,[260]^97^,[261]^98^,[262]^99^,[263]^100^,[264] ^101^,[265]^102^,[266]^103^,[267]^104^,[268]^105^,[269]^106^,[270]^107^ ,[271]^108^,[272]^109^,[273]^110^,[274]^111^,[275]^112^,[276]^113^,[277 ]^114^,[278]^115^,[279]^116^,[280]^117^,[281]^118^,[282]^119^,[283]^120 ^,[284]^121^,[285]^122^,[286]^123^,[287]^124^,[288]^125 Additionally, we analyzed ENCODE’s eCLIP data to investigate RNA-RBP interactions.[289]^74 RNA-seq analysis RNA-seq quality control was performed using FastQC (v0.12.1) ([290]https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), low-quality reads were trimmed using Trim Galore (v0.6.10) ([291]https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/), and adaptors were cut with Cutadapt.[292]^126 RNA-seq alignment was performed with the hg19 reference genome. The reference genome used in our study was downloaded from the UCSC Genome Browser, with the accession number GCA_000001405.1. For expression and splicing quantification, all RNA-seq data are mapped with Olego (v1.1.9), which is specifically optimized for spliced read alignment in transcriptome analysis and aligned well with the Quantas pipeline.[293]^127 AS events were quantified with Quantas (v1.0.9),[294]^33 which offers a highly curated and comprehensive annotation of exon-exon junctions and AS events, including 42,762 CASSs, 20,618 alternative 3′ splice sites, 8,067 alternative 5′ splice sites, 5,799 intron retentions, 6,684 mutually exclusive exons, and 18,164 tandem CASSs. PSI (Ψ) was used to measure splicing levels, which was calculated as the fraction of junction reads that support the inclusion form of transcripts over the total number of junction reads with both inclusion and exclusion transcripts. Gene expression was also analyzed with Quantas and quantified as RPKM. eCLIP-seq data were analyzed with the CLIP Tool Kit (v1.1.3), following the recommended pipeline on the website.[295]^128 Differential expression and splicing analysis Unless otherwise specified, DS analysis was conducted using Fisher’s exact test by Quantas in this study. Splicing events were considered DS if they met the following criteria: coverage of ≥20, Benjamini FDR of ≤0.05, and |ΔΨ| ≥ 0.1. A DS event was identified as upregulated or downregulated with ΔΨ ≥ 0.1 or ΔΨ ≤ −0.1 representing inclusion increase or decrease in tumor samples. For paired tumor-adjacent normal comparisons, EventPointer (v3.16.0) was applied to account for matched sample design, an event was considered significant if the FDR was ≤0.05. Unless otherwise specified, for DE analysis, we used the edgeR package (v3.42.4) incorporated by Quantas pipeline ([296]https://zhanglab.c2b2.columbia.edu/index.php/Quantas_Documentatio n). Genes were considered significantly differentially expressed if they met the following criteria: absolute log[2]FC ≥ 1, the group average RPKM exceeded the global average RPKM, and a Benjamini-Hochberg-adjusted FDR of ≤0.05. For comparisons of RBP expression between tumor and adjacent normal tissues, edgeR was used in paired mode to account for matched sample design. RBPs with |log[2]FC| ≥ 0.6 and a FDR of ≤0.05 were defined as differentially expressed. Gene functional enrichment analysis Gene functional enrichment analysis was performed using the DAVID[297]^129 ([298]https://davidbioinformatics.nih.gov/) tool. DAVID was used to conduct GO enrichment analysis, as well as to identify significantly enriched pathways based on KEGG. A p value of ≤0.05 was considered statistically significant for enrichment analysis. In an overview analysis, we directly used the FDR values provided by DAVID, while in biomarker candidates analysis, we calculated FDR based on the enriched terms identified. Feature selection using the Boruta algorithm For AS biomarker selection, we used the Boruta algorithm (v8.0.0) in R to identify the most relevant splicing events. Boruta is a wrapper algorithm built around a random Forest classifier that performs iterative feature selection. In each iteration, it creates shadow features by shuffling the original features and compares the importance of real features to these randomized ones. Features that consistently show significantly greater importance than the best shadow feature are considered truly relevant, while less informative features are rejected. This process continues until all features are confidently classified as important or unimportant. We ran Boruta with default parameters to retain the most informative AS biomarkers for downstream analysis. AS biomarker validation kNN classification is a non-parametric method that classifies data points based on the majority vote of their k nearest neighbors, with distance typically measured using Euclidean distance. In this study, we applied R package caret (v7.0.1) to perform kNN classification, and k = 7 was used for cancer-specific biomarkers and k = 21 for subtype-specific biomarkers, which was determined based on optimizing model performance. For cross-validation within our own dataset, we divided the dataset into five folds. In each iteration, we used four folds for feature selection and model training, and tested the model on the remaining fold. For external validation using the TCGA or CCLE datasets, we first built the model using AS biomarkers on our own dataset and then tested the model on TCGA or CCLE ([299]Figure S2C). MDS is a dimensionality reduction technique used to evaluate the distribution and clustering of the data. We calculated the distance matrix and then applied classical MDS to AS markers matrix or RBP expression matrix; only the first two dimensions are shown in the scatterplots. Protein domain enrichment analysis Protein domain enrichment analysis was performed by overlapping the genomic coordinates of spliced exons with annotated protein domains from the Pfam database (v33.1). The occurrence change of each domain in DS exons was evaluated, considering the direction of PSI changes: if an exon is upregulated, the occurrence is incremented by one, while downregulation results in a decrement of one. The final occurrence value is based on the absolute value of the change. The occurrence change of each domain in DS exons was then compared with its occurrence in all annotated CASSs. Enrichment significance was assessed using the chi-squared test. Survival analysis Univariate survival analysis was performed by stratifying the samples into two groups based on the PSI value of each AS biomarker: a high-PSI group and a low-PSI group. For multivariate survival analysis, the top AS biomarkers were selected, and a Cox proportional hazards regression model was constructed. The samples were then classified into a high-risk and a low-risk group based on the model. Kaplan-Meier survival curves were generated to visualize survival differences between the groups, and the log rank test was applied to assess the statistical significance of the differences. Survival analyses were conducted using the R package survival (v3.7.0), while the R package survminer (v0.4.9) was utilized for curve plotting. Motif enrichment analysis We extracted the exon and 200-bp upstream and downstream sequences of the DS events from the hg19 reference genome. A motif search was performed based on RBP motifs annotated in the mCross database.[300]^130 The number of motifs identified in each sequence was counted and used for further analysis. Additionally, GSEA was conducted using the motif counts from all annotated AS events’ exons and their flanking regions as background. The enrichment was then evaluated based on the presence or absence of the identified motifs. RBP activity inference We analyzed ENCODE KD RNA-seq data to identify target CASS events for each RBP (coverage ≥20, Benjamini-Hochberg FDR of ≤0.05, and |ΔΨ| ≥ 0.1). For each RBP, cancer-specific or subtype-specific DS events were considered the foreground, while all identified target events were used as the background. Enrichment significance was assessed using the Fisher’s exact test. For validation, we applied SFPointer,[301]^75 using DS analysis between tumor and adjacent normal samples, or between LUAD and LUSC tumor samples, as input. Fisher’s exact test was used to infer splicing factor activity. Data visualization We used R packages such as ggplot2 (v3.5.1) for creating boxplots, bar plots, and scatterplots; pheatmap (v1.0.12) for generating heatmaps; and clusterprofile (v4.10.1) for the GSEA plots.[302]^131 BedGraph files for visualizing genome-wide data were generated by Quantas and visualized using Integrative Genomics Viewer (v2.19.4).[303]^132 Additionally, open-source images from BioRender ([304]https://www.biorender.com/) and BioIcon ([305]https://bioicons.com/) were used for creating schematic illustrations and figures. Data and code availability The RNA-seq data generated in this study have been deposited in the Genome Sequence Archive at the National Genomics Data Center/China National Center for Bioinformation under project number GSA: [306]PRJCA035402. The SRA IDs or download links of public RNA-seq and eCLIP data used in this work are listed in [307]Table S16. The analysis code of this study is available at [308]https://github.com/liuyilei8969/Lung-Cancer-AS. Acknowledgments