Abstract Long non‐coding RNAs (lncRNAs) are receiving increasing attention as biomarkers for cancer diagnosis and therapy. Although there are many computational methods to identify cancer lncRNAs, they do not comprehensively integrate multi‐omics features for predictions or systematically evaluate the contribution of each omics to the multifaceted landscape of cancer lncRNAs. In this study, an algorithm, POCALI, is developed to identify cancer lncRNAs by integrating 44 omics features across six categories. The contributions of different omics are explored to identifying cancer lncRNAs and, more specifically, how each feature contributes to a single prediction. The model is evaluated and benchmarked POCALI with existing methods. Finally, the cancer phenotype and genomics characteristics of the predicted novel cancer lncRNAs are validated. POCALI identifies secondary structure and gene expression‐related features as strong predictors of cancer lncRNAs, and epigenomic features as moderate predictors. POCALI performed better than other methods, especially in terms of sensitivity, and predicted more candidates. Novel POCALI‐predicted cancer lncRNAs have strong relationships with cancer phenotypes, similar to known cancer lncRNAs. Overall, this study facilitates the identification of previously undetected cancer lncRNAs and the comprehensive exploration of the multifaceted feature contributions to cancer lncRNA prediction. Keywords: cancer lncRNA, computational biology, machine learning, model explanation, multi‐omics __________________________________________________________________ Long non‐coding RNAs (lncRNAs) are receiving increasing attention as biomarkers for cancer diagnosis and therapy, highlighting the urgent need for computational methods to accelerate their comprehensive discovery. Here, to better predict and provide functional insight into cancer lncRNAs, a novel interpretable machine‐learning method (POCALI) is developed by leveraging 44 multi‐omics features in six categories. graphic file with name SMTD-9-2401987-g001.jpg 1. Introduction Cancer is a complicated disease and a leading cause of death worldwide.^[ [34]^1 ^] Understanding the mechanisms of cell transformation is a fundamental goal in cancer research. A significant step toward achieving this aim involves identifying all the genes capable of driving tumors.^[ [35]^2 ^] Most studies have focused on the perspective of protein‐coding genes (PCGs) and mutation mechanisms in cancer gene discovery.^[ [36]^3 , [37]^4 ^] However, in recent years, evidence has shown that a lack of mutation can also drive cancer development and promote the discovery of cancer genes–via epigenomics, for instance.^[ [38]^5 , [39]^6 , [40]^7 , [41]^8 , [42]^9 ^] An increasing amount of research has also revealed the important role of non‐coding RNA in cancer. This includes long non‐coding RNA (lncRNA), whose molecule comprises more than 200 nucleotides and has little potential for protein translation. lncRNAs are involved in a series of cellular and biological processes, including chromatin architecture and gene regulation.^[ [43]^10 , [44]^11 ^] Their abnormal expression and mutations are closely associated with carcinogenesis, metastasis, and tumor stages.^[ [45]^12 ^] Intergenic lncRNAs constitute a prominent type of lncRNAs that are particularly useful for computational and experimental studies due to their lack of overlap with PCGs. Many powerful experimental technologies and computational tools have allowed for the extensively exploring the role of lncRNAs in cancer. For example, CRISPR‐mediated interference (CRISPRi)‐based genome‐scale screening has led to the identification of 499 lncRNA loci that modify cell growth.^[ [46]^13 ^] Antisense LNA‐modified GapmeR antisense oligonucleotide (ASO) technology has been used to suppress the expression of 285 lncRNAs in human primary dermal fibroblasts and assess cellular and molecular phenotypes separately.^[ [47]^14 ^] Further, Cancer LncRNA Census (CLC) and Lnc2Cancer initiatives have led to the collection of cancer lncRNAs through literature research, providing a potential golden standard for predicting and evaluating cancer lncRNAs. Both CLC and Lnc2Cancer have been updated to version 3.^[ [48]^15 , [49]^16 ^] Our previous work CADTAD identified core cancer driver lncRNAs by relating to cancer driver PCGs.^[ [50]^17 ^] In addition, ExInAtor has been used to identify cancer driver lncRNAs based on mutation characteristics.^[ [51]^18 , [52]^19 ^] As the amount of data on biology increases, data‐driven AI approaches can help researchers conduct more effective research. Notably, some of the methods adopted for the discovery of cancer lncRNAs involve the use of machine learning for predictions.^[ [53]^20 , [54]^21 , [55]^22 ^] Zhao et al. identified 707 potential cancer‐related lncRNAs by developing a computational method based on the naïve Bayesian classifier method and by integrating genome, regulome, and transcriptome data.^[ [56]^20 ^] CRlncRC, a random forest classifier that integrates genomic, expression, epigenomics, and network features, enables the identification of 121 cancer‐related lncRNA candidates.^[ [57]^21 ^] CRlncRC2, an improved version of CRlncRC, was developed by using the XGBoost framework, SMOTE‐based over‐sampling, and Laplacian Score‐based feature selection, leading to the identification of 439 cancer‐related lncRNA candidates.^[ [58]^22 ^] However, Zhao's method only involved eight simple features as representatives of lncRNA characteristics, and CRlncRC's utilization of expression and epigenomic features in each tissue made the resulting explanations lack general and representative features for identifying cancer‐related lncRNAs. CRlncRC2 has the same disadvantage, albeit with feature selection. Further, researchers usually identify cancer driver PCGs based on mutations, and the use of such data with ExInAtor and OncodriveFML has led to progress in identifying cancer driver lncRNAs.^[ [59]^18 , [60]^19 , [61]^23 ^] However, ExInAtor was observed to lose prediction sensitivity, and it lacks an evaluation system for exploring how mutations contribute to cancer lncRNA identification. Since epigenomic features can also be used to identify cancer genes,^[ [62]^6 , [63]^8 , [64]^9 ^] we investigated their utility in identifying cancer lncRNAs and whether other features could lead to the better identification of cancer lncRNAs. To address these objectives, we developed a method called POCALI (Prediction and insight On CAncer LncRNAs by Integrating multi‐omics data with machine learning) based on LightGBM, with EasyEnsemble trained on known and well‐defined cancer lncRNAs (CalncRNAs) and neutral lncRNAs (NeulncRNAs) obtained by strict criteria. By using POCALI, we found that transcriptome features, such as secondary structure and differential expression, largely contributed to CalncRNA predictions and that epigenomic features moderately contributed to predictions. Our evaluation revealed that POCALI performs better than other methods, especially in terms of sensitivity and prediction number. We also used multiple cancer phenotypes and functional genomics datasets to evaluate the novel CalncRNA predicted by POCALI and found a strong relationship between them and cancer phenotypes, similar to known CalncRNAs. 2. Results 2.1. POCALI Predicts CalncRNAs Based on Known Cancer lncRNAs and Neutral lncRNAs We developed the computational tool POCALI to predict CalncRNAs by integrating features from six categories (Epigenomics, Genomics, Transcriptomics, Phenotype, Network, and Mutation), collecting high‐quality training datasets, and selecting the best‐performing model from multiple classification algorithms. We also analyzed these features’ contributions to the prediction results and subsequently compared POCALI to other methods. Finally, we used some function datasets to validate the functions of novel CalncRNAs in cancer (Figure [65]1A). Figure 1. Figure 1 [66]Open in a new tab Flowchart of the POCALI method.A) A systematic overview of the POCALI method. The workflow includes collecting training data, extracting features, selecting algorithms, training the model, analyzing the features’ contribution, and evaluating the model and prediction results. B) Detailed information about algorithm selection. C) Detailed model training process. Based on a literature search, we collected a total of 44 features that were likely to be predictive of CalncRNAs. These features either have known roles in predicting cancer PCGs or potential links to the discovery of CalncRNAs.^[ [67]^9 , [68]^18 , [69]^21 , [70]^23 ^] We aimed to determine whether some features that can be used to identify cancer PCGs can also predict CalncRNAs. We categorized these features into six major types (Figure [71]S1A and Table [72]S1, Supporting Information): a) 15 epigenomic features adapted from DORGE,^[ [73]^9 ^] including peak width of 11 histone modifications, super enhancer percentages, promoter and gene body methylation, and replication time S50 score; b) 13 genomic features including one feature related to gene length, 10 features adapted from CRlncRC,^[ [74]^21 ^] one feature obtained from CADTAD,^[ [75]^17 ^] and one feature about k‐mer content^[ [76]^24 ^] that has not been used to predict CalncRNAs; c) six transcriptomics features, including two gene expression‐related features, which have been widely used to identify cancer‐related lncRNAs in many studies, and four secondary structure‐related features that could have the potential to identify CalncRNAs, as they have been previously used to identify gene/lncRNA essentiality and different types of RNA;^[ [77]^25 , [78]^26 ^] d) three network features, including their interactions with cancer‐related mRNA, miRNA, and protein, which could hint at the potential roles of lncRNAs in cancer; and e) four mutation features adapted from ExInAtor,^[ [79]^18 ^] OncodriveFML,^[ [80]^23 ^] and copy number variations (CNVs), which have previously been used to identify CalncRNAs.^[ [81]^27 , [82]^28 , [83]^29 , [84]^30 , [85]^31 ^] We used these features to annotate all lncRNAs. CalncRNA prediction is a classification problem that requires a high‐quality training dataset containing reliable CalncRNAs as the positive dataset and a dataset containing the lncRNAs unlikely to be CalncRNAs as the negative dataset. Therefore, we established strict criteria to select the training datasets. First, we considered only the intergenic lncRNAs present in the high‐quality curations from GENCODE^[ [86]^32 ^] to eliminate the potential influence of PCGs on mutation or epigenomics. This resulted in 10 746 lncRNA genes as references.