Abstract

   Long non‐coding RNAs (lncRNAs) are receiving increasing attention as
   biomarkers for cancer diagnosis and therapy. Although there are many
   computational methods to identify cancer lncRNAs, they do not
   comprehensively integrate multi‐omics features for predictions or
   systematically evaluate the contribution of each omics to the
   multifaceted landscape of cancer lncRNAs. In this study, an algorithm,
   POCALI, is developed to identify cancer lncRNAs by integrating 44 omics
   features across six categories. The contributions of different omics
   are explored to identifying cancer lncRNAs and, more specifically, how
   each feature contributes to a single prediction. The model is evaluated
   and benchmarked POCALI with existing methods. Finally, the cancer
   phenotype and genomics characteristics of the predicted novel cancer
   lncRNAs are validated. POCALI identifies secondary structure and gene
   expression‐related features as strong predictors of cancer lncRNAs, and
   epigenomic features as moderate predictors. POCALI performed better
   than other methods, especially in terms of sensitivity, and predicted
   more candidates. Novel POCALI‐predicted cancer lncRNAs have strong
   relationships with cancer phenotypes, similar to known cancer lncRNAs.
   Overall, this study facilitates the identification of previously
   undetected cancer lncRNAs and the comprehensive exploration of the
   multifaceted feature contributions to cancer lncRNA prediction.

   Keywords: cancer lncRNA, computational biology, machine learning, model
   explanation, multi‐omics
     __________________________________________________________________

   Long non‐coding RNAs (lncRNAs) are receiving increasing attention as
   biomarkers for cancer diagnosis and therapy, highlighting the urgent
   need for computational methods to accelerate their comprehensive
   discovery. Here, to better predict and provide functional insight into
   cancer lncRNAs, a novel interpretable machine‐learning method (POCALI)
   is developed by leveraging 44 multi‐omics features in six categories.

   graphic file with name SMTD-9-2401987-g001.jpg

1. Introduction

   Cancer is a complicated disease and a leading cause of death
   worldwide.^[ [34]^1 ^] Understanding the mechanisms of cell
   transformation is a fundamental goal in cancer research. A significant
   step toward achieving this aim involves identifying all the genes
   capable of driving tumors.^[ [35]^2 ^] Most studies have focused on the
   perspective of protein‐coding genes (PCGs) and mutation mechanisms in
   cancer gene discovery.^[ [36]^3 , [37]^4 ^] However, in recent years,
   evidence has shown that a lack of mutation can also drive cancer
   development and promote the discovery of cancer genes–via epigenomics,
   for instance.^[ [38]^5 , [39]^6 , [40]^7 , [41]^8 , [42]^9 ^] An
   increasing amount of research has also revealed the important role of
   non‐coding RNA in cancer. This includes long non‐coding RNA (lncRNA),
   whose molecule comprises more than 200 nucleotides and has little
   potential for protein translation. lncRNAs are involved in a series of
   cellular and biological processes, including chromatin architecture and
   gene regulation.^[ [43]^10 , [44]^11 ^] Their abnormal expression and
   mutations are closely associated with carcinogenesis, metastasis, and
   tumor stages.^[ [45]^12 ^] Intergenic lncRNAs constitute a prominent
   type of lncRNAs that are particularly useful for computational and
   experimental studies due to their lack of overlap with PCGs.

   Many powerful experimental technologies and computational tools have
   allowed for the extensively exploring the role of lncRNAs in cancer.
   For example, CRISPR‐mediated interference (CRISPRi)‐based genome‐scale
   screening has led to the identification of 499 lncRNA loci that modify
   cell growth.^[ [46]^13 ^] Antisense LNA‐modified GapmeR antisense
   oligonucleotide (ASO) technology has been used to suppress the
   expression of 285 lncRNAs in human primary dermal fibroblasts and
   assess cellular and molecular phenotypes separately.^[ [47]^14 ^]
   Further, Cancer LncRNA Census (CLC) and Lnc2Cancer initiatives have led
   to the collection of cancer lncRNAs through literature research,
   providing a potential golden standard for predicting and evaluating
   cancer lncRNAs. Both CLC and Lnc2Cancer have been updated to version
   3.^[ [48]^15 , [49]^16 ^] Our previous work CADTAD identified core
   cancer driver lncRNAs by relating to cancer driver PCGs.^[ [50]^17 ^]
   In addition, ExInAtor has been used to identify cancer driver lncRNAs
   based on mutation characteristics.^[ [51]^18 , [52]^19 ^]

   As the amount of data on biology increases, data‐driven AI approaches
   can help researchers conduct more effective research. Notably, some of
   the methods adopted for the discovery of cancer lncRNAs involve the use
   of machine learning for predictions.^[ [53]^20 , [54]^21 , [55]^22 ^]
   Zhao et al. identified 707 potential cancer‐related lncRNAs by
   developing a computational method based on the naïve Bayesian
   classifier method and by integrating genome, regulome, and
   transcriptome data.^[ [56]^20 ^] CRlncRC, a random forest classifier
   that integrates genomic, expression, epigenomics, and network features,
   enables the identification of 121 cancer‐related lncRNA candidates.^[
   [57]^21 ^] CRlncRC2, an improved version of CRlncRC, was developed by
   using the XGBoost framework, SMOTE‐based over‐sampling, and Laplacian
   Score‐based feature selection, leading to the identification of 439
   cancer‐related lncRNA candidates.^[ [58]^22 ^] However, Zhao's method
   only involved eight simple features as representatives of lncRNA
   characteristics, and CRlncRC's utilization of expression and epigenomic
   features in each tissue made the resulting explanations lack general
   and representative features for identifying cancer‐related lncRNAs.
   CRlncRC2 has the same disadvantage, albeit with feature selection.
   Further, researchers usually identify cancer driver PCGs based on
   mutations, and the use of such data with ExInAtor and OncodriveFML has
   led to progress in identifying cancer driver lncRNAs.^[ [59]^18 ,
   [60]^19 , [61]^23 ^] However, ExInAtor was observed to lose prediction
   sensitivity, and it lacks an evaluation system for exploring how
   mutations contribute to cancer lncRNA identification. Since epigenomic
   features can also be used to identify cancer genes,^[ [62]^6 , [63]^8 ,
   [64]^9 ^] we investigated their utility in identifying cancer lncRNAs
   and whether other features could lead to the better identification of
   cancer lncRNAs.

   To address these objectives, we developed a method called POCALI
   (Prediction and insight On CAncer LncRNAs by Integrating multi‐omics
   data with machine learning) based on LightGBM, with EasyEnsemble
   trained on known and well‐defined cancer lncRNAs (CalncRNAs) and
   neutral lncRNAs (NeulncRNAs) obtained by strict criteria. By using
   POCALI, we found that transcriptome features, such as secondary
   structure and differential expression, largely contributed to CalncRNA
   predictions and that epigenomic features moderately contributed to
   predictions. Our evaluation revealed that POCALI performs better than
   other methods, especially in terms of sensitivity and prediction
   number. We also used multiple cancer phenotypes and functional genomics
   datasets to evaluate the novel CalncRNA predicted by POCALI and found a
   strong relationship between them and cancer phenotypes, similar to
   known CalncRNAs.

2. Results

2.1. POCALI Predicts CalncRNAs Based on Known Cancer lncRNAs and Neutral
lncRNAs

   We developed the computational tool POCALI to predict CalncRNAs by
   integrating features from six categories (Epigenomics, Genomics,
   Transcriptomics, Phenotype, Network, and Mutation), collecting
   high‐quality training datasets, and selecting the best‐performing model
   from multiple classification algorithms. We also analyzed these
   features’ contributions to the prediction results and subsequently
   compared POCALI to other methods. Finally, we used some function
   datasets to validate the functions of novel CalncRNAs in cancer (Figure
   [65]1A).

Figure 1.

   Figure 1
   [66]Open in a new tab

   Flowchart of the POCALI method.A) A systematic overview of the POCALI
   method. The workflow includes collecting training data, extracting
   features, selecting algorithms, training the model, analyzing the
   features’ contribution, and evaluating the model and prediction
   results. B) Detailed information about algorithm selection. C) Detailed
   model training process.

   Based on a literature search, we collected a total of 44 features that
   were likely to be predictive of CalncRNAs. These features either have
   known roles in predicting cancer PCGs or potential links to the
   discovery of CalncRNAs.^[ [67]^9 , [68]^18 , [69]^21 , [70]^23 ^] We
   aimed to determine whether some features that can be used to identify
   cancer PCGs can also predict CalncRNAs. We categorized these features
   into six major types (Figure [71]S1A and Table [72]S1, Supporting
   Information): a) 15 epigenomic features adapted from DORGE,^[ [73]^9 ^]
   including peak width of 11 histone modifications, super enhancer
   percentages, promoter and gene body methylation, and replication time
   S50 score; b) 13 genomic features including one feature related to gene
   length, 10 features adapted from CRlncRC,^[ [74]^21 ^] one feature
   obtained from CADTAD,^[ [75]^17 ^] and one feature about k‐mer
   content^[ [76]^24 ^] that has not been used to predict CalncRNAs; c)
   six transcriptomics features, including two gene expression‐related
   features, which have been widely used to identify cancer‐related
   lncRNAs in many studies, and four secondary structure‐related features
   that could have the potential to identify CalncRNAs, as they have been
   previously used to identify gene/lncRNA essentiality and different
   types of RNA;^[ [77]^25 , [78]^26 ^] d) three network features,
   including their interactions with cancer‐related mRNA, miRNA, and
   protein, which could hint at the potential roles of lncRNAs in cancer;
   and e) four mutation features adapted from ExInAtor,^[ [79]^18 ^]
   OncodriveFML,^[ [80]^23 ^] and copy number variations (CNVs), which
   have previously been used to identify CalncRNAs.^[ [81]^27 , [82]^28 ,
   [83]^29 , [84]^30 , [85]^31 ^] We used these features to annotate all
   lncRNAs.

   CalncRNA prediction is a classification problem that requires a
   high‐quality training dataset containing reliable CalncRNAs as the
   positive dataset and a dataset containing the lncRNAs unlikely to be
   CalncRNAs as the negative dataset. Therefore, we established strict
   criteria to select the training datasets. First, we considered only the
   intergenic lncRNAs present in the high‐quality curations from GENCODE^[
   [86]^32 ^] to eliminate the potential influence of PCGs on mutation or
   epigenomics. This resulted in 10 746 lncRNA genes as references.