Abstract Introduction MicroRNAs (miRNAs) regulate gene expression and play an important role in carcinogenesis through complex interactions with messenger RNAs (mRNAs) and long non-coding RNAs (lncRNAs). Despite their established influence on tumor progression and therapeutic resistance, the application of miRNA interaction networks for tumor tissue-of-origin (TOO) classification remains underexplored. Methods We developed a machine learning (ML) framework that integrates miRNA-mRNA-lncRNA interaction networks to classify tumors by their tissue of origin. Using transcriptomic profiles from 14 cancer types in The Cancer Genome Atlas (TCGA), we constructed co-expression networks and applied multiple feature selection techniques including recursive feature elimination (RFE), random forest (RF), Boruta, and linear discriminant analysis (LDA) to identify a minimal yet informative subset of miRNA features. Ensemble ML algorithms were trained and validated with stratified five-fold cross-validation for robust performance assessment across class distributions. Results Our models achieved an overall 99% classification accuracy, distinguishing 14 cancer types with high robustness and generalizability. A minimal set of 150 miRNAs selected via RFE resulted in optimal performance across all classifiers. Furthermore, in silico validation revealed that many of the top miRNAs, including miR-21-5p, miR-93-5p, and miR-10b-5p, were not only highly central in the network but also correlated with patient survival and drug response. In addition, functional enrichment analyses indicated significant involvement of miRNAs in pathways such as TGF-beta signaling, epithelial-mesenchymal transition, and immune modulation. Our comparative analysis demonstrated that models based on miRNA outperformed those using mRNA or lncRNA classifiers. Discussion Our integrated framework provides a biologically grounded, interpretable, and highly accurate approach for tumor tissue-of-origin classification. The identified miRNA biomarkers demonstrate strong translational potential, supported by clinical trial overlap, drug sensitivity data, and survival analyses. This work highlights the power of combining miRNA network biology with ML to improve precision oncology diagnostics and supports future development of liquid biopsy-based cancer classification. Keywords: miRNAs, network, machine learning, feature selection, tumor tissue origin, ensemble learning 1 Introduction Cancer is the second leading cause of death globally, accounting for nearly 9.7 million deaths as of 2022, and is projected to become the leading cause of premature mortality by the end of the century ([34]Bray et al., 2024; [35]Murthy et al., 2024). In the U.S., cancer incidence rates for 2025 are projected at 643.5 per 100,000 males and 581.4 per 100,000 females, with total expected deaths of 618,120, underscoring the increasing cancer burden ([36]Siegel et al., 2025). Despite considerable advancements in research, early detection with accurate classification remains a significant challenge due to non-specific symptoms, eventually leading to late-stage diagnoses and poor survival ([37]Crosby et al., 2022; [38]Pulumati et al., 2023; [39]Ali et al., 2021; [40]Lingasamy et al., 2019; [41]Lingasamy et al., 2021). MicroRNAs (miRNAs), small non-coding RNAs, typically 17–25 nucleotides long, have gained prominence as cancer biomarkers due to their role as oncogenes or tumor suppressors ([42]Peng and Croce, 2016; [43]Chakrabortty et al., 2023; [44]Pekarek et al., 2023). Identifying cancer-specific miRNA signatures is essential for understanding molecular mechanisms, enabling early-stage detection, and improving treatment outcomes ([45]Peng and Croce, 2016; [46]Yerukala Sathipati et al., 2023; [47]Paranjape et al., 2009; [48]Galvão-Lima et al., 2021). Dysregulated miRNAs significantly impact cellular and biological processes such as apoptosis, cell cycle regulation, metastasis, transcriptome, and interactions within the tumor microenvironment ([49]Peng and Croce, 2016; [50]Sempere et al., 2021; [51]Li et al., 2022). In fact, clinical trials targeting miRNAs (miR-16, miR-34a, miR-155, and miR-193a-3p) have shown promising therapeutic potential for various types of cancer ([52]Seyhan, 2024; [53]Kim and Croce, 2023; [54]Qian et al., 2024). In parallel, long non-coding RNAs (lncRNAs) are recognized as critical regulators of gene expression and tumor progression, thereby expanding the focus of non-coding RNA research in oncology ([55]Huang et al., 2022; [56]Ratti et al., 2020). Certain lncRNAs act as miRNA sponges or competing endogenous RNAs (ceRNAs) that regulate gene expression by competitively binding to shared miRNA-targets through miRNA response elements (MREs), thereby reducing the ability of miRNAs to suppress target mRNA transcripts ([57]Salmena et al., 2011). The advent of high-throughput RNA sequencing facilitates quantitative profiling of miRNA, mRNA, and lncRNA expression levels and aids in identifying MREs in the 3′ untranslated regions (UTRs) of target genes. Tools like TargetScan and miRanda help predict interactions between miRNAs and mRNAs/lncRNAs, revealing regulatory connections that affect cancer pathways ([58]Lewis et al., 2005; [59]Betel et al., 2008). Dysregulation of ceRNAs can lead to abnormal gene expression, promoting cancer progression, metastasis, drug resistance, and maintenance of cancer stem cells ([60]Salmena et al., 2011; [61]Tay et al., 2014; [62]Gutschner and Diederichs, 2012). lncRNAs such as MALAT1, a prognostic marker for metastasis in non-small cell lung cancer (NSCLC) ([63]Gutschner and Diederichs, 2012) and H19 enhance tumorigenesis by influencing tumor suppressor genes and oncogenes ([64]Zhang et al., 2022), while other lncRNAs like HOTAIR and DANCR facilitate cancer metastasis through ceRNA networks ([65]Cheng and Huang, 2021), underscoring their potential as therapeutic targets and biomarkers for cancer prognosis. Advances in sequencing technologies, supported by large-scale initiatives like The Cancer Genome Atlas (TCGA) ([66]Tomczak et al., 2015), have potentially enhanced our understanding of cancer genomics. These initiatives have identified miRNA and lncRNA profiles associated with tumor progression, metastasis, and therapy resistance, establishing them as promising biomarkers and therapeutic targets ([67]Dong et al., 2020; [68]Gao and Zhou, 2019; [69]Cho and Han, 2017). For example, miR-29b/c in gastric cancer and miR-186 and miR-34a in breast cancer play critical roles in modulating key pathways linked to tumor progression ([70]Gong et al., 2014; [71]Sun et al., 2019). Similarly, lncRNAs like HOTAIR and MALAT1 have been implicated in promoting invasion and metastasis ([72]Archit et al., 2024; [73]Zhou et al., 2021). In prostate cancer, the negative correlation between miR-142-5p and lncRNA ADAMTS9-AS1 has been found to facilitate tumor progression ([74]Archit et al., 2024; [75]Zhou et al., 2021). Integrating miRNA expression data into pan-cancer analyses can identify shared molecular signatures across diverse cancer types ([76]Mitra et al., 2020; [77]Lopez-Rincon et al., 2020; [78]Tan et al., 2019). Pan-cancer studies increasingly use miRNAs as biomarkers for cancer classification and TOO prediction, achieving 84.27% accuracy for cancer type classification ([79]Yerukala Sathipati et al., 2023) and 97% accuracy for predicting TOO in metastatic cancers using neural networks ([80]Raghu et al., 2024). Furthermore, treatment strategies guided by miRNAs based on support vector machines (SVM) showed a higher classification accuracy of 97.2% ([81]Cheerla and Gevaert, 2017). However, the above methods often overlook the intricate interactions of miRNAs within broader molecular networks, which is crucial for capturing the complexity of cancer biology. Thus, multi-omics models integrating miRNA, mRNA, and lncRNA data may offer promising potential for pan-cancer classification. To address the current limitations, our study utilized comprehensive machine learning (ML) methods to classify TOO using miRNAs identified from miRNA-mRNA-lncRNA interactions from TCGA datasets. Further, we identified a minimal set of 150 miRNA biomarkers using ensemble ML methods, achieving 99% accuracy in distinguishing 14 cancer types. Our findings are projected to emphasize the potential of integrating computational and biological approaches to advance precision oncology, enabling the development of innovative diagnostic tools and treatment strategies. 2 Materials and methods 2.1 Data collection and preprocessing Transcriptomic profiles, including miRNA-Seq (miRNA isoform) and RNA-Seq data, were obtained from the TCGA project via the Genomics Data Commons portal (GDC) ([82]Heath et al., 2021). Raw read counts for both solid tissue normal (NT) and primary solid tumor (TP) tissue samples were downloaded using the Bioconductor R package TCGAbiolinks (v2.32.0). The selection of cancer types was limited to those with at least 10 patient samples per cancer type, with data for both tumor and corresponding normal tissues included. To maintain consistency, only primary tumor samples were analyzed. The disease type classifications, along with their respective primary sites and TCGA project names, are summarized in [83]Table 1. A comprehensive breakdown of sample counts (NT and TP samples per cancer) for miRNA-Seq and RNA-Seq datasets, categorized by cancer type and tissue type, is provided in [84]Table 1. In total, the dataset comprised 14 cancer types, with 6,485 and 6,507 samples from miRNA-Seq and RNA-Seq distributed across tumor tissues and 640 and 660 samples from miRNA-Seq and RNA-Seq distributed across normal tissues, respectively. TABLE 1. Projects and cancer types from TCGA. Project Project name Disease type Primary site NT miRNA-Seq TP miRNA-Seq Total count miRNA-Seq NT RNA-Seq TP RNA-Seq Total count RNA-Seq TCGA-BLCA Bladder Urothelial Carcinoma Adenomas and Adenocarcinomas Bladder 19 417 436 19 412 431 Epithelial Neoplasms, NOS Squamous Cell Neoplasms Transitional Cell Papillomas and Carcinomas TCGA-BRCA Breast Invasive Carcinoma Adenomas and Adenocarcinomas Breast 104 1094 1198 113 1111 1224 Adnexal and Skin Appendage Neoplasms Basal Cell Neoplasms Complex Epithelial Neoplasms Cystic, Mucinous and Serous Neoplasms Ductal and Lobular Neoplasms Epithelial Neoplasms, NOS Fibroepithelial Neoplasms Squamous Cell Neoplasms TCGA-ESCA Esophageal Carcinoma squamous cell neoplasms Esophagus 13 186 199 13 184 197 adenomas and adenocarcinomas cystic, mucinous and serous neoplasms adenomas and adenocarcinomas Stomach squamous cell neoplasms TCGA-HNSC Head and Neck Squamous Cell Carcinoma squamous cell neoplasms Base of tongue 44 523 567 44 520 564 Bones, joints and articular cartilage of other and unspecified sites Floor of mouth Gum Hypopharynx Larynx Lip Oropharynx Other and ill-defined sites in lip, oral cavity and pharynx Other and unspecified parts of mouth Other and unspecified parts of tongue Palate Tonsil TCGA-KICH Kidney Chromophobe Adenomas and Adenocarcinomas Kidney 25 66 91 25 66 91 TCGA-KIRP Kidney Renal Papillary Cell Carcinoma Kidney 34 291 325 32 290 322 TCGA-KIRC Kidney Renal Clear Cell Carcinoma Kidney 71 544 615 72 541 613 TCGA-LIHC Liver Hepatocellular Carcinoma Adenomas and Adenocarcinomas Liver and intrahepatic bile ducts 50 372 422 50 371 421 TCGA-LUAD Lung Adenocarcinoma Acinar Cell Neoplasms Bronchus and lung 46 519 565 59 539 598 Adenomas and Adenocarcinomas Cystic, Mucinous and Serous Neoplasms TCGA-LUSC Lung Squamous Cell Carcinoma Squamous Cell Neoplasms Bronchus and lung 45 478 523 51 502 553 TCGA-STAD Stomach Adenocarcinoma Adenomas and Adenocarcinomas Stomach 45 446 491 36 412 448 Cystic, Mucinous and Serous Neoplasms TCGA-PRAD Prostate Adenocarcinoma Adenomas and Adenocarcinomas Prostate gland 52 498 550 52 501 553 Cystic, Mucinous and Serous Neoplasms Ductal and Lobular Neoplasms TCGA-THCA Thyroid Carcinoma Adenomas and Adenocarcinomas Thyroid gland 59 506 565 59 505 564 Epithelial Neoplasms, NOS TCGA-UCEC Uterine Corpus Endometrial Carcinoma Adenomas and Adenocarcinomas Corpus uteri 33 545 578 35 553 588 cystic, mucinous and serous neoplasms epithelial neoplasms, nos not reported Uterus, NOS 640 6485 7125 660 6507 7167 Total [85]Open in a new tab Projects and cancer types from TCGA, along with the sample count of miRNA/RNA, expression profiles per tissue and cancer types used in this study. (NT: Solid tissue normal TP: Primary tumor tissue). 2.2 miRNA network construction The construction of miRNA networks for each cancer was based on the methodology outlined earlier ([86]Lawarde et al., 2024). The procedure can be summarized as follows: 2.2.1 Differential expression analysis We conducted differential expression analysis between tumor and normal tissue using the R package DESeq2 (v1.44.0), for miRNA-Seq data and applied VST to visualize the data using t-SNE plot. Expression matrices for protein-coding genes and lncRNAs for each type of cancer were also extracted from the RNA-Seq dataset to perform differential expression analysis using the same DESeq2 package. 2.2.2 Network construction After identifying common patient samples shared between both miRNA-Seq and RNA-Seq datasets, we calculated Pearson correlation coefficients to construct a miRNA-mRNA-lncRNA co-expression correlation network. This network included miRNAs, mRNA, and lncRNAs that met the criteria |log2 fold change| ≥ 1, p-values adjusted using the Benjamini–Hochberg (BH) method was <0.05, and the correlation coefficient |R| ≥ 0.5. The R package igraph (v2.1.1) was utilized for network construction, and the fast greedy algorithm identified communities within the network. Additionally, the assortativity coefficient and the degree of the network were calculated using the igraph package. The miRNA features were obtained from the edge table of each cancer type, as shown in ([87]Supplementary Table S1). 2.3 Machine learning models to classify multiple cancer types based on TOO A classification model using interacting miRNAs was developed to categorize 27 different classes, including 7,125 samples (training samples 4,978), consisting of NT and TP samples and cancer types as detailed in [88]Table 1, using tree-based and ensemble machine learning techniques. A minimum sample requirement of 15 was established to ensure robust model training and testing, focusing exclusively on these samples. Due to insufficient data, NT samples from the esophageal carcinoma (ESCA) were excluded from further analysis. 2.3.1 Machine learning (ML) methods used for training We implemented four different machine-learning methods to train our classification models as described below: * 1.Random Forest (RF): RF builds multiple decision trees using bootstrapped samples and selected features randomly, improving accuracy and enhancing robustness while avoiding overfitting. Moreover, the RF method aggregates the results from individual trees to provide a more stable and accurate prediction. * 2.AdaBoost (Adaptive Boosting): AdaBoost sequentially combines weak classifiers, focusing on previously misclassified instances. However, AdaBoost is sensitive to noise and outliers despite effective reduction of bias, impacting the overall performance in certain datasets. * 3.XGBoost (Extreme Gradient Boosting): XGBoost refines gradient boosting through efficient parallel processing and regularization, making it particularly suitable for high-dimensional datasets. * 4.LightGBM (Light Gradient Boosting Machine): LightGBM is yet another gradient boosting approach that increases speed and memory efficiency by using histogram-based learning and leaf-wise growth techniques, making it particularly effective for larger datasets. Together, these ensemble methods used in our models leverage the strengths of multiple models, thereby enhancing predictive performance and robustness across various machine-learning tasks. 2.3.2 Training and test set The miRNA expression dataset, encompassing 14 cancer types and 27 classes, was divided into training and test sets with a 70:30 split. Specifically, 70% of the data was allocated for model training, while the remaining 30% was reserved for testing. The model was trained and tested using Python 3. A pipeline was constructed using the imblearn.pipeline module, which included StandardScaler from sklearn.preprocessing for feature scaling and the Synthetic Minority Over-sampling Technique (SMOTE) technique from imblearn.over_sampling to address class imbalance. This pipeline was used for training, integrating feature scaling and sample balancing. Further, all the prediction models were cross-validated using a 5-fold strategy with the StratifiedKFold method from the sklearn.model_selection module. The classification report and confusion matrix were generated using the classification_report and confusion_matrix functions from the sklearn.metric module. The Area Under the Curve (AUC) was calculated using the roc_auc_score function from the same module. All methods were implemented with default parameters, except for the AdaBoost method, which was tuned through hyperparameter adjustment. Specifically, we used a decision tree as the base estimator with a maximum depth of 5, set the algorithm to Stagewise Additive Modeling using a Multiclass Exponential loss function (SAMME), adjusted the learning rate to 1.2, and set the number of estimators to 300 for the AdaBoost model. 2.4 Feature selection We employed four feature selection methods on all interacting miRNAs to identify the most relevant predictors for the classification task. Recursive Feature Elimination (RFE) was used to iteratively remove the least important features based on model performance, effectively narrowing down the feature set. The Boruta method, a wrapper algorithm, was applied to determine the significance of features by comparing their importance to random permutations, ensuring only the most relevant features. Linear Discriminant Analysis (LDA) was also utilized to select features that maximally separate between classes, focusing on those contributing to the best class discrimination. Finally, the Random Forest (RF) method provided feature importance scores, allowing for the selection of features based on their contribution to the predictive power of the model accuracy. This comprehensive approach of feature selection improved model performance and minimized dimensionality, ensuring that only the most relevant features were utilized for multiclass classification. 2.5 Model evaluation metrics To evaluate the performance of each classification model, we used standard metrics, including accuracy, sensitivity, specificity, precision, F1-score, and AUC, in line with similar studies ([89]Modhukur et al., 2021; [90]Rahmani et al., 2023). The performance metrics were computed as follows: * •Precision = TP/(TP + FP) * •Recall/Sensitivity = TP/(TP + FN) * •F1-score = 2*TP/(2 * TP + FP + FN) * •Accuracy = (TP + TN)/(TP + TN + FP + FN) * •Sensitivity = TP/(TP + FN) * •AUC:AUC refers to the area under the Receiver Operating Characteristic (ROC) curve. AUC provides an aggregate measure of performance across all classification thresholds, indicating the model’s ability to distinguish between classes effectively. 2.6 Cross-validation of interacting miRNAs with literature and clinical trial data We manually compiled a comprehensive collection of miRNAs in cancer, miRNA isoforms in cancer, extracellular vesicular (EV) miRNAs, and clinical trial miRNAs from the literature ([91]Supplementary File S1). Additionally, we downloaded the miRNA-drug associations from the noncoRNA db ([92]Li et al., 2020) and miRNA genes from the Cancer miRNA Census (CMC miRNAs) from the published paper ([93]Suszynska et al., 2024). The CMC miRNA genes were mapped to miRNA IDs (miRBase v21) and overlapping miRNAs between CMC and all interacting miRNAs were identified. Our literature-derived compendium and drug-target association were visualized with Venn diagrams and pie charts using R packages ggplot2 (v3.5.1) and VennDiagram (v1.7.3). 2.7 Machine learning classifier comparison with other biomolecules: mRNAs and lncRNAs The LightGBM model was trained on three sets of mRNA features. The mRNA features were selected from the interactions between the miRNA-mRNA-lncRNA network. All mRNAs are significantly regulated in each cancer type (|log2FoldChange| ≥ 1 and adjust p-value with BH < 0.05) (all interactions are listed in [94]Supplementary Table S1). We used random number generation to pick the number of mRNA features from a total of 6207 interacting mRNAs. Two random numbers, 123 and 223, were selected from 100 to 200 and 200 to 300 random numbers. Similarly, for lncRNAs, we used random number generation to select lncRNA features from a total of 2245 lncRNAs to train the ML models. A total of 105 and 258 lncRNAs were selected from 100 to 200 and 200 to 300 random numbers. The training steps are followed in the same manner as mentioned for the miRNA models above. For both mRNAs and lncRNAs, we trained three models each. Two from random feature selection and one with all interacting mRNAs/lncRNAs. 2.8 Functional enrichment analysis We obtained experimentally validated gene targets of interacting miRNAs from TarBase, miRTarBase, and miRecords databases using the Bioconductor R package multiMiR (v1.26.0). Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were conducted for these gene targets using the Bioconductor package clusterProfiler (v4.12.6), and the results were visualized as a dot plot. Additionally, GO and KEGG enrichment analyses were performed for protein-coding and lncRNA genes obtained from each cancer type’s network. Finally, we conducted a separate enrichment analysis for the targets of common miRNAs compiled from all interacting miRNAs, miRNA compendium (including EVmiRNA list), and CMC ([95]Suszynska et al., 2024) miRNAs. 2.9 Survival and clinical prognosis analysis of interacting miRNAs To evaluate the prognostic potential of the identified interacting miRNAs, we performed univariate Cox-PH analysis using the MethSurv pipeline ([96]Modhukur et al., 2018). Patients were stratified into high- and low-expression groups based on the median expression level of each miRNA. The statistical significance of the association between miRNA expression and overall survival was assessed using the log-rank test. The proportional hazards assumption was verified using the Schoenfeld residuals test, and survival curves were visualized with the Kaplan-Meier (KM) plot. We used the R packages survival (v3.7.0) and survminer (v0.4.9) for survival analysis and visualizations, respectively. 3 Results 3.1 Workflow overview The overview of the workflow adopted in this study is illustrated in [97]Figure 1, which consists of three main sections, as summarized below. * (A) Data Collection and Preprocessing + i.Raw read counts were collected from TCGA for 14 cancer types, with a focus on miRNAs, protein-coding genes, and lncRNAs. + ii.Differential expression analysis was performed using DESeq2, comparing tumor and normal samples with strict significance thresholds (|log2 fold change| ≥ 1 and adjusted p-value <0.05). + iii.The raw counts were normalized using variance stabilizing transformation (VST) to reduce heteroscedasticity and improve comparability across samples. The VST-normalized data was used for downstream visualizations, including t-SNE plots. + iv.A Pearson correlation matrix was created to evaluate relationships between differentially regulated miRNAs, mRNAs, and lncRNAs to help identify potential interactions. + v.miRNA-mRNA-lncRNA network was constructed based on the aforementioned correlations, considering only interactions with a correlation coefficient (|R|) of 0.5 or higher. + vi.The network structure was analyzed further through community identification using the fast-greedy method, revealing clusters of interacting features. + vii.A total of 597 interacting miRNAs were selected for subsequent analysis. + viii.Survival analysis was performed using univariate Cox Proportional Hazards (Cox-PH) regression to assess the relationship between miRNA expression and patient survival. * (B) Feature Selection, Analysis, and Machine Learning + i.The raw miRNA counts were log2 transformed, quantile normalized, and batch effects removed. From the total preprocessed data, a subset of the quantile normalized count matrix of 597 interacting miRNA obtained from part (A) was used for the next steps. + ii.Dimensionality was reduced by using feature selection methods: RFE, RF, Boruta, and LDA. + iii.The data were split into 70% training and 30% testing sets, followed by feature scaling and application of SMOTE to address class imbalance. + iv.A multilabel classification model was used to classify normal and tumor tissues, employing machine learning algorithms including RF, Adaptive Boosting (AdaBoost), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM), and a voting classifier, along with feature importance evaluations. * (C) Validation with Literature and Functional Enrichment Analysis + i.The results were validated through comparisons with existing literature. FIGURE 1. [98]FIGURE 1 [99]Open in a new tab Overview of the study workflow. (A) Data preprocessing and network creation from TCGA data for 14 cancers were included in the study, resulting in 14 networks and 597 miRNAs selected from the highly correlated networks. (B) Feature selection and model training using network-derived features. (C) Cross-validation of selected features through literature and database comparisons. Functional enrichment analysis, including Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment, were used to identify miRNA-drug target associations and potential biomarkers in clinical trials. 3.2 Overview of interacting miRNAs and network properties In our analysis, a highly correlated network (|R| ≥ 0.5) of differentially regulated miRNA, mRNA, and lncRNA for the selected cancer types, applying significant thresholds of p. adjust <0.05 and |log2FoldChange| ≥1) was constructed, where the assortativity coefficient and degree of assortativity for networks ranged from −0.59 to −0.85 and −0.3 to −0.63, respectively. The negative assortativity coefficient indicates that the nodes tend to connect to other nodes with different properties, such as high-degree nodes (miRNAs) connecting with low-degree nodes (protein-coding genes and lncRNAs). After removing duplicates, a total of 597 unique miRNAs were compiled by combining miRNAs from networks of all 14 cancer types. The top ten miRNAs with the highest degree of centrality per cancer type are presented in [100]Table 2, with a detailed list of interactions per cancer shown in [101]Supplementary Table S1. Key miRNAs, including miR-145-3p/5p, miR-142-3p, miR-100-5p, miR-143-3p, and miR-199b-5p, exhibited the highest degree of centrality across multiple cancer types in our study, highlighting their potential involvement in oncogenic pathways. TABLE 2. Degree centrality of miRNAs in selected cancer types. BLCA (bladder urothelial carcinoma) Degree centrality BRCA (breast invasive carcinoma) Degree centrality hsa-miR-125b-5p 362 hsa-miR-190b 197 hsa-miR-145-3p 353 hsa-miR-155-5p 137 hsa-let-7c-5p 323 hsa-miR-934 135 hsa-miR-99a-5p 297 hsa-miR-18a-5p 118 hsa-miR-100-5p 295 hsa-miR-142-5p 106 hsa-miR-143-3p 282 hsa-miR-577 88 hsa-miR-145-5p 280 hsa-miR-379-5p 54 hsa-miR-6507-5p 242 hsa-miR-135b-5p 45 hsa-miR-200a-3p 225 hsa-miR-199b-5p 43 hsa-miR-141-3p 220 hsa-miR-142-3p 41 LIHC (Liver Hepatocellular Carcinoma) Degree centrality LUAD (Lung Adenocarcinoma) Degree centrality hsa-miR-105-5p 226 hsa-miR-34b-3p 263 hsa-miR-767-5p 224 hsa-miR-34c-3p 248 hsa-miR-4652-5p 189 hsa-miR-34b-5p 229 hsa-miR-4746-5p 174 hsa-miR-105-5p 108 hsa-miR-767-3p 142 hsa-miR-767-5p 108 hsa-miR-199a-3p 132 hsa-miR-150-5p 89 hsa-miR-199b-3p 132 hsa-miR-767-3p 80 hsa-miR-466 110 hsa-miR-548y 69 hsa-miR-199b-5p 109 hsa-miR-194-5p 65 hsa-miR-214-3p 86 hsa-miR-192-5p 61 ESCA (Esophageal Carcinoma) Degree centrality HNSC (Head and Neck Squamous Cell Carcinoma) Degree centrality hsa-miR-944 930 hsa-miR-133a-5p 297 hsa-miR-205-3p 833 hsa-miR-1-5p 277 hsa-miR-205-5p 813 hsa-miR-1-3p 264 hsa-miR-149-5p 731 hsa-miR-133b 261 hsa-miR-6499-3p 609 hsa-miR-133a-3p 257 hsa-miR-708-5p 588 hsa-miR-499a-5p 245 hsa-miR-708-3p 582 hsa-miR-206 229 hsa-miR-375 553 hsa-miR-381-3p 103 hsa-miR-224-5p 495 hsa-miR-9-5p 99 hsa-miR-452-5p 484 hsa-miR-193b-3p 77 hsa-miR-142-3p 146 hsa-miR-222-3p 632 hsa-miR-944 70 hsa-miR-221-3p 630 hsa-miR-203a-3p 41 hsa-miR-23b-3p 430 hsa-miR-100-5p 37 hsa-miR-27b-3p 421 hsa-miR-205-5p 33 hsa-miR-145-3p 377 hsa-miR-29c-3p 22 hsa-miR-133b 357 hsa-miR-7702 21 hsa-miR-96-5p 241 hsa-miR-148a-3p 19 hsa-miR-141-3p 223 hsa-miR-196a-5p 13 hsa-miR-143-3p 201 hsa-miR-511-5p 12 hsa-miR-6510-3p 196 KICH (Kidney Chromophobe) Degree centrality KIRC (Kidney Renal Clear Cell Carcinoma) Degree centrality hsa-miR-221-3p 594 hsa-miR-142-5p 277 hsa-miR-182-5p 474 hsa-miR-155-5p 247 hsa-miR-96-5p 432 hsa-miR-142-3p 133 hsa-miR-222-3p 398 hsa-miR-892b 128 hsa-miR-221-5p 365 hsa-miR-892c-3p 117 hsa-miR-30e-5p 332 hsa-miR-888-5p 111 hsa-miR-455-3p 327 hsa-miR-204-5p 102 hsa-miR-891a-5p 304 hsa-miR-891b 93 hsa-miR-222-5p 293 hsa-miR-892a 87 hsa-miR-29a-3p 240 hsa-miR-21-5p 67 STAD (Stomach Adenocarcinoma) Degree centrality THCA (Thyroid Carcinoma) Degree centrality hsa-miR-195-5p 605 hsa-miR-146b-3p 892 hsa-miR-100-5p 593 hsa-miR-21-5p 853 hsa-miR-145-3p 570 hsa-miR-146b-5p 853 hsa-miR-125b-5p 569 hsa-miR-7-2-3p 696 hsa-miR-145-5p 548 hsa-miR-1179 672 hsa-miR-133a-3p 528 hsa-miR-204-5p 562 hsa-let-7c-5p 500 hsa-miR-7156-5p 402 hsa-miR-218-5p 487 hsa-miR-375 401 hsa-miR-1-3p 480 hsa-miR-31-3p 374 hsa-miR-133b 448 hsa-miR-6860 366 hsa-miR-143-3p 263 hsa-miR-145-3p 185 hsa-miR-126-3p 262 hsa-miR-145-5p 172 hsa-miR-143-5p 225 hsa-miR-142-5p 94 hsa-miR-145-5p 191 hsa-miR-449a 93 hsa-miR-145-3p 188 hsa-miR-449c-5p 84 hsa-miR-223-3p 179 hsa-miR-449b-5p 81 hsa-miR-199a-3p 141 hsa-miR-449b-3p 81 hsa-miR-199b-3p 138 hsa-miR-199a-5p 80 hsa-miR-1-3p 122 hsa-miR-199b-5p 78 hsa-miR-4772-3p 122 hsa-miR-142-3p 78 [102]Open in a new tab Top 10 miRNAs, with the highest degree of centrality for each cancer type in the study. 3.3 Comparison of interacting and feature-selected miRNAs Feature selection methods identified the most representative miRNA signature for cancer-type classification when applied to 597 interacting miRNAs. The RF method identified 298 miRNAs of the most important features based on a median importance (≥0.00039) cutoff, whereas RFE identified 150 miRNAs. The Boruta method yielded 530 miRNA features, while LDA selected 352 miRNAs, achieving 90% cumulative importance. The miRNAs from each feature set are listed in [103]Supplementary Table S2, while [104]Table 3 shows the count of miRNAs in each feature set. The Venn diagram ([105]Supplementary Figure S1) shows the overlap among the 597 interacting miRNAs and those identified by the feature selection methods (RF, LDA, RFE, and Boruta). A total of 98 miRNAs were common between the 597 interacting miRNAs and miRNA feature sets identified by four feature selection methods. Additionally, 40 miRNAs were unique to the 597 miRNA feature set, while 52 miRNAs were shared among the RFE, Boruta, RF, and 597 feature sets. TABLE 3. Total number of miRNAs in each feature set. Feature set Interacting miRNAs RFE RF Boruta LDA No. Of miRNAs 597 150 298 530 352 [106]Open in a new tab 3.4 Performance of machine learning models To assess predictive performance, we trained a total of 25 ML models using five algorithms (RF, AdaBoost, XGBoost, LightGBM, and a voting classifier) across five feature sets (597 miRNAs, RFE-selected features, RF-selected features, Boruta-selected features, and LDA-selected features). The model performance was evaluated using precision, recall, and F1-score, with detailed classification results shown in [107]Table 4. TABLE 4. Machine learning model performances with each feature set. ML models Score 597 RFE RF Boruta LDA Precision Recall f1-score Precision Recall f1-score Precision Recall f1-score Precision Recall f1-score Precision Recall f1-score Random forest accuracy 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 macro avg 0.98 0.95 0.96 0.98 0.96 0.97 0.99 0.97 0.98 0.99 0.96 0.97 0.99 0.96 0.97 weighted avg 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 AdaBoost accuracy 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 macro avg 0.99 0.94 0.96 0.99 0.95 0.96 0.99 0.95 0.97 0.99 0.96 0.97 0.99 0.93 0.96 weighted avg 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 XGBoost accuracy 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 macro avg 0.98 0.95 0.96 0.97 0.95 0.96 0.98 0.96 0.97 0.97 0.95 0.96 0.98 0.95 0.96 weighted avg 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 LightGBM accuracy 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 macro avg 0.99 0.95 0.97 0.98 0.96 0.97 0.98 0.96 0.97 0.98 0.95 0.96 0.99 0.95 0.97 weighted avg 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 Ensemble classifier accuracy 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 macro avg 0.99 0.95 0.97 0.98 0.97 0.97 0.98 0.96 0.97 0.98 0.96 0.97 0.99 0.96 0.97 weighted avg 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 [108]Open in a new tab Precision, recall, f-score of classification accuracy for 5 machine learning methods, Random Forest, AdaBoost, XGBoost, LightGBM, and ensemble classifier used in the study. (RFE: recursive feature elimination, RF: random forest, LDA: Linear discriminant analysis). The RF model exhibited consistently high accuracy across all feature sets, achieving an average accuracy of 99.18% ± 0.0013. However, its performance was lowest for the bladder urothelial carcinoma (BLCA) solid tissue normal (NT) (BLCA-NT) class, where recall ranged between 50% and 60%, and F1 scores ranged from 60% to 80%, regardless of the feature set ([109]Supplementary Table S3). Similarly, the AdaBoost model achieved an overall accuracy of 98.93% ± 0.0017, with the Boruta feature set (530 miRNAs) outperforming others by yielding the highest accuracy of 99.16% ([110]Supplementary Table S4). The XGBoost model achieved an accuracy of 98.80% ± 0.0012, with the RF-selected feature set (298 miRNAs) providing the best performance with 99.02% accuracy ([111]Supplementary Table S5). For LightGBM models, the overall accuracy was 98.98% ± 0.0006. The RF, RFE, and LDA feature sets (298, 150, and 352 miRNAs, respectively) outperformed the 597 (accuracy = 99%) and Boruta feature sets (with 530 miRNAs, the accuracy = 99%). However, the LightGBM models exhibited lower recall for the BLCA-NT and the lung adenocarcinoma (LUAD-NT) classes ([112]Supplementary Table S6). Furthermore, we developed a voting classifier that combined RF, AdaBoost, XGBoost, and LightGBM models and achieved an average accuracy of 99.03% ± 0.0005. Notably, the RFE feature set (150 miRNAs) demonstrated particularly strong results, achieving 99% accuracy with both weighted and macro averages at 99%. A detailed comparison of model performance metrics as shown in [113]Tables 5–[114]9, and accuracy, precision, recall, F1-score, specificity, and AUC for each class and five feature sets are presented in [115]Figure 2. TABLE 5. Performance of Ensemble classifier with 597 miRNA features. Response class 597_precision 597_recall 597_f1-score 597_Specificity 597_AUC TCGA-BLCA-NT 1 0.67 0.8 1 1 TCGA-BLCA-TP 0.98 1 0.99 1 1 TCGA-BRCA-NT 1 0.84 0.91 1 0.98 TCGA-BRCA-TP 0.98 1 0.99 1 1 TCGA-ESCA-TP 1 1 1 1 1 TCGA-HNSC-NT 1 0.85 0.92 1 1 TCGA-HNSC-TP 0.99 1 0.99 1 1 TCGA-KICH-NT 1 1 1 1 1 TCGA-KICH-TP 1 1 1 1 1 TCGA-KIRC-NT 1 1 1 1 1 TCGA-KIRC-TP 1 1 1 1 1 TCGA-KIRP-NT 1 0.9 0.95 1 1 TCGA-KIRP-TP 0.99 1 0.99 1 1 TCGA-LIHC-NT 0.93 0.93 0.93 1 1 TCGA-LIHC-TP 0.99 0.99 0.99 1 1 TCGA-LUAD-NT 0.92 0.79 0.85 1 1 TCGA-LUAD-TP 0.98 0.99 0.99 1 1 TCGA-LUSC-NT 1 1 1 1 1 TCGA-LUSC-TP 1 1 1 1 1 TCGA-PRAD-NT 1 0.93 0.97 1 1 TCGA-PRAD-TP 0.99 1 1 1 1 TCGA-STAD-NT 0.92 0.86 0.89 1 1 TCGA-STAD-TP 0.99 0.99 0.99 1 1 TCGA-THCA-NT 1 0.89 0.94 1 1 TCGA-THCA-TP 0.99 1 0.99 1 1 TCGA-UCEC-NT 1 1 1 1 1 TCGA-UCEC-TP 1 1 1 1 1 Accuracy 0.99 0.99 0.99 macro avg 0.99 0.95 0.97 weighted avg 0.99 0.99 0.99 [116]Open in a new tab Precision, recall, f1-score, specify, and AUC, for ensemble classifier using 597 interacting miRNAs, as a feature set. TABLE 9. Performance of Ensemble classifier with LDA miRNA features. Response class LDA_precision LDA_recall LDA_f1-score LDA_Specificity LDA_AUC TCGA-BLCA-NT 1 0.83 0.91 1 1 TCGA-BLCA-TP 0.99 1 1 1 1 TCGA-BRCA-NT 0.96 0.84 0.9 1 0.99 TCGA-BRCA-TP 0.98 1 0.99 1 1 TCGA-ESCA-TP 1 1 1 1 1 TCGA-HNSC-NT 1 0.92 0.96 s1 1 TCGA-HNSC-TP 0.99 1 1 1 1 TCGA-KICH-NT 1 1 1 1 1 TCGA-KICH-TP 1 1 1 1 1 TCGA-KIRC-NT 1 1 1 1 1 TCGA-KIRC-TP 1 1 1 1 1 TCGA-KIRP-NT 1 0.9 0.95 1 1 TCGA-KIRP-TP 0.99 1 0.99 1 1 TCGA-LIHC-NT 0.92 0.8 0.86 1 1 TCGA-LIHC-TP 0.97 0.99 0.98 1 1 TCGA-LUAD-NT 1 0.93 0.96 1 1 TCGA-LUAD-TP 0.99 1 1 1 1 TCGA-LUSC-NT 1 1 1 1 1 TCGA-LUSC-TP 1 1 1 1 1 TCGA-PRAD-NT 1 0.87 0.93 1 1 TCGA-PRAD-TP 0.99 1 0.99 1 1 TCGA-STAD-NT 1 0.86 0.92 1 1 TCGA-STAD-TP 0.99 1 0.99 1 1 TCGA-THCA-NT 0.94 0.89 0.91 1 1 TCGA-THCA-TP 0.99 0.99 0.99 1 1 TCGA-UCEC-NT 1 1 1 1 1 TCGA-UCEC-TP 1 1 1 1 1 accuracy 0.99 0.99 0.99 macro avg 0.99 0.96 0.97 weighted avg 0.99 0.99 0.99 [117]Open in a new tab Precision, recall, f1-score, specify, and AUC, for ensemble classifier using 352 interacting miRNAs, as a feature set from the LDA, method of feature selection. FIGURE 2. [118]FIGURE 2 [119]Open in a new tab Barplot depicting ensemble model performance across five feature sets. (A) Precision and accuracy for each class. (B) Recall for each class. (C) F1-score for each class. (D) Specificity for each class. (E) Area under the curve (AUC) for each class. TABLE 6. Performance of Ensemble classifier with RFE miRNA features. Response class RFE_precision RFE_recall RFE_f1-score RFE_Specificity RFE_AUC TCGA-BLCA-NT 1 0.83 0.91 1 1 TCGA-BLCA-TP 0.99 1 1 1 1 TCGA-BRCA-NT 0.96 0.87 0.92 1 0.99 TCGA-BRCA-TP 0.99 1 0.99 1 1 TCGA-ESCA-TP 1 1 1 1 1 TCGA-HNSC-NT 1 1 1 1 1 TCGA-HNSC-TP 1 1 1 1 1 TCGA-KICH-NT 1 1 1 1 1 TCGA-KICH-TP 1 1 1 1 1 TCGA-KIRC-NT 1 1 1 1 1 TCGA-KIRC-TP 1 1 1 1 1 TCGA-KIRP-NT 1 0.9 0.95 1 1 TCGA-KIRP-TP 0.99 1 0.99 1 1 TCGA-LIHC-NT 0.93 0.93 0.93 1 1 TCGA-LIHC-TP 0.99 0.99 0.99 1 1 TCGA-LUAD-NT 0.87 0.93 0.9 1 1 TCGA-LUAD-TP 0.99 0.99 0.99 1 1 TCGA-LUSC-NT 1 1 1 1 1 TCGA-LUSC-TP 1 1 1 1 1 TCGA-PRAD-NT 0.88 1 0.94 1 1 TCGA-PRAD-TP 1 0.99 0.99 1 1 TCGA-STAD-NT 0.92 0.86 0.89 1 1 TCGA-STAD-TP 0.99 0.99 0.99 1 1 TCGA-THCA-NT 0.94 0.94 0.94 1 1 TCGA-THCA-TP 0.99 0.99 0.99 1 1 TCGA-UCEC-NT 0.9 1 0.95 1 1 TCGA-UCEC-TP 1 0.99 1 1 1 accuracy 0.99 0.99 0.99 macro avg 0.98 0.97 0.97 weighted avg 0.99 0.99 0.99 [120]Open in a new tab Precision, recall, f1-score, specify, and AUC, for ensemble classifier using 150 interacting miRNAs, as a feature set from the RFE, method of feature selection. TABLE 7. Performance of Ensemble classifier with random forest miRNA features. Response class RF_precision RF_recall RF_f1-score RF_Specificity RF_AUC TCGA-BLCA-NT 1 0.83 0.91 1 1 TCGA-BLCA-TP 0.99 1 1 1 1 TCGA-BRCA-NT 1 0.84 0.91 1 0.99 TCGA-BRCA-TP 0.98 1 0.99 1 1 TCGA-ESCA-TP 1 1 1 1 1 TCGA-HNSC-NT 1 0.92 0.96 1 1 TCGA-HNSC-TP 0.99 1 1 1 1 TCGA-KICH-NT 1 1 1 1 1 TCGA-KICH-TP 1 1 1 1 1 TCGA-KIRC-NT 1 1 1 1 1 TCGA-KIRC-TP 1 1 1 1 1 TCGA-KIRP-NT 1 0.9 0.95 1 1 TCGA-KIRP-TP 0.99 1 0.99 1 1 TCGA-LIHC-NT 0.94 1 0.97 1 1 TCGA-LIHC-TP 1 0.99 1 1 1 TCGA-LUAD-NT 0.93 0.93 0.93 1 1 TCGA-LUAD-TP 0.99 0.99 0.99 1 1 TCGA-LUSC-NT 1 1 1 1 1 TCGA-LUSC-TP 1 1 1 1 1 TCGA-PRAD-NT 0.93 0.93 0.93 1 1 TCGA-PRAD-TP 0.99 0.99 0.99 1 1 TCGA-STAD-NT 0.92 0.86 0.89 1 1 TCGA-STAD-TP 0.99 0.99 0.99 1 1 TCGA-THCA-NT 0.94 0.94 0.94 1 1 TCGA-THCA-TP 0.99 0.99 0.99 1 1 TCGA-UCEC-NT 1 0.89 0.94 1 1 TCGA-UCEC-TP 0.99 1 1 1 1 accuracy 0.99 0.99 0.99 macro avg 0.98 0.96 0.97 weighted avg 0.99 0.99 0.99 [121]Open in a new tab Precision, recall, f1-score, specify, and AUC, for ensemble classifier using 298 interacting miRNAs, as a feature set from the RF, method of feature selection. TABLE 8. Performance of Ensemble classifier with Boruta miRNA features. Response class Boruta_precision Boruta_recall Boruta_f1-score Boruta_Specificity Boruta_AUC TCGA-BLCA-NT 1 0.67 0.8 1 1 TCGA-BLCA-TP 0.98 1 0.99 1 1 TCGA-BRCA-NT 0.96 0.87 0.92 1 0.99 TCGA-BRCA-TP 0.99 1 0.99 1 1 TCGA-ESCA-TP 1 1 1 1 1 TCGA-HNSC-NT 1 1 1 1 1 TCGA-HNSC-TP 1 1 1 1 1 TCGA-KICH-NT 1 1 1 1 1 TCGA-KICH-TP 1 1 1 1 1 TCGA-KIRC-NT 1 1 1 1 1 TCGA-KIRC-TP 1 1 1 1 1 TCGA-KIRP-NT 1 0.9 0.95 1 1 TCGA-KIRP-TP 0.99 1 0.99 1 1 TCGA-LIHC-NT 0.93 0.93 0.93 1 1 TCGA-LIHC-TP 0.99 0.99 0.99 1 1 TCGA-LUAD-NT 0.93 0.93 0.93 1 1 TCGA-LUAD-TP 0.99 0.99 0.99 1 1 TCGA-LUSC-NT 1 0.93 0.96 1 1 TCGA-LUSC-TP 0.99 1 1 1 1 TCGA-PRAD-NT 0.93 0.93 0.93 1 1 TCGA-PRAD-TP 0.99 0.99 0.99 1 1 TCGA-STAD-NT 0.92 0.79 0.85 1 1 TCGA-STAD-TP 0.98 0.99 0.99 1 1 TCGA-THCA-NT 0.94 0.89 0.91 1 1 TCGA-THCA-TP 0.99 0.99 0.99 1 1 TCGA-UCEC-NT 1 1 1 1 1 TCGA-UCEC-TP 1 1 1 1 1 accuracy 0.99 0.99 0.99 macro avg 0.98 0.96 0.97 weighted avg 0.99 0.99 0.99 [122]Open in a new tab Precision, recall, f1-score, specify, and AUC, for ensemble classifier using 530 interacting miRNAs, as a feature set from the Boruta method of feature selection. The ensemble model using 597 miRNA features performed below 80% for the BLCA-NT (recall = 67%) and LUAD-NT (recall = 79%) classes. Similarly, the Boruta feature set model (530 miRNAs) also performed below 80% for the BLCA-NT class (recall = 67%) and the stomach adenocarcinoma (STAD-NT) class (recall = 79%). In contrast, the RFE feature set (150 miRNAs) showed superior performance for several classes, including breast invasive carcinoma (BRCA-NT/TP), esophageal carcinoma (ESCA-TP), head and neck squamous cell carcinoma (HNSC-NT/TP), kidney chromophobe (KICH-NT/TP), kidney renal clear cell carcinoma (KIRC-NT/TP), kidney renal papillary cell carcinoma (KIRP-NT/TP) as compared to the RF (298 miRNAs) and LDA model (352 miRNAs). The RF model performed better in the liver hepatocellular carcinoma (LIHC-NT) class than the RFE and LDA models. For the LUAD-NT/TP, STAD-NT/TP, and uterine corpus endometrial carcinoma (UCEC-NT/TP) classes, the LDA-based model outperformed the RFE and RF feature set-based models. The RFE feature set model demonstrated similar performance to the RF feature set model for THCA-NT, however it outperformed the RF model for the UCEC-NT class. A bar plot comparing ensemble model performance using the RFE, RF, and LDA feature sets is shown in [123]Figures 2A–E. The confusion matrix plot ([124]Figures 3A, B; [125]Supplementary Figures S2A–C) highlights the ensemble classifier’s true classification counts per cancer type across all feature sets. Additionally, t-SNE projections of the 597-miRNA feature set and the RFE feature set (150 miRNAs) are shown in [126]Figures 3C, D, respectively. FIGURE 3. [127]FIGURE 3 [128]Open in a new tab Confusion matrix and t-SNE plot, depicting models’ separability (A) Confusion matrix of the ensemble model trained with 597 miRNA features. (B) Confusion matrix of the ensemble model trained with RFE (150 miRNA) features. (C) t-SNE projection of 597 miRNAs across 14 cancer types, illustrating models’ separability. (D) t-SNE projection of 150 miRNAs (RFE feature set) across 14 cancer types, illustrating model separability. 3.5 Feature importance analysis and survival outcomes In our feature importance analysis, we evaluated the contributions of individual miRNAs to the predictive models. The most important features, according to the RF, AdaBoost, XGBoost, and LightGBM models, respectively, highlight the topmost impactful miRNA for each model in [129]Figures 4A–E. Notably, several miRNAs, including miR-520d-5p, miR-520a-3p, miR-520e, miR-892c-3p, miR-892b, miR-105-3p, miR-215-3p, miR-10b-5p, miR-139-5p, miR-21-5p, miR-93-5p, miR-4778-3p, miR-30c-2-3p, and miR-204-5p, emerged as common top features across models, suggesting their significant role in cancer progression. The top features for each trained model, along with their interacting genes and lncRNAs from the network, are highlighted in [130]Supplementary Table S7. FIGURE 4. [131]FIGURE 4 [132]Open in a new tab Bar plot depicting feature importance from the 4 ML methods. (A) Random forest model with the random forest-based feature set. (B) AdaBoost model with the Boruta feature set. (C) XGBoost model with the random forest feature set. (D) LightGBM model with the RFE feature set. (E) LightGBM model with the random forest feature set. (F) LightGBM model with the LDA feature set. The survival analysis results further emphasized the prognostic potential of these miRNAs in various cancer types. For example, higher expression of miR-204-5p in BRCA correlated with improved survival outcomes (HR < 1; p < 0.0001), whereas lower expression of miR-105-5p was linked to poorer prognosis (HR > 1; p < 0.0001) for patients ([133]Figures 5A, B). In UCEC, patients with elevated miR-93-5p and miR-1301-3p expression levels exhibited a median survival of approximately 120 months (HR > 1), indicating poor prognostic markers ([134]Figures 5C, D). Similarly, in KIRC, high expression of miR-10b-5p and miR-139-5p correlated with better survival than the low expression group (HR < 1; p < 0.0001), while elevated miR-21-5p levels predicted worse survival outcomes (HR > 1; p < 0.0001) ([135]Figures 5E, F). In LIHC, high miR-139-5p expression significantly improved survival outcomes compared to low expression levels, as indicated by the statistical significance (p < 0.0001) and the hazard ratio (HR = 0.42) ([136]Figure 5G). Collectively, these results highlight the prognostic relevance of these miRNAs, suggesting their potential as cancer prognostic biomarkers. FIGURE 5. [137]FIGURE 5 [138]Open in a new tab Survival curve depicting prognostic capabilities of miRNA biomarkers using Kaplan-Meier plot for (A–B) Breast Invasive Carcinoma (BRCA) (C–D) Uterine Corpus Endometrial Carcinoma (UCEC). (E–G) Kidney Renal Papillary Cell Carcinoma (KIRC) and (H) Liver Hepatocellular Carcinoma (LIHC). The interaction networks of top miRNA features are plotted for UCEC, BRCA, and LUAD ([139]Figures 6A–C). In UCEC and BRCA, miR-499bc-5p demonstrated the highest degree of centrality, connecting 81 nodes in UCEC and 35 in BRCA. In UCEC, miRNA interacted with both upregulated and downregulated genes, while in BRCA, its network connections were confined to upregulated genes ([140]Figures 6A, B). The networks in all three cancers were sparse with miRNAs linked to multiple genes. MiR-139-5p and let-7c-5p are both downregulated and were associated with the expression of downregulated genes, with degree centralities of 29 and 35, respectively. In the LUAD network, miR-93-5p had a degree centrality of 8, whereas miR-30a-3p had a degree centrality of 15, indicating similar regulatory patterns ([141]Figure 6C). These interactions further emphasize the important co-regulatory roles of miRNAs in cancer progression and survival. FIGURE 6. [142]FIGURE 6 [143]Open in a new tab miRNA-mRNA-lncRNA network visualization. (A) Network of miR-499b-5p in UCEC. (B–D) Network of miR-499b-5p, miR-139-5p, and let-7c-5p in BRCA. (E–F) Network in LUAD for miR-93-5p, miR-30a-3p. The log2 fold change values highlight node borders: green for downregulated and red for upregulated miRNAs, mRNAs, and lncRNAs. Node shapes represent miRNAs (circles), protein-coding genes (squares), and lncRNAs (c-shaped squares). 3.6 Overlap of predictive miRNAs with literature compendium, clinical trials, and drug target associations Our findings revealed a substantial overlap between the predictive miRNA biomarkers identified in this study and those reported in existing literature, ongoing clinical trials, and the CMC, underscoring their relevance in cancer-related research ([144]Figure 7A). Specifically, a total of 159 predictive miRNAs matched entries in the manually curated miRNA literature-derived compendium ([145]Supplementary Table S8). Of these, 126 miRNAs have been reported as being EV in various cancers. Furthermore, 202 miRNAs from the CMC list corresponded to our identified features. Across the four miRNA datasets, including interacting miRNAs (597 miRNAs), literature-derived miRNAs (miRNA isoforms and EV miRNAs), and miRNAs from the CMC list, 63 miRNAs were common across all datasets, highlighting their significant relevance in cancer progression. FIGURE 7. [146]FIGURE 7 [147]Open in a new tab Overlapping miRNAs with miRNA compendium, CMC db, EV miRNAs and noncoRNA db and KEGG enrichment analysis. (A) Venn diagram depicting overlapping miRNAs features with miRNA compendium and CMC database and EV miRNAs (B) Dot plot depicting KEGG pathway enrichment analysis. (C) Pie chart depicting miRNAs sensitive to drugs and the experimental evidence of drug-target associations from noncoRNA db. Several specific miRNAs identified in our analysis have well-established associations with cancer. For example, miR-106a-3p/5p is recognized for targeting PTEN in prostate cancer, while the miR-125 cluster significantly influences the same type of cancer by targeting p53 inhibitors ([148]Wu et al., 2024). Likewise, both the miR-200 and miR-204-3p/5p clusters are associated with thyroid cancer, and miR-21-3p/5p has connections to both lung and thyroid cancers ([149]Cabané et al., 2024; [150]Carrà et al., 2024). The common miRNAs found in EVs play significant roles in cancer progression by influencing the growth and proliferation of tumors. Specifically, the miR-155 cluster and the miR-221 cluster are associated with renal cell carcinoma and breast cancer, respectively ([151]Das et al., 2019; [152]Liwei et al., 2021; [153]Parashar et al., 2024). Additionally, the miR-181c, the miR-200 cluster, and the miR-221 cluster have been implicated in breast cancer metastasis ([154]Das et al., 2019; [155]Parashar et al., 2024; [156]Tominaga et al., 2015; [157]Le et al., 2014). In terms of angiogenesis, miR-181b-5p in esophageal squamous cell carcinoma, as well as miR-143 and miR-145 clusters in lung cancers, and miR-15a and miR-181b in renal carcinomas, all influence this vital process ([158]Parashar et al., 2024; [159]Wang et al., 2020; [160]Lawson et al., 2017; [161]Lopatina et al., 2019). Furthermore, miR-31-5p and miR-222-3p from renal cell carcinoma and non-small cell lung cancer are involved in chemotherapy resistance ([162]Parashar et al., 2024; [163]He et al., 2020; [164]Wei et al., 2017). Conversely, miR-503 in breast cancer cells has been shown to enhance sensitivity to chemotherapy by disrupting cell proliferation and invasion ([165]Parashar et al., 2024; [166]Bovy et al., 2015). These miRNAs are highlighted as crucial elements in our interaction network, as shown in the [167]Supplementary Table S7. Our analysis identified a considerable number of miRNAs that are currently being investigated in clinical trials for cancer therapies, as shown in [168]Table 10, underscoring their translational potential ([169]Supplementary Table S9). We also evaluated the overlap between the identified miRNA features and those in the noncoRNA db, as shown in the pie chart ([170]Figure 7C) revealed associations with drug targets. Among the 617 miRNAs associated with drug targets, we found 186 miRNAs in validated associations that were resistant to cancer therapies, while 157 miRNAs were linked to treatment sensitivity. Overall, our analysis revealed 256 predicted resistance associations and 18 predicted sensitivity associations. A complete overview of the drug-target associations is provided in [171]Supplementary Table S10. TABLE 10. miRNAs in ongoing clinical trial studies. miRNAs in clinical trial Cancer hsa-miR-10b-3p Glioblastoma hsa-miR-10b-5p Glioblastoma hsa-miR-1307-3p pancreatic cancer hsa-miR-1307-5p pancreatic cancer hsa-miR-146a-5p lung cancer hsa-miR-155-3p lymphoma, breast cancer hsa-miR-155-5p lymphoma, breast cancer hsa-miR-16-1-3p lung cancer hsa-miR-16-2-3p lung cancer hsa-miR-18a-3p breast cancer hsa-miR-18a-5p breast cancer hsa-miR-193a-3p advanced solid tumors hsa-miR-211-5p ovarian cancer hsa-miR-218-5p lung cancer hsa-miR-22-3p lung cancer hsa-miR-29b-2-5p lung cancer hsa-miR-29b-3p lung cancer hsa-miR-34a-5p Renal cell carcinoma, non small cell lung cancer (NSCLC), liver cancer hsa-miR-7-2-3p lung cancer, gastric cancer hsa-miR-7-5p lung cancer, gastric cancer hsa-miR-9-3p Lung cancer hsa-miR-9-5p lung cancer [172]Open in a new tab 3.7 Functional enrichment analysis To explore the biological roles of the 597 interacting miRNAs identified in our study, we conducted GO and KEGG pathway enrichment analyses on their experimentally validated targets. The findings from GO analysis revealed significant enrichment in biological processes critical for cancer progression, including cellular adhesion, differentiation, organelle localization, embryonic organ development, renal and kidney development, T-cell differentiation, and DNA replication ([173]Supplementary Figure S3). Further, KEGG pathway enrichment analysis was performed on the targets of all 597 miRNAs, the 150 miRNA features selected by the RFE method, and the 63 miRNAs shared across the 597 miRNAs, the literature compendium, EV miRNAs, and CMC miRNAs ([174]Figure 7B). The top enriched pathways resulting from the above-mentioned analysis included cellular senescence, Hippo signaling, FoxO signaling, MAPK signaling, TNF signaling, and pathways related to Human Papillomavirus (HPV) infections. These enriched pathways highlight the central role of miRNAs in key signaling cascades implicated in cancer biology. Detailed data from the GO and KEGG enrichment analyses are presented in [175]Supplementary Tables S11–14. miRNA interactions common to validated and predicted interactions for 597 miRNAs, extracted from the databases such as miRTarBase, TarBase, miRecords, Pictar, and Diana, -obtained from multiMiR R package were shown in [176]Supplementary Tables S15, 16. 3.8 Classification performance of mRNA/lncRNA classifiers The lightGBM method was used to classify 14 cancer types using interacting mRNAs and lncRNAs as the features. The classification results for mRNA/lncRNA-based ML models are shown in [177]Supplementary Tables S17, 18. The models had an overall accuracy of 98%–99% in both cases. However, they did not perform as well when classifying some cancer types. For the mRNA models, all three models showed lower sensitivity for normal samples in these classes: BLCA-NT, HNSC-NT, LUAD-NT, PRAD-NT, STAD-NT, and UCEC-NT. The recall values for these classes were below 80%. Additionally, the F1-score was lower than 80% for the following classes: BLCA-NT, HNSC-NT, PRAD-NT, and STAD-NT. Similarly, the lncRNA feature models had lower precision, in the case of KICH-NT and BLCA-NT, lower recall/sensitivity, and F1 -score for BLCA-NT, PRAD-NT, STAD-NT, and UCEC-NT (<80%). The same LightGBM models trained on miRNA features performed better in classifying these normal samples than the mRNA or lncRNA feature sets. 4 Discussion The current study aimed to identify a minimal set of miRNA biomarkers capable of distinguishing primary cancer types based on their TOO while considering the complex interactions among miRNAs, mRNAs, and lncRNAs. Correlation-based networks infer regulatory relationships by analyzing co-expression profiles, which capture molecular interactions in cancer progression ([178]Zheng et al., 2020; [179]Yang et al., 2022), rather than relying on sequence analysis of RNA interactions ([180]Adinolfi et al., 2019; [181]Bheemireddy et al., 2022). The availability of extensive and standardized expression datasets from the TCGA makes correlation-based methods particularly effective for constructing robust and biologically meaningful co-expression networks in cancer studies. By integrating these interactions, our approach provides a more biologically relevant and robust set of miRNA signatures, enhancing the potential for early cancer detection. Our ensemble learning framework, combining Random Forest, AdaBoost, XGBoost, and LightGBM, achieved an impressive 99% accuracy in classifying 14 distinct cancer types based on TOO. We visualized the clustering of cancer samples according to tissue types using t-SNE plots ([182]Figures 3C, D), which demonstrated higher discrimination power. Despite achieving high overall accuracy in TOO prediction, our study revealed some variations in model performance for specific cancer types ([183]Figures 2A–E). These variations can be attributed to factors such as molecular complexity. Gastric cancer, specifically STAD, is often challenging to classify in molecular studies ([184]Cao et al., 2022). In contrast, cancers with well-characterized molecular profiles, such as BRCA and lung cancer (e.g., LUSC) ([185]Koboldt et al., 2012; [186]Thennavan et al., 2021; [187]Hammerman et al., 2012), exhibited higher and more consistent accuracy across all models and feature selection methods. This consistent performance for BRCA and lung cancers highlights the critical role of distinct molecular signatures in improving classification accuracy. These results suggest the need for tailored optimization strategies to enhance classification outcomes, especially for complex and heterogeneous cancer types like gastric cancer. While previous studies have demonstrated promising results in tumor origin classification, they often face limitations in accurately classifying specific cancer types or lack comprehensive biological validation. For instance, Raghu et al. (2024) ([188]Raghu et al., 2024) achieved a high accuracy (97%) for tumor origin detection, however, their method struggled with cancers like uterine (77% with decision tree) and esophagus (33.3% with decision tree and 83% with deep learning) cancers, highlighting limitations in certain cancer classifications. Similarly, Tang et al. (2018) ([189]Tang et al., 2018) used miRNA and DNA methylation markers, achieving ∼91% and ∼96% accuracy, respectively, but relying solely on single-layer data. Another comparative study demonstrates that DNA methylation profiles, particularly when analyzed using LASSO and neural network models, offer the highest predictive accuracy, ∼97.77% for tumor tissue origin detection compared to mRNA, microRNA, and lncRNA expression profiles​ ([190]Feng and Wang, 2024). [191]Lopez-Rincon et al. (2020) focused on comprehensive feature selection with ensemble methods providing minimal miRNAs for classification. A study by Matsuzaki et al. (2023) on serum miRNomes for predicting the TOO in early-stage cancers showed an 88% accuracy across all stages ([192]Matsuzaki et al., 2023). Unlike the studies mentioned earlier, our research integrates miRNA-mRNA-lncRNA interactions identified from co-expression networks, crucial for understanding cancer initiation and pathways, as demonstrated in other cancer studies ([193]Dong et al., 2020; [194]Zheng et al., 2020; [195]Naghsh-Nilchi et al., 2022; [196]Gao et al., 2021). Our approach includes thorough in silico validation using CMC, analysis of survival markers, assessment of drug sensitivity, and relevance to clinical trials. This multi-layered approach provides more biologically relevant insights, positioning our study as a more comprehensive tool for cancer classification and therapeutic planning. Our methodological approach demonstrated the power of Artificial Intelligence (AI) in complex multiclass classification. Our feature selection process identified several important miRNAs. These include miR-21-5p, miR-93-5p, and miR-10b-5p ([197]Qian et al., 2024; [198]Yan et al., 2021; [199]Pan et al., 2021). These miRNAs are linked to critical tumorigenic processes. These processes include immune modulation, epithelial-mesenchymal transition, angiogenesis, and chemoresistance ([200]Pavlíková et al., 2022). We also conducted an in silico validation. This validation revealed overlaps between these miRNA features and drug-target associations. This highlights their dual role in regulating drug sensitivity ([201]Seyhan, 2024; [202]Si et al., 2019) and chemoresistance ([203]Pavlíková et al., 2022). Overall, these miRNAs have an influence on essential processes. These processes include apoptosis, immune response, and therapy resistance. This underscores their potential to guide personalized cancer treatments ([204]Mishra et al., 2016). Functional enrichment analysis, including GO and pathway analysis of miRNA targets, uncovered significant KEGG pathways and GO terms. These terms are associated with both normal biological processes (e.g., embryonic organ development, the establishment of organelle localization, DNA replication), tissue differentiation (e.g., mononuclear cell differentiation, renal system development), and cancer-specific mechanisms involved in cancer development (e.g., T cell differentiation). As highlighted in previous studies ([205]Khatun et al., 2024; [206]Khatun et al., 2021), our study provides an intricate association between HPV and gynecological cancers by incorporating advanced machine learning approaches and rigorous in silico validation methods. Our findings emphasize the role of various cellular mechanisms in cancer development and progression, along with key cancer pathways ([207]Figure 7B), which are consistent with previous studies ([208]Xing et al., 2016; [209]Andrés-León et al., 2017). The top miRNA features identified by our ML models ([210]Supplementary Table S7) were associated with patient prognosis, with several of those implicated in ongoing clinical trials, consistent with findings from previous studies ([211]Seyhan, 2024; [212]Kim and Croce, 2023; [213]Hanna et al., 2019). For instance, RNA-based therapies targeting miR-21-5p have addressed immune infiltration and poor prognosis in KIRC ([214]Rhim et al., 2022; [215]Jenike and Halushka, 2021; [216]Wang et al., 2022). miR-93-5p enhances radiosensitivity by increasing apoptosis in breast cancer ([217]Pan et al., 2021) while promoting tumor progression in the bladder ([218]Yuan et al., 2023) and esophageal carcinoma cells ([219]Xu, 2019). miR-204-5p acts as a tumor suppressor in laryngeal squamous cell carcinoma (LSCC) ([220]Gao et al., 2017; [221]Fan et al., 2023), targets anti-apoptotic protein BCL2 in prostate cancer (PCa) ([222]Lin et al., 2017) and serves as an early diagnostic biomarker in endometrial cancer (EC) ([223]Wu et al., 2022). miR-10b-5p regulates gastric cancer (GC) fibroblast interactions via the TGFβ signaling pathway ([224]Yan et al., 2021), while miR-1301-3p is a potential therapeutic target for thyroid papillary carcinoma ([225]Qiao et al., 2021), gastric cancer ([226]Luo et al., 2021), and endometrial cancer ([227]Lu et al., 2021). Overall, these findings highlight the multifaceted role of miRNAs in distinguishing TOO as diagnostic biomarkers and potential therapeutic targets, offering unifying translational tools for leveraging circulating miRNAs for personalized medicine across pan-cancers/various cancer types. 4.1 Strengths and limitations Our comprehensive study has several notable strengths. The inclusion of 14 cancer types ensures broader applicability and cost-effectiveness. This was complemented by TCGA data, which provided a larger sample size, enhancing the reliability and generalizability of our findings. The integration of advanced ML models with biologically informed feature selection and a multi-validation approach, comprising functional enrichment analyses and clinical trial associations, collectively enhances the robustness of our analytical framework. Furthermore, the identification of key miRNAs with significant diagnostic potential emphasizes the translational relevance of this study. By accounting for complex molecular interactions and addressing gaps in existing studies, our study offers improved diagnostic precision. Despite these potential strengths, certain limitations persist. First, the complexity of miRNA interaction networks poses challenges for experimental validation. Our study relied exclusively on TCGA data, which, while comprehensive, may not fully represent the heterogeneity of cancer subtypes, particularly in rare cases. Additionally, a limitation of this study is the lack of detailed subtype information and metastatic samples, as our analysis was restricted to TCGA-derived primary tumor datasets. Future work will aim to incorporate these aspects to enhance the resolution and applicability of the classification model. Incorporating multiple clinical cohorts and more comprehensive clinical data could further improve our understanding of the role of these miRNA biomarkers in cancer. Finally, while the use of solid tissue samples offers valuable insights, their inherent heterogeneity limits the clinical translation of miRNA biomarkers. Future studies incorporating liquid biopsy data and multi-omics approaches could enhance the translational potential of our findings. 5 Conclusion and future research In summary, our study demonstrated the potential of integrating biologically relevant miRNA features with advanced ML approaches to achieve high accuracy in TOO prediction. Through in silico validation, including functional enrichment analysis, survival analysis, clinical trial associations, and drug sensitivity correlations, we highlighted the biological significance and therapeutic potential of the identified miRNAs. These findings emphasize the importance of integrating computational approaches with biological insights to improve the robustness of cancer diagnostics and treatment. Although the predictive power is promising, further experimental validation is warranted to confirm the clinical relevance of these miRNAs, ultimately advancing precision oncology and improving patient care. Future studies should explore the application of miRNAs in precisely classifying cancer subtypes and accurately determining the origins of metastatic tumors using samples from solid tissues or bodily fluids. Acknowledgments