Abstract Head and neck carcinoma (HNSC) is often diagnosed at advanced stage, incurring poor patient outcome. Despite of advances in chemoradiation and surgery approaches, limited improvements in survival rates of HNSC have been observed over the last decade. Accumulating evidences have demonstrated the importance of microRNAs (miRNAs) in carcinogenesis. In this context, we sought to identify a miRNA signature associated with the survival time in patients with HNSC. This study proposed a survival estimation method called HNSC-Sig that identified a miRNA signature consists of 25 miRNAs associated with the survival in 133 patients with HNSC. HNSC-Sig achieved 10-fold cross validation a mean correlation coefficient and a mean absolute error of 0.85 ± 0.01 and 0.46 ± 0.02 years, respectively, between actual and estimated survival times. The survival analysis revealed that five miRNAs, hsa-miR-3605-3p, hsa-miR-629-3p, hsa-miR-3127-5p, hsa-miR-497-5p, and hsa-miR-374a-5p, were significantly associated with prognosis in patients with HNSC. Comparing the relative expression difference of top 10 prioritized miRNAs, eight miRNAs, hsa-miR-629-3p, hsa-miR-3127-5p, hsa-miR-221-3p, hsa-miR-501-5p, hsa-miR-491-5p, hsa-miR-149-3p, hsa-miR-3934-5p, and hsa-miR-3170, were significantly expressed between cancer and normal groups. In addition, biological relevance, disease association, and target interactions of the miRNA signature were discussed. Our results suggest that identified miRNA signature have potential to serve as biomarker for diagnosis and clinical practice in HNSC. Keywords: Machine learning, miRNA signature, Cancer survival, Head and neck cancer 1. Introduction Head and neck squamous cell carcinomas (HNSCs) are a heterogenic malignancies that can involve multiple cellular origins and sites within the head and neck region. HNSCs are the most common malignancy accounting for nearly 430,000 deaths worldwide each year [[25]1]. The major risk factors associated with HNSCs are including, tobacco, alcohol consumption [[26]2] and human papillomavirus (HPV) infection [[27]3]. Treatment of HNSCs involves chemotherapy, radiotherapy and surgical eradication. However, the heterogeneity of HNSCs, diversity of treatment approaches and individual patient response to these variables, make it difficult to define the outcome of the treatment. Though, advancements in the treatment strategies, concerning aged patients and individual response, these modalities affects quality of life [[28]4,[29]5] and radiotherapy resistance remains a major challenge for HNSC treatment. The toxic nature of treatment and poor survival rates, highlighting the need for effective therapies using the specific biomarkers to improve the treatment outcome. MicroRNA (miRNA) brings new insights to cancer pathobiology and involves in the regulation of gene expression [[30]6]. MiRNAs take part in many processes that are crucial for cancer progression, such as proliferation, migration and apoptosis. Several studies demonstrated that miRNA expression is dysregulated in human cancers [[31]7,[32]8]. There are various studies that have identified miRNAs for the prediction of progression in HNSC and explored the roles of miRNAs as biomarkers in cancer detection and prognosis. Upregulation of miR-21, miR-200c and miR-34a and downregulation of miR-375 was observed in tumors of HNSCs [[33]9]. Inhibition of miR-16 suppress cell migration in laryngeal cancer cell line and targets zyxin and promotes cell mobility [[34]10]. The expression of miR-375 is associated with localization of the tumors and alcohol consumption in patients with laryngeal squamous cell carcinoma [[35]11]. Hsa-miR-21 expression was consistently reported to be significantly associated with poor survival in HNSCs [[36]12]. A meta-analysis on HNSCs identified significant miRNAs that are up and downregulated in HNSCs and decreased expression of 26 miRNAs were associated with poor survival [[37]13]. Decreased expression level of serum miR-9 was found to be associated with poor prognosis in patients with oral squamous cell carcinoma [[38]14]. Altered expression of miRNAs were used to predict recurrence in patients with HNSC [[39]15]. Expression levels of hsa-miR-205, hsa-let-7d and hsa-miR-616 were associated with poor survival in patients with HNSC [[40]16,[41]17]. Some miRNAs are identified as potential biomarkers in HNSC [[42]18]. Additionally, there are some miRNAs that are associated with risk factors and treatment modalities of HNSCs, such as HPV associated miRNAs [[43]19] and radiotherapy associated extra cellular miRNAs [[44]20]. Therefore, miRNAs have been using as prognostic biomarker in HNSC. Computational models have been using for prediction of prognosis in HNSCs. For instance, Bryce et al. developed an artificial neural network (ANN) model to predict survival in patients with HNSCs using clinical factors [[45]21]. Recently, Jurmeister et al. developed machine learning-based models to distinguish lung cancer and HNSCs using DNA methylation data, in which support vector machine (SVM)-based model obtained a promising predictive accuracy for classifying HNSCs [[46]22]. Reddy et al. utilized Random Forest, gradient boosted decision trees, and logistic regression models to predict the acute radiation toxicities in patients with HNSC [[47]23]. Jiang et al. used ridge logistic regression to predict xerostomia after radiation therapy in patients with HNSC [[48]24]. A set of machine learning models were used to detect the cancer in surgical specimens from 36 patients with HNSC [[49]25]. A deep learning based model has been used to detect HNSC from computed tomography (CT) scan images [[50]26]. A convolutional neural network-based frame work was utilized to predict patient outcome based on their CT images [[51]27]. However, machine learning models based on miRNA expression profiles for estimating survival in HNSC are limited. Accordingly, our purpose is to utilize machine learning model to identify a miRNA signature that associated with survival patients with HNSC. In this study, we utilized an optimized support vector regression (SVR) model incorporated with inheritable bi-objective combinatorial genetic algorithm (IBCGA) as a feature selection algorithm to estimate the survival time as well as identify a miRNA signature that is associated with survival in patients with HNSC. The optimization technique was adopted from our previous work [[52][28], [53][29], [54][30]]. The major contributions of this work includes identifying a set of survival associated miRNA signature; robust feature set selection using feature appearance score (FAS); survival validation of the identified miRNAs using Kaplan-Meier (KM) survival analysis; biological relevance of the miRNAs was analyzed using Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) annotations; and relative miRNA expression difference and coexpression analysis of the miRNAs in HNSC and their importance in cancers were discussed further. The manuscript is organized as follows: Section [55]1 provides background and related work, Section [56]2 describes the dataset, development of the HNSC-Sig method, and the bioinformatics tools employed in the study. Section [57]3 covers the identification of the miRNA signature for estimating survival time in HNSC patients, biological pathway analysis of the miRNA signature, gene ontology annotations, gene targets, miRNA disease association, and expression difference analysis. Section [58]4 discusses the roles of the identified miRNA signature in HNSC, the potential and limitations of utilizing miRNA signatures in cancer diagnosis and prognosis, and concludes with final remarks. 2. Materials and methods 2.1. Dataset Initially, the dataset consisted of 523 HNSC samples were retrieved from TCGA database, and the miRNA profiling was performed using the Illumina HiSeq 2000 miRNA sequencing platform. We filtered the dataset based on the availability of survival information and miRNA expression profiling. Finally, the dataset consisted of 133 patients with 498 miRNA expression profiles of HNSC were further considered for the analysis. 2.2. HNSC-Sig We establish the HNSC-Sig method based on SVR incorporated with optimal feature selection algorithm called IBCGA. The optimization technique has been successfully applied in various cancer survival estimations [[59]28,[60][31], [61][32], [62][33], [63][34], [64][35]]. HNSC-Sig was designed to identify a survival associated miRNA signature as well as estimate the survival time in patients with HNSC. Recent years, SVMs playing increasingly prominent roles in solving several biological problems, especially in cancer prognosis and survival predictions [[65]36]. The SVR implemented in this study was used from the LibSVM package [[66]37]. 2.2.1. An optimal feature selection method IBCGA Feature selection algorithms are crucial in reducing dimensionality when working with datasets that have a large number of features. Gu et al. proposed a new method for transient stability assessment of power systems, combining kernelized fuzzy rough sets and the memetic algorithm [[67]38]. Li et al. proposed a novel OS-extreme learning machine with binary Jaya-based feature selection [[68]39]. In this study, we used optimal feature selection algorithm IBCGA, which is potential at solving large parameter optimization problems and can select minimum set of miRNAs as signature from a large number of miRNA expression profiles (n = 498) while maximizing the prediction performance. Here, we used correlation coefficient (CC) [Equation [69](1)] and mean absolute error (MAE) as the estimation measures to evaluate the prediction performance. [MATH: CC=i=1N(xix)(yiy)i=1N(xix)2< mrow>[i=1N(yiy)2] :MATH] (1) where x[i] and y[i] are actual and are the predicted survival times of the ith miRNA, [MATH: x :MATH] and [MATH: y :MATH] are the corresponding means, and N is the total number of patients in the cohort. The MAE is defined in Equation [70](2). [MATH: MAE=1Ni=1N|yixi|2 :MATH] (2) Here, we used genetic algorithm (GA) terminology to represents chromosomes and genes. The chromosome of IBCGA comprises 498 genes and three 4-bit genes for encoding, γ, C, and ν for the ν-SVR. The encoded chromosomes were designed as described in previous studies [[71]36,[72]40]. The IBCGA can simultaneously obtain a set of solutions, X[r], where r = r[start], r[start + 1] … r[end] in a single run. The detailed steps involved in IBCGA can be found in Supplementary method. 2.2.2. Robust feature set selection The robust set of miRNA signature among 30 independent runs in HNSC-Sig has the highest FAS obtained using the following procedure [Equation [73](3)]. * Step 1: Perform N independent runs of HNSC-Sig by maximizing accuracy of 10-CV for obtaining N miRNA signatures. There are [MATH: Pa :MATH] features in the a-th signatures, a = 1, …, N. * Step 2: FAS is calculated as follows: + a) Calculate the Feature appearance score f(a) for each feature ‘a’ that ever presents in the N sets of miRNAs. + b) Calculate the score P[a], t = 1, …, N where s[i] is the i-th feature in the a-th solution: [MATH: Fa=i =1mt< /msub>f(si)/pa :MATH] (3) * Step 3: Output the a-th feature set with the highest appearance score F[a]. 2.3. Biological significance analysis KEGG pathway GO annotation analysis was performed using DIANA tools. The DIANA-microT web server provided the predicted miRNA targets for the pathway analysis. The p-value threshold was set to 0.05 and Fishers’s exact test (hypergeometric distribution) was used for the enrichment analysis. 2.4. miRNA-target interaction network We constructed the miRNA-target interaction network using Cytoscape v3.7.2. For the prediction of target genes miRTarBase and TargetScan predictions were used presented in the CyTargetLinker application of the Cytoscape. The interaction threshold was adjusted to two units. 3. Results 3.1. MiRNA signature selection and survival estimation The dataset consisted of 133 patients with HNSC from The cancer genome atlas (TCGA) database. The optimized method HNSC-sig was incorporated with feature selection algorithm IBCGA to identify a set of survival miRNAs in patients with HNSC. HNSC-Sig identified an average 31 miRNAs and achieved a 10-fold cross-validation (10-CV) mean correlation coefficient and mean absolute error of 0.85 ± 0.01 and 0.46 ± 0.02 years, respectively, between actual and predicted survival time. We performed 30 independent runs of HNSC-sig to select the robust feature set. The system flow-chart of HNSC-Sig is depicted in [74]Fig. 1. To select a robust feature set, FAS was measured for each independent run of the HNSC-Sig. The highest FAS indicates that frequency of these features are higher compared to the lower FAS. The highest FAS of HNSC-sig was 8.48 and the 25 miRNAs were selected as a signature. HNSC-sig with highest FAS obtained a correlation and a mean absolute error of 0.86 and 0.42, respectively, between actual and predicted survival time. The FAS for each independent run is shown in [75]Fig. 2a. Fig. 1. [76]Fig. 1 [77]Open in a new tab System flow chart. Step 1: collection of miRNA expression data from TCGA data portal, Step: HNSC-Sig methodology, and Step 3: miRNA signature discovery by estimation of survival time, KM-survival curves, KEGG and GO annotation analysis, target network analysis, and expression difference analysis. Fig. 2. [78]Fig. 2 [79]Open in a new tab Robust miRNA signature identification. (A) Robust signature selection using Feature appearance score (FAS) measurement for each independent run of HNSC-Sig. The highest FAS obtained is 8.48 at the 9th independent run, (B) estimation performance of the robust miRNA signature, and (C) KM survival plots for hsa-miR-3605-3p, hsa-miR-629-3p, hsa-miR-3127-5p, hsa-miR-497-5p, and hsa-miR-374a-5p. The prediction performance of HNSC-Sig was compared with some regression methods, such as the least absolute shrinkage and selection operator (LASSO) and elastic net. HNSC-Sig exhibited superior prediction performance compared to LAASO and elastic net methods. The comparison of prediction performance results are shown in [80]Table 1. The prediction performance of HNSC-Sig is shown in [81]Fig. 2b. Correlation plots for HNSC-sig, LASSO, and elastic net are shown in [82]Supplementary Fig. S1. Table 1. HNSC-Sig prediction performance comparison. Method Correlation coefficient Mean absolute error in years Features selected LASSO 0.62 0.69 22 Elastic net 0.82 0.53 55 HNSC-Sig 0.88 0.43 29 HNSC-Sig-FAS 0.86 0.42 25 HNSC-Sig-Mean 0.85 ± 0.01 0.46 ± 0.02 31.70 ± .75 [83]Open in a new tab The miRNAs of the signature were ranked based on main effect difference (MED) analysis [[84]41]. The MED analysis gives score for each miRNA based on its estimation capability, hence, the miRNA with a highest MED score possess higher importance while estimating the survival time. The list of miRNAs of the signature and corresponding MED scores are shown in [85]Supplementary Table S1. 3.2. Survival validation of the miRNAs The miRNAs of the signature were validated using KM survival analysis. Survival outcome was explored considering p-values. A p-value < 0.05 was considered to be statistically significant. Five of the miRNA signature, hsa-miR-3605-3p, hsa-miR-629-3p, hsa-miR-3127-5p, hsa-miR-497-5p, and hsa-miR-374a-5p, were significantly associated with prognosis in patients with HNSC. These five miRNAs, hsa-miR-3605-3p, hsa-miR-629-3p, hsa-miR-3127-5p, hsa-miR-497-5p, and hsa-miR-374a-5p, obtained p-values of 0.046, 0.048, 0.048, 0.0095, and 0.044, respectively. KM survival curves for these five miRNAs are shown in [86]Fig. 2c. 3.3. Biological pathway analysis of the miRNA signature Biological significance of the identified miRNA signature was analyzed using KEGG pathways. The identified miRNA signature enriched in various pathways, the more significant enriched pathways (P < 0.005) including, fatty acid biosynthesis, lysine degradation, hippo signaling pathway, adherens junction, proteoglycans in cancer, fatty acid metabolism, ECM-receptor interaction, pathways in cancer, prostate cancer, glioma, steroid biosynthesis, and p53 signaling pathway. The identified miRNA signature involved in fatty acid biosynthesis (p < E−365), lysine degradation (p = 1.14E−13), hippo signaling pathway (p = 4.01E−13), adherens junction (p = 1.12E−12), proteoglycans in cancer (p = 1.98E−12), fatty acid metabolism (p = 8.20E−09), ECM-receptor interaction (p = 6.70E−08), pathways in cancer (p = 0.0006), prostate cancer (p = 0.0008), glioma (p = 0.0011), steroid biosynthesis (p = 0.0045), and p53 signaling pathway (p = 0.007) and targets 2, 20, 63, 38, 88, 9, 16, 124, 45, 28, 3, and 38 genes, respectively. The details of KEGG pathways and number of target genes are described in [87]Table 2. The KEGG pathway enrichment analysis is depicted in [88]Supplementary Fig. S2. Table 2. The miRNA signature involved in significant KEGG pathways. KEGG pathway miRNAs Target genes p-value Proteoglycans in cancer 20 112 1.211E−16 Adherens junction 20 49 2.932E−12 Viral carcinogenesis 20 104 7.936E−09 Hepatitis B 19 78 2.281E−07 Protein processing in endoplasmic reticulum 21 95 8.18E−07 TGF-beta signaling pathway 17 47 9.288E−07 Pancreatic cancer 19 42 9.288E−07 Bacterial invasion of epithelial cells 20 48 9.288E−07 Hippo signaling pathway 21 72 1.983E−06 Cell cycle 21 70 4.562E−06 Colorectal cancer 20 41 6.81E−06 Prostate cancer 21 56 1.037E−05 Estrogen signaling pathway 20 55 1.306E−05 Acute myeloid leukemia 17 38 1.677E−05 Glycosaminoglycan biosynthesis - keratan sulfate 12 10 2.874E−05 Pathways in cancer 21 184 3.088E−05 Focal adhesion 21 108 3.515E−05 Fatty acid biosynthesis 9 5 4.006E−05 Oocyte meiosis 17 59 4.028E−05 Glioma 19 37 4.028E−05 Endocytosis 21 100 4.028E−05 p53 signaling pathway 19 43 6.164E−05 N-Glycan biosynthesis 14 27 6.176E−05 Renal cell carcinoma 20 43 7.457E−05 Neurotrophin signaling pathway 21 67 7.712E−05 Chronic myeloid leukemia 18 45 7.842E−05 Shigellosis 20 39 8.722E−05 Prolactin signaling pathway 18 42 0.0001111 Non-small cell lung cancer 17 33 0.0001449 mTOR signaling pathway 18 38 0.000156 FoxO signaling pathway 20 72 0.0001875 Lysine degradation 17 25 0.0001893 Small cell lung cancer 19 49 0.0003562 Sphingolipid signaling pathway 19 60 0.0003722 Ubiquitin mediated proteolysis 19 74 0.0003919 Endometrial cancer 19 32 0.0004231 Insulin signaling pathway 21 74 0.0004354 AMPK signaling pathway 20 67 0.0004477 ErbB signaling pathway 21 49 0.0004477 HIF-1 signaling pathway 20 58 0.0006434 Thyroid cancer 17 19 0.0007077 Signaling pathways regulating pluripotency of stem cells 20 72 0.0007951 [89]Open in a new tab 3.4. Gene ontology annotations GO annotations of the miRNA signature was analyzed in terms of biological processes, cellular components and molecular functions. The miRNA signature involved more significant biological pathways were epidermal growth factor receptor signaling pathway, fibroblast growth factor receptor signaling pathway, RNA metabolic process, DNA metabolic process, transcription, DNA-templated, mRNA metabolic process, immune system process, Fc-epsilon receptor signaling pathway, nucleobase-containing compound catabolic process, and Fc-gamma receptor signaling pathway involved in phagocytosis, by targeting 2120 genes. The miRNA signature involved in more significant cellular components were protein complex, nucleoplasm, cytosol, organelle, microtubule organizing center, and focal adhesion, by targeting 6887 genes. The miRNA signature involved in various molecular functions, the most significant molecular functions includes, enzyme regulator activity, protein binding transcription factor activity, nucleic acid binding transcription factor activity, poly(A) RNA binding, enzyme binding, RNA binding, ion binding, cytoskeletal protein binding, transcription corepressor activity, and transcription factor binding, by targeting 5098 genes. The details of the miRNA signature involved in GO annotations are listed in [90]Supplementary Table S2. The GO enrichment analysis is shown in [91]Supplementary Fig. S3. 3.5. Identifying miRNA signature gene targets To investigate the possible genes targeted by the miRNA signature, we constructed a miRNA-target interaction network using Cytoscape v3.7.2. The miRNA-target network was constructed using predicted and validated targets by TargetScan Homo sapiens version 6.2 and miRTarBase v4.4 and MicroCosm v5. In which, miRTarBase collects target genes experimentally validated by western blot, microarrays, reporter assay, and next generation sequencing experiments [[92]42]. TargetScan predicts biological targets of miRNAs [[93]43], and MicroCosm contains computationally predicted targets for microRNAs, which are obtained from the miRBase sequence database and EnsEMBL.There are totally 20,832 target interactions in the network, in which TargetScan, miRTarBase and MicroCosm predicted and validated targets were 8313, 195, 12,324, respectively. For the better visualization overlap threshold was adjusted and in the constructed miRNA-target network ([94]Fig. 3A), TargetScan, miRTarBase and MicroCosm targeted interactions were 1019, 107, and 981, respectively. Fig. 3. [95]Fig. 3 [96]Open in a new tab (A) miRNA-Target interaction using Cytoscape. The target genes of the miRNA signature were predicted using the miRTarBase, TargetScan, and MicroCosm. In this network, microRNAs and target genes are shown as red circles and pink rounded hexagons, respectively. The predicted microRNA–target interactions were visualized in orange, blue and violet using CyTargetLinker. (B) gene enrichment analysis predictions using miRTarBase, and (C) gene enrichment analysis predictions using TargetScan. (For interpretation of the references to colour in this