Abstract
Head and neck carcinoma (HNSC) is often diagnosed at advanced stage,
incurring poor patient outcome. Despite of advances in chemoradiation
and surgery approaches, limited improvements in survival rates of HNSC
have been observed over the last decade. Accumulating evidences have
demonstrated the importance of microRNAs (miRNAs) in carcinogenesis. In
this context, we sought to identify a miRNA signature associated with
the survival time in patients with HNSC. This study proposed a survival
estimation method called HNSC-Sig that identified a miRNA signature
consists of 25 miRNAs associated with the survival in 133 patients with
HNSC. HNSC-Sig achieved 10-fold cross validation a mean correlation
coefficient and a mean absolute error of 0.85 ± 0.01 and 0.46 ± 0.02
years, respectively, between actual and estimated survival times. The
survival analysis revealed that five miRNAs, hsa-miR-3605-3p,
hsa-miR-629-3p, hsa-miR-3127-5p, hsa-miR-497-5p, and hsa-miR-374a-5p,
were significantly associated with prognosis in patients with HNSC.
Comparing the relative expression difference of top 10 prioritized
miRNAs, eight miRNAs, hsa-miR-629-3p, hsa-miR-3127-5p, hsa-miR-221-3p,
hsa-miR-501-5p, hsa-miR-491-5p, hsa-miR-149-3p, hsa-miR-3934-5p, and
hsa-miR-3170, were significantly expressed between cancer and normal
groups. In addition, biological relevance, disease association, and
target interactions of the miRNA signature were discussed. Our results
suggest that identified miRNA signature have potential to serve as
biomarker for diagnosis and clinical practice in HNSC.
Keywords: Machine learning, miRNA signature, Cancer survival, Head and
neck cancer
1. Introduction
Head and neck squamous cell carcinomas (HNSCs) are a heterogenic
malignancies that can involve multiple cellular origins and sites
within the head and neck region. HNSCs are the most common malignancy
accounting for nearly 430,000 deaths worldwide each year [[25]1]. The
major risk factors associated with HNSCs are including, tobacco,
alcohol consumption [[26]2] and human papillomavirus (HPV) infection
[[27]3]. Treatment of HNSCs involves chemotherapy, radiotherapy and
surgical eradication. However, the heterogeneity of HNSCs, diversity of
treatment approaches and individual patient response to these
variables, make it difficult to define the outcome of the treatment.
Though, advancements in the treatment strategies, concerning aged
patients and individual response, these modalities affects quality of
life [[28]4,[29]5] and radiotherapy resistance remains a major
challenge for HNSC treatment. The toxic nature of treatment and poor
survival rates, highlighting the need for effective therapies using the
specific biomarkers to improve the treatment outcome.
MicroRNA (miRNA) brings new insights to cancer pathobiology and
involves in the regulation of gene expression [[30]6]. MiRNAs take part
in many processes that are crucial for cancer progression, such as
proliferation, migration and apoptosis. Several studies demonstrated
that miRNA expression is dysregulated in human cancers [[31]7,[32]8].
There are various studies that have identified miRNAs for the
prediction of progression in HNSC and explored the roles of miRNAs as
biomarkers in cancer detection and prognosis. Upregulation of miR-21,
miR-200c and miR-34a and downregulation of miR-375 was observed in
tumors of HNSCs [[33]9]. Inhibition of miR-16 suppress cell migration
in laryngeal cancer cell line and targets zyxin and promotes cell
mobility [[34]10]. The expression of miR-375 is associated with
localization of the tumors and alcohol consumption in patients with
laryngeal squamous cell carcinoma [[35]11]. Hsa-miR-21 expression was
consistently reported to be significantly associated with poor survival
in HNSCs [[36]12]. A meta-analysis on HNSCs identified significant
miRNAs that are up and downregulated in HNSCs and decreased expression
of 26 miRNAs were associated with poor survival [[37]13]. Decreased
expression level of serum miR-9 was found to be associated with poor
prognosis in patients with oral squamous cell carcinoma [[38]14].
Altered expression of miRNAs were used to predict recurrence in
patients with HNSC [[39]15]. Expression levels of hsa-miR-205,
hsa-let-7d and hsa-miR-616 were associated with poor survival in
patients with HNSC [[40]16,[41]17]. Some miRNAs are identified as
potential biomarkers in HNSC [[42]18]. Additionally, there are some
miRNAs that are associated with risk factors and treatment modalities
of HNSCs, such as HPV associated miRNAs [[43]19] and radiotherapy
associated extra cellular miRNAs [[44]20]. Therefore, miRNAs have been
using as prognostic biomarker in HNSC.
Computational models have been using for prediction of prognosis in
HNSCs. For instance, Bryce et al. developed an artificial neural
network (ANN) model to predict survival in patients with HNSCs using
clinical factors [[45]21]. Recently, Jurmeister et al. developed
machine learning-based models to distinguish lung cancer and HNSCs
using DNA methylation data, in which support vector machine (SVM)-based
model obtained a promising predictive accuracy for classifying HNSCs
[[46]22]. Reddy et al. utilized Random Forest, gradient boosted
decision trees, and logistic regression models to predict the acute
radiation toxicities in patients with HNSC [[47]23]. Jiang et al. used
ridge logistic regression to predict xerostomia after radiation therapy
in patients with HNSC [[48]24]. A set of machine learning models were
used to detect the cancer in surgical specimens from 36 patients with
HNSC [[49]25]. A deep learning based model has been used to detect HNSC
from computed tomography (CT) scan images [[50]26]. A convolutional
neural network-based frame work was utilized to predict patient outcome
based on their CT images [[51]27]. However, machine learning models
based on miRNA expression profiles for estimating survival in HNSC are
limited. Accordingly, our purpose is to utilize machine learning model
to identify a miRNA signature that associated with survival patients
with HNSC.
In this study, we utilized an optimized support vector regression (SVR)
model incorporated with inheritable bi-objective combinatorial genetic
algorithm (IBCGA) as a feature selection algorithm to estimate the
survival time as well as identify a miRNA signature that is associated
with survival in patients with HNSC. The optimization technique was
adopted from our previous work [[52][28], [53][29], [54][30]]. The
major contributions of this work includes identifying a set of survival
associated miRNA signature; robust feature set selection using feature
appearance score (FAS); survival validation of the identified miRNAs
using Kaplan-Meier (KM) survival analysis; biological relevance of the
miRNAs was analyzed using Kyoto Encyclopedia of Genes and Genomes
(KEGG) and Gene Ontology (GO) annotations; and relative miRNA
expression difference and coexpression analysis of the miRNAs in HNSC
and their importance in cancers were discussed further.
The manuscript is organized as follows: Section [55]1 provides
background and related work, Section [56]2 describes the dataset,
development of the HNSC-Sig method, and the bioinformatics tools
employed in the study. Section [57]3 covers the identification of the
miRNA signature for estimating survival time in HNSC patients,
biological pathway analysis of the miRNA signature, gene ontology
annotations, gene targets, miRNA disease association, and expression
difference analysis. Section [58]4 discusses the roles of the
identified miRNA signature in HNSC, the potential and limitations of
utilizing miRNA signatures in cancer diagnosis and prognosis, and
concludes with final remarks.
2. Materials and methods
2.1. Dataset
Initially, the dataset consisted of 523 HNSC samples were retrieved
from TCGA database, and the miRNA profiling was performed using the
Illumina HiSeq 2000 miRNA sequencing platform. We filtered the dataset
based on the availability of survival information and miRNA expression
profiling. Finally, the dataset consisted of 133 patients with 498
miRNA expression profiles of HNSC were further considered for the
analysis.
2.2. HNSC-Sig
We establish the HNSC-Sig method based on SVR incorporated with optimal
feature selection algorithm called IBCGA. The optimization technique
has been successfully applied in various cancer survival estimations
[[59]28,[60][31], [61][32], [62][33], [63][34], [64][35]]. HNSC-Sig was
designed to identify a survival associated miRNA signature as well as
estimate the survival time in patients with HNSC. Recent years, SVMs
playing increasingly prominent roles in solving several biological
problems, especially in cancer prognosis and survival predictions
[[65]36]. The SVR implemented in this study was used from the LibSVM
package [[66]37].
2.2.1. An optimal feature selection method IBCGA
Feature selection algorithms are crucial in reducing dimensionality
when working with datasets that have a large number of features. Gu et
al. proposed a new method for transient stability assessment of power
systems, combining kernelized fuzzy rough sets and the memetic
algorithm [[67]38]. Li et al. proposed a novel OS-extreme learning
machine with binary Jaya-based feature selection [[68]39]. In this
study, we used optimal feature selection algorithm IBCGA, which is
potential at solving large parameter optimization problems and can
select minimum set of miRNAs as signature from a large number of miRNA
expression profiles (n = 498) while maximizing the prediction
performance. Here, we used correlation coefficient (CC) [Equation
[69](1)] and mean absolute error (MAE) as the estimation measures to
evaluate the prediction performance.
[MATH: CC=∑i=1N(xi−x‾)(yi−y‾)⌊∑i=1N(xi−x‾)2⌋<
mrow>[∑i=1N(yi−y‾)2] :MATH]
(1)
where x[i] and y[i] are actual and are the predicted survival times of
the ith miRNA,
[MATH: x‾
:MATH]
and
[MATH: y‾
:MATH]
are the corresponding means, and N is the total number of patients in
the cohort. The MAE is defined in Equation [70](2).
[MATH: MAE=1N∑i=1N|yi−xi|2 :MATH]
(2)
Here, we used genetic algorithm (GA) terminology to represents
chromosomes and genes. The chromosome of IBCGA comprises 498 genes and
three 4-bit genes for encoding, γ, C, and ν for the ν-SVR. The encoded
chromosomes were designed as described in previous studies
[[71]36,[72]40]. The IBCGA can simultaneously obtain a set of
solutions, X[r], where r = r[start], r[start + 1] … r[end] in a single
run. The detailed steps involved in IBCGA can be found in Supplementary
method.
2.2.2. Robust feature set selection
The robust set of miRNA signature among 30 independent runs in HNSC-Sig
has the highest FAS obtained using the following procedure [Equation
[73](3)].
* Step 1: Perform N independent runs of HNSC-Sig by maximizing
accuracy of 10-CV for obtaining N miRNA signatures. There are
[MATH: Pa :MATH]
features in the a-th signatures, a = 1, …, N.
* Step 2: FAS is calculated as follows:
+ a)
Calculate the Feature appearance score f(a) for each feature
‘a’ that ever presents in the N sets of miRNAs.
+ b)
Calculate the score P[a], t = 1, …, N where s[i] is the i-th
feature in the a-th solution:
[MATH: Fa=∑i
=1mt<
/msub>f(si)/pa :MATH]
(3)
* Step 3: Output the a-th feature set with the highest appearance
score F[a].
2.3. Biological significance analysis
KEGG pathway GO annotation analysis was performed using DIANA tools.
The DIANA-microT web server provided the predicted miRNA targets for
the pathway analysis. The p-value threshold was set to 0.05 and
Fishers’s exact test (hypergeometric distribution) was used for the
enrichment analysis.
2.4. miRNA-target interaction network
We constructed the miRNA-target interaction network using Cytoscape
v3.7.2. For the prediction of target genes miRTarBase and TargetScan
predictions were used presented in the CyTargetLinker application of
the Cytoscape. The interaction threshold was adjusted to two units.
3. Results
3.1. MiRNA signature selection and survival estimation
The dataset consisted of 133 patients with HNSC from The cancer genome
atlas (TCGA) database. The optimized method HNSC-sig was incorporated
with feature selection algorithm IBCGA to identify a set of survival
miRNAs in patients with HNSC. HNSC-Sig identified an average 31 miRNAs
and achieved a 10-fold cross-validation (10-CV) mean correlation
coefficient and mean absolute error of 0.85 ± 0.01 and 0.46 ± 0.02
years, respectively, between actual and predicted survival time. We
performed 30 independent runs of HNSC-sig to select the robust feature
set. The system flow-chart of HNSC-Sig is depicted in [74]Fig. 1. To
select a robust feature set, FAS was measured for each independent run
of the HNSC-Sig. The highest FAS indicates that frequency of these
features are higher compared to the lower FAS. The highest FAS of
HNSC-sig was 8.48 and the 25 miRNAs were selected as a signature.
HNSC-sig with highest FAS obtained a correlation and a mean absolute
error of 0.86 and 0.42, respectively, between actual and predicted
survival time. The FAS for each independent run is shown in [75]Fig.
2a.
Fig. 1.
[76]Fig. 1
[77]Open in a new tab
System flow chart. Step 1: collection of miRNA expression data from
TCGA data portal, Step: HNSC-Sig methodology, and Step 3: miRNA
signature discovery by estimation of survival time, KM-survival curves,
KEGG and GO annotation analysis, target network analysis, and
expression difference analysis.
Fig. 2.
[78]Fig. 2
[79]Open in a new tab
Robust miRNA signature identification. (A) Robust signature selection
using Feature appearance score (FAS) measurement for each independent
run of HNSC-Sig. The highest FAS obtained is 8.48 at the 9th
independent run, (B) estimation performance of the robust miRNA
signature, and (C) KM survival plots for hsa-miR-3605-3p,
hsa-miR-629-3p, hsa-miR-3127-5p, hsa-miR-497-5p, and hsa-miR-374a-5p.
The prediction performance of HNSC-Sig was compared with some
regression methods, such as the least absolute shrinkage and selection
operator (LASSO) and elastic net. HNSC-Sig exhibited superior
prediction performance compared to LAASO and elastic net methods. The
comparison of prediction performance results are shown in [80]Table 1.
The prediction performance of HNSC-Sig is shown in [81]Fig. 2b.
Correlation plots for HNSC-sig, LASSO, and elastic net are shown in
[82]Supplementary Fig. S1.
Table 1.
HNSC-Sig prediction performance comparison.
Method Correlation coefficient Mean absolute error in years Features
selected
LASSO 0.62 0.69 22
Elastic net 0.82 0.53 55
HNSC-Sig 0.88 0.43 29
HNSC-Sig-FAS 0.86 0.42 25
HNSC-Sig-Mean 0.85 ± 0.01 0.46 ± 0.02 31.70 ± .75
[83]Open in a new tab
The miRNAs of the signature were ranked based on main effect difference
(MED) analysis [[84]41]. The MED analysis gives score for each miRNA
based on its estimation capability, hence, the miRNA with a highest MED
score possess higher importance while estimating the survival time. The
list of miRNAs of the signature and corresponding MED scores are shown
in [85]Supplementary Table S1.
3.2. Survival validation of the miRNAs
The miRNAs of the signature were validated using KM survival analysis.
Survival outcome was explored considering p-values. A p-value < 0.05
was considered to be statistically significant. Five of the miRNA
signature, hsa-miR-3605-3p, hsa-miR-629-3p, hsa-miR-3127-5p,
hsa-miR-497-5p, and hsa-miR-374a-5p, were significantly associated with
prognosis in patients with HNSC. These five miRNAs, hsa-miR-3605-3p,
hsa-miR-629-3p, hsa-miR-3127-5p, hsa-miR-497-5p, and hsa-miR-374a-5p,
obtained p-values of 0.046, 0.048, 0.048, 0.0095, and 0.044,
respectively. KM survival curves for these five miRNAs are shown in
[86]Fig. 2c.
3.3. Biological pathway analysis of the miRNA signature
Biological significance of the identified miRNA signature was analyzed
using KEGG pathways. The identified miRNA signature enriched in various
pathways, the more significant enriched pathways (P < 0.005) including,
fatty acid biosynthesis, lysine degradation, hippo signaling pathway,
adherens junction, proteoglycans in cancer, fatty acid metabolism,
ECM-receptor interaction, pathways in cancer, prostate cancer, glioma,
steroid biosynthesis, and p53 signaling pathway. The identified miRNA
signature involved in fatty acid biosynthesis (p < E−365), lysine
degradation (p = 1.14E−13), hippo signaling pathway (p = 4.01E−13),
adherens junction (p = 1.12E−12), proteoglycans in cancer
(p = 1.98E−12), fatty acid metabolism (p = 8.20E−09), ECM-receptor
interaction (p = 6.70E−08), pathways in cancer (p = 0.0006), prostate
cancer (p = 0.0008), glioma (p = 0.0011), steroid biosynthesis
(p = 0.0045), and p53 signaling pathway (p = 0.007) and targets 2, 20,
63, 38, 88, 9, 16, 124, 45, 28, 3, and 38 genes, respectively. The
details of KEGG pathways and number of target genes are described in
[87]Table 2. The KEGG pathway enrichment analysis is depicted in
[88]Supplementary Fig. S2.
Table 2.
The miRNA signature involved in significant KEGG pathways.
KEGG pathway miRNAs Target genes p-value
Proteoglycans in cancer 20 112 1.211E−16
Adherens junction 20 49 2.932E−12
Viral carcinogenesis 20 104 7.936E−09
Hepatitis B 19 78 2.281E−07
Protein processing in endoplasmic reticulum 21 95 8.18E−07
TGF-beta signaling pathway 17 47 9.288E−07
Pancreatic cancer 19 42 9.288E−07
Bacterial invasion of epithelial cells 20 48 9.288E−07
Hippo signaling pathway 21 72 1.983E−06
Cell cycle 21 70 4.562E−06
Colorectal cancer 20 41 6.81E−06
Prostate cancer 21 56 1.037E−05
Estrogen signaling pathway 20 55 1.306E−05
Acute myeloid leukemia 17 38 1.677E−05
Glycosaminoglycan biosynthesis - keratan sulfate 12 10 2.874E−05
Pathways in cancer 21 184 3.088E−05
Focal adhesion 21 108 3.515E−05
Fatty acid biosynthesis 9 5 4.006E−05
Oocyte meiosis 17 59 4.028E−05
Glioma 19 37 4.028E−05
Endocytosis 21 100 4.028E−05
p53 signaling pathway 19 43 6.164E−05
N-Glycan biosynthesis 14 27 6.176E−05
Renal cell carcinoma 20 43 7.457E−05
Neurotrophin signaling pathway 21 67 7.712E−05
Chronic myeloid leukemia 18 45 7.842E−05
Shigellosis 20 39 8.722E−05
Prolactin signaling pathway 18 42 0.0001111
Non-small cell lung cancer 17 33 0.0001449
mTOR signaling pathway 18 38 0.000156
FoxO signaling pathway 20 72 0.0001875
Lysine degradation 17 25 0.0001893
Small cell lung cancer 19 49 0.0003562
Sphingolipid signaling pathway 19 60 0.0003722
Ubiquitin mediated proteolysis 19 74 0.0003919
Endometrial cancer 19 32 0.0004231
Insulin signaling pathway 21 74 0.0004354
AMPK signaling pathway 20 67 0.0004477
ErbB signaling pathway 21 49 0.0004477
HIF-1 signaling pathway 20 58 0.0006434
Thyroid cancer 17 19 0.0007077
Signaling pathways regulating pluripotency of stem cells 20 72
0.0007951
[89]Open in a new tab
3.4. Gene ontology annotations
GO annotations of the miRNA signature was analyzed in terms of
biological processes, cellular components and molecular functions. The
miRNA signature involved more significant biological pathways were
epidermal growth factor receptor signaling pathway, fibroblast growth
factor receptor signaling pathway, RNA metabolic process, DNA metabolic
process, transcription, DNA-templated, mRNA metabolic process, immune
system process, Fc-epsilon receptor signaling pathway,
nucleobase-containing compound catabolic process, and Fc-gamma receptor
signaling pathway involved in phagocytosis, by targeting 2120 genes.
The miRNA signature involved in more significant cellular components
were protein complex, nucleoplasm, cytosol, organelle, microtubule
organizing center, and focal adhesion, by targeting 6887 genes. The
miRNA signature involved in various molecular functions, the most
significant molecular functions includes, enzyme regulator activity,
protein binding transcription factor activity, nucleic acid binding
transcription factor activity, poly(A) RNA binding, enzyme binding, RNA
binding, ion binding, cytoskeletal protein binding, transcription
corepressor activity, and transcription factor binding, by targeting
5098 genes. The details of the miRNA signature involved in GO
annotations are listed in [90]Supplementary Table S2. The GO enrichment
analysis is shown in [91]Supplementary Fig. S3.
3.5. Identifying miRNA signature gene targets
To investigate the possible genes targeted by the miRNA signature, we
constructed a miRNA-target interaction network using Cytoscape v3.7.2.
The miRNA-target network was constructed using predicted and validated
targets by TargetScan Homo sapiens version 6.2 and miRTarBase v4.4 and
MicroCosm v5. In which, miRTarBase collects target genes experimentally
validated by western blot, microarrays, reporter assay, and next
generation sequencing experiments [[92]42]. TargetScan predicts
biological targets of miRNAs [[93]43], and MicroCosm contains
computationally predicted targets for microRNAs, which are obtained
from the miRBase sequence database and EnsEMBL.There are totally 20,832
target interactions in the network, in which TargetScan, miRTarBase and
MicroCosm predicted and validated targets were 8313, 195, 12,324,
respectively. For the better visualization overlap threshold was
adjusted and in the constructed miRNA-target network ([94]Fig. 3A),
TargetScan, miRTarBase and MicroCosm targeted interactions were 1019,
107, and 981, respectively.
Fig. 3.
[95]Fig. 3
[96]Open in a new tab
(A) miRNA-Target interaction using Cytoscape. The target genes of the
miRNA signature were predicted using the miRTarBase, TargetScan, and
MicroCosm. In this network, microRNAs and target genes are shown as red
circles and pink rounded hexagons, respectively. The predicted
microRNA–target interactions were visualized in orange, blue and violet
using CyTargetLinker. (B) gene enrichment analysis predictions using
miRTarBase, and (C) gene enrichment analysis predictions using
TargetScan. (For interpretation of the references to colour in this