Abstract

   MicroRNAs (miRNAs) are pivotal biomarkers for cancer screening.
   Identifying distinctive expression patterns of miRNAs in specific
   cancer types can serve as an effective strategy for classification and
   characterization. However, the development of a minimal signature of
   miRNAs for accurate cancer classification remains challenging, hindered
   by the lack of integrated approaches that systematically analyse miRNA
   expression levels of miRNAs alongside their associated biological
   pathways. In this study, we present a comprehensive integrative
   approach that utilizes transcriptomic data from lung, breast, and
   melanoma cancer cell lines to identify specific expression patterns. By
   combining bioinformatics, dimensionality reduction techniques, machine
   learning, and experimental validation, we pinpoint miRNAs linked to
   critical biological pathways. Our results demonstrate a highly
   significant differentiation of cancer types, achieving 100 %
   classification accuracy with minimal training time using a streamlined
   miRNA signature. Validation of the miRNA profile confirms that each of
   the three identified miRNAs regulates distinct biological pathways with
   minimal overlap. This specificity highlights their unique roles in
   tumour biology and set the stage for further exploration of miRNAs
   interactions and their contributions to tumourigenesis across diverse
   cancer types. Our work paves the way for multi-cancer classification,
   emphasizing the transformative potential of miRNA research in oncology.
   Beyond advancing the understanding of tumour biology, our step-by-step
   guide offers a robust tool for a wide range of users to investigate
   precise diagnostics and promising therapeutic strategies.

   Keywords: MicroRNA, Omics, RNA-seq, Machine learning classifiers,
   Cancer cells

Graphical Abstract

   graphic file with name ga1.jpg
   [29]Open in a new tab

1. Introduction

   MicroRNAs (miRNAs) are small (∼18–25 nucleotides), non-coding RNA
   molecules that play essential roles in cellular regulation at both
   post-transcriptional and translational levels [30][1], [31][2].
   Approximately 3 % of human genes encode miRNAs [32][2], underscoring
   their significance in fundamental biological processes. Through the RNA
   interference pathway, miRNAs exert their effects by binding to the 3’
   untranslated region (UTR) of target messenger RNAs (mRNAs), leading to
   either mRNA degradation or translational suppression [33][3]. This
   regulation influences crucial cellular functions, including cell cycle
   control, proliferation, apoptosis, and stress responses [34][4],
   [35][5]. Interestingly, the regulatory role of miRNAs can be compared
   to a ‘blockchain’, where networks maintain secure and precise
   information flow.

   In particular, miRNA dysregulation has been implicated in various
   cancers, with specific miRNAs exhibiting preferential expression in
   distinct tumour types [36][6], [37][7]. Advances in omics research have
   facilitated the identification of diagnostic biomarkers and complex
   biomolecular networks that shed light on cancer mechanisms [38][8],
   [39][9], [40][10]. miRNAs are central to these networks, acting as key
   regulators of gene expression and influencing cellular functions
   critical to cancer biology [41][11], [42][12], [43][13]. Numerous
   specialized databases provide extensive information on miRNA functions,
   interaction networks, and disease associations, significantly advancing
   our understanding of their roles and applications [44][14], [45][15],
   [46][16].

   Pivotal studies demonstrate that a high-panel miRNA classifier (e.g.,
   48 miRNAs) can classify tumours based on tissue origin with 90 %
   accuracy, showcasing the potential of miRNAs as diagnostic tools
   [47][17], [48][18], [49][19]. However, this approach underscores the
   challenge of reducing model complexity while maintaining high
   predictive accuracy, which is crucial for clinical applications.
   Building on this foundation, the exploration of minimal miRNA
   signatures for cancer classification addresses the common need to
   balance simplicity and diagnostic precision [50][20].

   Despite the significant progress in miRNA research, the precise
   mechanisms and functions of miRNAs across different cancers remain
   unclear. Variations in miRNA expression patterns, combined with the
   complexity of biomolecular networks, present substantial challenges in
   achieving a unified understanding of their roles, particularly in the
   context of cancer [51][20]. This highlights the need for continued
   research to develop advanced methods for data interpretation and
   feature analysis, as well as to retrieve and identify critical
   molecular signatures. New approaches are essential for uncovering the
   specific functions of miRNAs across diverse cancer types and will
   contribute to a deeper understanding of cancer biology and the
   development of targeted therapies [52][21], [53][22].

   To address these challenges, the development and application of
   innovative integrated approaches are crucial, with machine learning
   (ML) emerging as a pivotal methodology [54][23], [55][24], [56][25],
   [57][26], [58][27], [59][28]. ML enables the analysis of complex omics
   datasets by applying feature selection and reduction techniques to
   identify the most informative features from large datasets. This
   process enhances model architecture, mitigates overfitting, and
   improves both interpretability and efficiency [60][28], [61][29],
   [62][30], [63][31]. Additionally, methods such as univariate analysis
   and principal component analysis (PCA), when combined with ML, have
   proven effective in identifying key features to differentiate cell
   types and providing comprehensive insights into pathological states.
   These advancements are essential for developing personalized therapies
   [64][32], [65][33], [66][34], [67][35], [68][36], [69][37], [70][38].

   Recent studies [71][39], [72][40] demonstrate that ML can significantly
   improve tumour classification accuracy compared to traditional methods.
   However, the variability in results highlights the need for
   higher-quality data, more interpretable models, and better integration
   of analytical outcomes. To ensure the applicability of these methods in
   clinical contexts, it is crucial to validate simplified ML models,
   ensuring good predictive accuracy while maintaining a focus on the
   essential biological aspects of cancer mechanisms [73][41], [74][42].
   Although miRNAs are recognized as useful cancer biomarkers, to our
   knowledge, no predictive methods exist to identify a minimal miRNA
   signature that can classify different cancer types, such as melanoma,
   lung, and breast cancer. In this regard, the identification of a
   minimal miRNA signature for cancer classification holds promise for
   improving early diagnosis and enhancing the efficiency and
   cost-effectiveness of targeted therapies [75][43], [76][44], [77][45],
   [78][46], [79][47], [80][48]. The identification of distinctive miRNAs
   is significant, feasible, and clinically relevant, especially if their
   specific target genes are also identified [81][49], [82][50]. This
   would allow the mapping of their regulatory networks and reveal
   differences or similarities among tumours, leading to a more precise
   and informative classification [83][51], [84][52], [85][53], [86][54],
   [87][55], [88][56], [89][57]. Integrating advanced ML approaches with
   miRNA expression data and molecular pathway analyses could provide a
   promising strategy for identifying minimal miRNA signatures that
   improve early diagnosis and guide targeted therapies for various types
   of cancer.

   Our work aims to develop a tool for analysing miRNAs as biomarkers for
   tumour classification, focusing on minimal miRNA signatures and
   leveraging advanced techniques such as machine learning and pathway
   analysis to improve predictive accuracy. To achieve this, we conducted
   a meta-analysis to identify miRNAs consistently dysregulated across
   multiple tumour studies. This was followed by pathway enrichment
   analysis using a miRNA-centric network visual analytics platform
   (miRNet) and KEGG (Kyoto Encyclopedia of Genes and Genomes) to
   determine the biological pathways impacted by the investigated miRNAs.
   By mapping the target genes of these miRNAs to KEGG pathways, we
   identified several significantly enriched pathways associated with
   tumourigenesis. These findings suggest that the identified miRNAs play
   a pivotal role in modulating crucial pathways, thereby contributing to
   cancer development and progression. We processed both publicly
   available datasets and laboratory-generated data to assess the
   predictive performance of the signature ([90]Fig. 1). Our study
   identified a miRNA signature—miR-103a-3p, miR-10b-5p, and
   miR-27b-3p—capable of distinguishing different cancer types with high
   analytical validity and 100 % accuracy in classification, however a
   single miRNA can still classify distinct cell classes with an accuracy
   of 94%. Achieving such high accuracy in a reduced context demonstrates
   the robustness of our method and paves the way for promising results
   with larger and more diverse datasets. Furthermore, we demonstrated
   that each of these miRNAs has unique targets contributing to distinct
   biological pathways, highlighting specific miRNA interactions and
   uncovering unique mechanisms of tumourigenesis.

Fig. 1.

   [91]Fig. 1
   [92]Open in a new tab

   Minimal miRNA signature investigation process. This study utilized
   three cancer cell lines as proof-of-concept models. Omics data obtained
   from the Gene Expression Omnibus (GEO) were processed through the
   miRDeep2 pipeline. Simultaneously, qRT-PCR was conducted in the
   laboratory to validate biologically significant miRNAs. These datasets
   were subsequently integrated into a ML-based classification system,
   which identified a minimal common miRNA signature capable of
   classifying different cell lines using a minimal set of miRNAs.

   In this study, we use cell lines as a tool to refine and validate our
   methods, which will be applied to more complex biological samples.
   Despite the limitations of cell-line models in capturing the full
   complexity of human tumours, correlation studies have shown a strong
   relationship between cell lines and primary tumours [93][58], [94][59],
   [95][60], [96][61]. Additionally, we demonstrate that cell lines
   exhibit cellular heterogeneity, assessed using Shannon entropy [97][62]
   based on our data. This measure quantifies the variability in miRNA
   expression, providing insights into their molecular diversity, although
   it does not fully capture the heterogeneity of human tumours.

   In this context, as a proof-of-concept, we classified lung, breast, and
   melanoma cancer cell lines based on miRNA expression profiles following
   careful pre-filtering of miRNAs, where only those consistently
   expressed across all cells (A375, MCF7, and A549) were selected for
   analysis. This approach minimized noise and ensured the robustness of
   the results improving the accuracy of the classifications.

   The novelty and significance of this article lie in its integrative
   approach, which combines machine learning (ML) techniques for feature
   classification and identification with the analysis of miRNA-regulated
   pathways to define a minimal miRNA signature. While our approach shows
   promise for future clinical applications, further validation is
   required to confirm its utility in real clinical samples from patients.
   Nevertheless, this contribution serves as a proof of concept, offering
   a practical and widely applicable approach for identifying a minimal
   miRNA signature for more efficient and cost-effective application of
   miRNAs as biomarkers in cancer management.

2. Materials and methods

2.1. RNAseq data analyses

   This proof-of-concept study focuses on miRNA expression in three human
   cancer cell lines: A375 (melanoma), A549 (lung cancer), and MCF-7
   (breast cancer) (see [98]Fig. 1, [99]Fig. 2B, and [100]Table S2). A
   database pipeline was optimized for miRNA read alignment to obtain
   reads per million (RPM) expression levels for different cell lines (see
   [101]Fig. 2A and [102]Table S3, as well as
   [103]https://github.com/David-UniNA/OMICSdataPIPELINE). First, raw
   reads were downloaded from the Gene Expression Omnibus (GEO)
   [[104]https://www.ncbi.nlm.nih.gov/geo] and Sequence Read Archive
   (SRA), a repository maintained by the National Centre for Biotechnology
   Information (NCBI) [[105]https://www.ncbi.nlm.nih.gov/sra] [106][88],
   [107][89]. Note that we pre-filtered the omics data, where only miRNAs
   which are consistently expressed across all investigated cells were
   considerate.

Fig. 2.

   [108]Fig. 2
   [109]Open in a new tab

   Classification outcome for omics dataset using the fine KNN classifier.
   A Comprehensive bioinformatics pipeline to process and analyse the
   omics data ([110]https://github.com/David-UniNA/OMICSdataPIPELINE). The
   GRCh38p14 reference genome was prepared using the bowtie tool. Publicly
   available datasets (SRA_RUN) were mapped to the reference genome with
   appropriate adapter sequences. Subsequently, reads per million (RPM)
   values were calculated for unique sequences mapped to hairpin and
   mature miRNAs. B Heatmap visualizing the expression patterns of
   consistently expressed miRNAs across all cell types. The dendrogram on
   top reveals distinct clusters within the omics dataset, highlighting
   potential subgroups ([111]Figs. S1 and [112]S4). C Scatter plot
   illustrates the relationship between the two most significant miRNA
   expressions ([113]Table S5). The confusion matrix demonstrates 100 %
   accuracy in classifying all 47 experiments. D Receiver operating
   characteristic (ROC) curve indicates 100 % specificity of the
   classification algorithm. E Parallel coordinate plot visualizes the
   z-scores of the 10 most significant miRNAs for each sample. Green, red,
   and blue colours represent A549, MCF7, and A375 omics data,
   respectively.

   Next, the miRDeep2 tool [114][90], which identifies miRNAs from deep
   sequencing data, was used to align public miRNA-seq reads from A375
   cells. In more detail (see [115]Fig. 2A), we downloaded the public
   experiments as ‘fastq’ file with the SRA toolkit 3.0.7
   [[116]https://github.com/ncbi/sra-tools] using the following command,
   ‘fastq-dump –split-files SRA_RUN’, where SRA_RUN was changed according
   to the experiment run name ([117]Table S2). Next, we used the mapping
   function of the miRDeep2 toolbox as follows,
   ‘mirdeep2-master/src/mapper.pl SRA_RUN.fastq -e -h -i -j -l 18 -k
   CUT_ADAPT -m -p Bowtie/B -q -s SRA_RUNreads.fa -t SRA_RUNreadsVSgenome
   -v -n’ (description in [118]Table S3). Here for the adapter trimming
   sequence (reported in [119]Table S3) and the previously calculated
   bowtie database (Bowtie-build 1.1.1) are needed, using the following
   command, ‘mirdeep2-master/essentials/bowtie-1.1.1/bowtie-build
   GRCh38.p14.genome.fa B. Bowtie is a short-read aligner geared toward
   quickly aligning large sets of short DNA sequences to large genomes,
   such as the ‘GRCh38p14’ human genome assembly. Last step of the reads
   alignment was the quantifying of reads using the quantifier function
   from miRDeep2 as well as hairpin and mature database files from
   miRmine, with the following command, ‘mirdeep2-master/src/quantifier.pl
   -p hairpin.fa -m mature.fa -r SRA_RUNreads.fa -t hsa -k’ (description
   in [120]Table S4).

2.2. Cancer miRNA meta-analysis and pathways network

   A systematic review and meta-analysis were conducted by searching
   PubMed for publications investigating diagnostic circulating miRNAs in
   three cancer types: melanoma, lung, and breast cancers [121][63],
   [122][64], [123][65], [124][66], [125][67], [126][68], [127][69],
   [128][70], [129][71], [130][72], [131][73], [132][74], [133][75],
   [134][76], [135][77], [136][78], [137][79], [138][80], [139][81],
   [140][82], [141][83], [142][84], [143][85], [144][86]. The miRNet
   interface was utilized to integrate miRNAs into a network, allowing
   visualization of relationships between selected miRNAs and their
   associated pathways. To identify target genes, the miRTarBase v8.0
   database was used [145][87]. A 4.0-degree filter and betweenness
   centrality were applied to all network nodes to exclude those with low
   centrality measures. Additionally, the shortest path filter and a
   minimum network configuration were used to retain only the essential
   connectivity signatures within the network. Enrichment analysis was
   performed using hypergeometric tests, with adjustments for the false
   discovery rate (FDR), based on the KEGG pathways. A protein-protein
   interaction (PPI) network was constructed following the conversion of
   gene names from [146]Table S14 into protein names using UniProt
   [[147]https://www.uniprot.org/]. The resulting protein list was
   uploaded to STRING: functional protein association networks
   [[148]https://string-db.org/], where K-means clustering was employed to
   identify predefined clusters based on protein interactions.

2.3. Cell samples miRNAs expression levels

   Cell culture: MCF7 cells were plated at a density of 20.000 cells/cm^2
   in flasks of 75 and grown in a culture medium (MEM, 10 % FBS, 1 %
   L-GLUT, 2 % P/S). A375 cells were plated at a density of 20.000
   cells/cm^2 in flasks of 75 and grown in a culture medium (DMEM, 10 %
   FBS, 1 % L-GLUT, 2 % Pen/Strep). A549 cells were plated at a density of
   20.000 cells/cm^2 in flasks of 75 and grown in culture medium (HAM F12
   K, 10 % FBS, 1 % L-GLUT, 2 % P/S). Cell lines were obtained from the
   American Type Culture Collection (ATCC; Rockville, MD, USA). MEM:
   Minimum Essential Medium. FBS: Fetal Bovine Serum. PBS:
   Phosphate-Buffered Saline. L-GLUT: L-Glutamine. Pen/Strep:
   Penicillin/Streptomycin. DMEM: Dulbecco’s Modified Eagles Medium. HAM
   F12 K: Ham's F-12K (Kaighn's) Medium.

   miRNAs extraction: for cellular miRNA extraction, the cells were
   pelleted and washed in PBS, and miRNA were purified from the pellet of
   cells using the mirVana™ miRNA isolation kit from ThermoFisher
   Scientific according to the manufacturing protocol. Before extraction,
   samples were spiked with 1 μL of 1 nM Cel-miR-39–3p synthetic miRNA, to
   check the correct extraction and purification procedure. The extracts
   were measured using a Nanodrop 2000 spectrophotometer and qRT-PCR were
   performed on RNA extract from A375, MCF7 and A549 cells. For each
   miRNA, the qRT-PCR was performed in duplicate, data represent the mean
   of three different technical replicates.

   Reverse transcriptase and qRT-PCR: for the qRT-PCR assay, 2 µL of the
   sample at different concentrations was reverse transcribed using the
   TaqMan™ Advanced miRNA cDNA Synthesis Kit - ThermoFisher Scientific. In
   detail, the mature miRNA sequence was extended using 3’ poly-A tailing
   and 5’ ligation of an adaptor sequence. For cDNA synthesis, universal
   primers were used to recognize the universal sequences on both the 5’
   and 3’ extended ends of the mature miRNA. All mature miRNA in the
   sample were reverse transcribed to cDNA. Then 5 µL of cDNA was used as
   a template for TaqMan™ Advanced miRNA Assay - cDNA was tested in
   triplicate. The TaqMan™ PCR was performed in a CFX-96 real-time thermal
   cycler (Bio-Rad Laboratories, Inc., Hercules, CA, USA) with the
   following protocol: 95 °C for 20 sec, followed by 40 cycles of 95 °C
   for 3 sec and 60 °C for 30 sec. We evaluated the expression of the
   following miRNAs: miR-191–5p, miR-21–5p, miR-25–3p, miR-93–5p,
   miR-363–5p, miR-16–5p, miR-362–5p, miR-210–3p, miR-27b-3p, miR-185–5p,
   miR-10b-5p, miR-106b-5p, miR-204–5p, miR-103a-3p.

2.4. Data analysis and statistical analysis

   All qRT-PCR results are presented as mean values from three independent
   experiments, with each miRNA tested in duplicate. The coefficient of
   variation (CV) was calculated for each miRNA in a single sample and
   considered non-informative if CV ≥ 4 %. To minimize analytical
   variability, data were normalized to the spike-in Cel-miR-39–3p to
   obtain ΔCt values (Ct of each miRNA - Ct of Cel-miR-39–3p). For further
   reliability, Ct-values were globally normalized to their mean value
   (Figure S6B).

   A ML approach and hierarchical clustering were employed using MATLAB
   (R2023b, MathWorks). Feature scores were calculated using an
   ANOVA-based feature ranking algorithm to identify the most significant
   miRNAs for classification. All classification procedures were repeated
   three times.

   Hierarchical clustering was performed to visualize relationships
   between samples based on miRNA expression profiles. A dendrogram was
   constructed using the 'average' linkage method and Euclidean distance
   metric. The optimal number of clusters was determined based on the
   dendrogram and the number of cell classes. The cutoff for the
   dendrogram was calculated as the median of the last two distances in
   the linkage matrix and visualized using the 'dendrogram' function
   ([149]Figs. S1 and [150]S2). Heatmaps of miRNA RPM values were
   generated using the 'clustergram' function in MATLAB, with row and
   column clustering and K-nearest neighbours’ imputation. To evaluate the
   performance of various ML algorithms, a 5-fold cross-validation
   approach was used. Classification accuracy was assessed using confusion
   matrices and parallel coordinate plots. All calculations were performed
   on a DELL Alien R9 computer with an i9–9900K CPU.

   The cell heterogeneity was investigated with the Shannon diversity
   index (SDI), which measures the diversity in a dataset, accounting for
   both richness (number of cell types) and evenness (how evenly the
   individual experiments are among the cell lines). The statistical
   difference between the 3 cell lines was tested with the Kruskal-Wallis
   test.

3. Results and discussion

3.1. Omics data analysis using supervised classification approaches

   The main goal of this study is to accurately classify lung (A375),
   breast (MCF7), and melanoma (A549) cancer cells using a minimal common
   set of miRNAs. To ensure data quality and consistency of the data, we
   applied a robust bioinformatic pipeline (mirDeep2, [151]Fig. 2A) for
   preprocessing and normalizing publicly available omics data. The
   heatmap of the processed data ([152]Fig. 2B) illustrates the expression
   patterns of consistently expressed miRNAs across the different cell
   types, aiding in the identification of potential biomarkers and
   classification signatures. Beside the heatmap indicate a high cell
   heterogeneity. Therefore, we performed a SDI analyses, which revealed a
   data overlapping between MCF7 and A549 experiments ([153]Fig. S4).

   Furthermore, dendrogram analysis ([154]Fig. S1) identified distinct
   clusters within the omics dataset, highlighting the heterogeneity
   identified by SDI calculations ([155]Fig. S4). To address the
   challenges posed by this heterogeneity and achieve accurate
   classification, we applied several supervised ML approaches. Initially,
   we found that only the ensemble classifier with a subspace nearest
   neighbour (KNN) and the medium Neural Network-based classifier achieved
   100 % prediction accuracy using all miRNAs ([156]Table S6). To reduce
   the feature space and identify the most informative miRNAs, we employed
   a feature ranking algorithm based on one-way ANOVA. By focusing on the
   top 10 miRNAs with the highest feature scores, we were able to maintain
   100 % classification accuracy using the ensemble classifier with
   subspace KNN ([157]Fig. 2C–E, [158]Table S7). This highlights the
   importance of feature selection in improving model performance.

   The limited number of experiments in the omics dataset can influence
   the performance of classification models. To address this, we conducted
   a thorough evaluation of different cross-validation folds (3−10) and
   testing data percentages (10–20 %). Although increasing the number of
   folds or testing data percentage did not result in significant
   improvements in model performance, we selected k = 5 for
   cross-validation and 10 % for testing data, as these are commonly used
   values in classification tasks."

3.2. Detection of key miRNAs in cancer pathway within the omics dataset

   After obtaining a suitable classifier model for the omics data, we
   focused on identifying a miRNA classification signature to pinpoint
   critical biomarkers that are biologically relevant and directly
   involved in cancer processes. Our goal is to validate the selected
   miRNAs with top-ranked feature scores, confirming their role in
   cancer-associated biological pathways and their potential as biomarkers
   for tumour cell diagnosis and classification. To achieve this, we have
   integrated omics data with functional analysis to ensure that the
   chosen miRNA signature not only accurately identifies cancer types but
   also reflects the underlying biological mechanisms and relevant
   pathological pathways.

   To identify miRNAs consistently dysregulated across the studied cancer
   types, we performed a meta-analysis. KEGG pathway enrichment analysis
   was then used to map miRNAs to relevant biological pathways, with a
   focus on those involved in cancer progression such as cell cycle
   regulation, apoptosis, and proliferation.

   Using miRNet and miRTarBase v9.0 [159][84], we have constructed a
   focused miRNA-gene interaction network ([160]Fig. 3A) that highlights
   the interactions between miRNAs (represented as orange hexagons) and
   their target genes (represented as pink and blue circles). Central
   miRNAs, miR-16–5p, miR-103a-3p, and miR-25–3p, exhibit extensive
   connections, indicating their significant regulatory roles. These
   interaction patterns indicate that key miRNAs may serve as potential
   biomarkers or therapeutic targets while specific target genes are
   crucial nodes in cancer pathways, providing insights for future
   research and therapeutic strategies.

Fig. 3.

   [161]Fig. 3
   [162]Open in a new tab

   Comprehensive analysis of miRNA interactions and their roles in cancer
   pathways. A miRNA-Gene Interaction Network Revealing Key Regulators in
   Cancer: This network highlights the interactions between 11 key miRNAs
   (orange hexagons) and their target genes (blue and pink circles). Blue
   circles represent genes identified as hits within cancer pathways,
   while pink circles represent other correlated genes. For detailed
   information, refer to [163]Tables S13–S15. B The Venn diagram
   illustrates the shared and unique target genes among the 14 selected
   miRNAs, indicating their coordinated regulatory roles in various
   pathways. C The bar graph represents the enrichment of target genes in
   different cancer-related pathways. The significance of enrichment is
   indicated by the -Log(p-value). D STRING interaction network among
   targets in cancer pathway ([164]Table S14). E The bar graph displays
   the ΔCt values of the 14 selected miRNAs across different cell lines
   (A375, MCF7, A549). These values indicate differential expression
   patterns among the cell lines.

   Following the identification of the target genes, we have created a
   Venn diagram to assess gene interactions ([165]Fig. 3B). The diagram
   reveals that most miRNAs share common target genes, with miR-16–5p
   notably overlapping with all selected miRNAs. This suggests a
   coordinated regulatory role among these miRNAs, potentially influencing
   key cellular functions such as proliferation, the cell cycle, or stress
   responses, within a complex network that may provide redundancy and
   flexibility in gene control. Additionally, this analysis could reveal
   links to specific diseases, pointing to potential biomarkers or
   therapeutic targets for further study.

   To further confirm the biological relevance of the selected miRNAs, we
   performed an enrichment analysis using the KEGG database. This analysis
   revealed significant associations with cancer-related pathways
   ([166]Fig. 3C, D and [167]Tables S13–S16). Notably, pathways such as
   Pathways in Cancer emerged as predominant and highly significant, with
   extremely low p-values, underscoring the strong involvement of the
   identified genes in oncological processes. Additionally, pathways like
   Cell Cycle and Focal Adhesion were also significantly enriched,
   suggesting potential impacts on fundamental cellular mechanisms. These
   findings emphasize a robust connection with tumour-related processes
   and provide valuable insights for future research directions.

   Moreover, as illustrated in the diagram in [168]Fig. 3D, the proteins
   encoded by genes associated with cancer pathways (see [169]Table S14)
   exhibit a high degree of interconnectivity. Specifically, this is
   evident in the protein interaction network generated using the STRING
   database (Search Tool for the Retrieval of Interacting Genes/Proteins;
   [170]https://string-db.org/), which reveals associations organized into
   three distinct clusters. Cluster 1 includes proteins such as TP53 and
   EGFR, which are central to cell cycle regulation and apoptosis, forming
   a critical regulatory network. Cluster 2 features FGF2, a key player in
   angiogenesis and cell differentiation. Cluster 3 encompasses RUNX1T1, a
   transcription factor involved in hematopoiesis and associated with
   leukemia.

   To enhance the reliability of the miRNA signature, we have conducted
   quantitative reverse transcription polymerase chain reaction (qRT-PCR)
   analyses on the selected miRNAs using cell samples in our laboratory
   (cell samples data). The qRT-PCR results ([171]Fig. 3E) have been
   compared with RNA-seq data, and this comparison confirmed the
   expression patterns observed in the RNA-seq analysis. This validation
   is essential to corroborate the differential expression of the selected
   miRNAs across the cancer cell lines studied.

3.3. Integrated analysis and classification of miRNA data: insights from
qRT-PCR and omics datasets

   At this stage, we focused on the 14 selected miRNAs from qRT-PCR due to
   their higher reliability and precision. The ΔCt values from qRT-PCR
   have been used to train our ML model, leveraging these accurate
   measurements to enhance predictive performance. This approach ensures
   that the prediction of the model is based on robust and validated miRNA
   expression data ([172]Fig. 4A).

Fig. 4.

   [173]Fig. 4
   [174]Open in a new tab

   Heatmap comparison for cell samples versus omics data and corresponding
   importance score values. A Workflow for cell sample data via real-time
   PCR includes RNA extraction, cDNA synthesis, qRT-PCR analysis, and data
   visualization B Heatmap of cell samples investigated in our laboratory.
   The dendrogram shows 3 classes, which correspond to the investigated
   cell lines (for more details see [175]Fig. S2). C Heatmap from omics
   data of the biological relevant miRNAs. The dendrogram shows several
   cell clusters (for more details see [176]Fig. S1). D Importance score
   of miRNAs calculated with the feature ranking algorithm ANOVA using the
   Classification Learner App from MATLAB. Note that not all miRNAs from
   cell samples are available in the omics dataset (miR-204–5p and
   miR-363–5p).

   We generated heatmap to visualize miRNA expression levels from cell
   sample experiments (ΔCt values from qRT-PCR) ([177]Fig. 4B) and
   compared these with RPM values from omics datasets ([178]Fig. 4C). The
   heatmap based on cell samples reveals three distinct classes
   corresponding to the cell lines, whereas the omics dataset ([179]Fig.
   4C) displays significant heterogeneity within individual cell classes.

   To refine our model, we have evaluated the importance scores of the
   selected miRNAs using an ANOVA-based feature ranking algorithm
   ([180]Fig. 4D). In both datasets we have identified miR-27b-3p as a key
   predictor for cancer cell classification. However, there were
   differences in the importance scores of other miRNAs between the
   datasets: miR-106b-5p and miR-185–5p from omics, and miR-10b-5p and
   miR-204–5p from cell samples. Due to the high heterogeneity in the
   omics data, we also utilized cell sample data (ΔCt values from qRT-PCR)
   for training and validating our ML model, integrating both datasets to
   enhance the model's performance and robustness.

3.4. Optimization of miRNA classification: dimensionality reduction and
feature selection in omics and cell sample data

   We have investigated which miRNA classification method is
   simultaneously suitable for both cell sample and omics data, using only
   the biologically relevant miRNAs previously mentioned ([181]Fig. 5). In
   cell sample experiments, one classification method failed (quadratic
   discriminant), while 15 others achieved 100 % accuracy with a training
   time of less than 1 sec ([182]Table S8). The omics dataset showed a
   similar outcome with one failed method (quadratic discriminant) and 11
   performing with 100 % accuracy ([183]Table S9) using all biologically
   relevant miRNAs. Consequently, we selected the fine KNN classifier as
   the most suitable algorithm based on calculation speed and prediction
   accuracy (training time ∼7.1 sec. with ∼610 observations per second)
   using Bayesian optimization with ‘expected improvement per second plus’
   as acquisition function, and 30 iterations.

Fig. 5.

   [184]Fig. 5
   [185]Open in a new tab

   miRNA expression of omics (A) versus cell sample (B) data. The parallel
   coordinate plot shows the relationship (z-score) between the most
   biologically significant miRNAs. The miRNAs with the highest feature
   ranking score are shown as scatter plot for C omics and F cell sample
   data. D The miRNA prediction was tested for cell samples with best
   ranked cell sample features (Cell), omics dataset with the best ranked
   omics features without considering biological relevant miRNAs (Omics)
   and for omics dataset with only best ranked biological relevant
   features (Omics Cell selected). E Venn diagram showing the overlap of
   target genes among three specific miRNAs: miR-10b-5p, miR-103a-3p, and
   miR-27b-3p. Each circle represents the set of target genes regulated by
   one miRNA. miR-10b-5p has 7 unique target genes, miR-103a-3p has 10
   unique target genes, and miR-27b-3p has 7 unique target genes.
   Additionally, two target genes are shared between miR-10b-5p and
   miR-27b-3p, four target genes are shared between miR-103a-3p and
   miR-27b-3p, one target gene is shared between miR-10b-5p and
   miR-103a-3p, and one target gene is common to all three miRNAs. This
   diagram highlights the specific and overlapping regulatory roles of
   these miRNAs on their target genes. G The prediction accuracy was
   plotted versus selected features using decreasing feature ranking score
   for higher feature numbers. For the parallel plots, green, red and blue
   indicate A549, MCF7 and A375 cells, respectively.

   Next, we have investigated the minimum number of miRNAs required to
   correctly classify the omics dataset. We have performed principal
   component analysis (PCA) to assess the dimensionality of the omics
   data. The scree plot indicates that two components are needed to
   represent 94.7 % of the dataset ([186]Fig. S5B). Using only the miRNAs
   with the highest importance score for omics data (106b-5p, [187]Fig.
   4C) it has resulted in a minimum misclassification of 2 out of 47
   experiments (average out of 3 independent classifications, [188]Table
   S10). This 18.2 % false negative rate for A549 cells is likely due to
   the high experimental heterogeneity. In fact, the classification
   process reveals that prediction accuracy significantly depends on the
   selected A549 data points. Consequently, adding the second miRNA
   (miR-185–5p) to the first miRNA achieves perfect prediction accuracy
   for all investigated experiments ([189]Fig. 5D, Omics; [190]Table S11).
   Furthermore, we investigated the best-performing miRNA signature in
   cell samples by analysing the principal components of the cell sample
   data. Two components represent 97.7 % of the data ([191]Fig. S5A). We
   performed the same prediction procedure using the three most important
   cell sample features, which are also present in the omics dataset
   (miR-27b-3p, miR-103a-3p, and miR-10b-5p). Using the highest importance
   score miRNA (miR-27b-3p) has resulted in a misclassification of 3
   experiments, while adding the next-ranked feature (miR-10b-5p) has led
   to perfect classification accuracy. Interestingly, adding the
   third-ranked feature (miR-103a-3p) has resulted in a misclassification
   of one experiment. Surprisingly, the combination of the second and
   third-ranked miRNAs (miR-10b-5p with miR-103a-3p), without the most
   classified miRNA (miR-27b-3p), has achieved perfect classification, as
   did the combination of all three miRNAs ([192]Fig. 5G, [193]Table S12).

   The scatter plot of the best two ranked miRNAs from the cell sample was
   plotted for both the omics dataset ([194]Fig. 5C) and cell sample
   ([195]Fig. 5F). These plots clearly demonstrate the need for more than
   one miRNA 3 or 2 to classify different cell classes with 100 % accuracy
   due to significant data overlap, however a single miRNA can still
   classify different cell classes with an accuracy of about 94%.
   Moreover, we observed that miR-10b-5p omics data point positions depend
   on their absolute RPM values for A549 experiments. In other words,
   higher RPM values result in significantly higher miR-10b-5p values
   ([196]Table S2 and [197]Fig. S3). This observation suggests that
   further data normalization could improve the classification accuracy of
   heterogeneous miRNA datasets.

   Nevertheless, we investigated the relationship between feature
   selection and prediction accuracy by selecting the best-ranked
   biological and non-biologically relevant miRNAs ([198]Fig. 5G). A
   prediction accuracy of 100 % was achieved with all combinations of
   selected miRNAs using the best two ranked miRNAs.

3.5. Integrated miRNA expression and target gene analysis for enhanced cancer
classification

   The identification of a miRNA signature enhances the specificity,
   sensitivity, robustness, and reliability of cancer analysis. Our
   qRT-PCR experiments in cell culture reveal a clear differential
   expression of three key miRNAs across various cancer cell lines. For
   instance, miR-103a-3p is highly expressed in MCF7 cells (ΔCt = −0.55),
   less expressed in A549 cells (ΔCt = 2.26), and is almost undetectable
   in A375 cells (ΔCt = 14.75). Conversely, miR-10b-5p is not expressed in
   MCF7 cells but shows notable expression in A375 (ΔCt = −0.74) and A549
   (ΔCt = 2.26). miR-27b-3p is predominantly expressed in A375 cells while
   being absent in MCF7 and A549 cells (ΔCt = 14.75 for both). These
   findings, depicted in [199]Fig. 3 and [200]Fig. S6, underscore the
   differential expression of our miRNA signature across cell lines,
   suggesting potential regulatory roles specific to each cell type. This
   insight is crucial for understanding the regulatory mechanisms
   involving these miRNAs and their implications in cancer biology.

   Our approach has successfully identified key miRNAs that are associated
   with significant cancer-related targets. Investigating their
   interactions with oncogenic or tumour suppressor pathways could provide
   valuable insights into the role of these miRNAs in carcinogenesis. The
   Venn diagram analysis ([201]Fig. 5E) of target genes for the three
   miRNAs reveals several key observations: miRNA 1 and miRNA 2 each have
   7 unique target genes for specific tumour cell lines. miRNA 3 has the
   highest number of unique targets (10), indicating a distinct role in a
   third tumour type. Only 1 gene is common among all three miRNAs, with
   others showing partial overlap.

   The low number of common target genes suggests that these miRNAs
   regulate largely distinct pathways, further enhancing their utility for
   tumour classification. The distinct regulatory pathways support the
   specificity of this classification method, with each miRNA, through its
   unique set of target genes, playing a key role in differentiating and
   classifying various tumour cell lines. In summary, the Venn diagram
   supports the use of these three miRNAs for tumour classification due to
   their specific target genes and minimal overlap in regulatory pathways.

   Our integrated approach, combining miRNA-seq data and qRT-PCR
   validation, represents a significant advancement in cancer diagnostics.
   This model has the potential to revolutionize early diagnosis and
   treatment strategies, offering deeper insights into miRNA biology in
   tumours and its clinical applications. As such, it stands as a
   promising innovation in molecular oncology and personalized medicine.

4. Conclusions

   Recent advancements in cancer diagnostics have increasingly leveraged
   the power of ML and omics data to enhance tumour classification and
   identify predictive biomarkers. miRNA profiling has emerged as a
   promising tool, enabling non-invasive detection and precise molecular
   characterization of cancers. Despite these advancements, achieving high
   accuracy and reliability across diverse cancer types and datasets
   remains challenging.

   In this proof-of-concept study, we first presented a step-by-step guide
   to obtain miRNA data from publicly available cell experiments. Next, we
   classified lung, breast, and melanoma cancer cell data from cell line
   experiments. Cell lines were chosen for this proof-of-concept due to
   the controlled environmental cell condition. Nevertheless, we
   prefiltered the omics dataset and identified 619 consistently expressed
   mature miRNAs from an initial pool of 2822. Feature ranking based
   classification with ML achieved 100 % classification accuracy for
   ensemble classifiers with subspace K-nearest neighbour (KNN) and neural
   networks. This high level of accuracy indicates the potential for
   minimal common miRNA patterns to be integrated into clinical
   diagnostics, offering a non-invasive method for precise tumour
   classification that could greatly enhance early detection and
   personalized treatment strategies. A low number of analysed miRNAs
   reduce qRT-PCR costs, sample analyses times and cell identification
   times. However, by integrating miRNA data with functional analysis, we
   conducted meta-analysis and KEGG pathway enrichment, identifying
   critical biomarkers like miR-103a-3p, miR-10b-5p, and miR-27b-3p. These
   biomarkers were validated through qRT-PCR, confirming their distinct
   expression profiles and demonstrating that high-importance miRNAs from
   qRT-PCR data significantly enhance predictive performance, addressing
   key challenges in cancer diagnostics.

   This work not only provides insights into miRNA-based cancer
   classification but also sets the stage for future research in
   personalized diagnostics and treatment strategies. The fact, that
   intra-cell-line heterogeneity was present in the analysed omics
   dataset, underline the capability of our method for more challenging
   classification tasks. Future investigations could focus on refining
   data normalization techniques, expanding miRNA-gene interaction
   networks, and applying these methods to other cell lines or patient
   derived tumour samples, ultimately translating these findings into
   real-world clinical applications. Such advancements have the potential
   to revolutionize molecular oncology by offering more precise and
   individualized diagnostic and therapeutic options.

Author statement

     * The authors declare that this manuscript is original, has not been
       previously published, and is not being submitted to another journal
       for publication.
     * The authors declare no relevant conflicts of interest in relation
       to this work.
     * All authors have read and approved the final manuscript and agree
       to its submission to this journal.
     * This work was funded by AIRC (Fondazione Italiana per la Ricerca
       sul Cancro), Grant IG 2020 n. 24623.

CRediT authorship contribution statement

   Sabrina Napoletano: Writing – review & editing, Writing – original
   draft, Visualization, Validation, Methodology, Investigation, Formal
   analysis, Data curation, Conceptualization. Paolo Antonio Netti:
   Writing – review & editing, Supervision. David Dannhauser: Writing –
   review & editing, Visualization, Validation, Software, Methodology,
   Investigation, Formal analysis, Data curation. Filippo Causa: Writing –
   review & editing, Supervision, Project administration, Funding
   acquisition, Conceptualization. The authors Sabrina Napoletano and
   David Dannhauser equally contributed to this paper.

Declaration of Competing Interest

   The authors declare no conflict of interest.

Acknowledgements