Abstract MicroRNAs (miRNAs) are pivotal biomarkers for cancer screening. Identifying distinctive expression patterns of miRNAs in specific cancer types can serve as an effective strategy for classification and characterization. However, the development of a minimal signature of miRNAs for accurate cancer classification remains challenging, hindered by the lack of integrated approaches that systematically analyse miRNA expression levels of miRNAs alongside their associated biological pathways. In this study, we present a comprehensive integrative approach that utilizes transcriptomic data from lung, breast, and melanoma cancer cell lines to identify specific expression patterns. By combining bioinformatics, dimensionality reduction techniques, machine learning, and experimental validation, we pinpoint miRNAs linked to critical biological pathways. Our results demonstrate a highly significant differentiation of cancer types, achieving 100 % classification accuracy with minimal training time using a streamlined miRNA signature. Validation of the miRNA profile confirms that each of the three identified miRNAs regulates distinct biological pathways with minimal overlap. This specificity highlights their unique roles in tumour biology and set the stage for further exploration of miRNAs interactions and their contributions to tumourigenesis across diverse cancer types. Our work paves the way for multi-cancer classification, emphasizing the transformative potential of miRNA research in oncology. Beyond advancing the understanding of tumour biology, our step-by-step guide offers a robust tool for a wide range of users to investigate precise diagnostics and promising therapeutic strategies. Keywords: MicroRNA, Omics, RNA-seq, Machine learning classifiers, Cancer cells Graphical Abstract graphic file with name ga1.jpg [29]Open in a new tab 1. Introduction MicroRNAs (miRNAs) are small (∼18–25 nucleotides), non-coding RNA molecules that play essential roles in cellular regulation at both post-transcriptional and translational levels [30][1], [31][2]. Approximately 3 % of human genes encode miRNAs [32][2], underscoring their significance in fundamental biological processes. Through the RNA interference pathway, miRNAs exert their effects by binding to the 3’ untranslated region (UTR) of target messenger RNAs (mRNAs), leading to either mRNA degradation or translational suppression [33][3]. This regulation influences crucial cellular functions, including cell cycle control, proliferation, apoptosis, and stress responses [34][4], [35][5]. Interestingly, the regulatory role of miRNAs can be compared to a ‘blockchain’, where networks maintain secure and precise information flow. In particular, miRNA dysregulation has been implicated in various cancers, with specific miRNAs exhibiting preferential expression in distinct tumour types [36][6], [37][7]. Advances in omics research have facilitated the identification of diagnostic biomarkers and complex biomolecular networks that shed light on cancer mechanisms [38][8], [39][9], [40][10]. miRNAs are central to these networks, acting as key regulators of gene expression and influencing cellular functions critical to cancer biology [41][11], [42][12], [43][13]. Numerous specialized databases provide extensive information on miRNA functions, interaction networks, and disease associations, significantly advancing our understanding of their roles and applications [44][14], [45][15], [46][16]. Pivotal studies demonstrate that a high-panel miRNA classifier (e.g., 48 miRNAs) can classify tumours based on tissue origin with 90 % accuracy, showcasing the potential of miRNAs as diagnostic tools [47][17], [48][18], [49][19]. However, this approach underscores the challenge of reducing model complexity while maintaining high predictive accuracy, which is crucial for clinical applications. Building on this foundation, the exploration of minimal miRNA signatures for cancer classification addresses the common need to balance simplicity and diagnostic precision [50][20]. Despite the significant progress in miRNA research, the precise mechanisms and functions of miRNAs across different cancers remain unclear. Variations in miRNA expression patterns, combined with the complexity of biomolecular networks, present substantial challenges in achieving a unified understanding of their roles, particularly in the context of cancer [51][20]. This highlights the need for continued research to develop advanced methods for data interpretation and feature analysis, as well as to retrieve and identify critical molecular signatures. New approaches are essential for uncovering the specific functions of miRNAs across diverse cancer types and will contribute to a deeper understanding of cancer biology and the development of targeted therapies [52][21], [53][22]. To address these challenges, the development and application of innovative integrated approaches are crucial, with machine learning (ML) emerging as a pivotal methodology [54][23], [55][24], [56][25], [57][26], [58][27], [59][28]. ML enables the analysis of complex omics datasets by applying feature selection and reduction techniques to identify the most informative features from large datasets. This process enhances model architecture, mitigates overfitting, and improves both interpretability and efficiency [60][28], [61][29], [62][30], [63][31]. Additionally, methods such as univariate analysis and principal component analysis (PCA), when combined with ML, have proven effective in identifying key features to differentiate cell types and providing comprehensive insights into pathological states. These advancements are essential for developing personalized therapies [64][32], [65][33], [66][34], [67][35], [68][36], [69][37], [70][38]. Recent studies [71][39], [72][40] demonstrate that ML can significantly improve tumour classification accuracy compared to traditional methods. However, the variability in results highlights the need for higher-quality data, more interpretable models, and better integration of analytical outcomes. To ensure the applicability of these methods in clinical contexts, it is crucial to validate simplified ML models, ensuring good predictive accuracy while maintaining a focus on the essential biological aspects of cancer mechanisms [73][41], [74][42]. Although miRNAs are recognized as useful cancer biomarkers, to our knowledge, no predictive methods exist to identify a minimal miRNA signature that can classify different cancer types, such as melanoma, lung, and breast cancer. In this regard, the identification of a minimal miRNA signature for cancer classification holds promise for improving early diagnosis and enhancing the efficiency and cost-effectiveness of targeted therapies [75][43], [76][44], [77][45], [78][46], [79][47], [80][48]. The identification of distinctive miRNAs is significant, feasible, and clinically relevant, especially if their specific target genes are also identified [81][49], [82][50]. This would allow the mapping of their regulatory networks and reveal differences or similarities among tumours, leading to a more precise and informative classification [83][51], [84][52], [85][53], [86][54], [87][55], [88][56], [89][57]. Integrating advanced ML approaches with miRNA expression data and molecular pathway analyses could provide a promising strategy for identifying minimal miRNA signatures that improve early diagnosis and guide targeted therapies for various types of cancer. Our work aims to develop a tool for analysing miRNAs as biomarkers for tumour classification, focusing on minimal miRNA signatures and leveraging advanced techniques such as machine learning and pathway analysis to improve predictive accuracy. To achieve this, we conducted a meta-analysis to identify miRNAs consistently dysregulated across multiple tumour studies. This was followed by pathway enrichment analysis using a miRNA-centric network visual analytics platform (miRNet) and KEGG (Kyoto Encyclopedia of Genes and Genomes) to determine the biological pathways impacted by the investigated miRNAs. By mapping the target genes of these miRNAs to KEGG pathways, we identified several significantly enriched pathways associated with tumourigenesis. These findings suggest that the identified miRNAs play a pivotal role in modulating crucial pathways, thereby contributing to cancer development and progression. We processed both publicly available datasets and laboratory-generated data to assess the predictive performance of the signature ([90]Fig. 1). Our study identified a miRNA signature—miR-103a-3p, miR-10b-5p, and miR-27b-3p—capable of distinguishing different cancer types with high analytical validity and 100 % accuracy in classification, however a single miRNA can still classify distinct cell classes with an accuracy of 94%. Achieving such high accuracy in a reduced context demonstrates the robustness of our method and paves the way for promising results with larger and more diverse datasets. Furthermore, we demonstrated that each of these miRNAs has unique targets contributing to distinct biological pathways, highlighting specific miRNA interactions and uncovering unique mechanisms of tumourigenesis. Fig. 1. [91]Fig. 1 [92]Open in a new tab Minimal miRNA signature investigation process. This study utilized three cancer cell lines as proof-of-concept models. Omics data obtained from the Gene Expression Omnibus (GEO) were processed through the miRDeep2 pipeline. Simultaneously, qRT-PCR was conducted in the laboratory to validate biologically significant miRNAs. These datasets were subsequently integrated into a ML-based classification system, which identified a minimal common miRNA signature capable of classifying different cell lines using a minimal set of miRNAs. In this study, we use cell lines as a tool to refine and validate our methods, which will be applied to more complex biological samples. Despite the limitations of cell-line models in capturing the full complexity of human tumours, correlation studies have shown a strong relationship between cell lines and primary tumours [93][58], [94][59], [95][60], [96][61]. Additionally, we demonstrate that cell lines exhibit cellular heterogeneity, assessed using Shannon entropy [97][62] based on our data. This measure quantifies the variability in miRNA expression, providing insights into their molecular diversity, although it does not fully capture the heterogeneity of human tumours. In this context, as a proof-of-concept, we classified lung, breast, and melanoma cancer cell lines based on miRNA expression profiles following careful pre-filtering of miRNAs, where only those consistently expressed across all cells (A375, MCF7, and A549) were selected for analysis. This approach minimized noise and ensured the robustness of the results improving the accuracy of the classifications. The novelty and significance of this article lie in its integrative approach, which combines machine learning (ML) techniques for feature classification and identification with the analysis of miRNA-regulated pathways to define a minimal miRNA signature. While our approach shows promise for future clinical applications, further validation is required to confirm its utility in real clinical samples from patients. Nevertheless, this contribution serves as a proof of concept, offering a practical and widely applicable approach for identifying a minimal miRNA signature for more efficient and cost-effective application of miRNAs as biomarkers in cancer management. 2. Materials and methods 2.1. RNAseq data analyses This proof-of-concept study focuses on miRNA expression in three human cancer cell lines: A375 (melanoma), A549 (lung cancer), and MCF-7 (breast cancer) (see [98]Fig. 1, [99]Fig. 2B, and [100]Table S2). A database pipeline was optimized for miRNA read alignment to obtain reads per million (RPM) expression levels for different cell lines (see [101]Fig. 2A and [102]Table S3, as well as [103]https://github.com/David-UniNA/OMICSdataPIPELINE). First, raw reads were downloaded from the Gene Expression Omnibus (GEO) [[104]https://www.ncbi.nlm.nih.gov/geo] and Sequence Read Archive (SRA), a repository maintained by the National Centre for Biotechnology Information (NCBI) [[105]https://www.ncbi.nlm.nih.gov/sra] [106][88], [107][89]. Note that we pre-filtered the omics data, where only miRNAs which are consistently expressed across all investigated cells were considerate. Fig. 2. [108]Fig. 2 [109]Open in a new tab Classification outcome for omics dataset using the fine KNN classifier. A Comprehensive bioinformatics pipeline to process and analyse the omics data ([110]https://github.com/David-UniNA/OMICSdataPIPELINE). The GRCh38p14 reference genome was prepared using the bowtie tool. Publicly available datasets (SRA_RUN) were mapped to the reference genome with appropriate adapter sequences. Subsequently, reads per million (RPM) values were calculated for unique sequences mapped to hairpin and mature miRNAs. B Heatmap visualizing the expression patterns of consistently expressed miRNAs across all cell types. The dendrogram on top reveals distinct clusters within the omics dataset, highlighting potential subgroups ([111]Figs. S1 and [112]S4). C Scatter plot illustrates the relationship between the two most significant miRNA expressions ([113]Table S5). The confusion matrix demonstrates 100 % accuracy in classifying all 47 experiments. D Receiver operating characteristic (ROC) curve indicates 100 % specificity of the classification algorithm. E Parallel coordinate plot visualizes the z-scores of the 10 most significant miRNAs for each sample. Green, red, and blue colours represent A549, MCF7, and A375 omics data, respectively. Next, the miRDeep2 tool [114][90], which identifies miRNAs from deep sequencing data, was used to align public miRNA-seq reads from A375 cells. In more detail (see [115]Fig. 2A), we downloaded the public experiments as ‘fastq’ file with the SRA toolkit 3.0.7 [[116]https://github.com/ncbi/sra-tools] using the following command, ‘fastq-dump –split-files SRA_RUN’, where SRA_RUN was changed according to the experiment run name ([117]Table S2). Next, we used the mapping function of the miRDeep2 toolbox as follows, ‘mirdeep2-master/src/mapper.pl SRA_RUN.fastq -e -h -i -j -l 18 -k CUT_ADAPT -m -p Bowtie/B -q -s SRA_RUNreads.fa -t SRA_RUNreadsVSgenome -v -n’ (description in [118]Table S3). Here for the adapter trimming sequence (reported in [119]Table S3) and the previously calculated bowtie database (Bowtie-build 1.1.1) are needed, using the following command, ‘mirdeep2-master/essentials/bowtie-1.1.1/bowtie-build GRCh38.p14.genome.fa B. Bowtie is a short-read aligner geared toward quickly aligning large sets of short DNA sequences to large genomes, such as the ‘GRCh38p14’ human genome assembly. Last step of the reads alignment was the quantifying of reads using the quantifier function from miRDeep2 as well as hairpin and mature database files from miRmine, with the following command, ‘mirdeep2-master/src/quantifier.pl -p hairpin.fa -m mature.fa -r SRA_RUNreads.fa -t hsa -k’ (description in [120]Table S4). 2.2. Cancer miRNA meta-analysis and pathways network A systematic review and meta-analysis were conducted by searching PubMed for publications investigating diagnostic circulating miRNAs in three cancer types: melanoma, lung, and breast cancers [121][63], [122][64], [123][65], [124][66], [125][67], [126][68], [127][69], [128][70], [129][71], [130][72], [131][73], [132][74], [133][75], [134][76], [135][77], [136][78], [137][79], [138][80], [139][81], [140][82], [141][83], [142][84], [143][85], [144][86]. The miRNet interface was utilized to integrate miRNAs into a network, allowing visualization of relationships between selected miRNAs and their associated pathways. To identify target genes, the miRTarBase v8.0 database was used [145][87]. A 4.0-degree filter and betweenness centrality were applied to all network nodes to exclude those with low centrality measures. Additionally, the shortest path filter and a minimum network configuration were used to retain only the essential connectivity signatures within the network. Enrichment analysis was performed using hypergeometric tests, with adjustments for the false discovery rate (FDR), based on the KEGG pathways. A protein-protein interaction (PPI) network was constructed following the conversion of gene names from [146]Table S14 into protein names using UniProt [[147]https://www.uniprot.org/]. The resulting protein list was uploaded to STRING: functional protein association networks [[148]https://string-db.org/], where K-means clustering was employed to identify predefined clusters based on protein interactions. 2.3. Cell samples miRNAs expression levels Cell culture: MCF7 cells were plated at a density of 20.000 cells/cm^2 in flasks of 75 and grown in a culture medium (MEM, 10 % FBS, 1 % L-GLUT, 2 % P/S). A375 cells were plated at a density of 20.000 cells/cm^2 in flasks of 75 and grown in a culture medium (DMEM, 10 % FBS, 1 % L-GLUT, 2 % Pen/Strep). A549 cells were plated at a density of 20.000 cells/cm^2 in flasks of 75 and grown in culture medium (HAM F12 K, 10 % FBS, 1 % L-GLUT, 2 % P/S). Cell lines were obtained from the American Type Culture Collection (ATCC; Rockville, MD, USA). MEM: Minimum Essential Medium. FBS: Fetal Bovine Serum. PBS: Phosphate-Buffered Saline. L-GLUT: L-Glutamine. Pen/Strep: Penicillin/Streptomycin. DMEM: Dulbecco’s Modified Eagles Medium. HAM F12 K: Ham's F-12K (Kaighn's) Medium. miRNAs extraction: for cellular miRNA extraction, the cells were pelleted and washed in PBS, and miRNA were purified from the pellet of cells using the mirVana™ miRNA isolation kit from ThermoFisher Scientific according to the manufacturing protocol. Before extraction, samples were spiked with 1 μL of 1 nM Cel-miR-39–3p synthetic miRNA, to check the correct extraction and purification procedure. The extracts were measured using a Nanodrop 2000 spectrophotometer and qRT-PCR were performed on RNA extract from A375, MCF7 and A549 cells. For each miRNA, the qRT-PCR was performed in duplicate, data represent the mean of three different technical replicates. Reverse transcriptase and qRT-PCR: for the qRT-PCR assay, 2 µL of the sample at different concentrations was reverse transcribed using the TaqMan™ Advanced miRNA cDNA Synthesis Kit - ThermoFisher Scientific. In detail, the mature miRNA sequence was extended using 3’ poly-A tailing and 5’ ligation of an adaptor sequence. For cDNA synthesis, universal primers were used to recognize the universal sequences on both the 5’ and 3’ extended ends of the mature miRNA. All mature miRNA in the sample were reverse transcribed to cDNA. Then 5 µL of cDNA was used as a template for TaqMan™ Advanced miRNA Assay - cDNA was tested in triplicate. The TaqMan™ PCR was performed in a CFX-96 real-time thermal cycler (Bio-Rad Laboratories, Inc., Hercules, CA, USA) with the following protocol: 95 °C for 20 sec, followed by 40 cycles of 95 °C for 3 sec and 60 °C for 30 sec. We evaluated the expression of the following miRNAs: miR-191–5p, miR-21–5p, miR-25–3p, miR-93–5p, miR-363–5p, miR-16–5p, miR-362–5p, miR-210–3p, miR-27b-3p, miR-185–5p, miR-10b-5p, miR-106b-5p, miR-204–5p, miR-103a-3p. 2.4. Data analysis and statistical analysis All qRT-PCR results are presented as mean values from three independent experiments, with each miRNA tested in duplicate. The coefficient of variation (CV) was calculated for each miRNA in a single sample and considered non-informative if CV ≥ 4 %. To minimize analytical variability, data were normalized to the spike-in Cel-miR-39–3p to obtain ΔCt values (Ct of each miRNA - Ct of Cel-miR-39–3p). For further reliability, Ct-values were globally normalized to their mean value (Figure S6B). A ML approach and hierarchical clustering were employed using MATLAB (R2023b, MathWorks). Feature scores were calculated using an ANOVA-based feature ranking algorithm to identify the most significant miRNAs for classification. All classification procedures were repeated three times. Hierarchical clustering was performed to visualize relationships between samples based on miRNA expression profiles. A dendrogram was constructed using the 'average' linkage method and Euclidean distance metric. The optimal number of clusters was determined based on the dendrogram and the number of cell classes. The cutoff for the dendrogram was calculated as the median of the last two distances in the linkage matrix and visualized using the 'dendrogram' function ([149]Figs. S1 and [150]S2). Heatmaps of miRNA RPM values were generated using the 'clustergram' function in MATLAB, with row and column clustering and K-nearest neighbours’ imputation. To evaluate the performance of various ML algorithms, a 5-fold cross-validation approach was used. Classification accuracy was assessed using confusion matrices and parallel coordinate plots. All calculations were performed on a DELL Alien R9 computer with an i9–9900K CPU. The cell heterogeneity was investigated with the Shannon diversity index (SDI), which measures the diversity in a dataset, accounting for both richness (number of cell types) and evenness (how evenly the individual experiments are among the cell lines). The statistical difference between the 3 cell lines was tested with the Kruskal-Wallis test. 3. Results and discussion 3.1. Omics data analysis using supervised classification approaches The main goal of this study is to accurately classify lung (A375), breast (MCF7), and melanoma (A549) cancer cells using a minimal common set of miRNAs. To ensure data quality and consistency of the data, we applied a robust bioinformatic pipeline (mirDeep2, [151]Fig. 2A) for preprocessing and normalizing publicly available omics data. The heatmap of the processed data ([152]Fig. 2B) illustrates the expression patterns of consistently expressed miRNAs across the different cell types, aiding in the identification of potential biomarkers and classification signatures. Beside the heatmap indicate a high cell heterogeneity. Therefore, we performed a SDI analyses, which revealed a data overlapping between MCF7 and A549 experiments ([153]Fig. S4). Furthermore, dendrogram analysis ([154]Fig. S1) identified distinct clusters within the omics dataset, highlighting the heterogeneity identified by SDI calculations ([155]Fig. S4). To address the challenges posed by this heterogeneity and achieve accurate classification, we applied several supervised ML approaches. Initially, we found that only the ensemble classifier with a subspace nearest neighbour (KNN) and the medium Neural Network-based classifier achieved 100 % prediction accuracy using all miRNAs ([156]Table S6). To reduce the feature space and identify the most informative miRNAs, we employed a feature ranking algorithm based on one-way ANOVA. By focusing on the top 10 miRNAs with the highest feature scores, we were able to maintain 100 % classification accuracy using the ensemble classifier with subspace KNN ([157]Fig. 2C–E, [158]Table S7). This highlights the importance of feature selection in improving model performance. The limited number of experiments in the omics dataset can influence the performance of classification models. To address this, we conducted a thorough evaluation of different cross-validation folds (3−10) and testing data percentages (10–20 %). Although increasing the number of folds or testing data percentage did not result in significant improvements in model performance, we selected k = 5 for cross-validation and 10 % for testing data, as these are commonly used values in classification tasks." 3.2. Detection of key miRNAs in cancer pathway within the omics dataset After obtaining a suitable classifier model for the omics data, we focused on identifying a miRNA classification signature to pinpoint critical biomarkers that are biologically relevant and directly involved in cancer processes. Our goal is to validate the selected miRNAs with top-ranked feature scores, confirming their role in cancer-associated biological pathways and their potential as biomarkers for tumour cell diagnosis and classification. To achieve this, we have integrated omics data with functional analysis to ensure that the chosen miRNA signature not only accurately identifies cancer types but also reflects the underlying biological mechanisms and relevant pathological pathways. To identify miRNAs consistently dysregulated across the studied cancer types, we performed a meta-analysis. KEGG pathway enrichment analysis was then used to map miRNAs to relevant biological pathways, with a focus on those involved in cancer progression such as cell cycle regulation, apoptosis, and proliferation. Using miRNet and miRTarBase v9.0 [159][84], we have constructed a focused miRNA-gene interaction network ([160]Fig. 3A) that highlights the interactions between miRNAs (represented as orange hexagons) and their target genes (represented as pink and blue circles). Central miRNAs, miR-16–5p, miR-103a-3p, and miR-25–3p, exhibit extensive connections, indicating their significant regulatory roles. These interaction patterns indicate that key miRNAs may serve as potential biomarkers or therapeutic targets while specific target genes are crucial nodes in cancer pathways, providing insights for future research and therapeutic strategies. Fig. 3. [161]Fig. 3 [162]Open in a new tab Comprehensive analysis of miRNA interactions and their roles in cancer pathways. A miRNA-Gene Interaction Network Revealing Key Regulators in Cancer: This network highlights the interactions between 11 key miRNAs (orange hexagons) and their target genes (blue and pink circles). Blue circles represent genes identified as hits within cancer pathways, while pink circles represent other correlated genes. For detailed information, refer to [163]Tables S13–S15. B The Venn diagram illustrates the shared and unique target genes among the 14 selected miRNAs, indicating their coordinated regulatory roles in various pathways. C The bar graph represents the enrichment of target genes in different cancer-related pathways. The significance of enrichment is indicated by the -Log(p-value). D STRING interaction network among targets in cancer pathway ([164]Table S14). E The bar graph displays the ΔCt values of the 14 selected miRNAs across different cell lines (A375, MCF7, A549). These values indicate differential expression patterns among the cell lines. Following the identification of the target genes, we have created a Venn diagram to assess gene interactions ([165]Fig. 3B). The diagram reveals that most miRNAs share common target genes, with miR-16–5p notably overlapping with all selected miRNAs. This suggests a coordinated regulatory role among these miRNAs, potentially influencing key cellular functions such as proliferation, the cell cycle, or stress responses, within a complex network that may provide redundancy and flexibility in gene control. Additionally, this analysis could reveal links to specific diseases, pointing to potential biomarkers or therapeutic targets for further study. To further confirm the biological relevance of the selected miRNAs, we performed an enrichment analysis using the KEGG database. This analysis revealed significant associations with cancer-related pathways ([166]Fig. 3C, D and [167]Tables S13–S16). Notably, pathways such as Pathways in Cancer emerged as predominant and highly significant, with extremely low p-values, underscoring the strong involvement of the identified genes in oncological processes. Additionally, pathways like Cell Cycle and Focal Adhesion were also significantly enriched, suggesting potential impacts on fundamental cellular mechanisms. These findings emphasize a robust connection with tumour-related processes and provide valuable insights for future research directions. Moreover, as illustrated in the diagram in [168]Fig. 3D, the proteins encoded by genes associated with cancer pathways (see [169]Table S14) exhibit a high degree of interconnectivity. Specifically, this is evident in the protein interaction network generated using the STRING database (Search Tool for the Retrieval of Interacting Genes/Proteins; [170]https://string-db.org/), which reveals associations organized into three distinct clusters. Cluster 1 includes proteins such as TP53 and EGFR, which are central to cell cycle regulation and apoptosis, forming a critical regulatory network. Cluster 2 features FGF2, a key player in angiogenesis and cell differentiation. Cluster 3 encompasses RUNX1T1, a transcription factor involved in hematopoiesis and associated with leukemia. To enhance the reliability of the miRNA signature, we have conducted quantitative reverse transcription polymerase chain reaction (qRT-PCR) analyses on the selected miRNAs using cell samples in our laboratory (cell samples data). The qRT-PCR results ([171]Fig. 3E) have been compared with RNA-seq data, and this comparison confirmed the expression patterns observed in the RNA-seq analysis. This validation is essential to corroborate the differential expression of the selected miRNAs across the cancer cell lines studied. 3.3. Integrated analysis and classification of miRNA data: insights from qRT-PCR and omics datasets At this stage, we focused on the 14 selected miRNAs from qRT-PCR due to their higher reliability and precision. The ΔCt values from qRT-PCR have been used to train our ML model, leveraging these accurate measurements to enhance predictive performance. This approach ensures that the prediction of the model is based on robust and validated miRNA expression data ([172]Fig. 4A). Fig. 4. [173]Fig. 4 [174]Open in a new tab Heatmap comparison for cell samples versus omics data and corresponding importance score values. A Workflow for cell sample data via real-time PCR includes RNA extraction, cDNA synthesis, qRT-PCR analysis, and data visualization B Heatmap of cell samples investigated in our laboratory. The dendrogram shows 3 classes, which correspond to the investigated cell lines (for more details see [175]Fig. S2). C Heatmap from omics data of the biological relevant miRNAs. The dendrogram shows several cell clusters (for more details see [176]Fig. S1). D Importance score of miRNAs calculated with the feature ranking algorithm ANOVA using the Classification Learner App from MATLAB. Note that not all miRNAs from cell samples are available in the omics dataset (miR-204–5p and miR-363–5p). We generated heatmap to visualize miRNA expression levels from cell sample experiments (ΔCt values from qRT-PCR) ([177]Fig. 4B) and compared these with RPM values from omics datasets ([178]Fig. 4C). The heatmap based on cell samples reveals three distinct classes corresponding to the cell lines, whereas the omics dataset ([179]Fig. 4C) displays significant heterogeneity within individual cell classes. To refine our model, we have evaluated the importance scores of the selected miRNAs using an ANOVA-based feature ranking algorithm ([180]Fig. 4D). In both datasets we have identified miR-27b-3p as a key predictor for cancer cell classification. However, there were differences in the importance scores of other miRNAs between the datasets: miR-106b-5p and miR-185–5p from omics, and miR-10b-5p and miR-204–5p from cell samples. Due to the high heterogeneity in the omics data, we also utilized cell sample data (ΔCt values from qRT-PCR) for training and validating our ML model, integrating both datasets to enhance the model's performance and robustness. 3.4. Optimization of miRNA classification: dimensionality reduction and feature selection in omics and cell sample data We have investigated which miRNA classification method is simultaneously suitable for both cell sample and omics data, using only the biologically relevant miRNAs previously mentioned ([181]Fig. 5). In cell sample experiments, one classification method failed (quadratic discriminant), while 15 others achieved 100 % accuracy with a training time of less than 1 sec ([182]Table S8). The omics dataset showed a similar outcome with one failed method (quadratic discriminant) and 11 performing with 100 % accuracy ([183]Table S9) using all biologically relevant miRNAs. Consequently, we selected the fine KNN classifier as the most suitable algorithm based on calculation speed and prediction accuracy (training time ∼7.1 sec. with ∼610 observations per second) using Bayesian optimization with ‘expected improvement per second plus’ as acquisition function, and 30 iterations. Fig. 5. [184]Fig. 5 [185]Open in a new tab miRNA expression of omics (A) versus cell sample (B) data. The parallel coordinate plot shows the relationship (z-score) between the most biologically significant miRNAs. The miRNAs with the highest feature ranking score are shown as scatter plot for C omics and F cell sample data. D The miRNA prediction was tested for cell samples with best ranked cell sample features (Cell), omics dataset with the best ranked omics features without considering biological relevant miRNAs (Omics) and for omics dataset with only best ranked biological relevant features (Omics Cell selected). E Venn diagram showing the overlap of target genes among three specific miRNAs: miR-10b-5p, miR-103a-3p, and miR-27b-3p. Each circle represents the set of target genes regulated by one miRNA. miR-10b-5p has 7 unique target genes, miR-103a-3p has 10 unique target genes, and miR-27b-3p has 7 unique target genes. Additionally, two target genes are shared between miR-10b-5p and miR-27b-3p, four target genes are shared between miR-103a-3p and miR-27b-3p, one target gene is shared between miR-10b-5p and miR-103a-3p, and one target gene is common to all three miRNAs. This diagram highlights the specific and overlapping regulatory roles of these miRNAs on their target genes. G The prediction accuracy was plotted versus selected features using decreasing feature ranking score for higher feature numbers. For the parallel plots, green, red and blue indicate A549, MCF7 and A375 cells, respectively. Next, we have investigated the minimum number of miRNAs required to correctly classify the omics dataset. We have performed principal component analysis (PCA) to assess the dimensionality of the omics data. The scree plot indicates that two components are needed to represent 94.7 % of the dataset ([186]Fig. S5B). Using only the miRNAs with the highest importance score for omics data (106b-5p, [187]Fig. 4C) it has resulted in a minimum misclassification of 2 out of 47 experiments (average out of 3 independent classifications, [188]Table S10). This 18.2 % false negative rate for A549 cells is likely due to the high experimental heterogeneity. In fact, the classification process reveals that prediction accuracy significantly depends on the selected A549 data points. Consequently, adding the second miRNA (miR-185–5p) to the first miRNA achieves perfect prediction accuracy for all investigated experiments ([189]Fig. 5D, Omics; [190]Table S11). Furthermore, we investigated the best-performing miRNA signature in cell samples by analysing the principal components of the cell sample data. Two components represent 97.7 % of the data ([191]Fig. S5A). We performed the same prediction procedure using the three most important cell sample features, which are also present in the omics dataset (miR-27b-3p, miR-103a-3p, and miR-10b-5p). Using the highest importance score miRNA (miR-27b-3p) has resulted in a misclassification of 3 experiments, while adding the next-ranked feature (miR-10b-5p) has led to perfect classification accuracy. Interestingly, adding the third-ranked feature (miR-103a-3p) has resulted in a misclassification of one experiment. Surprisingly, the combination of the second and third-ranked miRNAs (miR-10b-5p with miR-103a-3p), without the most classified miRNA (miR-27b-3p), has achieved perfect classification, as did the combination of all three miRNAs ([192]Fig. 5G, [193]Table S12). The scatter plot of the best two ranked miRNAs from the cell sample was plotted for both the omics dataset ([194]Fig. 5C) and cell sample ([195]Fig. 5F). These plots clearly demonstrate the need for more than one miRNA 3 or 2 to classify different cell classes with 100 % accuracy due to significant data overlap, however a single miRNA can still classify different cell classes with an accuracy of about 94%. Moreover, we observed that miR-10b-5p omics data point positions depend on their absolute RPM values for A549 experiments. In other words, higher RPM values result in significantly higher miR-10b-5p values ([196]Table S2 and [197]Fig. S3). This observation suggests that further data normalization could improve the classification accuracy of heterogeneous miRNA datasets. Nevertheless, we investigated the relationship between feature selection and prediction accuracy by selecting the best-ranked biological and non-biologically relevant miRNAs ([198]Fig. 5G). A prediction accuracy of 100 % was achieved with all combinations of selected miRNAs using the best two ranked miRNAs. 3.5. Integrated miRNA expression and target gene analysis for enhanced cancer classification The identification of a miRNA signature enhances the specificity, sensitivity, robustness, and reliability of cancer analysis. Our qRT-PCR experiments in cell culture reveal a clear differential expression of three key miRNAs across various cancer cell lines. For instance, miR-103a-3p is highly expressed in MCF7 cells (ΔCt = −0.55), less expressed in A549 cells (ΔCt = 2.26), and is almost undetectable in A375 cells (ΔCt = 14.75). Conversely, miR-10b-5p is not expressed in MCF7 cells but shows notable expression in A375 (ΔCt = −0.74) and A549 (ΔCt = 2.26). miR-27b-3p is predominantly expressed in A375 cells while being absent in MCF7 and A549 cells (ΔCt = 14.75 for both). These findings, depicted in [199]Fig. 3 and [200]Fig. S6, underscore the differential expression of our miRNA signature across cell lines, suggesting potential regulatory roles specific to each cell type. This insight is crucial for understanding the regulatory mechanisms involving these miRNAs and their implications in cancer biology. Our approach has successfully identified key miRNAs that are associated with significant cancer-related targets. Investigating their interactions with oncogenic or tumour suppressor pathways could provide valuable insights into the role of these miRNAs in carcinogenesis. The Venn diagram analysis ([201]Fig. 5E) of target genes for the three miRNAs reveals several key observations: miRNA 1 and miRNA 2 each have 7 unique target genes for specific tumour cell lines. miRNA 3 has the highest number of unique targets (10), indicating a distinct role in a third tumour type. Only 1 gene is common among all three miRNAs, with others showing partial overlap. The low number of common target genes suggests that these miRNAs regulate largely distinct pathways, further enhancing their utility for tumour classification. The distinct regulatory pathways support the specificity of this classification method, with each miRNA, through its unique set of target genes, playing a key role in differentiating and classifying various tumour cell lines. In summary, the Venn diagram supports the use of these three miRNAs for tumour classification due to their specific target genes and minimal overlap in regulatory pathways. Our integrated approach, combining miRNA-seq data and qRT-PCR validation, represents a significant advancement in cancer diagnostics. This model has the potential to revolutionize early diagnosis and treatment strategies, offering deeper insights into miRNA biology in tumours and its clinical applications. As such, it stands as a promising innovation in molecular oncology and personalized medicine. 4. Conclusions Recent advancements in cancer diagnostics have increasingly leveraged the power of ML and omics data to enhance tumour classification and identify predictive biomarkers. miRNA profiling has emerged as a promising tool, enabling non-invasive detection and precise molecular characterization of cancers. Despite these advancements, achieving high accuracy and reliability across diverse cancer types and datasets remains challenging. In this proof-of-concept study, we first presented a step-by-step guide to obtain miRNA data from publicly available cell experiments. Next, we classified lung, breast, and melanoma cancer cell data from cell line experiments. Cell lines were chosen for this proof-of-concept due to the controlled environmental cell condition. Nevertheless, we prefiltered the omics dataset and identified 619 consistently expressed mature miRNAs from an initial pool of 2822. Feature ranking based classification with ML achieved 100 % classification accuracy for ensemble classifiers with subspace K-nearest neighbour (KNN) and neural networks. This high level of accuracy indicates the potential for minimal common miRNA patterns to be integrated into clinical diagnostics, offering a non-invasive method for precise tumour classification that could greatly enhance early detection and personalized treatment strategies. A low number of analysed miRNAs reduce qRT-PCR costs, sample analyses times and cell identification times. However, by integrating miRNA data with functional analysis, we conducted meta-analysis and KEGG pathway enrichment, identifying critical biomarkers like miR-103a-3p, miR-10b-5p, and miR-27b-3p. These biomarkers were validated through qRT-PCR, confirming their distinct expression profiles and demonstrating that high-importance miRNAs from qRT-PCR data significantly enhance predictive performance, addressing key challenges in cancer diagnostics. This work not only provides insights into miRNA-based cancer classification but also sets the stage for future research in personalized diagnostics and treatment strategies. The fact, that intra-cell-line heterogeneity was present in the analysed omics dataset, underline the capability of our method for more challenging classification tasks. Future investigations could focus on refining data normalization techniques, expanding miRNA-gene interaction networks, and applying these methods to other cell lines or patient derived tumour samples, ultimately translating these findings into real-world clinical applications. Such advancements have the potential to revolutionize molecular oncology by offering more precise and individualized diagnostic and therapeutic options. Author statement * The authors declare that this manuscript is original, has not been previously published, and is not being submitted to another journal for publication. * The authors declare no relevant conflicts of interest in relation to this work. * All authors have read and approved the final manuscript and agree to its submission to this journal. * This work was funded by AIRC (Fondazione Italiana per la Ricerca sul Cancro), Grant IG 2020 n. 24623. CRediT authorship contribution statement Sabrina Napoletano: Writing – review & editing, Writing – original draft, Visualization, Validation, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Paolo Antonio Netti: Writing – review & editing, Supervision. David Dannhauser: Writing – review & editing, Visualization, Validation, Software, Methodology, Investigation, Formal analysis, Data curation. Filippo Causa: Writing – review & editing, Supervision, Project administration, Funding acquisition, Conceptualization. The authors Sabrina Napoletano and David Dannhauser equally contributed to this paper. Declaration of Competing Interest The authors declare no conflict of interest. Acknowledgements