Abstract Objective Invasive lung cancer staging poses significant challenges, often requiring painful and costly biopsy procedures. This study aims to identify non-invasive biomarkers for detecting bronchogenic carcinoma and its various stages by analyzing gene expression data using bioinformatics and machine learning techniques. By leveraging these advanced computational methods, we seek to eliminate the need for surgical intervention in the diagnostic process. Methods We utilized the TCGA-LUAD dataset, including gene expression data from healthy and cancerous samples. To identify robust biomarkers, we applied eight metaheuristic algorithms for feature selection, combined with four classification methods and two data fusion techniques to optimize performance. Results Our approach achieved 100% accuracy in distinguishing healthy samples from cancerous ones, outperforming existing methods that reported 97% accuracy. Notably, while prior methods have struggled to separate bronchogenic carcinoma stages effectively, our research achieved an approximate accuracy of 77% in stage classification. Furthermore, using gene enrichment methods, we identified 5, 7, and 16 diagnostic biomarker candidates for stages I, II, III, and IV, respectively. Conclusion This study demonstrates that integrating bioinformatics, gene set enrichment, and biological pathway analysis can enable non-invasive diagnostics for bronchogenic carcinoma stages. These findings hold promise for developing alternatives to traditional, invasive staging systems, potentially improving patient outcomes and reducing healthcare costs. Supplementary Information The online version contains supplementary material available at 10.1007/s12672-025-02395-5. Keywords: Biomarker, Bronchogenic carcinoma, Feature selection algorithms, Information fusion, Machine learning Introduction Lung cancer is a multifaceted disease that demands comprehensive research to unravel its intricate molecular mechanisms. Among its subtypes, bronchogenic carcinoma stands out as one of the most widespread and aggressive malignancies globally [[30]1]. Studies suggest that the prognosis of non-small cell lung cancer (NSCLC) is highly dependent on the stage of disease progression. Early diagnosis leads to higher survival rates. Patients with early-stage lung cancer often do not exhibit obvious symptoms [[31]2] and, as a result, they miss the optimal time for early treatment. Compounding this issue is the highly destructive nature of tumor metastasis, which remains a leading contributor to the elevated mortality rates associated with the disease. The use of molecular biomarkers can be a valuable tool in cancer detection. Due to the absence of clear symptoms in the early stages of NSCLC, which makes diagnosis challenging, early detection and treatment have become therapeutic goals for lung cancer, urgently requiring the identification of reliable biomarkers for diagnosis and prognosis. Identifying biomarkers for early detection and treatment of cancer has become essential with the advancement of bioinformatics, which has led to effective analysis and discovery of cancer genes. Therefore, conducting research like the present study on lung cancer is crucial to detect the early stages of the disease to prevent further progression. Additionally, given that treatments are specific to disease stages, accurate diagnosis of these stages is necessary. By diagnosing the disease stage, an effective treatment for that particular stage could be proposed and prescribed. With significant advancements in sequencing technology in recent years, identifying genetic characteristics based on gene expression data from healthy and cancer samples, such as RNA sequencing (RNA-seq), has shown great promise in cancer diagnosis. There are numerous studies indicating that bioinformatics methods are used to analyze comprehensive gene expression data to identify cancer-related biomarkers for cancer prevention, diagnosis, and treatment [[32]3–[33]6]. Various meta-heuristic algorithms are also used for data analysis. Validation requirements for data mining algorithms, which help to increase the performance of various algorithms on healthcare data [[34]7].There is a wide variety of applications of meta-heuristics in the fields of medical services, including improved classification systems, effective diagnosis systems, and increased diagnosis rates of various diseases [[35]8].Treatments were also improved using these techniques and reduced the complications of patients with long-term diseases. Some diseases have the disadvantage that if the patient is not diagnosed for a long time, he unknowingly transmits the disease to many others. If the disease is diagnosed early, the spread of this disease can be prevented. Meta-heuristic techniques can be generally divided into two categories: single solution algorithms and population solution-based algorithms [[36]9]. In solution-based algorithms, all solutions are randomly generated until the original and optimal solution is established. In the case of population-based algorithms, the number of solutions is randomly generated, and the values of all solutions are updated iteratively. The best solution is developed using different iterations [[37]10]. New metaheuristic algorithms such as artificial immune systems, cat swarm optimization, firefly algorithm, genetic algorithm, gray wolf optimization, glowworm swarm optimization, grasshopper optimization, crow search, taboo search, ant colony optimization, bee algorithm, chimpanzee optimization algorithm and many others are very useful for feature extraction and selection for different types of disease detection and early diagnosis [[38]11]. The proposed method generally consists of two sections. In the first section, Gene expression data are used to detect and differentiate various stages of bronchogenic carcinoma, which includes four stages of the disease. In this study, RNA-seq data are used, which include gene expression features specific to bronchogenic carcinoma. In the second section, a set of metaheuristic algorithms, along with data fusion methods, is used for feature selection that provides the best separation of the four stages of bronchogenic carcinoma. The search for the best features could be carried out using various methods, such as an exhaustive search, optimization search, or metaheuristic algorithms. The process continues until the best set of features is obtained based on evaluation criteria or until the termination condition is met [[39]12]. Feature selection methods based on evolutionary and metaheuristic algorithms have advantages over other approaches, including the ability to discover complex and nonlinear features, flexibility in handling different problems, model dimensionality and complexity reduction, and applicability to large-scale problems [[40]13]. Using classification algorithms, they are categorized into four stages of the disease, which represent classes of the research method. Evaluation is carried out using accuracy, precision, and other metrics. The main purpose of a classification algorithm is to assess predictions accurately. To achieve this, various algorithms have been developed, including Decision Trees, K-Nearest Neighbors, Support Vector Machines, and Naive Bayes. Each classification method has its unique characteristics and functions for classifying features. In this study, appropriate and accurate feature selection from approximately 18,000 genes is presented to optimally differentiate the stages of bronchogenic carcinoma. Subsequently, by examining and comparing four classification methods, including Naive Bayes (NB) [[41]14], Support Vector Machine (SVM) [[42]15], K-nearest neighbors (KNN) [[43]16] and decision tree (DT) [[44]17], and utilizing data fusion techniques, the most reliable and optimal biomarkers for identifying the stages of bronchogenic carcinoma are proposed. The impetus for this research on the implementation of feature selection methods lies in the execution of 32 combined trials involving metaheuristic algorithms alongside classification algorithms. This study employs eight distinct metaheuristic algorithms, each integrated with four classification methods, to explore their synergistic effects. This approach ensures that the feature selection process achieves the highest possible accuracy. There has been less research in various journals that presents its results with 8 feature selection algorithm methods and 4 classifier methods to introduce biomarkers that separate the different stages of the complex disease of bronchogenic lung cancer. The resulting gene sets as biomarkers for classifying and identifying bronchogenic carcinoma stages are ultimately identified with greater reliability using enrichment methods and biological pathways. Certain gaps were identified in previous research, particularly a scarcity of studies focusing on biomarkers that accurately indicate disease stages, especially in cancer staging. This research aims to identify biomarkers that differentiate lung cancer stages through the optimization of metaheuristic algorithms. The primary objective of the feature selection problem is to reduce the dimensionality of the feature set while preserving performance accuracy. Although various methods have been developed for dataset classification, metaheuristic algorithms have garnered significant attention for their effectiveness in addressing diverse optimization challenges. A key advantage of these algorithms is their problem-independent nature, allowing them to function as general solvers capable of providing optimal solutions across complex search spaces. Consequently, they can be adapted to address issues such as the inaccuracies in identifying lung cancer biomarkers with minimal modifications, achieving satisfactory accuracy. The subsequent sections of this paper are arranged as follows: Sect. [45]2 provides a comprehensive literature review, highlighting relevant studies and methodologies. Section [46]3 details the proposed method, including data sources and analytical techniques. Section [47]4 presents the results of our analyses, followed by a discussion in Sect. [48]5 that interprets the findings and their implications. Finally, Sect. [49]6 concludes the paper, summarizing key insights and acknowledging the limitations of the study. Literature review The previous investigations were conducted by performing differential expressed gene (DEG) analysis, a commonly-used technique for analyzing RNA-seq data, followed by the application of a metaheuristic method to identify and select an optimal set of genes from those identified in the first step. The identification of differentially expressed genes across two or more sample groups can be facilitated by using this tool in various RNA-seq data processing programs. The second approach involves employing a combination of multiple metaheuristic algorithms to select genes from the outset of the study. The unique strengths of each metaheuristic algorithm are utilized in this approach, resulting in precise decision-making in gene selection based on the results. In the first approach, gene selection is conducted using the initial approach, whereas the subsequent papers adopt the second approach, which involves a combination of metaheuristic algorithms. M. Jansi Ran et al. proposed a two-stage algorithm for selecting informative genes in cancer data classification. In the first stage, mutual information (MI)-based gene selection was applied, which selects only those genes that have high information related to cancer [[50]18]. Genes with high mutual information were given as input to the second stage. In the second stage, a genetic algorithm (GA) was used for gene selection to identify and select the optimal set of genes required for accurate classification. For classification, SVM was used. The proposed MI-GA gene selection approach was applied to colon, lung, and ovarian cancer datasets. Rabia Musheer Aziz et al. proposed a hybrid machine learning (ML) framework based on the nature-inspired cuckoo search (CS) algorithm combined with the artificial bee colony (ABC) algorithm [[51]19]. These algorithms were used to balance the exploration and exploitation phases of the ABC and GA algorithms in the search process. In preprocessing, the independent component analysis (ICA) method was applied to extract important genes from the dataset. Then, the proposed gene selection algorithms, coupled with the NB classifier and leave-one-out cross-validation (LOOCV), were used to find a small set of informative genes that maximize classification accuracy. To perform a comprehensive performance study, the proposed algorithms were applied to six benchmark gene expression datasets. The experimental comparison showed the proposed framework (ICA and CS-based hybrid algorithm with NB classifier) performed a deeper search during the iterative process, which helped avoid premature convergence and yielded better results compared to the previously published feature selection algorithm for NB classifier. Mohamed Tehnan et al. proposed a novel model, called voting-based enhanced binary Ebola optimization search algorithm (VBEOSA) [[52]20]. This algorithm combines the binary Ebola optimization search algorithm (BEOSA) [[53]21] with six population-based algorithms, including differential evolution (DE) [[54]22], ant colony optimization (ACO) [[55]23], particle swarm optimization (PSO) [[56]24], GA [[57]25], simulated annealing (SA) [[58]26] and tabu search (TS) [[59]27] for feature selection and classification. Classification was improved using a voting-based model applied to a lung cancer gene expression dataset. In this study, the potential of VBEOSA was utilized to identify 50 important genes associated with lung cancer. Further exploration through protein–protein interaction (PPI) analysis led to the identification of a selected group of 10 hub genes that serve as biomarkers in lung cancer. Iqbal et al. focused on early lung cancer detection using an ACO algorithm and combined it with a deep neural network (DNN) algorithm [[60]28]. The ACO optimization algorithm, inspired by the foraging behavior of ants, could be used for feature selection, which helps simplify the model and improve performance. The developed ACO was integrated with DNN to enhance the system’s diagnostic accuracy. The features selected by ACO were fed into DNN for achieving the final results. This study focused on DNNs due to their better performance in modeling data relationships and handling large datasets. Pirgazi et al. investigated an efficient hybrid filter-wrapper metaheuristic-based gene selection method for high-dimensional datasets [[61]29]. In their research, a two-stage hybrid algorithm based on the Shuffled Frog Leaping Algorithm (SFLA) was proposed. This method combines filtering and wrapping for efficient feature selection. Over the past decade, high-dimensional data in bioinformatics have received great attention. Sangjin Kim et al. investigated alternative strategies for overcoming challenges and identifying truly important features [[62]30]. They categorized filtering methods into two groups of individual ranking and feature subset selection methods. They proposed a novel filter ranking method using an elastic net penalty with sure independent screening (SIS) based on resampling techniques to overcome these challenges. This method was applied to gene expression data related to colon and lung cancer to assess classification performance and identify genes truly associated with colon and lung cancer. The results showed that the proposed PF method consistently identified the most promising features related to lung cancer. Shib Sankar et al. investigated 4,127 samples of 9 types of cancer from NGS data using algorithms such as ABC, ACO, DE, and PSO for feature selection. SVM with an RBF kernel was used for classification. SVM was improved using tenfold cross-validation. The highest classification accuracy was achieved using the ABC algorithm, i.e., 99.10% accuracy for brain low-grade glioma data [[63]31]. In another study, Shib Sankar et al. combined PSO with SVM [[64]32]. They compared their method with joint mutual information (JMI) + SVM, minimum redundancy maximum relevance (mRMR) + SVM, and conditional mutual information maximization (CMIM) + SVM and selected PSO + SVM as the best algorithm with the highest accuracy of 97.80%. Based on these results, the second method was used for gene selection in this study. Table [65]1 compares previous studies from 2019 to 2024, summarizing details such as publication year and used algorithms. Table 1. Comparative Analysis of Studies (2019–2024): Key Details and Algorithms Author Method Advantages limitations M.Jansi Ran et al. [[66]18] MI-GA (Mutual Information and Genetic algorithm) gene selection approach—Mutual Information based gene selection in the first stage and GA based gene selection in second stage and verified by using the SVM based classifier The proposed MI-GA combination for gene selection provides more accurate measurements, and this hybridization can help reduce the complexity of classification models and obtain much better results Microarray data and early stage Shib Sankar et al. [[67]31] DESeq2 is used to select differentially expressed genes and the optimization algorithms ABC, ACO, DE, and PSO and Classification by SVM Genes selected by ABC, ACO, DE, and PSO algorithms are generally independent. Variation of modeling performance among the optimization algorithms leads to the selection of the minutely overlapped set of genes irrespective of the datasets Using brain lower grade glioma data Sangjin Kim et al. [[68]30] New filter ranking method (PF) using the elastic net penalty with SIS based on resampling technique Demonstrated that SIS-LASSO, SIS-MCP, and SIS-SCAD with the proposed filtering method achieved superior performance of not only accuracy, AUROC, and geometric mean but also true positive detection compared to those with the marginal maximum likelihood ranking method (MMLR) Low power of detecting truly important variables as well as prediction of classification Pirgazi et al. [[69]29] Hybrid method based on the Incremental Wrapper Subset Selection with replacement (IWSSr) method and SFLA to select effective features in a large-scale gene dataset The proposed method detects the relationship between the features properly and removes the redundant and irrelevant features from the selected feature set Using small samples Rabia Musheer Aziz et al. [[70]19] Incorporating the CS algorithm with an ABC in the exploitation and exploration of the GA., the NB classifier, and LOOCV the classification algorithm The experimental comparison shows that the proposed framework (hybrid algorithm) performs a deeper search in the iterative process, which can avoid premature convergence and produce better results compared to the previously published feature selection algorithm Not implementing the methodology on other data such as RNA Seq Tehnan I.A.Mohamed et al. [[71]33] The algorithm VBEOSA, which is derived from the metaheuristic algorithm BEOSA from the Ebola virus infection mechanism and uses binary optimization in the optimization phase, uses classification algorithms, including SVM, DT, KNN, random forest (RF), multilayer perceptron (MLP), and Gaussian naive Bayes (GNB) Findings bear significant implications for enhancing the diagnosis, prognosis, and development of therapeutic strategies for lung cancer Lack of identification of biomarker genes that identify cancer stages S. Iqbal et al. [[72]28] The Lung Cancer Detection method based on ACO and DNN ACO models and DL can help in the early detection of lung cancer and other diseases, thereby helping to save lives and reduce costs and the healthcare burden Accuracy less than 100% (In this study, 100% accuracy in diagnosing lung cancer was achieved) Ali Mahmoud Ali et al Cat Swarm Optimization and Support Vector Machine Approach for Multi-Omics Data Fill the gap of some previous studies that have failed to explain aspects of specific features that affect classification results Integrating SVM with SMOTE or CSO and K-means will consume a significant amount of computational time, which may not be beneficial in technical environments with limited computational power Ali Mahmoud Ali et al The intersection of omics and AI models ML and DL algorithms can increase cancer diagnostic accuracy, speed, and cost-effectiveness The omics community must significantly address the gap between theory and practice by developing relevant applications Ihsan Jasim Hussein et al The combination of features generated by edge operators and using them as input to the ANN classifier is trained to produce three classifiers. The fusion technique is used to make a decision using the entire set of features This method was performed on gynecological ultrasound images to identify suspicious objects or cases with health consequences for analyzing the probability of cancer diseases Noise in images of breast and ovarian tumors [73]Open in a new tab An examination of all the articles indicated a deficiency in precise disease staging utilizing gene expression data for comparing various disease stages, which is likely attributed to small sample sizes or insufficient accuracy in differentiating between case and healthy samples. However, in the present research, the accuracy of distinguishing diseased samples from healthy ones was nearly 100%. Additionally, bronchogenic carcinoma stages were identified through comparative analysis, i.e., comparing Stage I and Stage II, Stage I and Stage III & IV, and finally, Stage II and Stage III & IV. In each case, biomarkers specific to the corresponding disease stage were identified using eight metaheuristic algorithms, and classification was performed using four algorithms, with their accuracy and precision assessed. Further validation of biomarkers was done through active biological pathways, confirming the precision of these biomarkers for distinguishing different stages of bronchogenic carcinoma. The proposed biomarkers provided reliable differentiation of the four stages of the disease. Ali Mahmoud Ali et al. [[74]34] try to fill the gap of some previous studies that have failed to explain aspects of specific features that affect classification results. This work aims to improve the feature selection method, thereby increasing the significance and accuracy of cancer type description using a powerful multi-omics data clustering framework. Regarding feature selection, we proposed a cat swarm optimization feature selection to isolate relevant features for prediction, K-means for clustering the dataset, and a nonlinear SVM. However, according to the methodology of this paper, integrating SVM with SMOTE or CSO and K-means will consume a significant amount of computational time, which may not be beneficial in technical environments with limited computational power. Ali Mahmoud Ali et al. [[75]35] provides unique insights into the intersection of omics and AI models, and it seems that clinicians are looking for a comprehensive measure for the use of artificial intelligence in processing omics data, which is one of the main findings of this article. The challenge of this paper is that healthcare companies rarely deploy DL applications in real-world contexts. Therefore, the omics community must significantly address the gap between theory and practice by developing relevant applications. In research related to a simple medical validation strategy, DL features, feature data, and definitions should be kept in mind. In the other paper, Ihsan Jasim Hussein et al. [[76]36] revolve around image texture enhancement using a novel treatment that can help reduce speckle noise from the image and preserve the most important information. The combination of features generated by edge operators and using them as input to the ANN classifier is trained to produce three classifiers. The fusion technique is used to make a decision using the entire set of features. Overall, 95.87% feature accuracy is achieved for ovarian tumor data collection. This method was performed on gynecological ultrasound images to identify suspicious objects or cases with health consequences for analyzing the probability of cancer diseases. To resolve the ambiguity of the research method, the proposed method can be enhanced by using Miter filters to improve the image and reduce speckle noise for breast and ovarian tumors. Alzheimer’s Disease (AD) [[77]37] constitutes a significant global health issue. Although more and more people are getting AD, there are still no effective drugs to treat it. Recently, Deep Learning (DL) techniques have been used more and more to diagnose AD. This essay meticulously examines the works that have talked about using DL with AD. Some of the methods are Natural Language Processing (NLP), drug reuse, classification, and identification. One important finding is that Convolutional Neural Networks (CNNs) are most often used for AD research, and Python is most often used for DL issues. Shiva Toumaj et al. [[78]37] conducted a study on AD. They stated their results as follows: Alarmingly, 88.9% of the studies that were looked at did not talk about security issues. This shows that we need to build DL models for AD in a way that is more aware of security issues right away Even though progress has been made, there are still important issues that need to be deal with right away, such as privacy, accessibility, fairness, and transferability, to make sure that AI is used in healthcare responsibly and effectively. Nowadays, ML has reached a high level of achievement in many contexts. The importance of DL in Internet of Things (IoT)-based bio- and medical informatics lies in its ability to analyze and interpret large amounts of complex and diverse data in real time, providing insights that can improve healthcare outcomes and increase efficiency in the healthcare industry. This research will provide valuable guidance for further research on DL and its use in medical and bioinformatics problems, and it has been shown that some features, such as security and convergence time, are underutilized in the literature review [[79]38]. Proposed method This study mainly aims to identify gene markers for detecting different stages of lung cancer. Given that lung cancer was recognized as the second leading cause of cancer-related deaths in 2020 (26), this research aims to identify marker genes that could separate healthy and cancer samples and identify different stages of lung cancer using RNA-Seq gene expression data related to lung cancer. This study is implemented in two steps. The schematic of the research is shown in Fig. [80]1. Fig. 1. [81]Fig. 1 [82]Open in a new tab Schematic view of the study The investigation is carried out in two steps. In the first step, the required data is obtained by downloading RNA-seq samples of lung tissue from TCGA. These samples include gene expression data from both healthy and cancerous lung tissues. Normalization and removal of redundant data are performed, and the data are extracted in the form of a matrix. These data are normalized using the trimmed mean of M-values (TMM) method. The second step focuses on feature selection, for which eight metaheuristic methods are used. These eight metaheuristic algorithms include the ABC [[83]24] and ACO [[84]23] related to swarm intelligence, the water cycle algorithm (WCA) [[85]39] and SA [[86]26] related to natural phenomena and harmony search (HS) [[87]40] and teaching–learning-based optimization (TLBO) [[88]41], inspired by human behaviors. Also, DE [[89]22] and GA [[90]25] related to evolutionary optimization algorithms are used. Figure [91]2 shows the classification of the types of used metaheuristic methods. Fig. 2. [92]Fig. 2 [93]Open in a new tab Classification of used meta-heuristic optimization algorithms The samples used in this study are divided into four categories: Healthy, Stage I, Stage II, and Stage III &IV [[94]42]. To ensure a comprehensive feature extraction process, the features are extracted in six different combinations using metaheuristic algorithms. Three combinations correspond to making comparisons between healthy sample vs. Stage I, Stage II vs. Stage III & IV, while the other three correspond to making comparisons between Stage I vs. Stage II, Stage I vs. Stage III & IV and Stage II vs. Stage III & IV. Numerous methods have been developed for dataset classification, with metaheuristic algorithms gaining significant attention for their effectiveness in solving a wide range of optimization problems. These algorithms are broadly categorized into four groups based on their underlying principles: Swarm Intelligence, Natural Phenomena, Human Inspiration, and Evolutionary Algorithms. In this study, two representative algorithms were selected from each category, resulting in a total of eight metaheuristic approaches. These include the Artificial Bee Colony (ABC) and Ant Colony Optimization, which fall under Swarm Intelligence; the Water Cycle Algorithm (WCA) and Simulated Annealing (SA), associated with Natural Phenomena; Harmony Search (HS) and Teaching–Learning-Based Optimization (TLBO), inspired by human behaviors; and Differential Evolution (DE) and Genetic Algorithm (GA), which are rooted in Evolutionary Optimization. Together, these algorithms provide a comprehensive framework for exploring optimization challenges. Metaheuristic algorithms use different strategies to discover and optimize subsets of features, each specifying a mechanism for solving feature selection. For example, the GA mimics the natural process. It starts with a population of random feature subsets, then iterates through methods such as selection, crossover (recombination of feature subsets), and mutation (random changes in subsets) to create better feature sets. It uses fitness functions to evaluate the quality of each feature subset and guides toward optimal solutions. There are four different methods for feature subset optimization. Each of these algorithms has its functions and features to find the best features: 1. Population-Based vs. Single-Solution Approaches: Population-based algorithms, such as Genetic Algorithms (GA), Particle Swarm Optimization (PSO), and Ant Colony Optimization (ACO), maintain a diverse set of solutions, enabling simultaneous exploration of multiple features within the search space. In contrast, single-solution algorithms like Simulated Annealing (SA), Tabu Search (TS), and Differential Evolution (DE) operate on a single solution at a time, emphasizing localized refinement and optimization. 2. Exploration versus exploitation: Some algorithms like PSO and ABC emphasize a combination of exploration (searching for new regions) and exploitation (refining good solutions). Others, like SA and TS, are more likely to find better solutions around a local region. 3. Search mechanism: Algorithms like GA and ACO use probabilistic transitions (crossover, mutation, pheromones), while PSO and DE rely on mathematical tuning and speedups. 4. Convergence: Algorithms like PSO converge faster due to their more continuous nature, while GA and ACO may require more generations or iterations to converge but can handle complex search spaces due to exploration for diversity. In this study, eight methods, ABC, HS, DE, ACO, GA, SA, WCA, and TL, are used for feature selection. In the following, each method is introduced along with its main parameters. The ABC algorithm is inspired by the behavior of bees in searching for food. The range of values ​​is [0,1], the selection threshold is set to 0.5, and the maximum number of times allowed without improvement is set to 5. HS Algorithm, inspired by music tuning, this method uses the pitch adjustment rate (PAR = 0.05), the harmony memory consideration rate (HMCR = 0.7) and the search bandwidth (bw = 0.2) for optimization. DE Algorithm is a combination of mutation and selection. The combination rate (CR = 0.9) and the amplification factor (F = 0.5) are its main parameters. In the ACO Algorithm, the search path is guided by pheromone emission. Pheromone control (α = 1), heuristic utility control (β = 0.1), pheromone evaporation rate (ρ = 0.2), and initial pheromone value are set to 1. The GA uses the combination rate (CR = 0.8) and mutation rate (MR = 0.01) to optimize the features. The SA Algorithm works by gradually decreasing the temperature. The temperature decreasing rate (c = 0.93) and the initial temperature (T0 = 100) are its main parameters. WCA Algorithm simulates the movement of rivers. The initial population size is 50, the number of rivers is 4, and the maximum number of iterations is 100. TLBO Algorithm imitates the learning process. The population size is 50, and the maximum number of iterations is 100. In all methods, the feature values ​​are set in the range [0,1], and the maximum number of iterations is considered to be 100. To confirm the values ​​found, a systematic search was conducted to find the optimal values ​​of the parameters of each algorithm for its optimal performance, as well as the range of values ​​searched for each parameter, which corresponded to the values ​​found in this study. Features are extracted using these eight methods and six combinations. The quality of the extracted features is evaluated using classifiers of NB, SVM, DT, and KNN. The training process of these models was performed using 10 iterations of fivefold cross-validation and reported for evaluation, the results obtained from these runs. This practice plays an important role in the stability of the performance of the model designed in this research. Additionally, the Borda count method is used for decision fusion. Borda count is an extended version of the majority voting method and is the most commonly used method for unsupervised rank-level fusion [[95]43]. In this voting method, each matcher assigns a priority to all possible user identities. Each matcher ranks a fixed set of possible user identities in order of preference. For each matcher, the top-ranked user identity is given N votes, the second-ranked candidate identity is given N-1 votes, and so on. Then, for each possible user identity, the votes from all the matchers are added. The identity with the highest number of votes is designated as the winner or true user identity [[96]44]. The confusion matrix is used when the output includes two or more classes [[97]45]. The metrics of accuracy, recall (or sensitivity), specificity, precision, and F-measure were used. The receiver operating characteristic (ROC) curve is an evaluation metric for binary classification problems. It is a probability curve that plots TPR against FPR at different threshold values. It essentially separates the signal from the noise. The area under the curve (AUC) is a measure of the classifier’s ability to discriminate between classes. The higher the AUC, the better the model’s performance in distinguishing healthy and cancer classes. The numerical value of AUC is a number between zero and one and indicates how accurate or precise the results of a test are. The accuracy of the test results depends on how well the test method can correctly distinguish between true positive (TP) and true negative (TF) results. If this number is close to one, it means that the data are generally above the bisector line, the true positive rate is high, and the test method has good diagnostic or precision. AUC numbers close to 0.5 indicate that the true positive rate and false positive rate are equal, and numbers less than 0.5 indicate that the false positive rate is higher than the true positive rate. The role of information fusion methods is to integrate the outputs of the base classifiers, allowing them to complement one another and address each other’s weaknesses. This study’s data was downloaded and preprocessed by extracting RNA-Seq samples from the TCGA database through the TCGAbiolinks package in R software. The TCGA-LUAD project data was obtained by setting up a query that included primary tumor samples and solid normal tissue, specifically utilizing GeneCodeV38 for accurate gene annotation. Following the download, the gene expression data was processed along with the patient's clinical information to prepare for the upcoming analysis steps. In order to enhance the data quality, only protein-coding genes were included in the analysis, and genes with low expression levels were eliminated. The TMM (Trimmed Mean of M-values) method in the edgeR package was used to normalize data and correct for variations caused by differences in read counts between different samples. Finally, log transformation was used to save the normalized values to use in the subsequent analysis steps. Afterwards, 584 RNA-Seq samples related to healthy and lung cancer samples, each containing 18,043 features, are divided into four groups: Healthy, Stage I, Stage II, and Stage III & IV. To evaluate the performance of the feature selection algorithms and selected features, the samples are assessed in six categories in a pairwise manner: Healthy vs. Stage I, healthy vs. Stage II, healthy vs. Stage III & IV, Stage I vs. Stage II, Stage I vs. Stage III & IV and, finally, Stage II vs. Stage III & IV. The features are determined using eight feature selection methods, and the performance of the selected features is evaluated by four classifiers. Additionally, the Borda count decision fusion method is used to combine the 32 decisions made at each stage. Results Evaluation dataset This dataset included 584 RNA-Seq samples, consisting of healthy and cancer samples, with the expression of 18,043 genes measured in each sample. The dataset consisted of 584 expression samples, including 59 healthy samples, 292 Stage I samples, 123 Stage II samples, 84 Stage III samples, and 26 Stage IV samples. Given the small number of Stage IV samples and their clinical similarity to Stage III, Stage III and Stage IV were considered as Stage III & IV, consisting of a total of 110 RNA-Seq samples. Evaluation results After applying eight feature selection methods to the data, the features were identified, and the performance of the selected features was evaluated using four classifiers. Given the high dimensionality of RNA-seq data (18,043 features), the use of metaheuristic methods for feature selection is fully justified. The main reason for this choice is the ability of these algorithms to efficiently search a very large feature space and identify an optimal subset of important features. The metaheuristic methods used use random search mechanisms and evolutionary optimization to find optimal combinations of features. These methods are especially advantageous in situations where the search space is very large and classical feature selection algorithms (such as filter-based methods or regression-based feature selection) are limited. Additionally, the Borda count decision fusion method was used to combine the 32 decisions made at each stage. Healthy group vs. stage I lung cancer Using the mentioned feature selection methods, features providing the best distinction between the healthy group and Stage I lung cancer were identified. Imbalance in the number of samples between the healthy group and different stages of cancer was the important point at this stage. To address this issue, given that there were only 59 healthy samples, 59 samples were randomly selected from each class for a total of 20 stages. Then, the features with the best separability potential were selected. Finally, features that appeared in 80% of the selected stages were selected. Table [98]2 presents feature selection algorithms, the number of selected features, and the performance of classifiers for each feature set. It should be noted that each classifier was run ten times, and the results presented in Table [99]2 represent the average of ten runs of different classifiers. In this stage, the data of healthy and Stage I lung cancer samples were used. Table 2. Results of evaluating performance of different feature selection algorithms on healthy and Stage I lung cancer samples using classifiers of NB, SVM, DT, and KNN, along with Borda count decision fusion for different classifiers with eight feature sets FS method Number of selected features Classifier Accuracy Sensitivity Specificity Precision F-measure ABC 775 NB 0.904 0.906 0.906 0.923 0.903 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.939 0.938 0.938 0.945 0.939 KNN 0.974 0.973 0.973 0.976 0.974 HS 656 NB 0.930 0.931 0.931 0.939 0.929 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.939 0.938 0.938 0.942 0.939 KNN 0.965 0.964 0.964 0.970 0.965 DE 942 NB 0.939 0.940 0.940 0.947 0.938 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.983 0.983 0.983 0.984 0.983 KNN 0.983 0.982 0.982 0.985 0.982 ACO 566 NB 0.930 0.930 0.930 0.946 0.929 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.974 0.973 0.973 0.976 0.974 KNN 0.983 0.982 0.982 0.985 0.982 GA 387 NB 0.965 0.967 0.967 0.968 0.965 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.991 0.992 0.992 0.992 0.991 KNN 0.974 0.973 0.973 0.977 0.974 SA 213 NB 0.991 0.992 0.992 0.992 0.991 SVM 0.991 0.992 0.992 0.992 0.991 DT 0.965 0.965 0.965 0.967 0.965 KNN 0.983 0.982 0.982 0.985 0.982 WCA 354 NB 0.939 0.939 0.939 0.945 0.939 SVM 0.991 0.991 0.991 0.992 0.991 DT 0.930 0.930 0.930 0.933 0.930 KNN 0.965 0.964 0.964 0.970 0.965 TLBO 1559 NB 0.926 0.928 0.928 0.938 0.925 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.965 0.964 0.964 0.967 0.965 KNN 0.965 0.964 0.964 0.970 0.965 Borda count decision fusion 1.000 1.000 1.000 1.000 1.000 [100]Open in a new tab Table [101]2 indicates the performance of the selected features. As shown in the last row of Table [102]2, the Borda count method was used for decision fusion and improving the performance of classifiers. This method ranks based on the decisions made by the classifiers. The decision with the highest frequency receives a higher rank and, ultimately, is evaluated based on the weight assigned to each rank. The Borda count decision fusion method improved performance, achieving 100% accuracy and precision. In almost all cases, the SVM classifier performed better than the other methods. Table [103]2 shows the number of features selected using each feature selection method. Next, the selected features by all methods were aggregated. The features obtained in at least 4 out of the 8 feature selection methods (608 features) and at least 5 feature selection methods (42 features) were used for classification (Table [104]3). Accordingly, feature fusion was performed. Then, different classification algorithms were run 10 times on the training and test data. Table [105]3 shows the results of classifiers for 608 and 42 features. The 608 and 42 selected features represent genes expressed in healthy and Stage I lung cancer RNA-seq samples. Figure [106]3 shows ROC curve plots, which are calculated for the features in Table [107]3. As can be seen from the graph, all four classifiers can accurately distinguish between all points in the healthy and cancer class with almost 100% accuracy. Table 3. Results of four classifiers based on the combination of the selected features shown in Table [108]2 FS method Number of selected features Classifier Accuracy Standard deviation of accuracy Sensitivity Precision F-measure At least in 4 feature selection methods 608 NB 0.965  ± 0.034 0.966 0.969 0.967 SVM 0.991  ± 0.015 0.991 0.992 0.991 DT 0.983  ± 0.023 0.983 0.983 0.983 KNN 0.957  ± 0.031 0.957 0.961 0.959 At least in 5 feature selection methods 42 NB 1.000  ± 0.003 1.000 1.000 1.000 SVM 0.991  ± 0.012 0.992 0.992 0.992 DT 0.974  ± 0.029 0.974 0.974 0.974 KNN 0.983  ± 0.017 0.983 0.983 0.983 [109]Open in a new tab Fig. 3. [110]Fig. 3 [111]Open in a new tab ROC curve plots based on the combination of selected features on healthy and Stage I lung cancer samples. In this figure, the first row is plotted for 608 features, and the second row is plotted for 42 features in the bottom row of Table [112]3 Healthy group vs. Stage II lung cancer Features that effectively classify healthy and Stage II lung cancer RNA-seq samples were selected. For this purpose, eight metaheuristic algorithms were used for feature selection, along with four different classifiers. Table [113]4 presents feature selection algorithms, the number of features, and the performance of the classifiers for each feature set. It should be noted that each classifier was run ten times. The results presented in Table [114]4 represent the average of these runs. In this stage, the data of healthy and Stage II lung cancer samples were used. Table [115]4 demonstrates the performance of the selected features. As shown in the last row of the table, the Borda count algorithm, which is a decision ranking method, was used for decision fusion and improving the performance of classifiers. In almost all cases, the SVM classifier performed better than the other methods. Next, the selected features by all methods were aggregated. The features obtained in at least 4 out of the 8 feature selection methods (681 features) and at least 5 feature selection methods (55 features) were used for classification. Different classification algorithms were run 10 times on the training and test data. Table [116]5 shows the results of classifiers for 681 and 55 features. Figure [117]4 shows ROC curve plots, which are calculated for the features in Table [118]5. As can be seen from the graph, all four classifiers can accurately distinguish between all points in the healthy and cancer class with almost 100% accuracy. Table 4. Results of evaluating performance of different feature selection algorithms on healthy and Stage II lung cancer samples using classifiers of NB, SVM, DT, and KNN, along with Borda count decision fusion for different classifiers with eight feature sets FS method Number of selected features Classifier Accuracy Sensitivity Specificity Precision F-measure ABC 739 NB 0.917 0.917 0.917 0.931 0.916 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.939 0.939 0.939 0.944 0.939 KNN 0.974 0.973 0.973 0.977 0.974 HS 691 NB 0.939 0.939 0.939 0.952 0.937 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.974 0.973 0.973 0.976 0.974 KNN 0.957 0.957 0.957 0.963 0.956 DE 1074 NB 0.907 0.908 0.908 0.929 0.903 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.957 0.957 0.957 0.958 0.956 KNN 0.965 0.965 0.965 0.969 0.965 ACO 628 NB 0.935 0.936 0.936 0.950 0.933 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.965 0.966 0.966 0.969 0.965 KNN 0.974 0.973 0.973 0.976 0.974 GA 379 NB 0.922 0.921 0.921 0.936 0.920 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.983 0.982 0.982 0.985 0.982 KNN 0.983 0.983 0.983 0.984 0.983 SA 935 NB 0.917 0.918 0.918 0.934 0.916 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.948 0.946 0.946 0.954 0.947 KNN 0.965 0.965 0.965 0.969 0.965 WCA 349 NB 0.930 0.930 0.930 0.943 0.929 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.991 0.992 0.992 0.992 0.991 KNN 0.983 0.983 0.983 0.984 0.983 TLBO 793 NB 0.943 0.944 0.944 0.947 0.943 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.957 0.956 0.956 0.962 0.956 KNN 0.939 0.939 0.939 0.949 0.938 Borda count decision fusion 1.000 1.000 1.000 1.000 1.000 [119]Open in a new tab Table 5. Results of four classifiers based on the combination of the selected features shown in Table [120]4 FS method Number of selected features Classifier Accuracy Standard deviation of accuracy Sensitivity Precision F-measure At least in 4 feature selection methods 681 NB 0.930  ± 0.034 0.930 0.939 0.934 SVM 1.000  ± 0.000 1.000 1.000 1.000 DT 1.000  ± 0.003 1.000 1.000 1.000 KNN 0.965  ± 0.024 0.967 0.968 0.967 At least in 5 feature selection methods 55 NB 0.983  ± 0.011 0.983 0.984 0.983 SVM 0.965  ± 0.021 0.965 0.970 0.967 DT 0.939  ± 0.043 0.939 0.948 0.943 KNN 0.957  ± 0.031 0.958 0.960 0.959 [121]Open in a new tab Fig. 4. [122]Fig. 4 [123]Open in a new tab ROC curve plots based on the combination of selected features on healthy and Stage II lung cancer samples. In this figure, the first row is plotted for 681 features, and the second row is plotted for 55 features in the bottom row of Table [124]5 Healthy group vs. Stage III and IV lung cancer In this stage, data of healthy and Stage III & IV lung cancer samples were used to select suitable features and classify them. Table [125]6 presents feature selection algorithms, the number of features, and the performance of classifiers for each feature set. It should be noted that each classifier was run ten times, and the results presented in the following table represent the average of these runs. In this stage, the data of healthy and Stage III & IV lung cancer samples were used. Table [126]6 demonstrates the performance of the selected features. As shown in the last row of the table, decision fusion was used to enhance the performance of classifiers. The accuracy of the Borda count method for decision fusion was 100%, indicating the quality of the selected features. In almost all cases, the SVM classifier performed better than the other methods. Table 6. Results of evaluating performance of different feature selection algorithms on healthy and Stage III & IV lung cancer samples using classifiers of NB, SVM, DT, and KNN, along with Borda count decision fusion for different classifiers with eight feature sets FS method Classifier Accuracy Sensitivity Specificity Precision F-measure ABC NB 0.922 0.923 0.923 0.928 0.921 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.948 0.948 0.948 0.949 0.948 KNN 0.974 0.975 0.975 0.976 0.974 HS NB 0.922 0.923 0.923 0.933 0.921 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.991 0.991 0.991 0.992 0.991 KNN 0.983 0.983 0.983 0.983 0.983 DE NB 0.922 0.922 0.922 0.930 0.920 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.922 0.921 0.921 0.928 0.921 KNN 0.991 0.992 0.992 0.992 0.991 ACO NB 0.913 0.914 0.914 0.929 0.911 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.965 0.964 0.964 0.967 0.965 KNN 0.991 0.992 0.992 0.992 0.991 GA NB 0.904 0.905 0.905 0.915 0.903 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.991 0.992 0.992 0.992 0.991 KNN 0.991 0.992 0.992 0.992 0.991 SA NB 0.902 0.902 0.902 0.914 0.900 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.974 0.974 0.974 0.976 0.974 KNN 0.991 0.992 0.992 0.992 0.991 WCA NB 0.922 0.921 0.921 0.929 0.921 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.957 0.956 0.956 0.959 0.956 KNN 0.974 0.974 0.974 0.976 0.974 TLBO NB 0.948 0.947 0.947 0.958 0.946 SVM 1.000 1.000 1.000 1.000 1.000 DT 0.957 0.956 0.956 0.957 0.956 KNN 0.983 0.983 0.983 0.985 0.983 Borda count decision fusion 1.000 1.000 1.000 1.000 1.000 [127]Open in a new tab Next, the selected features by all methods were aggregated. The features obtained in at least 4 out of the 8 feature selection methods (584 features) and at least 5 feature selection methods (35 features) are shown in Table [128]7. Figure [129]5 shows ROC curve plots, which are calculated for the features in Table [130]7. As can be seen from the graph, all four classifiers can accurately distinguish between all points in the healthy and cancer class with almost 100% accuracy. Table 7. Results of four classifiers based on the combination of the selected features shown in Table [131]6 F-measure Precision Sensitivity Standard deviation of accuracy Accuracy Classifier Number of selected features FS method 0.933 0.937 0.929  ± 0.022 0.930 NB 587 At least in 4 feature selection methods 0.991 0.992 0.991  ± 0.014 0.991 SVM 0.974 0.974 0.974  ± 0.019 0.974 DT 0.983 0.984 0.983  ± 0.014 0.983 KNN 1.000 1.000 1.000  ± 0.002 1.000 NB 35 At least in 5 feature selection methods 0.991 0.992 0.991  ± 0.011 0.991 SVM 0.950 0.954 0.946  ± 0.036 0.948 DT 0.915 0.925 0.906  ± 0.042 0.906 KNN [132]Open in a new tab Fig. 5. [133]Fig. 5 [134]Open in a new tab ROC curve plots based on the combination of selected features on healthy and Stage III & IV lung cancer samples. In this figure, the first row is plotted for 587 features and the second row is plotted for 35 features in the bottom row of Table [135]7 Stage I vs. Stage II lung cancer Based on the results obtained in the previous sections, the healthy samples and different stages of lung cancer could be easily distinguished with high accuracy using the extracted features. However, to better evaluate the performance of the methods, the same steps were applied to identify Stage I and Stage II. First, eight feature selection algorithms were applied to 292 Stage I samples and 123 Stage II samples. Then, to evaluate the performance of the extracted features, classification algorithms of NB, SVM, DT, and KNN were used. Table [136]1 in Supplementary Table S1 presents the results of feature selection and classification using these selected features. Additionally, the Borda count method was applied for decision fusion, the results of which are displayed in Table [137]1 of Supplementary Table S1. To evaluate the performance of the selected features in the feature fusion state, features that had been selected by at least 6, 7, and 8 feature selection methods were selected and then evaluated using classifiers of NB, SVM, DT, and KNN. Table [138]2 in Supplementary Table S1 shows the results of feature fusion and evaluation using different classification methods. Names of these features (genes) are provided in Supplementary Table S2. Supplementary Figure S4 indicates the PCA plot of all features (18,043 features), as well as that of 361 and 23 selected features for Stage I and Stage II lung cancer samples. Figure S1 in Supplementary Figures shows ROC curve plots, which are calculated for the features in Table [139]2 in Supplementary Table S1. Stage I vs. Stage III & IV lung cancer Features of Stage I and Stage III & IV lung cancer were evaluated. First, eight feature selection algorithms were applied to 292 Stage I samples and 110 Stage III & IV samples. Then, to evaluate the performance of the extracted features, classification algorithms of NB, SVM, DT, and KNN were used. Table [140]3 in Supplementary Table S1 presents the results of feature selection and classification using these selected features. Additionally, the Borda count method was applied for decision fusion, the results of which are displayed in Table [141]3 in Supplementary Table S1. To evaluate the performance of the selected features in the feature fusion state, features that had been selected by at least 6, 7, and 8 feature selection methods were selected and then evaluated using classifiers of NB, SVM, DT, and KNN. Table [142]4 in Supplementary Table S1 shows the results of feature fusion and evaluation using different classification methods. Names of these features (genes) are provided in Supplementary Table S2. Supplementary Figure S5 indicates the PCA plot of all features (18,043 features), as well as that of 321 and 33 selected features for Stage I and Stage III & IV lung cancer samples. Figure S2 in Supplementary Figures shows ROC curve plots, which are calculated for the features in Table [143]4 in Supplementary Table S1. Stage II vs. Stage III & IV lung cancer In this stage, eight feature selection algorithms were first applied to 123 Stage II samples and 110 Stage III & IV samples. Then, to evaluate the performance of the extracted features, classification algorithms of NB, SVM, DT, and KNN were used. Table [144]5 in Supplementary Table S1 presents the results of feature selection and classification using these selected features. Additionally, the Borda count method was applied for decision fusion, the results of which are displayed in Table [145]5 in Supplementary Table S1. To evaluate the performance of the selected features in the feature fusion state, features that had been selected by at least 6, 7, and 8 feature selection methods were selected and then evaluated using classifiers of NB, SVM, DT, and KNN. Table [146]6 in Supplementary Table S1 shows the results of feature fusion and evaluation using different classification methods. Names of these features (genes) are provided in Supplementary Table S2. Supplementary Figure S6 indicates the PCA plot of all features (18,043 features), as well as that of 367 and 31 selected features for Stage II and Stage III & IV lung cancer samples. Figure S3 in Supplementary Figures shows ROC curve plots, which are calculated for the features in Table [147]6 in Supplementary Table S1. Figure [148]6 shows the PCA plot of healthy and Stage I lung cancer samples in three scenarios: All features, 608 selected features, and 42 selected features. As could be observed, all three scenarios effectively separated healthy and cancer samples, but the 608-feature scenario performed the best. Names of these features (genes) are listed in Supplementary Table S2. Based on the results shown in Tables [149]2, [150]3, and Fig. [151]6, it could be concluded that the selected features could separate healthy samples from Stage I lung cancer samples using RNA-seq data with an accuracy of nearly 100%. Fig. 6. [152]Fig. 6 [153]Open in a new tab PCA plot using 608 and 42 selected features based on eight feature selection methods, as well as all features (18,043 features) of healthy and Stage I lung cancer samples Also, Fig. [154]7 displays the PCA plot of healthy and Stage II lung cancer samples in three scenarios: All features, 681 selected features, and 55 selected features. As could be observed, all three scenarios effectively separated healthy and cancer samples, but the 681-feature scenario performed the best. The 681 and 55 selected features represent genes expressed in healthy and Stage II lung cancer RNA-seq samples. Names of these features (genes) are listed in Supplementary Table S2. Fig. 7. [155]Fig. 7 [156]Open in a new tab PCA plot using 681 and 55 selected features based on eight feature selection methods, as well as all features (18,043 features) of healthy and Stage II lung cancer samples Additionally, Fig. [157]8 illustrates the PCA plot of healthy and Stage III & IV lung cancer samples in three scenarios: All features, 584 selected features, and 35 selected features. As could be observed, all three scenarios effectively separated healthy and cancer samples, but the 584-feature scenario performed the best, achieving the highest classification accuracy. Names of these features (genes) are listed in Supplementary Table S2. Fig. 8. [158]Fig. 8 [159]Open in a new tab PCA plot using 587 and 35 selected features based on eight feature selection methods shown in Table [160]7, as well as all features (18,043 features) of healthy and Stage III & IV lung cancer samples Figures S4, S5, and S6 illustrate the PCA diagrams for various stages of lung cancer samples. Figure S4 presents the PCA diagram for Stage I and Stage II lung cancer samples, highlighting all features, with 361, 23, and 10,843 selected features in three cases. It is evident that all three cases effectively separate Stage I and Stage II samples, with the 361 features case performing the best due to its optimal number of features and high classification accuracy. Similarly, Figure S5 displays the PCA diagram for Stage I and Stage III & IV lung cancer samples, showcasing all features, with 321, 33, and 10,843 selected features in the three cases. Again, all three cases successfully distinguish between Stage I and Stage III & IV samples, with the 321 features case being the most effective for the same reasons. Lastly, Figure S6 depicts the PCA diagram for Stage II, Stage III & IV lung cancer samples, featuring 367, 31, and 10,843 selected features in three cases. All three cases notably separate Stage II and Stage III & IV samples, with the 367 features case demonstrating superior performance, attributed to its favorable number of features and high classification accuracy. Analyzing obtained results Comparing different cancer stages showed that Stage I and Stage III & IV had higher classification accuracy than Stage I and Stage II, as well as Stage II and Stage III & IV. In other words, separating stages that are closer to each other is performed with lower accuracy compared to stages that are farther apart in cancer samples. To analyze the results obtained in the feature selection section, another method, such as data correlation, was used for feature selection. In this method, features that had a high correlation with other features were selected. For the stages, correlations between −0.8 and 0.8 were considered, and features with the highest correlation with the other features were selected. These features were then ranked according to their correlation, and classification was performed on the data (executed 10 times). The total number of correlations was 18,043 × 18,043 = 325,549,849. Table [161]8 shows the correlation between 18,043 characteristics for the samples of Stage I and Stage II, Stage I and Stage III & IV, and Stage II and Stage III & IV. This correlation is expressed in the ranges of −0.8, −0.5, 0.5, and 0.8 and expressed as a percentage. Table [162]9 evaluates the results of feature selection by data correlation method for samples of Stage I and Stage II, Stage I and Stage III & IV, and Stage II and Stage III & IV, which includes the best selection and the average of 50 executions. Table 8. Analysis of Correlation Between 18,043 Features in Different Lung Cancer Stages Correlation >  = − 0.8 Correlation >  = + 0.5 Correlation >  = 0.5 Correlation >  = 0.8 Stage I and Stage II samples 0.01% 0.84% 0.26% 0% Stage I and Stage III & IV samples 0.01% 0.77% 0.22% 0% Stage II, Stage III & IV samples 0.01% 0.4% 0.05% 0% [163]Open in a new tab Table 9. Results of feature selection using data correlation method for Stages (best results and average results of 50 runs) Accuracy Sensitivity Precision F-measure Stage I and Stage II samples-(60 features) Best results 0.7108 0.5124 0.6080 0.4527 0.7108 0.5000 0.3554 0.4155 0.7229 0.6444 0.6571 0.6494 0.7229 0.5332 0.6958 0.4917 Average results of 50 runs 0.5343 0.5363 0.5302 0.5068 0.6252 0.5267 0.5125 0.5135 0.5885 0.5131 0.5132 0.5111 0.6774 0.4928 0.4656 0.4372 Stage I and Stage III & IV samples (230 features) Best results 0.6750 0.6912 0.6540 0.6484 0.7125 0.6324 0.6364 0.6343 0.7375 0.6638 0.6687 0.6661 0.7625 0.5823 0.7800 0.5767 Average results of 50 runs 0.5944 0.5953 0.5767 0.5602 0.6554 0.5486 0.5428 0.5387 0.6237 0.5350 0.5341 0.5323 0.7134 0.5165 0.5664 0.5513 Stage II, Stage III & IV samples (241 features) Best results 0.6522 0.6477 0.6548 0.6462 0.6304 0.6307 0.6304 0.6303 0.5435 0.5417 0.5419 0.5415 0.5652 0.5511 0.5888 0.5054 Average results of 50 runs 0.5041 0.5071 0.5074 0.5000 0.5088 0.5063 0.5046 0.4995 0.5059 0.5045 0.5043 0.5013 0.4944 0.4852 0.4825 0.4614 [164]Open in a new tab Based on the results obtained from the correlation method, it was found that the feature selection results using the method proposed in this study (eight metaheuristic methods) were equal to or better than other methods, such as the feature selection method using data correlation, indicating the validity of the approach used in this study. To validate and verify the obtained biomarkers, enrichment analysis was selected. This pathway-based approach helps researchers gain mechanistic insights into gene lists they have previously generated. This gene list included our target genes, which were selected for analysis during the first and second Steps to better understand the function of these genes in the cell. Pathway enrichment analysis identifies biological pathways that are enriched in the gene list beyond what would be expected by chance. Therefore, if the output of the second step indicates that the genes identified in this stage have a specific function and are involved in a particular pathway, the result should be statistically significant, ensuring that the identified gene set is significantly correlated and not randomly selected. The procedure was carried out as follows: First, a list of pathways associated with lung cancer, as reported in papers from 2022 to 2023, was compiled (Table [165]10) (References are listed