Abstract

Objective

   Invasive lung cancer staging poses significant challenges, often
   requiring painful and costly biopsy procedures. This study aims to
   identify non-invasive biomarkers for detecting bronchogenic carcinoma
   and its various stages by analyzing gene expression data using
   bioinformatics and machine learning techniques. By leveraging these
   advanced computational methods, we seek to eliminate the need for
   surgical intervention in the diagnostic process.

Methods

   We utilized the TCGA-LUAD dataset, including gene expression data from
   healthy and cancerous samples. To identify robust biomarkers, we
   applied eight metaheuristic algorithms for feature selection, combined
   with four classification methods and two data fusion techniques to
   optimize performance.

Results

   Our approach achieved 100% accuracy in distinguishing healthy samples
   from cancerous ones, outperforming existing methods that reported 97%
   accuracy. Notably, while prior methods have struggled to separate
   bronchogenic carcinoma stages effectively, our research achieved an
   approximate accuracy of 77% in stage classification. Furthermore, using
   gene enrichment methods, we identified 5, 7, and 16 diagnostic
   biomarker candidates for stages I, II, III, and IV, respectively.

Conclusion

   This study demonstrates that integrating bioinformatics, gene set
   enrichment, and biological pathway analysis can enable non-invasive
   diagnostics for bronchogenic carcinoma stages. These findings hold
   promise for developing alternatives to traditional, invasive staging
   systems, potentially improving patient outcomes and reducing healthcare
   costs.

Supplementary Information

   The online version contains supplementary material available at
   10.1007/s12672-025-02395-5.

   Keywords: Biomarker, Bronchogenic carcinoma, Feature selection
   algorithms, Information fusion, Machine learning

Introduction

   Lung cancer is a multifaceted disease that demands comprehensive
   research to unravel its intricate molecular mechanisms. Among its
   subtypes, bronchogenic carcinoma stands out as one of the most
   widespread and aggressive malignancies globally [[30]1]. Studies
   suggest that the prognosis of non-small cell lung cancer (NSCLC) is
   highly dependent on the stage of disease progression. Early diagnosis
   leads to higher survival rates. Patients with early-stage lung cancer
   often do not exhibit obvious symptoms [[31]2] and, as a result, they
   miss the optimal time for early treatment. Compounding this issue is
   the highly destructive nature of tumor metastasis, which remains a
   leading contributor to the elevated mortality rates associated with the
   disease.

   The use of molecular biomarkers can be a valuable tool in cancer
   detection. Due to the absence of clear symptoms in the early stages of
   NSCLC, which makes diagnosis challenging, early detection and treatment
   have become therapeutic goals for lung cancer, urgently requiring the
   identification of reliable biomarkers for diagnosis and prognosis.
   Identifying biomarkers for early detection and treatment of cancer has
   become essential with the advancement of bioinformatics, which has led
   to effective analysis and discovery of cancer genes.

   Therefore, conducting research like the present study on lung cancer is
   crucial to detect the early stages of the disease to prevent further
   progression. Additionally, given that treatments are specific to
   disease stages, accurate diagnosis of these stages is necessary. By
   diagnosing the disease stage, an effective treatment for that
   particular stage could be proposed and prescribed. With significant
   advancements in sequencing technology in recent years, identifying
   genetic characteristics based on gene expression data from healthy and
   cancer samples, such as RNA sequencing (RNA-seq), has shown great
   promise in cancer diagnosis.

   There are numerous studies indicating that bioinformatics methods are
   used to analyze comprehensive gene expression data to identify
   cancer-related biomarkers for cancer prevention, diagnosis, and
   treatment [[32]3–[33]6].

   Various meta-heuristic algorithms are also used for data analysis.
   Validation requirements for data mining algorithms, which help to
   increase the performance of various algorithms on healthcare data
   [[34]7].There is a wide variety of applications of meta-heuristics in
   the fields of medical services, including improved classification
   systems, effective diagnosis systems, and increased diagnosis rates of
   various diseases [[35]8].Treatments were also improved using these
   techniques and reduced the complications of patients with long-term
   diseases. Some diseases have the disadvantage that if the patient is
   not diagnosed for a long time, he unknowingly transmits the disease to
   many others. If the disease is diagnosed early, the spread of this
   disease can be prevented. Meta-heuristic techniques can be generally
   divided into two categories: single solution algorithms and population
   solution-based algorithms [[36]9]. In solution-based algorithms, all
   solutions are randomly generated until the original and optimal
   solution is established. In the case of population-based algorithms,
   the number of solutions is randomly generated, and the values of all
   solutions are updated iteratively. The best solution is developed using
   different iterations [[37]10]. New metaheuristic algorithms such as
   artificial immune systems, cat swarm optimization, firefly algorithm,
   genetic algorithm, gray wolf optimization, glowworm swarm optimization,
   grasshopper optimization, crow search, taboo search, ant colony
   optimization, bee algorithm, chimpanzee optimization algorithm and many
   others are very useful for feature extraction and selection for
   different types of disease detection and early diagnosis [[38]11]. The
   proposed method generally consists of two sections. In the first
   section, Gene expression data are used to detect and differentiate
   various stages of bronchogenic carcinoma, which includes four stages of
   the disease. In this study, RNA-seq data are used, which include gene
   expression features specific to bronchogenic carcinoma. In the second
   section, a set of metaheuristic algorithms, along with data fusion
   methods, is used for feature selection that provides the best
   separation of the four stages of bronchogenic carcinoma. The search for
   the best features could be carried out using various methods, such as
   an exhaustive search, optimization search, or metaheuristic algorithms.
   The process continues until the best set of features is obtained based
   on evaluation criteria or until the termination condition is met
   [[39]12]. Feature selection methods based on evolutionary and
   metaheuristic algorithms have advantages over other approaches,
   including the ability to discover complex and nonlinear features,
   flexibility in handling different problems, model dimensionality and
   complexity reduction, and applicability to large-scale problems
   [[40]13].

   Using classification algorithms, they are categorized into four stages
   of the disease, which represent classes of the research method.
   Evaluation is carried out using accuracy, precision, and other metrics.

   The main purpose of a classification algorithm is to assess predictions
   accurately. To achieve this, various algorithms have been developed,
   including Decision Trees, K-Nearest Neighbors, Support Vector Machines,
   and Naive Bayes. Each classification method has its unique
   characteristics and functions for classifying features. In this study,
   appropriate and accurate feature selection from approximately 18,000
   genes is presented to optimally differentiate the stages of
   bronchogenic carcinoma. Subsequently, by examining and comparing four
   classification methods, including Naive Bayes (NB) [[41]14], Support
   Vector Machine (SVM) [[42]15], K-nearest neighbors (KNN) [[43]16] and
   decision tree (DT) [[44]17], and utilizing data fusion techniques, the
   most reliable and optimal biomarkers for identifying the stages of
   bronchogenic carcinoma are proposed.

   The impetus for this research on the implementation of feature
   selection methods lies in the execution of 32 combined trials involving
   metaheuristic algorithms alongside classification algorithms. This
   study employs eight distinct metaheuristic algorithms, each integrated
   with four classification methods, to explore their synergistic effects.
   This approach ensures that the feature selection process achieves the
   highest possible accuracy. There has been less research in various
   journals that presents its results with 8 feature selection algorithm
   methods and 4 classifier methods to introduce biomarkers that separate
   the different stages of the complex disease of bronchogenic lung
   cancer.

   The resulting gene sets as biomarkers for classifying and identifying
   bronchogenic carcinoma stages are ultimately identified with greater
   reliability using enrichment methods and biological pathways.

   Certain gaps were identified in previous research, particularly a
   scarcity of studies focusing on biomarkers that accurately indicate
   disease stages, especially in cancer staging. This research aims to
   identify biomarkers that differentiate lung cancer stages through the
   optimization of metaheuristic algorithms. The primary objective of the
   feature selection problem is to reduce the dimensionality of the
   feature set while preserving performance accuracy. Although various
   methods have been developed for dataset classification, metaheuristic
   algorithms have garnered significant attention for their effectiveness
   in addressing diverse optimization challenges. A key advantage of these
   algorithms is their problem-independent nature, allowing them to
   function as general solvers capable of providing optimal solutions
   across complex search spaces. Consequently, they can be adapted to
   address issues such as the inaccuracies in identifying lung cancer
   biomarkers with minimal modifications, achieving satisfactory accuracy.

   The subsequent sections of this paper are arranged as follows:
   Sect. [45]2 provides a comprehensive literature review, highlighting
   relevant studies and methodologies. Section [46]3 details the proposed
   method, including data sources and analytical techniques. Section [47]4
   presents the results of our analyses, followed by a discussion in
   Sect. [48]5 that interprets the findings and their implications.
   Finally, Sect. [49]6 concludes the paper, summarizing key insights and
   acknowledging the limitations of the study.

Literature review

   The previous investigations were conducted by performing differential
   expressed gene (DEG) analysis, a commonly-used technique for analyzing
   RNA-seq data, followed by the application of a metaheuristic method to
   identify and select an optimal set of genes from those identified in
   the first step. The identification of differentially expressed genes
   across two or more sample groups can be facilitated by using this tool
   in various RNA-seq data processing programs. The second approach
   involves employing a combination of multiple metaheuristic algorithms
   to select genes from the outset of the study. The unique strengths of
   each metaheuristic algorithm are utilized in this approach, resulting
   in precise decision-making in gene selection based on the results. In
   the first approach, gene selection is conducted using the initial
   approach, whereas the subsequent papers adopt the second approach,
   which involves a combination of metaheuristic algorithms.

   M. Jansi Ran et al. proposed a two-stage algorithm for selecting
   informative genes in cancer data classification. In the first stage,
   mutual information (MI)-based gene selection was applied, which selects
   only those genes that have high information related to cancer [[50]18].
   Genes with high mutual information were given as input to the second
   stage. In the second stage, a genetic algorithm (GA) was used for gene
   selection to identify and select the optimal set of genes required for
   accurate classification. For classification, SVM was used. The proposed
   MI-GA gene selection approach was applied to colon, lung, and ovarian
   cancer datasets.

   Rabia Musheer Aziz et al. proposed a hybrid machine learning (ML)
   framework based on the nature-inspired cuckoo search (CS) algorithm
   combined with the artificial bee colony (ABC) algorithm [[51]19]. These
   algorithms were used to balance the exploration and exploitation phases
   of the ABC and GA algorithms in the search process. In preprocessing,
   the independent component analysis (ICA) method was applied to extract
   important genes from the dataset. Then, the proposed gene selection
   algorithms, coupled with the NB classifier and leave-one-out
   cross-validation (LOOCV), were used to find a small set of informative
   genes that maximize classification accuracy. To perform a comprehensive
   performance study, the proposed algorithms were applied to six
   benchmark gene expression datasets. The experimental comparison showed
   the proposed framework (ICA and CS-based hybrid algorithm with NB
   classifier) performed a deeper search during the iterative process,
   which helped avoid premature convergence and yielded better results
   compared to the previously published feature selection algorithm for NB
   classifier.

   Mohamed Tehnan et al. proposed a novel model, called voting-based
   enhanced binary Ebola optimization search algorithm (VBEOSA) [[52]20].
   This algorithm combines the binary Ebola optimization search algorithm
   (BEOSA) [[53]21] with six population-based algorithms, including
   differential evolution (DE) [[54]22], ant colony optimization (ACO)
   [[55]23], particle swarm optimization (PSO) [[56]24], GA [[57]25],
   simulated annealing (SA) [[58]26] and tabu search (TS) [[59]27] for
   feature selection and classification. Classification was improved using
   a voting-based model applied to a lung cancer gene expression dataset.
   In this study, the potential of VBEOSA was utilized to identify 50
   important genes associated with lung cancer. Further exploration
   through protein–protein interaction (PPI) analysis led to the
   identification of a selected group of 10 hub genes that serve as
   biomarkers in lung cancer.

   Iqbal et al. focused on early lung cancer detection using an ACO
   algorithm and combined it with a deep neural network (DNN) algorithm
   [[60]28]. The ACO optimization algorithm, inspired by the foraging
   behavior of ants, could be used for feature selection, which helps
   simplify the model and improve performance. The developed ACO was
   integrated with DNN to enhance the system’s diagnostic accuracy. The
   features selected by ACO were fed into DNN for achieving the final
   results. This study focused on DNNs due to their better performance in
   modeling data relationships and handling large datasets.

   Pirgazi et al. investigated an efficient hybrid filter-wrapper
   metaheuristic-based gene selection method for high-dimensional datasets
   [[61]29]. In their research, a two-stage hybrid algorithm based on the
   Shuffled Frog Leaping Algorithm (SFLA) was proposed. This method
   combines filtering and wrapping for efficient feature selection.

   Over the past decade, high-dimensional data in bioinformatics have
   received great attention. Sangjin Kim et al. investigated alternative
   strategies for overcoming challenges and identifying truly important
   features [[62]30]. They categorized filtering methods into two groups
   of individual ranking and feature subset selection methods. They
   proposed a novel filter ranking method using an elastic net penalty
   with sure independent screening (SIS) based on resampling techniques to
   overcome these challenges. This method was applied to gene expression
   data related to colon and lung cancer to assess classification
   performance and identify genes truly associated with colon and lung
   cancer. The results showed that the proposed PF method consistently
   identified the most promising features related to lung cancer.

   Shib Sankar et al. investigated 4,127 samples of 9 types of cancer from
   NGS data using algorithms such as ABC, ACO, DE, and PSO for feature
   selection. SVM with an RBF kernel was used for classification. SVM was
   improved using tenfold cross-validation. The highest classification
   accuracy was achieved using the ABC algorithm, i.e., 99.10% accuracy
   for brain low-grade glioma data [[63]31].

   In another study, Shib Sankar et al. combined PSO with SVM [[64]32].
   They compared their method with joint mutual information (JMI) + SVM,
   minimum redundancy maximum relevance (mRMR) + SVM, and conditional
   mutual information maximization (CMIM) + SVM and selected PSO + SVM as
   the best algorithm with the highest accuracy of 97.80%. Based on these
   results, the second method was used for gene selection in this study.
   Table [65]1 compares previous studies from 2019 to 2024, summarizing
   details such as publication year and used algorithms.

Table 1.

   Comparative Analysis of Studies (2019–2024): Key Details and Algorithms
   Author Method Advantages limitations

   M.Jansi Ran

   et al. [[66]18]
   MI-GA (Mutual Information and Genetic algorithm) gene selection
   approach—Mutual Information based gene selection in the first stage and
   GA based gene selection in second stage and verified by using the SVM
   based classifier The proposed MI-GA combination for gene selection
   provides more accurate measurements, and this hybridization can help
   reduce the complexity of classification models and obtain much better
   results Microarray data and early stage
   Shib Sankar et al. [[67]31] DESeq2 is used to select differentially
   expressed genes and the optimization algorithms ABC, ACO, DE, and PSO
   and Classification by SVM Genes selected by ABC, ACO, DE, and PSO
   algorithms are generally independent. Variation of modeling performance
   among the optimization algorithms leads to the selection of the
   minutely overlapped set of genes irrespective of the datasets Using
   brain lower grade glioma data
   Sangjin Kim et al. [[68]30] New filter ranking method (PF) using the
   elastic net penalty with SIS based on resampling technique Demonstrated
   that SIS-LASSO, SIS-MCP, and SIS-SCAD with the proposed filtering
   method achieved superior performance of not only accuracy, AUROC, and
   geometric mean but also true positive detection compared to those with
   the marginal maximum likelihood ranking method (MMLR) Low power of
   detecting truly important variables as well as prediction of
   classification
   Pirgazi et al. [[69]29] Hybrid method based on the Incremental Wrapper
   Subset Selection with replacement (IWSSr) method and SFLA to select
   effective features in a large-scale gene dataset The proposed method
   detects the relationship between the features properly and removes the
   redundant and irrelevant features from the selected feature set Using
   small samples
   Rabia Musheer Aziz et al. [[70]19] Incorporating the CS algorithm with
   an ABC in the exploitation and exploration of the GA., the NB
   classifier, and LOOCV the classification algorithm The experimental
   comparison shows that the proposed framework (hybrid algorithm)
   performs a deeper search in the iterative process, which can avoid
   premature convergence and produce better results compared to the
   previously published feature selection algorithm Not implementing the
   methodology on other data such as RNA Seq

   Tehnan

   I.A.Mohamed et al. [[71]33]
   The algorithm VBEOSA, which is derived from the metaheuristic algorithm
   BEOSA from the Ebola virus infection mechanism and uses binary
   optimization in the optimization phase, uses classification algorithms,
   including SVM, DT, KNN, random forest (RF), multilayer perceptron
   (MLP), and Gaussian naive Bayes (GNB) Findings bear significant
   implications for enhancing the diagnosis, prognosis, and development of
   therapeutic strategies for lung cancer Lack of identification of
   biomarker genes that identify cancer stages
   S. Iqbal et al. [[72]28] The Lung Cancer Detection method based on ACO
   and DNN ACO models and DL can help in the early detection of lung
   cancer and other diseases, thereby helping to save lives and reduce
   costs and the healthcare burden Accuracy less than 100% (In this study,
   100% accuracy in diagnosing lung cancer was achieved)
   Ali Mahmoud Ali et al Cat Swarm Optimization and Support Vector Machine
   Approach for Multi-Omics Data Fill the gap of some previous studies
   that have failed to explain aspects of specific features that affect
   classification results Integrating SVM with SMOTE or CSO and K-means
   will consume a significant amount of computational time, which may not
   be beneficial in technical environments with limited computational
   power
   Ali Mahmoud Ali et al The intersection of omics and AI models ML and DL
   algorithms can increase cancer diagnostic accuracy, speed, and
   cost-effectiveness The omics community must significantly address the
   gap between theory and practice by developing relevant applications
   Ihsan Jasim Hussein et al The combination of features generated by edge
   operators and using them as input to the ANN classifier is trained to
   produce three classifiers. The fusion technique is used to make a
   decision using the entire set of features This method was performed on
   gynecological ultrasound images to identify suspicious objects or cases
   with health consequences for analyzing the probability of cancer
   diseases Noise in images of breast and ovarian tumors
   [73]Open in a new tab

   An examination of all the articles indicated a deficiency in precise
   disease staging utilizing gene expression data for comparing various
   disease stages, which is likely attributed to small sample sizes or
   insufficient accuracy in differentiating between case and healthy
   samples. However, in the present research, the accuracy of
   distinguishing diseased samples from healthy ones was nearly 100%.
   Additionally, bronchogenic carcinoma stages were identified through
   comparative analysis, i.e., comparing Stage I and Stage II, Stage I and
   Stage III & IV, and finally, Stage II and Stage III & IV. In each case,
   biomarkers specific to the corresponding disease stage were identified
   using eight metaheuristic algorithms, and classification was performed
   using four algorithms, with their accuracy and precision assessed.
   Further validation of biomarkers was done through active biological
   pathways, confirming the precision of these biomarkers for
   distinguishing different stages of bronchogenic carcinoma. The proposed
   biomarkers provided reliable differentiation of the four stages of the
   disease.

   Ali Mahmoud Ali et al. [[74]34] try to fill the gap of some previous
   studies that have failed to explain aspects of specific features that
   affect classification results. This work aims to improve the feature
   selection method, thereby increasing the significance and accuracy of
   cancer type description using a powerful multi-omics data clustering
   framework. Regarding feature selection, we proposed a cat swarm
   optimization feature selection to isolate relevant features for
   prediction, K-means for clustering the dataset, and a nonlinear SVM.
   However, according to the methodology of this paper, integrating SVM
   with SMOTE or CSO and K-means will consume a significant amount of
   computational time, which may not be beneficial in technical
   environments with limited computational power.

   Ali Mahmoud Ali et al. [[75]35] provides unique insights into the
   intersection of omics and AI models, and it seems that clinicians are
   looking for a comprehensive measure for the use of artificial
   intelligence in processing omics data, which is one of the main
   findings of this article. The challenge of this paper is that
   healthcare companies rarely deploy DL applications in real-world
   contexts. Therefore, the omics community must significantly address the
   gap between theory and practice by developing relevant applications. In
   research related to a simple medical validation strategy, DL features,
   feature data, and definitions should be kept in mind.

   In the other paper, Ihsan Jasim Hussein et al. [[76]36] revolve around
   image texture enhancement using a novel treatment that can help reduce
   speckle noise from the image and preserve the most important
   information. The combination of features generated by edge operators
   and using them as input to the ANN classifier is trained to produce
   three classifiers. The fusion technique is used to make a decision
   using the entire set of features. Overall, 95.87% feature accuracy is
   achieved for ovarian tumor data collection. This method was performed
   on gynecological ultrasound images to identify suspicious objects or
   cases with health consequences for analyzing the probability of cancer
   diseases. To resolve the ambiguity of the research method, the proposed
   method can be enhanced by using Miter filters to improve the image and
   reduce speckle noise for breast and ovarian tumors.

   Alzheimer’s Disease (AD) [[77]37] constitutes a significant global
   health issue. Although more and more people are getting AD, there are
   still no effective drugs to treat it. Recently, Deep Learning (DL)
   techniques have been used more and more to diagnose AD. This essay
   meticulously examines the works that have talked about using DL with
   AD. Some of the methods are Natural Language Processing (NLP), drug
   reuse, classification, and identification. One important finding is
   that Convolutional Neural Networks (CNNs) are most often used for AD
   research, and Python is most often used for DL issues.

   Shiva Toumaj et al. [[78]37] conducted a study on AD. They stated their
   results as follows: Alarmingly, 88.9% of the studies that were looked
   at did not talk about security issues. This shows that we need to build
   DL models for AD in a way that is more aware of security issues right
   away Even though progress has been made, there are still important
   issues that need to be deal with right away, such as privacy,
   accessibility, fairness, and transferability, to make sure that AI is
   used in healthcare responsibly and effectively.

   Nowadays, ML has reached a high level of achievement in many contexts.
   The importance of DL in Internet of Things (IoT)-based bio- and medical
   informatics lies in its ability to analyze and interpret large amounts
   of complex and diverse data in real time, providing insights that can
   improve healthcare outcomes and increase efficiency in the healthcare
   industry. This research will provide valuable guidance for further
   research on DL and its use in medical and bioinformatics problems, and
   it has been shown that some features, such as security and convergence
   time, are underutilized in the literature review [[79]38].

Proposed method

   This study mainly aims to identify gene markers for detecting different
   stages of lung cancer. Given that lung cancer was recognized as the
   second leading cause of cancer-related deaths in 2020 (26), this
   research aims to identify marker genes that could separate healthy and
   cancer samples and identify different stages of lung cancer using
   RNA-Seq gene expression data related to lung cancer. This study is
   implemented in two steps. The schematic of the research is shown in
   Fig. [80]1.

Fig. 1.

   [81]Fig. 1
   [82]Open in a new tab

   Schematic view of the study

   The investigation is carried out in two steps. In the first step, the
   required data is obtained by downloading RNA-seq samples of lung tissue
   from TCGA. These samples include gene expression data from both healthy
   and cancerous lung tissues. Normalization and removal of redundant data
   are performed, and the data are extracted in the form of a matrix.
   These data are normalized using the trimmed mean of M-values (TMM)
   method. The second step focuses on feature selection, for which eight
   metaheuristic methods are used.

   These eight metaheuristic algorithms include the ABC [[83]24] and ACO
   [[84]23] related to swarm intelligence, the water cycle algorithm (WCA)
   [[85]39] and SA [[86]26] related to natural phenomena and harmony
   search (HS) [[87]40] and teaching–learning-based optimization (TLBO)
   [[88]41], inspired by human behaviors. Also, DE [[89]22] and GA
   [[90]25] related to evolutionary optimization algorithms are used.
   Figure [91]2 shows the classification of the types of used
   metaheuristic methods.

Fig. 2.

   [92]Fig. 2
   [93]Open in a new tab

   Classification of used meta-heuristic optimization algorithms

   The samples used in this study are divided into four categories:
   Healthy, Stage I, Stage II, and Stage III &IV [[94]42]. To ensure a
   comprehensive feature extraction process, the features are extracted in
   six different combinations using metaheuristic algorithms. Three
   combinations correspond to making comparisons between healthy sample
   vs. Stage I, Stage II vs. Stage III & IV, while the other three
   correspond to making comparisons between Stage I vs. Stage II, Stage I
   vs. Stage III & IV and Stage II vs. Stage III & IV.

   Numerous methods have been developed for dataset classification, with
   metaheuristic algorithms gaining significant attention for their
   effectiveness in solving a wide range of optimization problems. These
   algorithms are broadly categorized into four groups based on their
   underlying principles: Swarm Intelligence, Natural Phenomena, Human
   Inspiration, and Evolutionary Algorithms. In this study, two
   representative algorithms were selected from each category, resulting
   in a total of eight metaheuristic approaches. These include the
   Artificial Bee Colony (ABC) and Ant Colony Optimization, which fall
   under Swarm Intelligence; the Water Cycle Algorithm (WCA) and Simulated
   Annealing (SA), associated with Natural Phenomena; Harmony Search (HS)
   and Teaching–Learning-Based Optimization (TLBO), inspired by human
   behaviors; and Differential Evolution (DE) and Genetic Algorithm (GA),
   which are rooted in Evolutionary Optimization. Together, these
   algorithms provide a comprehensive framework for exploring optimization
   challenges.

   Metaheuristic algorithms use different strategies to discover and
   optimize subsets of features, each specifying a mechanism for solving
   feature selection. For example, the GA mimics the natural process. It
   starts with a population of random feature subsets, then iterates
   through methods such as selection, crossover (recombination of feature
   subsets), and mutation (random changes in subsets) to create better
   feature sets. It uses fitness functions to evaluate the quality of each
   feature subset and guides toward optimal solutions. There are four
   different methods for feature subset optimization. Each of these
   algorithms has its functions and features to find the best features:
    1. Population-Based vs. Single-Solution Approaches: Population-based
       algorithms, such as Genetic Algorithms (GA), Particle Swarm
       Optimization (PSO), and Ant Colony Optimization (ACO), maintain a
       diverse set of solutions, enabling simultaneous exploration of
       multiple features within the search space. In contrast,
       single-solution algorithms like Simulated Annealing (SA), Tabu
       Search (TS), and Differential Evolution (DE) operate on a single
       solution at a time, emphasizing localized refinement and
       optimization.
    2. Exploration versus exploitation: Some algorithms like PSO and ABC
       emphasize a combination of exploration (searching for new regions)
       and exploitation (refining good solutions). Others, like SA and TS,
       are more likely to find better solutions around a local region.
    3. Search mechanism: Algorithms like GA and ACO use probabilistic
       transitions (crossover, mutation, pheromones), while PSO and DE
       rely on mathematical tuning and speedups.
    4. Convergence: Algorithms like PSO converge faster due to their more
       continuous nature, while GA and ACO may require more generations or
       iterations to converge but can handle complex search spaces due to
       exploration for diversity.

   In this study, eight methods, ABC, HS, DE, ACO, GA, SA, WCA, and TL,
   are used for feature selection. In the following, each method is
   introduced along with its main parameters. The ABC algorithm is
   inspired by the behavior of bees in searching for food. The range of
   values ​​is [0,1], the selection threshold is set to 0.5, and the
   maximum number of times allowed without improvement is set to 5. HS
   Algorithm, inspired by music tuning, this method uses the pitch
   adjustment rate (PAR = 0.05), the harmony memory consideration rate
   (HMCR = 0.7) and the search bandwidth (bw = 0.2) for optimization. DE
   Algorithm is a combination of mutation and selection. The combination
   rate (CR = 0.9) and the amplification factor (F = 0.5) are its main
   parameters. In the ACO Algorithm, the search path is guided by
   pheromone emission. Pheromone control (α = 1), heuristic utility
   control (β = 0.1), pheromone evaporation rate (ρ = 0.2), and initial
   pheromone value are set to 1. The GA uses the combination rate (CR
   = 0.8) and mutation rate (MR = 0.01) to optimize the features. The SA
   Algorithm works by gradually decreasing the temperature. The
   temperature decreasing rate (c = 0.93) and the initial temperature (T0
   = 100) are its main parameters. WCA Algorithm simulates the movement of
   rivers. The initial population size is 50, the number of rivers is 4,
   and the maximum number of iterations is 100. TLBO Algorithm imitates
   the learning process. The population size is 50, and the maximum number
   of iterations is 100. In all methods, the feature values ​​are set in
   the range [0,1], and the maximum number of iterations is considered to
   be 100.

   To confirm the values ​​found, a systematic search was conducted to
   find the optimal values ​​of the parameters of each algorithm for its
   optimal performance, as well as the range of values ​​searched for each
   parameter, which corresponded to the values ​​found in this study.
   Features are extracted using these eight methods and six combinations.
   The quality of the extracted features is evaluated using classifiers of
   NB, SVM, DT, and KNN.

   The training process of these models was performed using 10 iterations
   of fivefold cross-validation and reported for evaluation, the results
   obtained from these runs. This practice plays an important role in the
   stability of the performance of the model designed in this research.

   Additionally, the Borda count method is used for decision fusion. Borda
   count is an extended version of the majority voting method and is the
   most commonly used method for unsupervised rank-level fusion [[95]43].
   In this voting method, each matcher assigns a priority to all possible
   user identities. Each matcher ranks a fixed set of possible user
   identities in order of preference. For each matcher, the top-ranked
   user identity is given N votes, the second-ranked candidate identity is
   given N-1 votes, and so on. Then, for each possible user identity, the
   votes from all the matchers are added. The identity with the highest
   number of votes is designated as the winner or true user identity
   [[96]44]. The confusion matrix is used when the output includes two or
   more classes [[97]45]. The metrics of accuracy, recall (or
   sensitivity), specificity, precision, and F-measure were used. The
   receiver operating characteristic (ROC) curve is an evaluation metric
   for binary classification problems. It is a probability curve that
   plots TPR against FPR at different threshold values. It essentially
   separates the signal from the noise. The area under the curve (AUC) is
   a measure of the classifier’s ability to discriminate between classes.
   The higher the AUC, the better the model’s performance in
   distinguishing healthy and cancer classes. The numerical value of AUC
   is a number between zero and one and indicates how accurate or precise
   the results of a test are. The accuracy of the test results depends on
   how well the test method can correctly distinguish between true
   positive (TP) and true negative (TF) results. If this number is close
   to one, it means that the data are generally above the bisector line,
   the true positive rate is high, and the test method has good diagnostic
   or precision. AUC numbers close to 0.5 indicate that the true positive
   rate and false positive rate are equal, and numbers less than 0.5
   indicate that the false positive rate is higher than the true positive
   rate.

   The role of information fusion methods is to integrate the outputs of
   the base classifiers, allowing them to complement one another and
   address each other’s weaknesses.

   This study’s data was downloaded and preprocessed by extracting RNA-Seq
   samples from the TCGA database through the TCGAbiolinks package in R
   software. The TCGA-LUAD project data was obtained by setting up a query
   that included primary tumor samples and solid normal tissue,
   specifically utilizing GeneCodeV38 for accurate gene annotation.
   Following the download, the gene expression data was processed along
   with the patient's clinical information to prepare for the upcoming
   analysis steps. In order to enhance the data quality, only
   protein-coding genes were included in the analysis, and genes with low
   expression levels were eliminated. The TMM (Trimmed Mean of M-values)
   method in the edgeR package was used to normalize data and correct for
   variations caused by differences in read counts between different
   samples. Finally, log transformation was used to save the normalized
   values to use in the subsequent analysis steps.

   Afterwards, 584 RNA-Seq samples related to healthy and lung cancer
   samples, each containing 18,043 features, are divided into four groups:
   Healthy, Stage I, Stage II, and Stage III & IV. To evaluate the
   performance of the feature selection algorithms and selected features,
   the samples are assessed in six categories in a pairwise manner:
   Healthy vs. Stage I, healthy vs. Stage II, healthy vs. Stage III & IV,
   Stage I vs. Stage II, Stage I vs. Stage III & IV and, finally, Stage II
   vs. Stage III & IV. The features are determined using eight feature
   selection methods, and the performance of the selected features is
   evaluated by four classifiers. Additionally, the Borda count decision
   fusion method is used to combine the 32 decisions made at each stage.

Results

Evaluation dataset

   This dataset included 584 RNA-Seq samples, consisting of healthy and
   cancer samples, with the expression of 18,043 genes measured in each
   sample. The dataset consisted of 584 expression samples, including 59
   healthy samples, 292 Stage I samples, 123 Stage II samples, 84 Stage
   III samples, and 26 Stage IV samples. Given the small number of Stage
   IV samples and their clinical similarity to Stage III, Stage III and
   Stage IV were considered as Stage III & IV, consisting of a total of
   110 RNA-Seq samples.

Evaluation results

   After applying eight feature selection methods to the data, the
   features were identified, and the performance of the selected features
   was evaluated using four classifiers.

   Given the high dimensionality of RNA-seq data (18,043 features), the
   use of metaheuristic methods for feature selection is fully justified.
   The main reason for this choice is the ability of these algorithms to
   efficiently search a very large feature space and identify an optimal
   subset of important features. The metaheuristic methods used use random
   search mechanisms and evolutionary optimization to find optimal
   combinations of features. These methods are especially advantageous in
   situations where the search space is very large and classical feature
   selection algorithms (such as filter-based methods or regression-based
   feature selection) are limited.

   Additionally, the Borda count decision fusion method was used to
   combine the 32 decisions made at each stage.

Healthy group vs. stage I lung cancer

   Using the mentioned feature selection methods, features providing the
   best distinction between the healthy group and Stage I lung cancer were
   identified. Imbalance in the number of samples between the healthy
   group and different stages of cancer was the important point at this
   stage. To address this issue, given that there were only 59 healthy
   samples, 59 samples were randomly selected from each class for a total
   of 20 stages. Then, the features with the best separability potential
   were selected. Finally, features that appeared in 80% of the selected
   stages were selected.

   Table [98]2 presents feature selection algorithms, the number of
   selected features, and the performance of classifiers for each feature
   set. It should be noted that each classifier was run ten times, and the
   results presented in Table [99]2 represent the average of ten runs of
   different classifiers. In this stage, the data of healthy and Stage I
   lung cancer samples were used.

Table 2.

   Results of evaluating performance of different feature selection
   algorithms on healthy and Stage I lung cancer samples using classifiers
   of NB, SVM, DT, and KNN, along with Borda count decision fusion for
   different classifiers with eight feature sets
   FS method Number of selected features Classifier Accuracy Sensitivity
   Specificity Precision F-measure
   ABC 775 NB 0.904 0.906 0.906 0.923 0.903
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.939 0.938 0.938 0.945 0.939
   KNN 0.974 0.973 0.973 0.976 0.974
   HS 656 NB 0.930 0.931 0.931 0.939 0.929
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.939 0.938 0.938 0.942 0.939
   KNN 0.965 0.964 0.964 0.970 0.965
   DE 942 NB 0.939 0.940 0.940 0.947 0.938
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.983 0.983 0.983 0.984 0.983
   KNN 0.983 0.982 0.982 0.985 0.982
   ACO 566 NB 0.930 0.930 0.930 0.946 0.929
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.974 0.973 0.973 0.976 0.974
   KNN 0.983 0.982 0.982 0.985 0.982
   GA 387 NB 0.965 0.967 0.967 0.968 0.965
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.991 0.992 0.992 0.992 0.991
   KNN 0.974 0.973 0.973 0.977 0.974
   SA 213 NB 0.991 0.992 0.992 0.992 0.991
   SVM 0.991 0.992 0.992 0.992 0.991
   DT 0.965 0.965 0.965 0.967 0.965
   KNN 0.983 0.982 0.982 0.985 0.982
   WCA 354 NB 0.939 0.939 0.939 0.945 0.939
   SVM 0.991 0.991 0.991 0.992 0.991
   DT 0.930 0.930 0.930 0.933 0.930
   KNN 0.965 0.964 0.964 0.970 0.965
   TLBO 1559 NB 0.926 0.928 0.928 0.938 0.925
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.965 0.964 0.964 0.967 0.965
   KNN 0.965 0.964 0.964 0.970 0.965
   Borda count decision fusion 1.000 1.000 1.000 1.000 1.000
   [100]Open in a new tab

   Table [101]2 indicates the performance of the selected features. As
   shown in the last row of Table [102]2, the Borda count method was used
   for decision fusion and improving the performance of classifiers. This
   method ranks based on the decisions made by the classifiers. The
   decision with the highest frequency receives a higher rank and,
   ultimately, is evaluated based on the weight assigned to each rank. The
   Borda count decision fusion method improved performance, achieving 100%
   accuracy and precision. In almost all cases, the SVM classifier
   performed better than the other methods. Table [103]2 shows the number
   of features selected using each feature selection method. Next, the
   selected features by all methods were aggregated. The features obtained
   in at least 4 out of the 8 feature selection methods (608 features) and
   at least 5 feature selection methods (42 features) were used for
   classification (Table [104]3). Accordingly, feature fusion was
   performed. Then, different classification algorithms were run 10 times
   on the training and test data. Table [105]3 shows the results of
   classifiers for 608 and 42 features. The 608 and 42 selected features
   represent genes expressed in healthy and Stage I lung cancer RNA-seq
   samples. Figure [106]3 shows ROC curve plots, which are calculated for
   the features in Table [107]3. As can be seen from the graph, all four
   classifiers can accurately distinguish between all points in the
   healthy and cancer class with almost 100% accuracy.

Table 3.

   Results of four classifiers based on the combination of the selected
   features shown in Table [108]2
   FS method Number of selected features Classifier Accuracy Standard
   deviation of accuracy Sensitivity Precision F-measure
   At least in 4 feature selection methods 608 NB 0.965  ± 0.034 0.966
   0.969 0.967
   SVM 0.991  ± 0.015 0.991 0.992 0.991
   DT 0.983  ± 0.023 0.983 0.983 0.983
   KNN 0.957  ± 0.031 0.957 0.961 0.959
   At least in 5 feature selection methods 42 NB 1.000  ± 0.003 1.000
   1.000 1.000
   SVM 0.991  ± 0.012 0.992 0.992 0.992
   DT 0.974  ± 0.029 0.974 0.974 0.974
   KNN 0.983  ± 0.017 0.983 0.983 0.983
   [109]Open in a new tab

Fig. 3.

   [110]Fig. 3
   [111]Open in a new tab

   ROC curve plots based on the combination of selected features on
   healthy and Stage I lung cancer samples. In this figure, the first row
   is plotted for 608 features, and the second row is plotted for 42
   features in the bottom row of Table [112]3

Healthy group vs. Stage II lung cancer

   Features that effectively classify healthy and Stage II lung cancer
   RNA-seq samples were selected. For this purpose, eight metaheuristic
   algorithms were used for feature selection, along with four different
   classifiers. Table [113]4 presents feature selection algorithms, the
   number of features, and the performance of the classifiers for each
   feature set. It should be noted that each classifier was run ten times.
   The results presented in Table [114]4 represent the average of these
   runs. In this stage, the data of healthy and Stage II lung cancer
   samples were used. Table [115]4 demonstrates the performance of the
   selected features. As shown in the last row of the table, the Borda
   count algorithm, which is a decision ranking method, was used for
   decision fusion and improving the performance of classifiers. In almost
   all cases, the SVM classifier performed better than the other methods.
   Next, the selected features by all methods were aggregated. The
   features obtained in at least 4 out of the 8 feature selection methods
   (681 features) and at least 5 feature selection methods (55 features)
   were used for classification. Different classification algorithms were
   run 10 times on the training and test data. Table [116]5 shows the
   results of classifiers for 681 and 55 features. Figure [117]4 shows ROC
   curve plots, which are calculated for the features in Table [118]5. As
   can be seen from the graph, all four classifiers can accurately
   distinguish between all points in the healthy and cancer class with
   almost 100% accuracy.

Table 4.

   Results of evaluating performance of different feature selection
   algorithms on healthy and Stage II lung cancer samples using
   classifiers of NB, SVM, DT, and KNN, along with Borda count decision
   fusion for different classifiers with eight feature sets
   FS method Number of selected features Classifier Accuracy Sensitivity
   Specificity Precision F-measure
   ABC 739 NB 0.917 0.917 0.917 0.931 0.916
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.939 0.939 0.939 0.944 0.939
   KNN 0.974 0.973 0.973 0.977 0.974
   HS 691 NB 0.939 0.939 0.939 0.952 0.937
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.974 0.973 0.973 0.976 0.974
   KNN 0.957 0.957 0.957 0.963 0.956
   DE 1074 NB 0.907 0.908 0.908 0.929 0.903
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.957 0.957 0.957 0.958 0.956
   KNN 0.965 0.965 0.965 0.969 0.965
   ACO 628 NB 0.935 0.936 0.936 0.950 0.933
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.965 0.966 0.966 0.969 0.965
   KNN 0.974 0.973 0.973 0.976 0.974
   GA 379 NB 0.922 0.921 0.921 0.936 0.920
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.983 0.982 0.982 0.985 0.982
   KNN 0.983 0.983 0.983 0.984 0.983
   SA 935 NB 0.917 0.918 0.918 0.934 0.916
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.948 0.946 0.946 0.954 0.947
   KNN 0.965 0.965 0.965 0.969 0.965
   WCA 349 NB 0.930 0.930 0.930 0.943 0.929
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.991 0.992 0.992 0.992 0.991
   KNN 0.983 0.983 0.983 0.984 0.983
   TLBO 793 NB 0.943 0.944 0.944 0.947 0.943
   SVM 1.000 1.000 1.000 1.000 1.000
   DT 0.957 0.956 0.956 0.962 0.956
   KNN 0.939 0.939 0.939 0.949 0.938
   Borda count decision fusion 1.000 1.000 1.000 1.000 1.000
   [119]Open in a new tab

Table 5.

   Results of four classifiers based on the combination of the selected
   features shown in Table [120]4
   FS method Number of selected features Classifier Accuracy Standard
   deviation of accuracy Sensitivity Precision F-measure
   At least in 4 feature selection methods 681 NB 0.930  ± 0.034 0.930
   0.939 0.934
   SVM 1.000  ± 0.000 1.000 1.000 1.000
   DT 1.000  ± 0.003 1.000 1.000 1.000
   KNN 0.965  ± 0.024 0.967 0.968 0.967
   At least in 5 feature selection methods 55 NB 0.983  ± 0.011 0.983
   0.984 0.983
   SVM 0.965  ± 0.021 0.965 0.970 0.967
   DT 0.939  ± 0.043 0.939 0.948 0.943
   KNN 0.957  ± 0.031 0.958 0.960 0.959
   [121]Open in a new tab

Fig. 4.

   [122]Fig. 4
   [123]Open in a new tab

   ROC curve plots based on the combination of selected features on
   healthy and Stage II lung cancer samples. In this figure, the first row
   is plotted for 681 features, and the second row is plotted for 55
   features in the bottom row of Table [124]5

Healthy group vs. Stage III and IV lung cancer

   In this stage, data of healthy and Stage III & IV lung cancer samples
   were used to select suitable features and classify them. Table [125]6
   presents feature selection algorithms, the number of features, and the
   performance of classifiers for each feature set. It should be noted
   that each classifier was run ten times, and the results presented in
   the following table represent the average of these runs. In this stage,
   the data of healthy and Stage III & IV lung cancer samples were used.
   Table [126]6 demonstrates the performance of the selected features. As
   shown in the last row of the table, decision fusion was used to enhance
   the performance of classifiers. The accuracy of the Borda count method
   for decision fusion was 100%, indicating the quality of the selected
   features. In almost all cases, the SVM classifier performed better than
   the other methods.

Table 6.

   Results of evaluating performance of different feature selection
   algorithms on healthy and Stage III & IV lung cancer samples using
   classifiers of NB, SVM, DT, and KNN, along with Borda count decision
   fusion for different classifiers with eight feature sets
   FS method Classifier        Accuracy Sensitivity Specificity Precision
   F-measure
   ABC       NB                 0.922      0.923       0.923      0.928   0.921
             SVM                1.000      1.000       1.000      1.000   1.000
             DT                 0.948      0.948       0.948      0.949   0.948
             KNN                0.974      0.975       0.975      0.976   0.974
   HS        NB                 0.922      0.923       0.923      0.933   0.921
             SVM                1.000      1.000       1.000      1.000   1.000
             DT                 0.991      0.991       0.991      0.992   0.991
             KNN                0.983      0.983       0.983      0.983   0.983
   DE        NB                 0.922      0.922       0.922      0.930   0.920
             SVM                1.000      1.000       1.000      1.000   1.000
             DT                 0.922      0.921       0.921      0.928   0.921
             KNN                0.991      0.992       0.992      0.992   0.991
   ACO       NB                 0.913      0.914       0.914      0.929   0.911
             SVM                1.000      1.000       1.000      1.000   1.000
             DT                 0.965      0.964       0.964      0.967   0.965
             KNN                0.991      0.992       0.992      0.992   0.991
   GA        NB                 0.904      0.905       0.905      0.915   0.903
             SVM                1.000      1.000       1.000      1.000   1.000
             DT                 0.991      0.992       0.992      0.992   0.991
             KNN                0.991      0.992       0.992      0.992   0.991
   SA        NB                 0.902      0.902       0.902      0.914   0.900
             SVM                1.000      1.000       1.000      1.000   1.000
             DT                 0.974      0.974       0.974      0.976   0.974
             KNN                0.991      0.992       0.992      0.992   0.991
   WCA       NB                 0.922      0.921       0.921      0.929   0.921
             SVM                1.000      1.000       1.000      1.000   1.000
             DT                 0.957      0.956       0.956      0.959   0.956
             KNN                0.974      0.974       0.974      0.976   0.974
   TLBO      NB                 0.948      0.947       0.947      0.958   0.946
             SVM                1.000      1.000       1.000      1.000   1.000
             DT                 0.957      0.956       0.956      0.957   0.956
             KNN                0.983      0.983       0.983      0.985   0.983
   Borda count decision fusion 1.000       1.000       1.000      1.000   1.000
   [127]Open in a new tab

   Next, the selected features by all methods were aggregated. The
   features obtained in at least 4 out of the 8 feature selection methods
   (584 features) and at least 5 feature selection methods (35 features)
   are shown in Table [128]7. Figure [129]5 shows ROC curve plots, which
   are calculated for the features in Table [130]7. As can be seen from
   the graph, all four classifiers can accurately distinguish between all
   points in the healthy and cancer class with almost 100% accuracy.

Table 7.

   Results of four classifiers based on the combination of the selected
   features shown in Table [131]6
   F-measure Precision Sensitivity Standard deviation of accuracy Accuracy
   Classifier Number of selected features FS method
   0.933 0.937 0.929  ± 0.022 0.930 NB 587 At least in 4 feature selection
   methods
   0.991 0.992 0.991  ± 0.014 0.991 SVM
   0.974 0.974 0.974  ± 0.019 0.974 DT
   0.983 0.984 0.983  ± 0.014 0.983 KNN
   1.000 1.000 1.000  ± 0.002 1.000 NB 35 At least in 5 feature selection
   methods
   0.991 0.992 0.991  ± 0.011 0.991 SVM
   0.950 0.954 0.946  ± 0.036 0.948 DT
   0.915 0.925 0.906  ± 0.042 0.906 KNN
   [132]Open in a new tab

Fig. 5.

   [133]Fig. 5
   [134]Open in a new tab

   ROC curve plots based on the combination of selected features on
   healthy and Stage III & IV lung cancer samples. In this figure, the
   first row is plotted for 587 features and the second row is plotted for
   35 features in the bottom row of Table [135]7

Stage I vs. Stage II lung cancer

   Based on the results obtained in the previous sections, the healthy
   samples and different stages of lung cancer could be easily
   distinguished with high accuracy using the extracted features. However,
   to better evaluate the performance of the methods, the same steps were
   applied to identify Stage I and Stage II. First, eight feature
   selection algorithms were applied to 292 Stage I samples and 123 Stage
   II samples. Then, to evaluate the performance of the extracted
   features, classification algorithms of NB, SVM, DT, and KNN were used.
   Table [136]1 in Supplementary Table S1 presents the results of feature
   selection and classification using these selected features.
   Additionally, the Borda count method was applied for decision fusion,
   the results of which are displayed in Table [137]1 of Supplementary
   Table S1. To evaluate the performance of the selected features in the
   feature fusion state, features that had been selected by at least 6, 7,
   and 8 feature selection methods were selected and then evaluated using
   classifiers of NB, SVM, DT, and KNN. Table [138]2 in Supplementary
   Table S1 shows the results of feature fusion and evaluation using
   different classification methods. Names of these features (genes) are
   provided in Supplementary Table S2. Supplementary Figure S4 indicates
   the PCA plot of all features (18,043 features), as well as that of 361
   and 23 selected features for Stage I and Stage II lung cancer samples.
   Figure S1 in Supplementary Figures shows ROC curve plots, which are
   calculated for the features in Table [139]2 in Supplementary Table S1.

Stage I vs. Stage III & IV lung cancer

   Features of Stage I and Stage III & IV lung cancer were evaluated.
   First, eight feature selection algorithms were applied to 292 Stage I
   samples and 110 Stage III & IV samples. Then, to evaluate the
   performance of the extracted features, classification algorithms of NB,
   SVM, DT, and KNN were used. Table [140]3 in Supplementary Table S1
   presents the results of feature selection and classification using
   these selected features. Additionally, the Borda count method was
   applied for decision fusion, the results of which are displayed in
   Table [141]3 in Supplementary Table S1. To evaluate the performance of
   the selected features in the feature fusion state, features that had
   been selected by at least 6, 7, and 8 feature selection methods were
   selected and then evaluated using classifiers of NB, SVM, DT, and KNN.
   Table [142]4 in Supplementary Table S1 shows the results of feature
   fusion and evaluation using different classification methods. Names of
   these features (genes) are provided in Supplementary Table S2.
   Supplementary Figure S5 indicates the PCA plot of all features (18,043
   features), as well as that of 321 and 33 selected features for Stage I
   and Stage III & IV lung cancer samples. Figure S2 in Supplementary
   Figures shows ROC curve plots, which are calculated for the features in
   Table [143]4 in Supplementary Table S1.

Stage II vs. Stage III & IV lung cancer

   In this stage, eight feature selection algorithms were first applied to
   123 Stage II samples and 110 Stage III & IV samples. Then, to evaluate
   the performance of the extracted features, classification algorithms of
   NB, SVM, DT, and KNN were used. Table [144]5 in Supplementary Table S1
   presents the results of feature selection and classification using
   these selected features. Additionally, the Borda count method was
   applied for decision fusion, the results of which are displayed in
   Table [145]5 in Supplementary Table S1. To evaluate the performance of
   the selected features in the feature fusion state, features that had
   been selected by at least 6, 7, and 8 feature selection methods were
   selected and then evaluated using classifiers of NB, SVM, DT, and KNN.
   Table [146]6 in Supplementary Table S1 shows the results of feature
   fusion and evaluation using different classification methods. Names of
   these features (genes) are provided in Supplementary Table S2.
   Supplementary Figure S6 indicates the PCA plot of all features (18,043
   features), as well as that of 367 and 31 selected features for Stage II
   and Stage III & IV lung cancer samples. Figure S3 in Supplementary
   Figures shows ROC curve plots, which are calculated for the features in
   Table [147]6 in Supplementary Table S1.

   Figure [148]6 shows the PCA plot of healthy and Stage I lung cancer
   samples in three scenarios: All features, 608 selected features, and 42
   selected features. As could be observed, all three scenarios
   effectively separated healthy and cancer samples, but the 608-feature
   scenario performed the best. Names of these features (genes) are listed
   in Supplementary Table S2. Based on the results shown in Tables [149]2,
   [150]3, and Fig. [151]6, it could be concluded that the selected
   features could separate healthy samples from Stage I lung cancer
   samples using RNA-seq data with an accuracy of nearly 100%.

Fig. 6.

   [152]Fig. 6
   [153]Open in a new tab

   PCA plot using 608 and 42 selected features based on eight feature
   selection methods, as well as all features (18,043 features) of healthy
   and Stage I lung cancer samples

   Also, Fig. [154]7 displays the PCA plot of healthy and Stage II lung
   cancer samples in three scenarios: All features, 681 selected features,
   and 55 selected features. As could be observed, all three scenarios
   effectively separated healthy and cancer samples, but the 681-feature
   scenario performed the best. The 681 and 55 selected features represent
   genes expressed in healthy and Stage II lung cancer RNA-seq samples.
   Names of these features (genes) are listed in Supplementary Table S2.

Fig. 7.

   [155]Fig. 7
   [156]Open in a new tab

   PCA plot using 681 and 55 selected features based on eight feature
   selection methods, as well as all features (18,043 features) of healthy
   and Stage II lung cancer samples

   Additionally, Fig. [157]8 illustrates the PCA plot of healthy and Stage
   III & IV lung cancer samples in three scenarios: All features, 584
   selected features, and 35 selected features. As could be observed, all
   three scenarios effectively separated healthy and cancer samples, but
   the 584-feature scenario performed the best, achieving the highest
   classification accuracy. Names of these features (genes) are listed in
   Supplementary Table S2.

Fig. 8.

   [158]Fig. 8
   [159]Open in a new tab

   PCA plot using 587 and 35 selected features based on eight feature
   selection methods shown in Table [160]7, as well as all features
   (18,043 features) of healthy and Stage III & IV lung cancer samples

   Figures S4, S5, and S6 illustrate the PCA diagrams for various stages
   of lung cancer samples. Figure S4 presents the PCA diagram for Stage I
   and Stage II lung cancer samples, highlighting all features, with 361,
   23, and 10,843 selected features in three cases. It is evident that all
   three cases effectively separate Stage I and Stage II samples, with the
   361 features case performing the best due to its optimal number of
   features and high classification accuracy. Similarly, Figure S5
   displays the PCA diagram for Stage I and Stage III & IV lung cancer
   samples, showcasing all features, with 321, 33, and 10,843 selected
   features in the three cases. Again, all three cases successfully
   distinguish between Stage I and Stage III & IV samples, with the 321
   features case being the most effective for the same reasons. Lastly,
   Figure S6 depicts the PCA diagram for Stage II, Stage III & IV lung
   cancer samples, featuring 367, 31, and 10,843 selected features in
   three cases. All three cases notably separate Stage II and Stage III &
   IV samples, with the 367 features case demonstrating superior
   performance, attributed to its favorable number of features and high
   classification accuracy.

Analyzing obtained results

   Comparing different cancer stages showed that Stage I and Stage III &
   IV had higher classification accuracy than Stage I and Stage II, as
   well as Stage II and Stage III & IV. In other words, separating stages
   that are closer to each other is performed with lower accuracy compared
   to stages that are farther apart in cancer samples. To analyze the
   results obtained in the feature selection section, another method, such
   as data correlation, was used for feature selection. In this method,
   features that had a high correlation with other features were selected.
   For the stages, correlations between −0.8 and 0.8 were considered, and
   features with the highest correlation with the other features were
   selected. These features were then ranked according to their
   correlation, and classification was performed on the data (executed 10
   times). The total number of correlations was 18,043 × 18,043
   = 325,549,849. Table [161]8 shows the correlation between 18,043
   characteristics for the samples of Stage I and Stage II, Stage I and
   Stage III & IV, and Stage II and Stage III & IV. This correlation is
   expressed in the ranges of −0.8, −0.5, 0.5, and 0.8 and expressed as a
   percentage. Table [162]9 evaluates the results of feature selection by
   data correlation method for samples of Stage I and Stage II, Stage I
   and Stage III & IV, and Stage II and Stage III & IV, which includes the
   best selection and the average of 50 executions.

Table 8.

   Analysis of Correlation Between 18,043 Features in Different Lung
   Cancer Stages
   Correlation >  = − 0.8 Correlation >  = + 0.5 Correlation >  = 0.5
   Correlation >  = 0.8
   Stage I and Stage II samples
   0.01% 0.84% 0.26% 0%
   Stage I and Stage III & IV samples
   0.01% 0.77% 0.22% 0%
   Stage II, Stage III & IV samples
   0.01% 0.4% 0.05% 0%
   [163]Open in a new tab

Table 9.

   Results of feature selection using data correlation method for Stages
   (best results and average results of 50 runs)
                              Accuracy Sensitivity Precision F-measure
   Stage I and Stage II samples-(60 features)
   Best results                0.7108    0.5124     0.6080    0.4527
                               0.7108    0.5000     0.3554    0.4155
                               0.7229    0.6444     0.6571    0.6494
                               0.7229    0.5332     0.6958    0.4917
   Average results of 50 runs  0.5343    0.5363     0.5302    0.5068
                               0.6252    0.5267     0.5125    0.5135
                               0.5885    0.5131     0.5132    0.5111
                               0.6774    0.4928     0.4656    0.4372
   Stage I and Stage III & IV samples (230 features)
   Best results                0.6750    0.6912     0.6540    0.6484
                               0.7125    0.6324     0.6364    0.6343
                               0.7375    0.6638     0.6687    0.6661
                               0.7625    0.5823     0.7800    0.5767
   Average results of 50 runs  0.5944    0.5953     0.5767    0.5602
                               0.6554    0.5486     0.5428    0.5387
                               0.6237    0.5350     0.5341    0.5323
                               0.7134    0.5165     0.5664    0.5513
   Stage II, Stage III & IV samples (241 features)
   Best results                0.6522    0.6477     0.6548    0.6462
                               0.6304    0.6307     0.6304    0.6303
                               0.5435    0.5417     0.5419    0.5415
                               0.5652    0.5511     0.5888    0.5054
   Average results of 50 runs  0.5041    0.5071     0.5074    0.5000
                               0.5088    0.5063     0.5046    0.4995
                               0.5059    0.5045     0.5043    0.5013
                               0.4944    0.4852     0.4825    0.4614
   [164]Open in a new tab

   Based on the results obtained from the correlation method, it was found
   that the feature selection results using the method proposed in this
   study (eight metaheuristic methods) were equal to or better than other
   methods, such as the feature selection method using data correlation,
   indicating the validity of the approach used in this study. To validate
   and verify the obtained biomarkers, enrichment analysis was selected.
   This pathway-based approach helps researchers gain mechanistic insights
   into gene lists they have previously generated. This gene list included
   our target genes, which were selected for analysis during the first and
   second Steps to better understand the function of these genes in the
   cell. Pathway enrichment analysis identifies biological pathways that
   are enriched in the gene list beyond what would be expected by chance.
   Therefore, if the output of the second step indicates that the genes
   identified in this stage have a specific function and are involved in a
   particular pathway, the result should be statistically significant,
   ensuring that the identified gene set is significantly correlated and
   not randomly selected. The procedure was carried out as follows: First,
   a list of pathways associated with lung cancer, as reported in papers
   from 2022 to 2023, was compiled (Table [165]10) (References are listed