Abstract Objective Breast cancer, a global concern predominantly impacting women, poses a significant threat when not identified early. While survival rates for breast cancer patients are typically favorable, the emergence of regional metastases markedly diminishes survival prospects. Detecting metastases and comprehending their molecular underpinnings are crucial for tailoring effective treatments and improving patient survival outcomes. Methods Various artificial intelligence methods and techniques were employed in this study to achieve accurate outcomes. Initially, the data was organized and underwent hold-out cross-validation, data cleaning, and normalization. Subsequently, feature selection was conducted using ANOVA and binary Particle Swarm Optimization (PSO). During the analysis phase, the discriminative power of the selected features was evaluated using machine learning classification algorithms. Finally, the selected features were considered, and the SHAP algorithm was utilized to identify the most significant features for enhancing the decoding of dominant molecular mechanisms in lymph node metastases. Results In this study, five main steps were followed for the analysis of mRNA expression data: reading, preprocessing, feature selection, classification, and SHAP algorithm. The RF classifier utilized the candidate mRNAs to differentiate between negative and positive categories with an accuracy of 61% and an AUC of 0.6. During the SHAP process, intriguing relationships between the selected mRNAs and positive/negative lymph node status were discovered. The results indicate that GDF5, BAHCC1, LCN2, FGF14-AS2, and IDH2 are among the top five most impactful mRNAs based on their SHAP values. Conclusion The prominent identified mRNAs including GDF5, BAHCC1, LCN2, FGF14-AS2, and IDH2, are implicated in lymph node metastasis. This study holds promise in elucidating a thorough insight into key candidate genes that could significantly impact the early detection and tailored therapeutic strategies for lymph node metastasis in patients with breast cancer. Introduction Breast cancer is a widespread and fatal form of cancer that affects women globally [[40]1]. In 2023, 297,790 new cases of breast cancer and 43,700 deaths were reported in the US [[41]2]. The survival rate for breast cancer patients is high, with 90% and 83% at 5 and 10 years, respectively [[42]3]. However, the presence of metastases in breast cancer patients leads to a notable decrease in survival rates. In cases where there are regional metastases, such as in the lymph nodes, the 5-year survival rate drops to 85%. If the metastases are distant, the rate decreases further to 26% [[43]4]. Identifying the presence of metastases is of utmost importance as it allows for appropriate treatment and ultimately enhances patient survival rates. When looking for the presence of metastases, the first step is to examine the regional lymph nodes. The presence of metastases in these lymph nodes is a poor prognostic factor only, while it is the main factor in predicting the presence of distant metastases [[44]5]. In the case of breast cancer, the most common method of evaluating the regional lymph node status is the sentinel lymph node procedure [[45]6, [46]7]. This procedure involves injecting a blue dye or radioactive tracer near the tumor. The first lymph node reached by the injected substance, known as the sentinel lymph node, is likely to contain metastatic cancer cells and should be removed. The removed lymph node is then sent for pathological processing and analysis by a pathologist. Evaluating lymph node status is a crucial task for pathologists, but it comes with a challenge. The large area of tissue that needs to be checked is extensive, and it can be hard to identify metastases that can be as small as single cells. In the case of sentinel lymph nodes, at least three sections at different levels through the lymph node have to be examined. At least ten lymph nodes of non-sentinel nodes must be examined [[47]8, [48]9]. This process can be tedious and time-consuming, and pathologists may miss small metastases [[49]10]. To address this, pathologists in the Netherlands performed a secondary examination using immune histochemical staining for cytokeratin if inspection of the H&E slide reveals no metastases. However, even in this secondary examination, metastases can still be missed [[50]11]. With the advent of personalized treatment, traditional prognostic biomarkers such as tumor size, tumor grade, and lymph node metastases are no longer sufficient for the effective management of early-diagnosed breast cancer patients [[51]12, [52]13]. As a result, extensive research has been conducted to identify and validate molecular biomarkers in recent years that can serve as prognostic and predictive indicators. These new approaches are commonly referred to as multi-parameter, multi-analyte, and multi-gene tests. Several of these tests, recommended by experts, are currently used in clinical practice. Some validated examinations include Oncotype DX, MammaPrint, and uPA/PAI-1. The Oncotype DX test is a widely used multigene signature test that helps predict the risk of breast cancer recurrence by evaluating the expression of 21 mRNA genes and then calculates a recurrence score based on the relative expression of these genes [[53]14, [54]15]. However, it should be noted that Oncotype DX has some limitations, such as a lack of validation for long-term follow-ups and ER-negative cases. Another molecular test called MammaPrint uses micro-array to evaluate the relative expression of genes associated with regulatory pathways of cancer, specifically 70 genes. MammaPrint is a validated test for predicting cancer recurrence and dividing patients into high-risk and low-risk groups [[55]16–[56]20]. In addition to these tests, there is another molecular test that evaluates protein levels. This test measures uPA and PAI-1 markers by extracting breast cancer tissues [[57]21]. Studies have shown that elevated levels of these proteins result in more severe outcomes in patients. These multigene signature tests are exorbitant in many countries. To set up a simple and inexpensive test to serve as a diagnostic and predictive biomarker test, considerable effort has been devoted. To discover a novel biomarker, databases could be considered a helpful tool as well. Recently, many researchers have been working on circulating tumor cells (CTCs), microRNAs, and DNA mutation testing (such as the measurement of ctDNA) to find new prognostic and predictive markers. These novel biomarkers before their clinical applications should be validated through clinical and analytical assessments to start our journey towards a personalized treatment for early-diagnosed patients with breast cancer, we need established prognostic biomarkers in combination with validated prognostic/predictive factors. In this study, we aimed to create a new roadmap for the development of robust biomarkers that can accurately detect the presence of lymph node metastases. The significance of this work lies in its ability to provide effective pathways for identifying patients who may have complex conditions where traditional clinical and imaging diagnostic methods fail to detect metastases despite the presence of metastases from a pathological perspective. Our research has introduced new perspectives and tools to assist pathologists and surgeons in screening and classifying such complex and ambiguous conditions and to facilitate decision-making when considering the possible removal of axillary lymph nodes. Material and methods Material The NCBI data portal ([58]https://www.ncbi.nlm.nih.gov/geo/) provides mRNA-seq data from the GEO repository. For this study, we used dataset number [59]GSE96058, which contains 30,865 gene expressions (mRNAs) for 3409 breast cancer patients measured using the [60]GPL11154 platform [[61]22, [62]23]. This dataset is a subset of the multi-center prospective cohort study Sweden Cancerome Analysis Network—Breast [SCAN-B], in which gene expression data is collected from multiple clinical centers. This platform applied the Illumina HiSeq 2000 technique, in which gene expression levels are reported using 54,715 Probes for 30,865 genes. In addition, clinical data of breast cancer patients were obtained from the same dataset. The breast cancer patients were divided into two groups based on their lymph node status: 2099 samples were in the negative lymph node group, while 1209 were in the positive lymph node group ([63]Table 1). Informed consent to participate was obtained from all patients by the Regional Ethical Review Board of Lund (diary numbers 2007/155, 2009/658, 2009/659, 2014/8), the county governmental biobank center, and the Swedish Data Inspection group (diary number 364–2010). Table 1. [64]GSE96058 information. [65]GSE96058 (Lymph Node Status) Positive Negative Not reported 1209 2099 110 [66]Open in a new tab The study was conducted by the principles of the Declaration of Helsinki (2013), and we received permission to access the research data file from the NCBI-GEO program through the National Cancer Institute in the United States. Since the NCBI-GEO data is publicly available, the local ethics committee waived the need for approval. The local ethics code is IR.NIMAD.REC.1400.025. Method The proposed approach for finding significant mRNAs involves five steps: reading, pre-processing, feature selection, classification, and applying the SHAP algorithm. [67]Fig 1 illustrates the complete process and provides additional details for each step. Fig 1. The overview of the proposed method. [68]Fig 1 [69]Open in a new tab Five main steps, including reading, preprocessing, feature selection, classification, and SHAP algorithm were applied to the mRNA expression data. 1) Required data was collected from the NCBI-GEO repository and organized during the reading step. 2) The pre-processing step includes two sub-steps, cross-validation and data normalization. 3) The feature-selection step contains two parts: the filter method based on ANOVA and the wrapper method based on Particle Swarm Optimization (PSO) for mRNA data, in which candidate mRNAs with more relevance to positive and negative Papillary lymph node groups were selected. 4) Multi-classifier models were utilized to evaluate the discrimination power of the selected mRNAs. 5) The SHAP algorithm was employed to discover the possible relationship between the selected mRNAs and positive and negative groups. In the reading step, the mRNA data was organized into a matrix with 3308 rows and 30,865 columns, representing the number of samples and features respectively. To achieve a more authentic error estimation, the hold-out Cross-Validation (CV) approach was used to separate the data into train, validation, and test proportions. The train, validation, and test sets were set to 70%, 10%, and 20%, respectively. Additionally, some feature columns with identical values for all samples of the training set were removed. Finally, to normalize the feature selection and classification steps, the z-score and min-max methods were employed. To reduce the number of irrelevant attributes, we implemented a two-part feature-selection process that consisted of a classifier-independent filter method and a wrapper method [[70]24]. During the filter phase, we used ANOVA [[71]25] to evaluate each feature individually and reduce the dimension of the mRNA data. Due to class imbalance in the training set, we used the SMOTE algorithm [[72]26] to reduce its effect on ANOVA. This also helped to reduce the computational cost during the subsequent wrapper step. From the training set, we selected the 200 top features based on their F-values. We used a method called Particle Swarm Optimization (PSO) [[73]27] to select the most important features. This method is based on swarm intelligence [[74]28] and requires a classifier, in our case Random Forest (RF), to evaluate the fitness function. To ensure accuracy, we chose the fitness function based on AUC and measured its value using a validation set. The algorithm parameters, including the number of population and iterations, were set to 35 and 100 respectively. After running the binary PSO method, we selected 109 significant mRNAs based on the output. Different machine learning models are not inherently better than others, as each model can outperform the others depending on the problem at hand. Therefore, it is crucial to employ various methods to address the issue, particularly in the classification domain. Subsequently, the best-performing model, free from bias, is selected based on its performance. To determine the most suitable model, various parameters are taken into consideration, with the AUC being one of the most valuable parameters according to machine learning literature. We used several supervised classifiers, such as Support Vector Machine (SVM) [[75]29], Naive Bayes (NB), K-Nearest Neighbor (KNN) [[76]30], and Random Forest (RF) [[77]31], to determine the differentiation power of the selected mRNAs. For each classifier, important evaluation metrics including accuracy and AUC were calculated. In the next step, we studied significant relationships using the SHapley Additive exPlanation (SHAP) algorithm [[78]32]. We extracted SHAP values regarding the effect of selected mRNAs in model predictions (positive/negative lymph node status). It is crucial to accurately understand the output of a prediction model. This helps build user trust, identify areas for improvement, and gain insights into the modelling process. Simple models, such as linear models, are often preferred in certain applications due to their ease of interpretation, even if they may not be as accurate as complex models. However, the increasing availability of big data has highlighted the trade-off between a model’s accuracy and interpretability. Various methods have been proposed to address this issue, but there is still a lack of understanding about how these methods compare and when to use one over the other. The SHAP algorithm is an effective tool for explaining the role of each input variable in model predictions. The SHAP technique employs simplified explanation models that yield close approximations to the original predictive model. These models help in explaining complex machine learning models. The detailed technical explanation for each of the stages mentioned in the methodology has been comprehensively covered in the original references. Additionally, the output results of each step have been