Abstract

Objective

   Breast cancer, a global concern predominantly impacting women, poses a
   significant threat when not identified early. While survival rates for
   breast cancer patients are typically favorable, the emergence of
   regional metastases markedly diminishes survival prospects. Detecting
   metastases and comprehending their molecular underpinnings are crucial
   for tailoring effective treatments and improving patient survival
   outcomes.

Methods

   Various artificial intelligence methods and techniques were employed in
   this study to achieve accurate outcomes. Initially, the data was
   organized and underwent hold-out cross-validation, data cleaning, and
   normalization. Subsequently, feature selection was conducted using
   ANOVA and binary Particle Swarm Optimization (PSO). During the analysis
   phase, the discriminative power of the selected features was evaluated
   using machine learning classification algorithms. Finally, the selected
   features were considered, and the SHAP algorithm was utilized to
   identify the most significant features for enhancing the decoding of
   dominant molecular mechanisms in lymph node metastases.

Results

   In this study, five main steps were followed for the analysis of mRNA
   expression data: reading, preprocessing, feature selection,
   classification, and SHAP algorithm. The RF classifier utilized the
   candidate mRNAs to differentiate between negative and positive
   categories with an accuracy of 61% and an AUC of 0.6. During the SHAP
   process, intriguing relationships between the selected mRNAs and
   positive/negative lymph node status were discovered. The results
   indicate that GDF5, BAHCC1, LCN2, FGF14-AS2, and IDH2 are among the top
   five most impactful mRNAs based on their SHAP values.

Conclusion

   The prominent identified mRNAs including GDF5, BAHCC1, LCN2, FGF14-AS2,
   and IDH2, are implicated in lymph node metastasis. This study holds
   promise in elucidating a thorough insight into key candidate genes that
   could significantly impact the early detection and tailored therapeutic
   strategies for lymph node metastasis in patients with breast cancer.

Introduction

   Breast cancer is a widespread and fatal form of cancer that affects
   women globally [[40]1]. In 2023, 297,790 new cases of breast cancer and
   43,700 deaths were reported in the US [[41]2]. The survival rate for
   breast cancer patients is high, with 90% and 83% at 5 and 10 years,
   respectively [[42]3]. However, the presence of metastases in breast
   cancer patients leads to a notable decrease in survival rates. In cases
   where there are regional metastases, such as in the lymph nodes, the
   5-year survival rate drops to 85%. If the metastases are distant, the
   rate decreases further to 26% [[43]4]. Identifying the presence of
   metastases is of utmost importance as it allows for appropriate
   treatment and ultimately enhances patient survival rates.

   When looking for the presence of metastases, the first step is to
   examine the regional lymph nodes. The presence of metastases in these
   lymph nodes is a poor prognostic factor only, while it is the main
   factor in predicting the presence of distant metastases [[44]5]. In the
   case of breast cancer, the most common method of evaluating the
   regional lymph node status is the sentinel lymph node procedure [[45]6,
   [46]7]. This procedure involves injecting a blue dye or radioactive
   tracer near the tumor. The first lymph node reached by the injected
   substance, known as the sentinel lymph node, is likely to contain
   metastatic cancer cells and should be removed. The removed lymph node
   is then sent for pathological processing and analysis by a pathologist.

   Evaluating lymph node status is a crucial task for pathologists, but it
   comes with a challenge. The large area of tissue that needs to be
   checked is extensive, and it can be hard to identify metastases that
   can be as small as single cells. In the case of sentinel lymph nodes,
   at least three sections at different levels through the lymph node have
   to be examined. At least ten lymph nodes of non-sentinel nodes must be
   examined [[47]8, [48]9]. This process can be tedious and
   time-consuming, and pathologists may miss small metastases [[49]10]. To
   address this, pathologists in the Netherlands performed a secondary
   examination using immune histochemical staining for cytokeratin if
   inspection of the H&E slide reveals no metastases. However, even in
   this secondary examination, metastases can still be missed [[50]11].

   With the advent of personalized treatment, traditional prognostic
   biomarkers such as tumor size, tumor grade, and lymph node metastases
   are no longer sufficient for the effective management of
   early-diagnosed breast cancer patients [[51]12, [52]13]. As a result,
   extensive research has been conducted to identify and validate
   molecular biomarkers in recent years that can serve as prognostic and
   predictive indicators. These new approaches are commonly referred to as
   multi-parameter, multi-analyte, and multi-gene tests. Several of these
   tests, recommended by experts, are currently used in clinical practice.
   Some validated examinations include Oncotype DX, MammaPrint, and
   uPA/PAI-1. The Oncotype DX test is a widely used multigene signature
   test that helps predict the risk of breast cancer recurrence by
   evaluating the expression of 21 mRNA genes and then calculates a
   recurrence score based on the relative expression of these genes
   [[53]14, [54]15]. However, it should be noted that Oncotype DX has some
   limitations, such as a lack of validation for long-term follow-ups and
   ER-negative cases. Another molecular test called MammaPrint uses
   micro-array to evaluate the relative expression of genes associated
   with regulatory pathways of cancer, specifically 70 genes. MammaPrint
   is a validated test for predicting cancer recurrence and dividing
   patients into high-risk and low-risk groups [[55]16–[56]20]. In
   addition to these tests, there is another molecular test that evaluates
   protein levels. This test measures uPA and PAI-1 markers by extracting
   breast cancer tissues [[57]21]. Studies have shown that elevated levels
   of these proteins result in more severe outcomes in patients.

   These multigene signature tests are exorbitant in many countries. To
   set up a simple and inexpensive test to serve as a diagnostic and
   predictive biomarker test, considerable effort has been devoted. To
   discover a novel biomarker, databases could be considered a helpful
   tool as well. Recently, many researchers have been working on
   circulating tumor cells (CTCs), microRNAs, and DNA mutation testing
   (such as the measurement of ctDNA) to find new prognostic and
   predictive markers. These novel biomarkers before their clinical
   applications should be validated through clinical and analytical
   assessments to start our journey towards a personalized treatment for
   early-diagnosed patients with breast cancer, we need established
   prognostic biomarkers in combination with validated
   prognostic/predictive factors.

   In this study, we aimed to create a new roadmap for the development of
   robust biomarkers that can accurately detect the presence of lymph node
   metastases. The significance of this work lies in its ability to
   provide effective pathways for identifying patients who may have
   complex conditions where traditional clinical and imaging diagnostic
   methods fail to detect metastases despite the presence of metastases
   from a pathological perspective. Our research has introduced new
   perspectives and tools to assist pathologists and surgeons in screening
   and classifying such complex and ambiguous conditions and to facilitate
   decision-making when considering the possible removal of axillary lymph
   nodes.

Material and methods

Material

   The NCBI data portal ([58]https://www.ncbi.nlm.nih.gov/geo/) provides
   mRNA-seq data from the GEO repository. For this study, we used dataset
   number [59]GSE96058, which contains 30,865 gene expressions (mRNAs) for
   3409 breast cancer patients measured using the [60]GPL11154 platform
   [[61]22, [62]23]. This dataset is a subset of the multi-center
   prospective cohort study Sweden Cancerome Analysis Network—Breast
   [SCAN-B], in which gene expression data is collected from multiple
   clinical centers. This platform applied the Illumina HiSeq 2000
   technique, in which gene expression levels are reported using 54,715
   Probes for 30,865 genes. In addition, clinical data of breast cancer
   patients were obtained from the same dataset. The breast cancer
   patients were divided into two groups based on their lymph node status:
   2099 samples were in the negative lymph node group, while 1209 were in
   the positive lymph node group ([63]Table 1). Informed consent to
   participate was obtained from all patients by the Regional Ethical
   Review Board of Lund (diary numbers 2007/155, 2009/658, 2009/659,
   2014/8), the county governmental biobank center, and the Swedish Data
   Inspection group (diary number 364–2010).

Table 1. [64]GSE96058 information.

   [65]GSE96058 (Lymph Node Status)
   Positive Negative  Not reported
     1209     2099         110
   [66]Open in a new tab

   The study was conducted by the principles of the Declaration of
   Helsinki (2013), and we received permission to access the research data
   file from the NCBI-GEO program through the National Cancer Institute in
   the United States. Since the NCBI-GEO data is publicly available, the
   local ethics committee waived the need for approval. The local ethics
   code is IR.NIMAD.REC.1400.025.

Method

   The proposed approach for finding significant mRNAs involves five
   steps: reading, pre-processing, feature selection, classification, and
   applying the SHAP algorithm. [67]Fig 1 illustrates the complete process
   and provides additional details for each step.

Fig 1. The overview of the proposed method.

   [68]Fig 1
   [69]Open in a new tab

   Five main steps, including reading, preprocessing, feature selection,
   classification, and SHAP algorithm were applied to the mRNA expression
   data. 1) Required data was collected from the NCBI-GEO repository and
   organized during the reading step. 2) The pre-processing step includes
   two sub-steps, cross-validation and data normalization. 3) The
   feature-selection step contains two parts: the filter method based on
   ANOVA and the wrapper method based on Particle Swarm Optimization (PSO)
   for mRNA data, in which candidate mRNAs with more relevance to positive
   and negative Papillary lymph node groups were selected. 4)
   Multi-classifier models were utilized to evaluate the discrimination
   power of the selected mRNAs. 5) The SHAP algorithm was employed to
   discover the possible relationship between the selected mRNAs and
   positive and negative groups.

   In the reading step, the mRNA data was organized into a matrix with
   3308 rows and 30,865 columns, representing the number of samples and
   features respectively. To achieve a more authentic error estimation,
   the hold-out Cross-Validation (CV) approach was used to separate the
   data into train, validation, and test proportions. The train,
   validation, and test sets were set to 70%, 10%, and 20%, respectively.
   Additionally, some feature columns with identical values for all
   samples of the training set were removed. Finally, to normalize the
   feature selection and classification steps, the z-score and min-max
   methods were employed.

   To reduce the number of irrelevant attributes, we implemented a
   two-part feature-selection process that consisted of a
   classifier-independent filter method and a wrapper method [[70]24].
   During the filter phase, we used ANOVA [[71]25] to evaluate each
   feature individually and reduce the dimension of the mRNA data. Due to
   class imbalance in the training set, we used the SMOTE algorithm
   [[72]26] to reduce its effect on ANOVA. This also helped to reduce the
   computational cost during the subsequent wrapper step. From the
   training set, we selected the 200 top features based on their F-values.

   We used a method called Particle Swarm Optimization (PSO) [[73]27] to
   select the most important features. This method is based on swarm
   intelligence [[74]28] and requires a classifier, in our case Random
   Forest (RF), to evaluate the fitness function. To ensure accuracy, we
   chose the fitness function based on AUC and measured its value using a
   validation set. The algorithm parameters, including the number of
   population and iterations, were set to 35 and 100 respectively. After
   running the binary PSO method, we selected 109 significant mRNAs based
   on the output.

   Different machine learning models are not inherently better than
   others, as each model can outperform the others depending on the
   problem at hand. Therefore, it is crucial to employ various methods to
   address the issue, particularly in the classification domain.
   Subsequently, the best-performing model, free from bias, is selected
   based on its performance. To determine the most suitable model, various
   parameters are taken into consideration, with the AUC being one of the
   most valuable parameters according to machine learning literature. We
   used several supervised classifiers, such as Support Vector Machine
   (SVM) [[75]29], Naive Bayes (NB), K-Nearest Neighbor (KNN) [[76]30],
   and Random Forest (RF) [[77]31], to determine the differentiation power
   of the selected mRNAs. For each classifier, important evaluation
   metrics including accuracy and AUC were calculated.

   In the next step, we studied significant relationships using the
   SHapley Additive exPlanation (SHAP) algorithm [[78]32]. We extracted
   SHAP values regarding the effect of selected mRNAs in model predictions
   (positive/negative lymph node status). It is crucial to accurately
   understand the output of a prediction model. This helps build user
   trust, identify areas for improvement, and gain insights into the
   modelling process. Simple models, such as linear models, are often
   preferred in certain applications due to their ease of interpretation,
   even if they may not be as accurate as complex models. However, the
   increasing availability of big data has highlighted the trade-off
   between a model’s accuracy and interpretability. Various methods have
   been proposed to address this issue, but there is still a lack of
   understanding about how these methods compare and when to use one over
   the other.

   The SHAP algorithm is an effective tool for explaining the role of each
   input variable in model predictions. The SHAP technique employs
   simplified explanation models that yield close approximations to the
   original predictive model. These models help in explaining complex
   machine learning models.

   The detailed technical explanation for each of the stages mentioned in
   the methodology has been comprehensively covered in the original
   references. Additionally, the output results of each step have been