Abstract Patients with locally advanced rectal cancer (LARC) show substantial individual variability and a pronounced imbalance in response distribution to neoadjuvant chemoradiotherapy (nCRT), posing significant challenges to treatment response prediction. This study aims to identify effective predictive biomarkers and develop an ensemble learning-based prediction model to assess the response of LARC patients to nCRT. A two-step feature selection method was developed to identify predictive biomarkers by deriving stable reversal gene pairs through within-sample relative expression orderings (REOs) from LARC patients undergoing nCRT. Preliminary screening utilized four methods—MDFS, Boruta, MCFS, and VSOLassoBag—to form a candidate feature set. Secondary screening ranked these features by permutation importance, applying Incremental Feature Selection (IFS) with an Extreme Gradient Boosting (XGBoost) to determine final predictive gene pairs. The ensemble model BoostForest, combining boosting and bagging, served as the predictive framework, with SHAP employed for interpretability. Through two-step feature selection, the 32-gene pair signature (32-GPS) was established as the final predictive biomarker. In the test set, the model achieved an area under the precision-recall curve (AUPRC) of 0.983 and an accuracy of 0.988. In the validation cohort, the AUPRC was 0.785, with an accuracy of 0.898, indicating strong model performance. The study further demonstrated that BoostForest achieved superior overall performance compared to Random Forest, Support Vector Machine (SVM), and XGBoost. To evaluate the effectiveness of the 32-GPS, its performance was compared with two alternative feature sets: the lasso-gene pair signature (lasso-GPS), derived through lasso regression, and the 15-shared gene pair signature (15-SGPS), consisting of gene pairs identified by all four feature selection methods. The 32-GPS demonstrated superior performance in both comparisons. The two-step feature selection method identified robust predictive biomarkers, and BoostForest outperformed Random Forest, Support Vector Machine, and XGBoost in classification performance and predictive capability. Supplementary Information The online version contains supplementary material available at 10.1038/s41598-025-94337-y. Keywords: Feature selection, Locally advanced rectal cancer, Neoadjuvant chemoradiotherapy, Class imbalance problem, Ensemble learning Subject terms: Biomedical engineering, Predictive markers Introduction Neoadjuvant chemoradiotherapy (nCRT) prior to total mesorectal excision (TME) in locally advanced rectal cancer (LARC) patients reduces tumor volume, minimizes the risk of local recurrence, and improves the likelihood of successful resection, thereby enhancing prognosis^[38]1–[39]3. Additionally, patients who achieve pathological complete response (pCR) after nCRT may avoid surgery, thereby enhancing their quality of life^[40]4,[41]5. Despite the numerous benefits of nCRT, only about 15–27% of patients achieve pCR, indicating variability in response due to the heterogeneity of tumor biology and the complexity of individual gene expression^[42]6. Such variability underscores the heterogeneity of tumor biology and the complexity of individual gene expression profiles. Current preclinical evaluation methods lack precision in predicting patient responses to nCRT, resulting in some patients enduring unnecessary side effects without deriving therapeutic benefits. In the context of precision medicine, identifying new biomarkers to predict response to nCRT is imperative. Such biomarkers would enable more precise treatment decisions, reducing both under-treatment and over-treatment and ultimately optimizing therapeutic outcomes for patients with LARC^[43]7,[44]8. Gene expression analysis has demonstrated significant potential in predicting treatment responses^[45]9–[46]12. Specific gene expression profiles offer valuable insights into patient treatment responses and prognosis^[47]13,[48]14, while also facilitating the identification of risk factors that can inform the development of novel therapeutic targets^[49]15. Although some progress has been made in predicting the response of rectal cancer patients to nCRT^[50]16,[51]17, there remain several limitations and challenges. Firstly, batch effects in gene expression data and the issue of high-dimensional feature selection continue to be critical factors influencing the performance of predictive models. Batch effects can lead to data discrepancies between experiments, thereby affecting the reliability of analysis results and the generalizability of the models. Additionally, feature selection in high-dimensional data involves identifying the most predictive features from a large number of variables, which is crucial for improving prediction accuracy and reducing model overfitting. Currently, three main types of feature selection methods are widely used, each based on different principles, leading to variations in the features selected. Furthermore, the significant imbalance in the proportion of patients responding to nCRT presents a challenge for feature selection, complicating the prediction of pCR in LARC patients^[52]18. Among patients receiving nCRT, only a small fraction achieves pCR. This imbalanced sample distribution complicates the development of accurate predictive models. Although some studies have attempted to address the issue of imbalanced sample distribution through resampling techniques or synthetic sample generation^[53]19,[54]20, these methods may alter the original characteristics of the dataset or introduce biases. The generated samples may not fully represent the real clinical situations of patients, potentially leading to unstable model performance on new datasets. Zheng et al. demonstrated that relative rank methods, such as ROGER, provide a robust alternative to traditional normalization for compositional NGS data analysis^[55]21. Among these, gene pair-based within-sample relative expression orderings (REOs) offer an innovative strategy to mitigate batch effects between samples. REOs are an analysis method based on qualitative transcript features. Since REOs rely solely on the relative order of gene expression, they exhibit strong robustness against batch effects, effectively eliminating systematic biases arising from different experimental conditions^[56]22–[57]25. Additionally, ensemble learning methods, which combine multiple base learners, provide an effective solution to the issue of imbalanced sample distribution, particularly when the minority class consists of the positive samples that researchers focus on more heavily^[58]26,[59]27. Traditional machine learning models are often sensitive to the majority class, which can result in poor prediction performance for minority class samples. However, ensemble learning enhances the ability to identify minority class samples through the combined decisions of multiple models, thereby improving overall model performance. Methods such as Bagging and Boosting can reduce the influence of majority class samples by adjusting sample weights or employing resampling strategies, ultimately improving the balance of classification results. This study employs REOs to identify stable reversal gene pairs and adjusts the feature selection process with the goal of discovering new predictive biomarkers. By applying ensemble learning, the study aims to improve the prediction probability for pCR patients, potentially providing therapeutic guidance for LARC patients prior to surgery. Regarding feature selection, the study utilizes two rounds of feature screening: an initial screening followed by incremental feature selection. This approach is designed to reduce bias introduced by any single feature selection method, thereby increasing the likelihood of identifying relevant features and minimizing the risk of overfitting. Additionally, BoostForest is introduced as the predictive model^[60]28, aimed at mitigating the impact of sample distribution imbalance on prediction accuracy, ultimately improving the model’s ability to predict nCRT response in LARC patients. Methods and materials Data pre-processing The workflow of this study is presented in Fig. [61]1. The cohort and validation cohort used in this study were both sourced from the Gene Expression Omnibus (GEO) database. The dataset [62]GSE87211^[63]29 was detected using the Agilent-026652 Whole Human Genome Microarray 4 × 44 K v2 ([64]GPL13497) platform, which includes 363 samples. For this study, 106 LARC samples from the [65]GSE87211 cohort were selected for analysis. [66]GSE40492^[67]30 served as the validation cohort, consisting of 118 samples used for experimental validation. All samples from both [68]GSE87211 and [69]GSE40492 cohorts were included in the prognostic analysis. All samples were collected from cancer patients. The data in this study have not undergone standardization across samples. It has been verified that all expression values have been log2-transformed. For genes represented by multiple probes, the arithmetic mean of the probe values is used, with values expressed on the log2 scale. The inclusion and exclusion criteria for sample selection were as follows: (1) patients who did not receive neoadjuvant chemoradiotherapy (nCRT) were excluded, (2) patients with a T stage of 4 were excluded^[70]31, and (3) Disease-free survival (DFS) is a critical indicator for evaluating the efficacy of neoadjuvant chemoradiotherapy and for assessing patient prognosis. Missing DFS data can compromise the reliability of prognostic information. In this study, patients with incomplete DFS data were excluded from the survival analysis. The sample distributions for [71]GSE87211 and [72]GSE40492 are shown in Fig. [73]2a. Fig. 1. [74]Fig. 1 [75]Open in a new tab Overview of the research workflow diagram. Fig. 2. [76]Fig. 2 [77]Open in a new tab (a) Distribution of pCR and non-pCR sample proportions in [78]GSE87211 and [79]GSE40492. (b) Volcano Plot of Differential Expression Analysis. (c) Venn diagram showing the overlapping genes identified by the four feature selection methods. (d) Upset plot showing the overlapping genes identified by the four feature selection methods. (e) Visualization of the feature selection process in the secondary feature selection. Differential gene identification In the [80]GSE87211 dataset, the study participants were classified into the pCR and non-pCR groups based on the tumor regression grade (TRG) from the American Joint Committee on Cancer (AJCC). Patients with TRG0 were assigned to the pCR group, while those with TRG1, TRG2, and TRG3 were categorized as non-pCR. The [81]GSE40492 dataset used Dowrak criteria to assess the pathological response status of patients. Although the two evaluation standards slightly differ, the differences between them are minimal^[82]32. In this study, the limma algorithm was applied for differential expression analysis to identify differentially expressed genes (DEGs) between the pCR and non-pCR groups, with a false discovery rate (FDR) of less than 5% and |log FC| > 0.5. The detailed process of the limma algorithm is as follows: A design matrix was constructed based on the grouping information from the [83]GSE87211 dataset to represent the different pCR and non-pCR groups. To perform the inter-group differential comparison, the makeContrasts function was used to create a contrast matrix, defining pCR - non-pCR as the comparison group. Next, we followed the standard workflow of the limma algorithm, initially fitting the data using the lmFit function, and then applying the contrast matrix to the fitted results with contrasts.fit. Finally, empirical Bayes adjustment was applied using eBayes to assess the significance of gene expression. It is important to note that the eBayes function does not enable trend adjustment by default. We assessed the consistency of differentially expressed genes (DEGs) between the [84]GSE87211 and [85]GSE40492 datasets using the limma algorithm. Consistency was defined as the proportion of overlapping DEGs with the same direction of differential expression. Under completely random conditions, the probability that a gene exhibits consistent differential expression direction in two datasets is 0.5. The number of differentially expressed genes common to both datasets is denoted as n, and the number of genes with consistent differential expression direction is denoted as s. Therefore, consistency is defined as s divided by n. The statistical significance of the consistency of differentially expressed genes is then evaluated using the binom.test function. Construction of gene pairs In this study, we utilized REOs to construct gene pairs across two groups of samples and identify stable reversal gene pairs, with the excellent performance of REOs validated in the study by Xue et al.^[86]7. First, we constructed gene pairs separately within each sample group. For each gene pair, we test whether the expression of one gene is significantly higher than the other, with the hypothesis that the proportion of times one gene’s expression exceeds the other’s is greater than 50%. This assumes that each comparison is an independent trial with two possible outcomes—either gene A’s expression is higher than gene B’s, or vice versa. After performing the binomial test, we obtained p-values for each gene pair and selected those with significant p-values. Stable reversal gene pairs were identified from this subset. Subsequently, a binomial distribution test was applied to calculate the p-value for each gene pair, and gene pairs with p-values less than 0.05 were selected, where at least one gene in each pair was identified as differentially expressed. We applied the Benjamini-Hochberg (BH) method for multiple comparisons correction of p-values (Supplementary Data [87]1). For any two genes a and b within the expression matrix, their expression values are represented by G[a] and G[b], respectively, and can exhibit one of two relationships: G[a]>G[b] or G[b]>G[a]. Stable reversal gene pairs are defined as gene pairs where the expression relationship between the genes exhibits opposite patterns in the pCR and non-pCR groups. In the pCR group, when the number of samples with consistent expression relationships exceeds 60%, and in the non-pCR group, when the number of samples with reverse expression relationships exceeds 60%, the gene pair is classified as a stable reversal gene pair. As shown in the Stable Reversal Gene Pairs section of Fig. [88]1, there are four possible expression relationships between gene pairs in the pCR and non-pCR groups: ① Ga > Gb, Ga > Gb; ② Ga > Gb, Ga < Gb; ③ Ga < Gb, Ga > Gb; ④ Ga < Gb, Ga < Gb. Gene pairs meeting the criteria specified in the red and blue regions (i.e., ② and ③) are identified as stable reversal gene pairs. The algorithm is provided in Supplementary Algorithm [89]1. Selection of gene pairs To select gene pairs capable of predicting the response of LARC patients to nCRT from the numerous stable reversal gene pairs, this study employed a two-step feature selection method. Preliminary screening was conducted using four distinct methods, namely MultiDimensional Feature Selection (MDFS)^[90]33, Boruta^[91]34, Monte Carlo Feature Selection (MCFS)^[92]35, and VSOLassoBag^[93]36, each offering unique advantages. To ensure comprehensive feature capture, the union of the selected features from all four methods was used as a new candidate gene set. MDFS (MultiDimensional Feature Selection) is a type of filter method that accounts for the relationships between multiple features and decision variables. MDFS, by calculating multidimensional information gain, can identify informative variables that are related to the target variable and simultaneously capture interactions between variables. This method is particularly effective in identifying gene pairs with potential interactions^[94]33. Based on the recommendations of the method developers, this study employs the more comprehensive MDFS-2D approach. Boruta is a feature selection algorithm based on random forests. Its fundamental principle involves evaluating the importance of each feature by comparing it with random shadow features^[95]34. In the context of nCRT response prediction, the high dimensionality of features necessitates the identification of key gene pairs from a large pool of redundant candidates. By iteratively calculating feature importance scores and performing global comparisons with shadow variables, Boruta effectively prioritizes gene pairs with superior predictive value. MCFS evaluates both the importance of individual features and their interactions, thereby facilitating the identification of complex dependencies. MCFS has the capability to simultaneously evaluate the association between features and the target variable as well as the interactions among features^[96]35. In the prediction of nCRT response in LARC patients, potential interactions between features may not only have a significant impact on the performance of the predictive model but also substantially influence the outcomes of feature selection, thereby affecting the identification of key features. VSOLassoBag combines the LASSO algorithm with a bagging strategy, demonstrating strong performance in the analysis of high-dimensional, low-sample-size biological data. Liang et al. have shown that VSOLassoBag is applicable in studies of colorectal cancer and can effectively mitigate issues related to multicollinearity, thereby avoiding the selection of highly correlated features^[97]36. LASSO inherently possesses sparsity, favoring the selection of a smaller number of features and typically selecting one feature from a group of highly correlated features during the feature selection process. By integrating multiple LASSO models, VSOLassoBag further optimizes the feature selection process. For secondary screening, the study applied an Incremental Feature Selection (IFS) algorithm to identify the optimal predictive feature combination^[98]37. Permutation importance was used to rank the importance of gene pairs in the new candidate gene set^[99]38, and Extreme Gradient Boosting (XGBoost) was selected as the classifier^[100]39. In the feature selection process, the IFS algorithm starts with the most important features ranked by importance and progressively adds them to the current feature subset. The performance of this feature subset is then evaluated using the XGBoost classifier, continuing until all features have been evaluated, at which point the algorithm terminates. Through cross-validation (GridSearchCV) or post-training performance evaluation, it is determined whether the addition of a new feature to the current subset contributes to improving the model’s performance. Permutation Importance, a model-agnostic method, evaluates the importance of features by randomly shuffling the values of a given feature while holding others constant to assess the resulting decline in model performance. A significant performance drop upon shuffling indicates high importance for the feature, whereas minimal impact suggests low importance. In secondary screening, two evaluation metrics were employed: the performance of the XGBoost model after each training iteration and an evaluation metric for selecting optimal hyperparameters through grid search. Considering data imbalance, this study used a combination of log-loss and AUC as evaluation metrics for the XGBoost model and selected the F1 score as the grid search metric. To enhance model performance on minority classes, the ratio of pCR to non-pCR samples was used to adjust the weight in the loss function. In this study, the class imbalance ratio was calculated as the number of non-pCR samples (class 0) divided by the number of pCR samples (class 1). This ratio was directly assigned to the scale_pos_weight parameter in XGBoost to balance the contributions of both classes during training. Specifically, adjusting scale_pos_weight helps prevent the model from being biased toward the majority class (non-pCR), thereby improving its ability to identify pCR cases. Establishment of the predictive model In this study, BoostForest was employed as the final predictive model. BoostForest is an ensemble learning method that combines the principles of boosting and bagging. First, multiple distinct subsets are generated from the original training set using the bootstrap method, and a BoostTree model is trained on each subset. The final prediction probability was obtained by calculating the arithmetic mean of the predicted probabilities from all BoostTree models. BoostTree utilizes gradient boosting to decompose the original problem into multiple sub-regression tasks and applies logistic regression to solve each sub-regression task at each node. A parameter pool sampling strategy is employed to facilitate the hyperparameter tuning process, where all possible values for each parameter are stored in a parameter pool^[101]28. BoostTree then randomly selects a parameter combination from this pool. Given an input, BoostTree assigns it to a leaf node and calculates the final prediction by summing the outputs of the models at all nodes along the path from the root to the leaf node. The accumulated output is then transformed into a probability using the Sigmoid function. BoostTree effectively combines the structured representation capability of decision trees with the strong learning ability of gradient boosting. When integrated with the BoostForest strategy, Zhao et al.‘s study validated the predictive performance of BoostForest across multiple datasets by comparing it with models such as RandomForest, XGBoost, and LightGBM^[102]28. The results indicated that BoostForest outperformed the other models across multiple datasets and demonstrated superior generalization ability compared to the models in the comparison. In our study, we employed grid search for hyperparameter optimization. The hyperparameters adjusted for the BoostForest model include: max_leafs, with possible values of None, 1, 3, 5, or 7; min_sample_leaf_list, with possible values of None, 1, 3, 5, 7, or 9; reg_alpha_list, with values of 0.01, 0.05, 0.1, 0.2, or 0.3; and n_estimators, with possible values of 50, 100, 200, or 300. The algorithm for this section is provided in Supplementary Algorithm [103]2. Pathway enrichment analysis Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed using KOBAS 3.0 (KEGG Orthology Based Annotation System)^[104]40. The results of the GO and KEGG enrichment analyses were visualized using R 4.4.0 software. A threshold of P.adj < 0.05 was applied to reduce the false discovery rate. Interpretability analysis To enhance the interpretability of the model, shapley additive explanation (SHAP)^[105]41, a Python-based tool, was used for analysis on both the test and validation cohort. The primary function of SHAP analysis is to help understand the contribution of each feature to the model’s prediction, thereby providing interpretability and aiding in the understanding of how the model makes decisions. The SHAP method quantifies the impact of each feature on the model’s prediction. SHAP values demonstrate the direction and magnitude of each feature’s effect on the prediction, assisting in identifying which features play a critical role in the model’s decision-making process. This analysis helps reveal the decision-making logic of the model across different datasets, thereby improving its transparency and interpretability. Statistical analysis The statistical analysis in this study was performed using R 4.4.0 and Python 3.9.6. In R, the limma package was used to identify differentially expressed genes, while the MDFS, Boruta, rmcfs, and VSOLassoBag packages were utilized for Preliminary screening. The survival package was employed for survival analysis. In this study, gene pairs obtained using the REOs algorithm were evaluated using the binomial test (p < 0.05), and the p-values were subsequently corrected for multiple comparisons using the Benjamini-Hochberg (BH) method. GO and KEGG enrichment analyses were performed using the online tool KOBAS 3.0, with terms having a p-value < 0.05 considered statistically significant. The predictive ability of features was assessed using the precision-recall (PR) curve and the area under the precision-recall curve (AUPRC). Survival analysis of the dataset was conducted using the Kaplan-Meier method, with statistical comparisons made using the log-rank test. In Python, modeling and evaluation were conducted using the scikit-learn and xgboost packages, and the shap package was used for model interpretability analysis. Statistical significance was defined as a p-value less than 0.05. We used TOPSIS analysis to evaluate multiple metrics and determine the relative performance of each model^[106]42. Results Differential analysis of pCR and non-pCR In this study, differential expression analysis was conducted for 21 pCR samples and 85 non-pCR samples selected from the training cohort [107]GSE87211. A total of 201 differentially expressed genes (DEGs) were identified. The identified DEGs were visualized using a volcano plot (Fig. [108]2b), with 86 genes upregulated and 115 genes downregulated. The consistency analysis of DEGs between the [109]GSE87211 and [110]GSE40492 datasets revealed a consistency rate of 0.966 (95% confidence interval: 0.932–0.986). A two-sided exact binomial test yielded a p-value of 1.52 × 10^–50, indicating statistically significant consistency. Feature selection A total of 5,487 stable reversal gene pairs were initially selected using the REOs algorithm. Subsequently, a two-step feature selection method was applied to identify the final predictive gene pairs. As shown in Fig. [111]2c-d, during the preliminary screening of the two-step feature selection, MDFS-2D selected 26 gene pairs, Boruta selected 66 gene pairs, rmcfs selected 66 gene pairs, and VSOLassoBag selected 81 gene pairs. A total of 15 gene pairs were common across all four methods, and the union of gene pairs selected by the four methods contained 127 gene pairs. During the Secondary screening, the parameters of the XGBoost model were set as learning_rate = 0.1, max_depth = 4, and n_estimators = 150, which yielded the best performance of the classification model. The ratio of pCR to non-pCR sample numbers was calculated and assigned to the scale_pos_weight parameter. As shown in Fig. [112]2e, when the number of gene pairs reached 32, the F1 score peaked at 0.9818. The details of these 32-gene pair signature are provided in Table [113]1, hereafter referred to as the 32-GPS. Table 1. The specific description of the 32-gene pair signature. ID Gene pair (G[a] > G[b]) ID Gene pair (G[a] > G[b]) ID Gene pair (G[a]> G[b]) 1 CTNNBIP1 > ZNF544 2 MFSD2A > SLC39A7 3 VAV2 > PER2 4 ANKRD20A1 > ETNK2 5 ITGB8 > C2orf49 6 RNF213 > CFH 7 CENPM > CACNA1H 8 STAC3 > PBX3 9 CDKN2A > ATG2A 10 RPP38 > ABCF3 11 CHRNA5 > CFH 12 RPP38 > PEX16 13 CENPM > PRDM2 14 SPTBN1 > ZNF175 15 ITGB8 > ZNF658 16 FUT8 > ITFG1 17 CAMTA1 > UAP1L1 18 CENPM > BEND7 19 GDAP1 > PDGFB 20 CENPM > C2orf49 21 CEP76 > SMPD1 22 KATNAL2 > RNF185 23 KATNAL2 > ZNF415 24 ITGB8 > KAT5 25 FUT8 > ARRDC1 26 CYP2C9 > RSU1 27 TNS1 > PIGL 28 CBR3 > MUC17 29 ZNF180 > CFH 30 DONSON > CHID1 31 VAV2 > DNMT3B 32 ITGB8 > RABGAP1 [114]Open in a new tab Ga > Gb represents pathological complete response to neoadjuvant chemoradiation in locally advanced rectal cancer. Predictive biomarkers for pathological complete response to nCRT The [115]GSE87211 dataset, consisting of 106 samples, was selected as the training cohort for this study. Internal validation was performed using the holdout method, with the training and test sets randomly split in a 7:3 ratio. The training set contained 74 samples, while the test set included 32 samples, with both sets maintaining the same proportion of positive and negative samples as the original dataset. The optimal parameter settings for the BoostForest classifier were max_leafs = None, min_sample_leaf_list = 5, reg_alpha_list = 0.1, and n_estimators = 200, with bootstrap sampling applied with replacement, through which 200 subsets were used to train 200 BoostTrees. In the case of imbalanced data, where the majority of samples are negative, the model may still exhibit a high true positive rate (TPR) and low false positive rate (FPR), resulting in a high AUC value. However, this may not accurately reflect the model’s performance in identifying positive cases. The PR curve, which focuses more on the performance for the positive class, was used in this study instead of the ROC curve to evaluate the model’s performance, providing a more accurate assessment of its ability to recognize the positive class in imbalanced datasets. As shown in Fig. [116]3a, in the test set, the AUPRC of BoostForest was 0.983 (0.951, 1.000), with an accuracy of 0.988 (0.963, 1.000), recall of 0.95 (0.852, 1.000), specificity of 0.983 (0.951, 1.000), and an F1 score of 0.867 (0.667, 1.000). The F1 score remains lower than accuracy, which may be attributed to data imbalance. Even if the model performs well on the majority class, accuracy can still be high, whereas the F1 score is influenced by the trade-off between precision and recall. As shown in Fig. [117]3b, in the [118]GSE40492 dataset, BoostForest achieved an AUPRC of 0.785, accuracy of 0.898, recall of 0.727, specificity of 0.938, and an F1 score of 0.727. Fig. 3. [119]Fig. 3 [120]Open in a new tab Precision-recall curves of four predictive models in the test and validation cohort. (a) PR curves of four predictive models in the test set, with baseline representing the random forest model. (b) PR curves of four predictive models in the validation cohort. (c) AUPRC Performance of Four Prediction Models Using lasso gene pair signature (lasso-GPS) as Predictive Features in the [121]GSE40492 Dataset. (d) AUPRC Performance of Four Prediction Models Using 15-shared gene pair signature (15-SGPS) as Predictive Features in the [122]GSE40492 Dataset. To validate the importance of BoostForest in this study, random forest (RF) was added as the baseline, and Support Vector Machine (SVM) and XGBoost were included as comparison models. All four prediction models were evaluated under the same experimental conditions. As shown in Fig. [123]3a, the AUPRC of all four models was 0.983 in the test set, with no significant difference, which may be attributed to the relatively small sample size of the test set, making it difficult to distinguish model performance. In the [124]GSE40492 dataset, BoostForest achieved the highest AUPRC of 0.785. We observed that the AUPRC of BoostForest in the [125]GSE40492 dataset is lower than its performance in the test set. This discrepancy may be attributed to the relatively small sample size of the test set. Additionally, although the test and training sets were partitioned using the holdout method, both sets originate from the [126]GSE87211 dataset, which may have led to overfitting in the test set, thus affecting the model’s performance in the [127]GSE40492 dataset. To ensure the reliability of the results, all algorithms were executed five times on each dataset, analyzed using TOPSIS, and statistically compared using the Mann-Whitney U test between BoostForest and other models. The results, presented in Table [128]2, are reported as mean values. The best-performing results are highlighted in bold, and * indicates the statistically significant superiority of BoostForest. In the [129]GSE87211 dataset, BoostForest achieved an average TOPSIS ranking of 2.2, second only to SVM. In the [130]GSE40492 dataset, BoostForest ranked first in the TOPSIS analysis, with its F1 Score and AUPRC significantly higher than those of other models (p-value < 0.05). We found that although BoostForest had a lower recall than XGBoost in the [131]GSE40492 dataset, it still achieved a recall of 0.7, which is above the average level. Except for recall, BoostForest exhibited significantly higher performance in all other metrics compared to XGBoost, and its TOPSIS ranking was also higher than that of XGBoost. We found that the specificity of BoostForest was significantly lower than that of Baseline (p-value < 0.05). However, BoostForest exhibited significantly higher recall, F1 score, and AUPRC compared to Baseline (p-value < 0.05). Moreover, the improvements in recall and F1 score were more pronounced than the reduction in specificity. These results were obtained through statistical significance testing, ensuring that the observed differences are not due to random variation. Table 2. Performance of four prediction models on the [132]GSE87211 and [133]GSE40492 datasets. Dataset Test [134]GSE40492 Model Baseline SVM XGBoost BoostForest Baseline SVM XGBoost BoostForest Accuracy 0.9688 0.9880 0.9320* 0.9654 0.8746 0.8658 0.8140* 0.8882 Recall 0.8400 0.9500 0.7000* 0.9400 0.4456* 0.5182* 0.7730 0.7000 Specificity 1.0000 1.0000 0.9830* 0.9966 0.9728 0.9460 0.8230* 0.9312 F1_Score 0.8730 0.9670 0.7000* 0.8798 0.5696* 0.5900* 0.6070* 0.7020 AUPRC 0.9898 0.9830 0.9830 0.9830 0.7510* 0.7600* 0.6120* 0.7710 Topsis_Rank 2.80 1.00 4.00 2.20 3.80 3.20 2.00 1.00 [135]Open in a new tab The best-performing results are highlighted in bold, and * indicates the statistically significant superiority of BoostForest. The term “Topsis_Rank” refers to the ranking obtained using TOPSIS analysis. In this study, lasso regression was also employed as an alternative feature selection method, and the features selected by lasso were compared with those selected by the two-step feature selection method, while all other conditions were kept constant. To enhance the robustness of the selected features, the 1-SE rule was applied to determine the regularization parameter (λ), resulting in the selection of 30 features, hereafter referred to as lasso gene pair signature (lasso-GPS). Table [136]3 presents the performance of BoostForest on the two feature sets: in the test set, lasso-GPS exhibited lower accuracy and recall compared to 32-GPS, but demonstrated certain advantages in terms of AUPRC and specificity. However, in the validation cohort, except for specificity, 32-GPS outperformed lasso-GPS across all other evaluation metrics, particularly in the prediction of small sample sizes (pCR), where the accuracy of 32-GPS was significantly improved. These results suggest that, compared to lasso-GPS, 32-GPS demonstrates superior robustness in feature selection. In [137]GSE40492, lasso-GPS was utilized as a predictive biomarker, and the AUPRC values of various models are presented in Fig. [138]3c. The performance of BoostForest surpassed that of the other four models. Table 3. Comparison of prediction results between lasso-GPS and 32-GPS. Accuracy AUPRC Recall Specificity F1_Score Test lasso-GPS 0.975 1 0.8 1 0.867 32-GPS 0.988 0.983 0.95 0.983 0.867 Δ 0.013 – 0.017 0.15 – 0.017 0 [139]GSE40492 lasso-GPS 0.864 0.774 0.318 0.99 0.467 32-GPS 0.898 0.785 0.727 0.938 0.727 Δ 0.034 0.011 0.409 -0.052 0.26 [140]Open in a new tab 32-GPS refers to the 32-gene pair signature, lasso-GPS refers to the lasso gene pair signature, and Δ represents the improvement of 32-GPS over lasso-GPS. To further validate the importance of secondary screening, the study used the 15 shared gene pair signature (15-SGPS), which was selected from the four feature selection methods during the preliminary screening phase, as the predictive features. The performance of BoostForest is shown in Table [141]4. The results indicate that, in the test set and [142]GSE40492 dataset, except for an improvement in specificity, all other metrics showed a decline compared to the 32-GPS features, suggesting that using fewer features in the 15-SGPS has certain limitations in predictive performance. Moreover, the AUPRC performance of each prediction model using the 15-SGPS features in the [143]GSE40492 dataset is shown in Fig. [144]3d. All models were run five times on the [145]GSE40492 dataset, and the Mann-Whitney U test was used to statistically compare the performance of BoostForest with other models. The results, presented in Supplementary Table [146]S1 (Supplementary File 1), show the mean AUPRC. In the [147]GSE40492 dataset, BoostForest achieved a mean AUPRC of 0.6404, which was significantly higher than that of Baseline, SVM, and XGBoost (p-value < 0.05). BoostForest still demonstrated the best performance, exhibiting strong robustness and superior predictive ability. This further confirms the key role of the 32-GPS in enhancing the overall performance of the model. Compared to the 15-SGPS, the 32-GPS features provide more comprehensive information, significantly improving the model’s predictive performance. Table 4. Comparison of prediction results between the 15-Shared Gene Pair Signature from four feature selection methods and the 32-Gene Pair Signature in the preliminary feature selection phase. Accuracy AUPRC Recall Specificity F1_Score Test 15-SGPS 0.959 0.933 0.95 1 0.633 32-GPS 0.988 0.983 0.95 0.983 0.867 Δ 0.029 0.05 0 – 0.017 0.234 [148]GSE40492 15-SGPS 0.839 0.639 0.364 0.948 0.457 32-GPS 0.898 0.785 0.727 0.938 0.727 Δ 0.059 0.146 0.363 -0.01 0.27 [149]Open in a new tab 32-GPS refers to the 32-gene pair signature, 15-SGPS refers to the 15-shared gene pair signature, and Δ represents the improvement of 32-GPS over 15-SGPS. We conducted a comparative analysis of two class imbalance handling methods: SMOTE (Synthetic Minority Over-sampling Technique) and loss function weight adjustment. The results, presented in Table [150]5, demonstrate the performance of BoostForest under these two strategies. Specifically, in the [151]GSE87211 dataset, the AUPRC of the loss function weight adjustment method (32_GPS) reached 0.9830, exceeding that of SMOTE (9_GPS) at 0.9666. Similarly, in the [152]GSE40492 dataset, the AUPRC for the loss function weight adjustment method was 0.7710, significantly higher than that of SMOTE (AUPRC = 0.5632). Moreover, across multiple evaluation metrics, including Accuracy, Recall, Specificity, and F1 Score, the loss function weight adjustment method consistently outperformed SMOTE. Notably, statistical analysis confirmed that these differences were statistically significant (p < 0.05, denoted by “*” in Table [153]5). These findings indicate that, in this study, the loss function weight adjustment method is more effective than SMOTE for addressing class imbalance. We appreciate your valuable suggestions, as this analysis has strengthened our evaluation of class imbalance handling methods. Table 5. Mean AUPRC of BoostForest under SMOTE and loss function weight adjustment for imbalance handling. Accuracy Recall Specificity F1_Score AUPRC Test 9_GPS 0.9158* 0.6900* 0.9766* 0.7668 0.9666 32_GPS 0.9654 0.9400 0.9966 0.8798 0.9830 [154]GSE40492 9_GPS 0.7458* 0.7362 0.7480* 0.5192* 0.5632* 32_GPS 0.8882 0.7000 0.9312 0.7020 0.7710 [155]Open in a new tab 9_GPS represents the features selected using the SMOTE technique, while 32_GPS represents the features selected through the loss function weight adjustment method. The best-performing results are highlighted in bold. * Indicates the statistically significant superiority of 32_GPS. Additionally, during internal validation, we evaluated the model’s performance using both the bootstrap method and 10 × 10-fold stratified cross-validation. All algorithms were executed five times on each dataset, and statistical comparisons between BoostForest and other models were conducted using the Mann-Whitney U test. The results, presented in Supplementary Table [156]S2 (Supplementary File 1), report the mean AUPRC. Under the bootstrap method, in the [157]GSE87211 dataset, BoostForest achieved an AUPRC of 0.972, with minimal differences compared to other models and no statistically significant differences. In the [158]GSE40492 dataset, BoostForest obtained a mean AUPRC of 0.7536, which was significantly higher than those of Baseline, SVM, and XGBoost (p-value < 0.05). Under 10 × 10-fold stratified cross-validation, in the [159]GSE87211 dataset, BoostForest achieved an AUPRC of 0.9868, which was significantly lower than those of Baseline and SVM. In the [160]GSE40492 dataset, BoostForest achieved an AUPRC of 0.7692, which was significantly higher than that of XGBoost and also higher than SVM and Baseline, although the differences with SVM and Baseline were not statistically significant. Permutation importance is also sensitive to noisy data. If the dataset contains noise or outliers, permuting the values of these features may lead to abnormal fluctuations in model performance. To further evaluate the impact of different feature importance ranking methods, we conducted additional experiments comparing ANOVA, Random Forest (Mean Decrease Gini, MDG), and permutation-based importance. As shown in Supplementary Table [161]S3 (Supplementary File 1), the permutation-based method demonstrated superior performance in terms of accuracy, F1 score, and AUPRC. To further assess the impact of XGBoost in secondary screening, we conducted additional experiments comparing it with Random Forest and SVM. As presented in Supplementary Table [162]S4 (Supplementary File 1), feature selection based on XGBoost consistently achieved the best classification performance, attaining the highest accuracy, recall, and AUPRC in both the test set and external validation dataset ([163]GSE40492). In contrast, Random Forest and SVM exhibited lower recall, particularly in the external validation dataset. Guo et al. employed forward feature selection and a voting method^[164]9, ultimately identifying 27_GPS. Xue et al. applied forward feature selection and a voting method combined with Random Forest’s Mean Decrease Gini (MDG) for Variable Importance (VIM) ranking, resulting in 41_GPS^[165]7. We utilized 27_GPS and 41_GPS as predictive features and compared them with the 32_GPS identified in this study. As presented in Supplementary Table [166]S5 (Supplementary File 1), 32_GPS outperformed both 27_GPS and 41_GPS across all evaluation metrics. Furthermore, we applied forward feature selection combined with a voting method for feature selection and prediction, identifying a total of 10 gene pairs, and evaluated their performance. The results, detailed in Supplementary Table [167]S6 (Supplementary File 1), demonstrate that the proposed method in this study outperforms the forward feature selection and voting strategy. Pathway enrichment analysis of predictive biomarkers Among the 32-GPS, a total of 51 genes were identified. The GO enrichment analysis revealed that 72.54% of these genes were enriched in protein binding, 19.61% in extracellular exosomes, and 3.92% in pathways such as the positive regulation of protein dephosphorylation, as shown in Fig. [168]4a-b (P.adj < 0.05). Among the genes enriched in these pathways, the number of downregulated genes was greater than that of upregulated genes. In the KEGG enrichment analysis, as shown in Fig. [169]4c (P.adj < 0.05), the genes in the 32-GPS were primarily enriched in pathways such as Transcriptional Misregulation in Cancer, Metabolic Pathways, and Regulation of Actin Cytoskeleton. In addition, we provide detailed information on GO enrichment pathways and KEGG enrichment pathways in Supplementary Data [170]2 and Supplementary Data [171]3. Fig. 4. [172]Fig. 4 [173]Open in a new tab Pathway enrichment analysis of genes included in the 32-GPS. (a) Gene enrichment distribution from the GO enrichment analysis. (b) Specific distribution of gene enrichment in the GO analysis. (c) Gene enrichment distribution from the KEGG enrichment analysis. Validation of predictive gene pairs using survival analysis In this study, survival analysis of DFS was conducted on the patients from the [174]GSE87211 and [175]GSE40492 datasets. As shown in Fig. [176]5a, in [177]GSE87211, the pCR group demonstrated significantly better survival probability compared to the non-pCR group, with a statistically significant difference (p-value = 0.015). Over the 100-month follow-up period, the survival probability of the pCR group remained at a high level, while the survival rate of the non-pCR group significantly declined over time. In [178]GSE40492 (Fig. [179]5b), the survival rate of the pCR group was also significantly higher than that of the non-pCR group (p-value = 0.11). During the tuning process, we observed a trend where the p-value of the survival curve tended to decrease as the min_sample_leaf_list gradually decreased. The survival analysis results corresponding to min_sample_leaf_list values of 4, 3, and 2 are presented in Fig. [180]5c and d, and [181]5e, respectively. When min_sample_leaf_list was set to 1 (Fig. [182]5f), the p-value was 0.021. After performing multiple testing correction using Benjamini-Hochberg (BH) adjustment on the p-values obtained for different min_sample_leaf_list values, the adjusted p-value was 0.105. Although the initial analysis suggested that the DFS in the pCR group was superior to that in the non-pCR group, the adjusted p-value did not reach statistical significance after the correction for multiple comparisons. Fig. 5. [183]Fig. 5 [184]Open in a new tab Survival analysis validation of the 32-GPS. (a) Survival analysis of [185]GSE87211 dataset. (b) Survival analysis of [186]GSE40492 dataset (with min_sample_leaf_list set to 5). (c–f) Survival analysis of [187]GSE40492 dataset (with min_sample_leaf_list set to 4, 3, 2, and 1, respectively). Interpretation of BoostForest model output using SHAP To gain a deeper understanding of the contribution of each feature to the model’s predictions, this study employed the SHAP method for interpretability analysis. The results are shown in the figure below. As shown in Fig. [188]6a, the average absolute SHAP values for each feature in the test set are displayed on the left, and the SHAP value distribution plot is presented on the right. Features are ranked from top to bottom based on their importance, with higher absolute SHAP values indicating greater contribution to the model’s predictions. Among them, CTNNBIP1 > ZNF544, CHRNA5 > CFH, CDKN2A > ATG2A, and ENPM > CACNA1H have the most significant impact on the model’s predictions. As shown in Fig. [189]6b, positive SHAP values indicate that the feature contributes to increasing the probability of a positive response, while negative SHAP values suggest that the feature reduces the probability of a positive response. Figure [190]6c presents a heatmap of SHAP values, where red indicates a contribution to positive predictions, and blue signifies a contribution to negative predictions. Some features exhibited clear discriminative ability between pCR and non-pCR patients, with similar results observed in the [191]GSE40492 dataset (Supplementary Figure [192]S1). Fig. 6. [193]Fig. 6 [194]Open in a new tab Interpretability analysis of predictive model. (a) A bar chart presents the importance of each feature, with the left section showing the mean absolute SHAP values and the right section displaying the SHAP value distribution. (b) A scatter plot shows the distribution of SHAP values. (c) A heatmap visualizes the contribution of each feature to the predictions. (d) Distribution of Samples with Consistent and Inconsistent Prediction Results Compared to True Labels. In datasets [195]GSE87211 and [196]GSE40492, we performed a directional consistency validation for the 32-GPS, utilizing limma analysis to compute logFC. The results revealed that 25 out of 32 gene pairs exhibited consistent directionalities, yielding a consistency rate of 0.78, which was statistically significant (p-value = 0.002), as shown in Supplementary Data [197]4. Analysis of samples with mismatched prediction results and true labels Subsequently, the distribution of samples with mismatched prediction results and true labels was analyzed. First, the 32-GPS score for each sample in the [198]GSE40492 dataset was calculated. As shown in Fig. [199]6d, samples with a score below 16 were predominantly predicted as non-pCR patients (using a prediction probability threshold of 0.5), while samples with a score above 21 were all predicted as pCR patients. The samples with mismatched predictions were mainly distributed between 16 and 21 points. For these challenging-to-predict samples, the discrepancy may arise from the higher weight assigned to the minority class (pCR) in the loss function during feature selection and model training. This causes the model to be more sensitive to the minority class samples, leading the selected features to potentially reflect characteristics more associated with the minority class, which in turn increases the likelihood of misclassifying a small number of majority class (non-pCR) samples as minority class samples. Discussion In studies on predicting the response of LARC patients to nCRT, existing feature selection methods still face challenges in overcoming reproducibility issues, which in turn results in the inability to validate the predictive models across different datasets. This issue has attracted extensive research attention. For example, Guan et al. employed Weighted Gene Co-Expression Network Analysis (WGCNA) to identify CCT5 and ELF1, which successfully predicted treatment response in LARC patients^[200]43. Wang et al. used LASSO regression to identify six key indicators and construct the Systemic Inflammation-Nutritional Index (SINI)^[201]44. However, in these studies, single feature selection methods often struggle to capture the generalizability of features. Additionally, sample imbalance remains a limitation, as it affects predictive accuracy for positive samples. Although some studies have employed synthetic data to balance the distribution of positive and negative samples, this approach is relatively prone to overfitting^[202]19,[203]20. In this study, we addressed these limitations by using REOs algorithm to construct gene expression order pairs, reducing batch effect influences. A two-step selection process was applied to identify the 32-GPS feature set, effectively predicting the response of LARC patients to nCRT. We then used BoostForest as the predictive model to enhance accuracy, particularly for pCR patients. Experimental results demonstrated that the model achieved an accuracy of 0.898 and an AUPRC of 0.785 on the validation cohort, effectively distinguishing pCR from non-pCR patients. These findings provide a reliable basis for clinical decision-making, helping to mitigate under- or overtreatment in patient care. Filtering features that are significantly associated with treatment response enables the identification of potential biomarkers, which serve as a foundation for personalized treatment strategies. While methods such as Lasso regression, Random Forest, and Recursive Feature Elimination (RFE) are commonly used to predict treatment responses in cancer^[204]45,[205]46, reliance on a single method presents challenges. These methods can be sensitive to noise and outliers, potentially leading to inconsistent feature selection results across different data distributions. As a result, the generalizability and accuracy of selected feature sets may be compromised. In this study, a two-step feature selection process was implemented to address these challenges. The Preliminary screening involved applying four distinct feature selection methods, with the results integrated to form a candidate feature set. This approach reduces dependence on a single method, resulting in a more robust set of features and improving the identification of treatment-response-related features from multiple perspectives. In the Secondary screening, IFS combined with an XGBoost model was used for secondary selection, ultimately identifying 32 GPS features as the critical predictive feature set. To further enhance the reliability of the feature selection process and avoid convergence on a local optimum, permutation importance was applied to rank the candidate features. This adjustment improves both the efficiency and reliability of the selection process. Finally, to address the issue of imbalanced datasets, especially concerning the minority class (pCR), the loss function weights of the XGBoost model were adjusted to account for class imbalance, and the F1 score was used as the evaluation metric. This approach ensures more accurate predictions for the minority class. This study employed a combination of four feature selection methods to identify features, resulting in the identification of 127 gene pairs. Specifically, MDFS-2D, Boruta, MCFS, and VSOLassoBag identified 26, 66, 66, and 81 features, respectively. Certain differences were observed in the features selected by different methods. Single feature selection methods, which are typically based on specific statistical assumptions or algorithmic principles, may lead to the omission of important features. In contrast, combining multiple feature selection methods allows for multidimensional analysis, facilitating the identification of features that may be overlooked by individual methods. This approach enables the reassessment of the importance of these features during secondary screening, thereby reducing the likelihood of missing critical features. This has been well demonstrated in the study by Wei et al.^[206]20. Moreover, the question of how to scientifically, accurately, and efficiently combine feature selection methods remains an intriguing area of investigation, which we plan to explore further in future studies. While machine learning is widely applied in clinical diagnostics, achieving notable improvements in accuracy and efficiency^[207]47,[208]48, and models such as random forest (RF) and support vector machine (SVM) demonstrate high predictive accuracy in assessing responses to neoadjuvant chemoradiotherapy (nCRT) in patients with locally advanced rectal cancer (LARC)^[209]49,[210]50, results from the present study indicate that RF and SVM tend to favor the majority class (non-pCR), limiting their effectiveness for minority class predictions (pCR). To address this limitation, a more robust model, BoostForest, was introduced. With RF set as the baseline model and SVM and XGBoost included for comparison, all models demonstrated similar predictive results in the test set, while BoostForest achieved the highest AUPRC in the validation cohort, showing an overall performance advantage. Despite the strong predictive power of tree-based models, their complex decision paths and multi-level branching structures present a significant interpretability challenge in clinical practice. SHAP analysis was therefore applied to quantify each feature’s contribution to the prediction results, thereby improving model transparency. This approach enhances interpretability, allowing clinicians to understand the factors underlying the model’s predictions more clearly and fostering greater trust and applicability in practical diagnostics. In summary, the BoostForest model, used in conjunction with the 32-GPS system, demonstrates effectiveness in supporting clinicians’ treatment decision-making by aiding in the selection of personalized treatment strategies tailored to specific patient categories, which may reduce the adverse effects associated with overtreatment. In selecting BoostForest as the predictive model for response to neoadjuvant chemoradiotherapy, we considered several key aspects. BoostForest utilizes BoostTree as its base learner and introduces randomness into the model through both Bootstrap sampling and random selection of hyperparameters. Additionally, it enhances model diversity by randomly choosing split points during node splitting, which contributes to reducing overfitting. Previous studies, such as those by Zhao et al., have shown that BoostForest generally outperforms or at least performs comparably to established ensemble models like LightGBM and XGBoost, particularly in challenging datasets^[211]28. In our own research, we have demonstrated that BoostForest provides superior overall performance compared to the XGBoost model. The limitations of this study must be acknowledged. First, although the proposed method and model have been optimized and achieved certain effectiveness, the existing sample data are insufficient to comprehensively illustrate the advantages of our approach. This limited sample size restricts the performance and potential of the method and model. Due to the relatively small size of the validation cohort ([212]GSE40492), the dataset may only represent a subset of patients, lacking sufficient sample diversity. In future studies, we aim to validate the model using larger and more diverse cohorts. Second, although the proposed signature and model have shown potential in preliminary validation, their performance is currently insufficient to instill full confidence among clinicians, thus limiting their applicability and verification in clinical settings. Future research should include multi-center studies and clinical validation to ensure the model’s robustness and applicability, thereby enhancing its impact and reliability in clinical practice. Conclusions This study constructed gene pairs using REOs and identified the 32-GPS from numerous gene pairs by applying a two-step selection method. BoostForest was introduced as the predictive model to forecast the response of LARC patients to nCRT. In comparison to traditional classification models such as random forests, SVM, and XGBoost, BoostForest demonstrates superior performance in terms of accuracy and predictive capability. This workflow is expected to improve the accuracy of predicting pCR patients, particularly in scenarios with imbalanced class distributions. Electronic supplementary material Below is the link to the electronic supplementary material. [213]Supplementary Material 1^ (247.7KB, pdf) [214]Supplementary Material 2^ (293.5KB, xlsx) Author contributions CQ and YG: Conceptualization; CQ, SY, and YG: Methodology; CQ: Data analysis, original draft writing; CQ, RG, SY, HW and YG: Review and editing; YG: Funding acquisition; YC, FS, and CL: revised the article; All authors have read and agreed to the final version of the manuscript. Funding This study was supported by the National Natural Science Foundation of China (NSFC) under Grant 82060618, Grant 72261018, Jiangxi Province Key Laboratory of Multidimensional Intelligent Perception and Control (2024SSY03161), Doctoral Fund of First Affiliated Hospital of Gannan Medical University. Data availability The datasets used in this study are available in the online repository, the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/). Declarations Competing interests The authors declare no competing interests. Footnotes Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Changshun Qian and Shuxin Yang contributed equally to this work. Contributor Information Hui Wang, Email: huiwang@shsmu.edu.cn. You Guo, Email: gy@gmu.edu.cn. References