Graphical abstract

   graphic file with name fx1.jpg
   [39]Open in a new tab

Highlights

     * •
       Semi-supervised learning to derive clinically relevant biomarkers
     * •
       A new 10-gene signature refines chemotherapy decisions in stage
       II/III colon cancer
     * •
       The signature shows the strong prognostic value and potential for
       treatment benefits
     * •
       Validation using FFPE samples to ensure clinical applicability
     __________________________________________________________________

   Xu et al. develop and validate a 10-gene signature that refines
   chemotherapy decisions for stage II and III colon cancer, demonstrating
   its prognostic value and potential to predict treatment benefits using
   extensive dataset analysis and validation with FFPE samples.

Introduction

   Colon cancer, a disease characterized by its heterogeneity, is the
   fourth deadliest cancer worldwide, causing nearly 900,000 deaths each
   year.[40]^1 Colon cancer has been categorized as either localized to
   bowel (stage I to II), regionally spread to lymph nodes (stage III), or
   distantly spread to the liver, lung, peritoneum, brain, or distant
   lymph nodes (stage IV). The five-year survival rates for patients with
   stage I colon cancer are approximately 90% without chemotherapy but
   range from 10% to 50% for stage IV disease despite treatment with
   chemotherapy.[41]^2^,[42]^3 Five-year disease-free survival rates in
   patients with high-risk stage II colon cancer were reported to be
   80.7%–83.9%, and the three-year disease-free survival rate among
   patients with stage III colon cancer was from 73.4% to
   76.3%.[43]^4^,[44]^5 Generally, patients with stage II and III colon
   cancer can be considered as clinically relevant groups for risk
   stratification of disease recurrence.[45]^6^,[46]^7^,[47]^8 For some
   patients with high-risk stage II colon cancer, and the majority of
   patients with stage III colon cancer, adjuvant chemotherapy (ACT)
   consisting of fluoropyrimidine with or without oxaliplatin has been
   routinely used as an important treatment following surgery with the
   intent to cure.[48]^9

   However, the decision about ACT for stage II colon cancer has been
   challenging and controversial; the results from clinical trials are not
   supportive of ACT for all patients with stage II colon
   cancer.[49]^10^,[50]^11 Although ACT has been recommended for all
   patients with stage III colon cancer, only ∼20% of patients receiving
   such therapy will benefit from the treatment in terms of a survival
   advantage.[51]^12 Progress in treating this disease has been hampered
   by the inability to accurately determine which patients are most likely
   to benefit and which are least likely to benefit from the current
   systemic treatment options. It is therefore critical to discover better
   predictive biomarkers to identify the stage II and III patients who
   will most likely benefit from ACT. Also, it is important to avoid
   unnecessary toxicity for the rest of the patients who are not as likely
   to benefit from that therapy and/or to stimulate further research into
   more effective adjuvant therapy options.

   Several biological biomarkers, including microsatellite instability
   (MSI) and loss of heterozygosity at chromosome 18q (18q LOH), have
   historically been proposed for prognostic and predictive purposes in
   colon cancer.[52]^13^,[53]^14 However, only a minority of patients with
   early-stage colon cancer (less than 15%) are MSI-high (MSI-H), and the
   prognostic significance of 18q LOH has not been conclusively validated
   in clinical trials.[54]^15^,[55]^16 Moreover, the currently available
   commercial gene signature panels for colon or colorectal cancer (CRC),
   such as Oncotype DX colon cancer,[56]^17 ColoPrint,[57]^18
   Veridex,[58]^19 and GeneFx colon,[59]^20 have primarily been developed
   with a focus on prognosis. While these panels have undergone external
   validation, their results have been somewhat inconsistent, and there is
   a lack of concrete evidence supporting their predictive value
   specifically for ACT benefit.[60]^21^,[61]^22

   In the realm of colon cancer ACT benefit prediction, a few gene
   signatures have been proposed.[62]^23^,[63]^24^,[64]^25 However,
   compared to general prognostic signatures, these specific signatures
   targeting chemotherapy benefit prediction have not undergone extensive
   independent validation. This gap highlights a critical need for
   reliable ACT benefit biomarkers, not only for stage II and III colon
   cancer but also for practical implementation in clinical settings using
   formalin-fixed paraffin-embedded (FFPE) samples. Our study aims to
   address this gap by focusing on the development and validation of gene
   signatures tailored for predicting ACT benefits, thereby contributing
   to the refinement of personalized treatment strategies in colon cancer.

   To further improve the identification of prognosis/prediction
   signatures, we assembled a comprehensive colon cancer stage II and III
   gene expression dataset, which included 933 samples from six datasets.
   We then applied an innovative semi-supervised machine learning approach
   to identify an 18-gene prognosis signature and a 10-gene chemotherapy
   benefit prediction signature set ([65]Figure 1). The prognosis and
   chemotherapy-benefit prediction signatures were validated in five and
   two testing datasets, respectively, including targeted NanoString
   nCounter data we generated from FFPE samples. We showed that patients
   predicted to benefit from ACT based on the 10-gene signature had
   significantly better survival outcomes than those predicted not to
   benefit using both fresh-frozen and FFPE samples. Additionally, the
   10-gene signature displayed the potential to predict the utility of
   cancer immunotherapy in an adjuvant setting in selected patients.

Figure 1.

   [66]Figure 1
   [67]Open in a new tab

   The workflow chart

   Schematic of the study design for (A) the development and validation of
   the 18-hub-gene prognosis signature; and (B) the development and
   validation of the 10-gene chemotherapy-benefit signature.

Results

Semi-supervised gene network analysis identified an 18-hub-gene set

   We first selected [68]GSE39582
   , which was our only discovery dataset with survival outcomes and is
   one of the largest gene expression datasets for stage II and III colon
   cancer, to identify genes associated with survival outcome. A total of
   3,091 genes with significant nominal p values (p ≤ 0.05) were selected
   as candidates. Next, to further pare down a more focused gene set from
   these survival-associated genes, these survival-associated genes in
   [69]GSE39582 were combined with five gene expression datasets without
   survival outcome to construct the large gene network in stage II and
   III colon cancer. Specifically, we combined [70]GSE39582 with five
   additional stage II and III colon cancer expression datasets
   ([71]GSE27854, [72]GSE26906, [73]GSE21510, [74]GSE18088, and
   [75]GSE2109) to form a large gene expression dataset with 933 samples
   and 3,091 genes. The SPACE algorithm was then used to construct a gene
   network, which included 504 nodes and 737 edges ([76]Figure 2A). Since
   hub genes with more connections in the gene network are more
   biologically meaningful compared with the p value-ranked gene list of
   meta-analysis,[77]^26^,[78]^27 we selected the top 18 hub genes with
   more than 11 connections (degrees) in the hub gene list ([79]Table S2),
   which included THY1, BUB1, SDPR, NAP1L3, DEPDC1, STON1, DLGAP5, FBN1,
   ATAD2, HSD17B2, KIF23, CHAC2, FRMD6, MCM10, TPX2, AURKB, CYR61, and
   FAM84A genes ([80]Figure 2B). The description of gene functions and
   pathway analysis results are listed in [81]Table S3 and [82]Figure S1,
   respectively.

Figure 2.

   [83]Figure 2
   [84]Open in a new tab

   Identification of the 18-hub-gene signature and validation of the
   Random Survival Forests (RSF) prognosis model

   (A) The topology of the constructed survival-related gene network in
   colon cancer. A total of 504 genes (nodes) with at least 11 connections
   with other genes were identified as hub genes and labeled in enlarging.

   (B) Composition of the 18 survival-associated hub genes. Degree refers
   to the number of genes that the hub gene has a direct connection with
   based on the constructed gene network in Figure A. GISTIC
   ([85]http://portals.broadinstitute.org/tcga/home) records the
   information of somatic copy-number variation: “+” indicates the genes
   with significant amplification and “−” indicates significant deletion
   in CRC. MethHC ([86]http://methhc.mbc.nctu.edu.tw/php/index.php)
   records the information on DNA methylation: “+” indicates the genes
   with significant hypermethylation and “−” indicates significant
   hypomethylation in CRC. DisGeNET ([87]https://www.disgenet.org/home/)
   records the information on human gene-disease associations (GDAs): “+”
   indicates the genes related to CRC. The gene symbols of the ten genes
   with more than two pieces of evidence that occurred in CRC were
   highlighted in red.

   (C) Maximally selected log rank statistics were applied to the in-bag
   predicted value of the RSF prognosis model. The optimal cutoff point
   was determined as 14.38.

   (D–G) Validation of the 18-hub-gene signature in 4 independent
   datasets, including [88]GSE17538
   , [89]GSE33113, [90]GSE37892, and [91]GSE38832. The high- and low-risk
   groups were defined based on the 18-hub-gene signature, which was
   derived from the [92]GSE39582 cohort. The optimal cutoff point, which
   was determined by the maximally selected rank statistics (maxstat), was
   used to divide the patients into high- and low-risk groups. Red and
   dark blue lines indicate predicted high- and low-relapse risk groups.
   HR compares the RFS of the low- and high-relapse risk groups. p values
   were obtained by the log rank test.

Prognosis model construction and performance evaluation

   To determine if the discovered 18-hub-gene set could predict colon
   cancer prognosis, we constructed a prognosis prediction model using
   [93]GSE39582
   as our training dataset. The optimal cutoff point based on maximally
   selected rank statistics was determined as 14.38 for the high- and
   low-risk group ([94]Figure 2C).[95]^28 Next, we evaluated this
   prognostic model using four additional independent expression datasets
   with available survival information ([96]GSE17538, [97]GSE33113,
   [98]GSE37892, and [99]GSE38832). In each testing dataset, the samples
   were also divided into high- and low-relapse risk groups using the
   optimal cutoff point in [100]Figure 2C. The hazard ratio (HR) and log
   rank p value of predicted risk groups in the five datasets were 3.324
   (p = 0.005), 3.933 (p = 0.003), 11.543 (p < 0.0001), and 4.745 (p =
   0.032), respectively ([101]Figures 2D–2G). Furthermore, the univariable
   Cox proportional hazard regression analysis was performed and the
   results suggested that the 18-hub-gene set was significantly
   associated with relapse-free survival (RFS) outcome in [102]GSE17538
   (HR = 3.336, p = 0.008), [103]GSE33113 (HR = 3.967, p = 0.006),
   [104]GSE37892 (HR = 12.05, p = 3.79 × 10^−^9), and [105]GSE38832 (HR =
   4.795, p = 0.051) ([106]Table S4). Also, the multivariable Cox
   proportional hazard regression analysis with additional clinical
   variables including age, gender, and stage was completed, and the
   results showed that the 18-hub-gene set was an independent factor for
   RFS outcome in [107]GSE17538 (HR = 4.136, p = 0.003), [108]GSE33113
   (HR = 4.117, p = 0.005), [109]GSE37892 (HR = 8.238, p = 2.73 × 10^−^6),
   and [110]GSE38832 (HR = 4.577, p = 0.058) ([111]Figure S2A;
   [112]Table S4). These results demonstrated the 18-hub-gene set provides
   an effective gene signature for colon cancer prognosis.

18-Hub-gene prognostic model outperformed existing signatures

   In the construction of clinical prognostic models, the efficient
   selection of a gene signature set from a high-dimensional gene list is
   a crucial step. Typically, in a p value-based gene selection approach,
   the set of genes with the most significant p values is selected to
   build the prognostic prediction model. To compare the performances of
   the p value-based gene selection approaches with our network-based gene
   selection approach, we chose 18 genes with the most significant p
   values from the Cox regression analysis and re-constructed a prognosis
   model of RFS using the same five independent GEO datasets described
   earlier. We found that the prediction performance of the 18-hub genes
   obtained with our semi-supervised approach had better performance than
   those based on the 18 most significant genes. The median AUC of the
   18-hub-gene set and the 18-most-significant gene set for predicting
   5-year RFS was 0.732 and 0.591, respectively ([113]Figure 3A).
   Additionally, analysis using pairwise mutual information distance
   measuring association showed that the 18-hub-gene signature set
   generated by our semi-supervised approach had much lower redundant
   information than the 18-top-ranked gene set (two-sample
   Kolmogorov-Smirnov test, p value < 0.00001, [114]Figure 3A). Moreover,
   principal component analysis also demonstrated that the 18-hub-gene set
   captured more variation in patients with colon cancer than the
   18-top-ranked gene set ([115]Figure 3B).

Figure 3.

   [116]Figure 3
   [117]Open in a new tab

   Comparison of the 18-hub-gene set and 18-most-significant gene set

   (A) Boxplot of median AUC for predicting 5-year RFS in four GEO testing
   datasets ([118]GSE17538
   , [119]GSE33113, [120]GSE37892, and [121]GSE38832), and pairwise mutual
   information distance based on expression values in the [122]GSE39582
   dataset.

   (B) Expression variation across the samples in the [123]GSE39582
   dataset based on the principal component analysis.

   (C) Gene expression patterns of the two gene sets in the [124]GSE39582
   dataset.

   Using the expression dataset of 469 stage II and III colon cancer
   samples in [125]GSE39582
   , we next compared the expression patterns of two gene sets and found
   the average expression level of the 18-hub-gene group was significantly
   higher than the expression level of the 18-top-ranked group (t test, p
   value < 2.2 × 10^−^16), demonstrating the effectiveness of our approach
   for optimally identifying a more relevant colon cancer-related gene
   signature set for prediction analysis ([126]Figure 3C).

   Furthermore, we compared the prediction performance of the 18-hub-gene
   set with other established gene sets for colon cancer prognosis
   prediction, including the Oncotype DX,[127]^29 ColoGuidePro,[128]^30
   ColoFinder,[129]^31 the 8-gene signature of Peng et al.,[130]^32 the
   13-gene signature of Tian et al.,[131]^33 the 3-gene signature of Cheng
   et al.,[132]^34 the 15-gene signature of Rokavec et al.,[133]^35 and
   the 12-gene signature of Fang et al.[134]^36 For each prognostic
   signature generated by these alternative methods, we also calculated
   HR, log rank test p values, and 5-year AUC for the four GEO validation
   cohorts ([135]GSE17538
   , [136]GSE33113, [137]GSE37892, and [138]GSE38832). The results showed
   the 18-hub-gene set generated by our semi-supervised approach had the
   best performance in prognostic prediction ([139]Table S5).

Identification and validation of 5-FU-based ACT chemotherapy-benefit gene
signature

   5-Fluorouracil (5-FU)-based ACT refers to patients who have received
   5-FU treatment alone (either intravenous 5-FU or oral capecitabine) or
   in combination with other medications, including folinic acid and
   oxaliplatin. Among all available colon cancer datasets, two datasets
   ([140]GSE39582
   and [141]GSE17538) included 142 and 80 primary tumor samples from
   patients who subsequently received adjuvant 5-FU-based treatments. We
   next tried to determine a gene signature that can more accurately
   predict patients with colon cancer who are likely to benefit from ACT
   therapy and thus demonstrate improved survival relative to 5-FU-based
   ACT from the primary 18-hub-gene set. To this end, we identified
   potential cancer driver genes with either significant genomic
   aberration or DNA methylation alteration between CRC and normal samples
   using the GISTIC tool[142]^37 for The Cancer Genome Atlas copy number
   and MethHC database[143]^38 (see [144]Figure 1B) and manually
   interrogated the 18-hub-gene set with these two databases. To not miss
   some crucial genes that play an important role in colon cancer, we also
   verified the 18-hub genes in the DisGeNET database,[145]^39 which is
   the largest disease-associated gene database. Finally, the genes
   occurring more than two times in the GISTIC, MethHC, or DisGeNET
   databases ([146]Figure 2B) were selected and used to construct a
   10-gene chemotherapy-benefit signature set. An RSF chemotherapy-benefit
   model was re-constructed using these 10 genes, along with 142 samples
   with 5-FU-based ACT information in the [147]GSE39582 dataset.

   The prognostic performance of the 10-gene chemotherapy-benefit
   signature set was evaluated in the four GEO colon cancer cohorts
   ([148]GSE17538
   , [149]GSE33113, [150]GSE37892, and [151]GSE38832). The HRs and log
   rank test p values calculated to evaluate the prediction performance in
   four datasets were 2.97 (p = 0.002), 4.467 (p = 0.004), 5.963
   (p < 0.0001), and 2.642 (p = 0.122), respectively
   ([152]Figures S3A–S3D). The results suggested that the 10-gene
   signature set demonstrated a prognostic performance similar to the
   18-hub-gene prognosis signature set.

   We then assessed which patients demonstrated an association with
   survival outcomes relative to 5-FU-based ACT by the 10-gene
   chemotherapy-benefit prediction model. According to the maximally
   selected rank statistics (maxstat) provided in the “survminer” package,
   the samples were divided into ACT-benefit and non-benefit groups using
   the optimal cutoff point ([153]Figure 4A). The HR and p value validated
   by another colon cancer cohort, [154]GSE17538
   with ACT information, were HR = 3.884 and p = 0.002 ([155]Figure 4B),
   which demonstrated significant prediction of RFS outcome for the
   10-gene chemotherapy-benefit signature set in stage II and III colon
   cancer. Furthermore, the results of univariable Cox proportional hazard
   regression analysis suggested that the 10-gene chemotherapy-benefit
   signature set was significantly associated with RFS of patients with
   5-FU-based ACT in [156]GSE17538 (HR = 3.933, p = 0.004)
   ([157]Table S6); and the results of multivariable Cox proportional
   hazard regression analysis also showed that the 10-gene
   chemotherapy-benefit signature set was a significant independent factor
   for RFS of patients with 5-FU-based ACT in [158]GSE17538 (HR = 4.476,
   p = 0.003) ([159]Figure S2B; [160]Table S6). Furthermore, we conducted
   a control experiment in which we randomly selected 5–15 genes 1,000
   times and compared their prediction performance with the 10-gene set in
   [161]GSE17538. As shown in the results ([162]Figure 4C), the AUC of the
   10-gene set for predicting 5-year RFS was 0.756, significantly
   outperforming the random gene set (p = 0.009). These data demonstrate
   that the 10-gene chemotherapy-benefit signature set selected by the
   genomic event could more accurately predict patients with colon cancer
   who would more likely benefit from ACT.

Figure 4.

   [163]Figure 4
   [164]Open in a new tab

   Validation of the RSF chemotherapy-benefit model

   (A) Maximally selected log rank statistics were applied to the in-bag
   predicted value of the RSF chemotherapy-benefit model. The optimal
   cutoff point was determined as 8.88.

   (B) Validation of the 10-gene chemotherapy-benefit signature in another
   independent dataset, [165]GSE17538
   , which contains only samples with 5-FU-based ACT. The ACT-benefit and
   non-benefit groups were defined based on the 10-gene
   chemotherapy-benefit signature, which was derived from samples with
   5-FU-based ACT in the [166]GSE39582 cohort. The optimal cutpoint, which
   was determined by the maximally selected rank statistics (maxstat), was
   used to divide the patients into the ACT-benefit and non-benefit
   groups. Dark blue and red lines indicate predicted ACT-benefit and
   non-benefit groups. HR compares the RFS of the ACT-benefit and
   non-benefit group. p values were obtained by the log rank test.

   (C) The density plot illustrated the AUC values of 1,000 randomly
   selected gene sets for predicting 5-year RFS. The red dotted line
   represents an AUC of 0.756 for the 10-gene set in predicting 5-year
   RFS.

   (D and E) The predictive performance evaluation of the prognosis model
   and chemotherapy-benefit model in the VUMC data. The 18-gene prognosis
   model divided high- and low-relapse risk groups in stage II to III
   patients in (D), and the 10-gene chemotherapy-benefit model divided
   ACT-responsive and non-responsive groups in stage II to III patients in
   (E).

   (F) The mutation pattern of KRAS, BRAF, PIK3CA, PTEN, NRAS, and AKT1
   between VUMC ACT-benefit and non-benefit patients.

   (G) The boxplot of the 18-gene score and 10-gene score between the
   different locations of lesions in the VUMC cohort.

   (H) The bar plot of the 18-gene and 10-gene groupings according to the
   patients with MSI-H in the VUMC cohort.

Validation of 18- and 10-gene signatures using FFPE samples

   To verify the utility of our 18-gene prognosis model and 10-gene
   chemotherapy-benefit model, we collected 109 stage II/III colon cancer
   specimens and detected the expression of 18-hub-gene by the NanoString
   nCounter platform, and 56 of them had received 5-FU-based ACT
   ([167]Table 1; [168]Table S8). The 18-gene prognosis model was used to
   divide these 109 specimens into high- and low-relapse risk groups. The
   HR and log rank p values were 3.542 (p = 0.009) ([169]Figure 4D). The
   results of univariable (p = 0.014, HR = 3.559) ([170]Table S4) and
   multivariable (p = 0.0262, HR = 4.46) ([171]Figure S2A; [172]Table S4)
   Cox regression analyses suggested that the 18-gene prognosis model is
   an independent prognosis factor. Then the 10-gene chemotherapy-benefit
   prediction model was used to divide 56 stage II to III specimens with
   5-FU-based ACT into benefit and non-benefit groups. Those patients
   predicted to be in the non-benefit group showed worse RFS after
   5-FU-based ACT. The HR was 2.735 and log rank test p value was p =
   0.058 ([173]Figure 4E). The HR and p value of univariable and
   multivariable Cox regression analyses were 2.765 (p = 0.069) and 4.469
   (p = 0.074) ([174]Figure S2B; [175]Table S6). We also investigated the
   mutation pattern of KRAS, BRAF, PIK3CA, PTEN, NRAS, and AKT1 in the
   Vanderbilt University Medical Center (VUMC) analysis group of patients
   within the ACT-benefit and non-benefit groups. The results suggested
   that the mutation frequencies of KRAS and PIK3CA in the ACT-non-benefit
   group (40% and 20%) were higher than in the ACT-benefit group (9.8% and
   4.9%) ([176]Figure 4F). In addition, we analyzed the distribution of
   the location of lesions and MSI-H status according to the 18- and
   10-gene signature groupings in the VUMC cohort. The results suggested
   that the 18-gene score and 10-gene score show no significant difference
   between left- and right-sided lesions in the VUMC cohort; the p value
   (Wilcoxon rank-sum test) was 0.548 and 0.82, respectively
   ([177]Figure 4G). For MSI status, the VUMC cohort has 10 patients with
   MSI-H. Intriguingly, MSI-H status is equally distributed among the
   high- (n = 5) and low (n = 5)-risk groups according to the 18-gene
   signature ([178]Figure 4H). However, MSI-H status is distributed only
   according to the ACT-benefit grouping (n = 6) of the 10-gene signature
   ([179]Figure 4H). This observation suggests that the 10-gene signature
   may have implications for identifying patients who are more likely to
   benefit from ACT, specifically within the MSI-H subgroup.

Table 1.

   Patient and tumor characteristics of the different sets
   Characteristics Type Mean age (sd[180]^a, range), years Gender ^a(M/F)
   TNM stage
     __________________________________________________________________

   Treatment with 5-FU[181]^a Median follow-up[182]^a (sd, range), months
   Relapse[183]^a
     __________________________________________________________________

   0 I II III IV NA Yes No N/A
   [184]GSE39582 (n = 566)∗ mRNA, independent 66.9 (13.3, 22–97) 310/256 4
   33 264∗ 205∗ 60 0 142∗ 48 (40.4, 0–201) 139 322 8
   [185]GSE27854 (n = 115)∗ mRNA, independent – – 0 16 41∗ 35∗ 23 0 – – –
   – –
   [186]GSE26906 (n = 90)∗ mRNA, independent 66.5 (12.6, 31–94) 43/47 0 0
   90∗ 0∗ 0 0 – – – – –
   [187]GSE21510 (n = 123)∗ mRNA, independent – – 0 15 46∗ 39∗ 23 0 – – –
   – –
   [188]GSE18088 (n = 53)∗ mRNA, independent 65.4 (12.2, 34–85) 26/27 0 0
   53∗ 0∗ 0 0 – – – – –
   [189]GSE2109 (n = 276)∗ mRNA, independent – 138/137 2 38 87∗ 73∗ 36 40
   – – – – –
   [190]GSE17538 (n = 232)∗ mRNA, independent 64.7 (13.4, 23–94) 122/110 0
   27 72∗ 76∗ 56 0 80∗ 40.7 (26.6, 0.4–118.6) 34 111 3
   [191]GSE33113 (n = 90)∗ mRNA, independent 70.4 (13, 35–95) 42/48 0 0
   90∗ 0∗ 0 0 0 38.8 (26.5, 1.6–118.3) 18 71 1
   [192]GSE37892 (n = 130)∗ mRNA, independent 68.3 (12.7, 22–97) 69/61 0 0
   73∗ 57∗ 0 0 0 42.9 (23.4, 0.6–103.1) 37 93 0
   [193]GSE38832 (n = 122)∗ mRNA, independent – – 0 18 35∗ 39∗ 30 0 0 41.2
   (34.1, 0.3–111.4) 9 65 0
   VUMC (n = 136/144)∗ mRNA, independent 65.5 (11.5, 42–86) – 0 1 65∗ 44∗
   26 0 56∗ 49.9 (51.8, 0.1–241.6) 20 89 0
   [194]Open in a new tab

   N/A, not available; sd, standard deviation; ∗ the datasets used to
   derive and validate the signature.
   ^a

   Among patients with stage II to III colon cancer.

Drug sensitivity analysis for 5-FU-based ACT chemotherapy-benefit gene
signature

   The 10-gene chemotherapy-benefit signature was applied to predict the
   5-FU benefit of NCI-60 cell lines, which were collected and downloaded
   from CellMiner.[195]^40 We then assigned 60 cell lines into 14
   5-FU-sensitive and 46 5-FU-resistant groups by the 10-gene prediction
   model, which included 2 and 5 colon cell lines, respectively
   ([196]Figure 5A). Meanwhile, 1,886 5-FU-related NCI-60 studies were
   collected from CellMiner, The Wilcoxon rank-sum test for −log10(GI 50)
   values between 2 5-FU-resistant colon cell lines and 5 5-FU-sensitive
   colon cell lines was applied. Interestingly, we found the median of GI
   50 values in the 5 5-FU-sensitive colon cell lines was significantly
   higher than that in the 2 5-FU-resistant colon cell lines (p
   value < 2.2 × 10^−^16) ([197]Figure 5B), suggesting that 5
   ACT-beneficial colon cell lines were more sensitive to 5-FU. In
   summary, the 10-gene chemotherapy-benefit signature set has a good
   performance for predicting 5-FU sensitivity vs. resistance for colon
   cancer cell lines from the NCI-60 data.

Figure 5.

   [198]Figure 5
   [199]Open in a new tab

   The performance of the 10-gene chemotherapy-benefit signature set for
   predicting 5-FU resistance for 60 cancer cell lines from the NCI-60
   data

   (A) The heatmap plot of NCI-60 cell lines, which were clustered into 22
   resistant and 38 sensitive groups based on the drug activity Z scores
   of the 10-gene chemotherapy-benefit signature.

   (B) The violin plot of −log10(GI 50) values between resistant and
   sensitive groups. Abbreviations are as follows: BR, breast; CNS,
   central nervous system; CO, colon; LC, non-small cell lung; LE,
   leukemia; ME, melanoma; OV, ovarian; PR, prostate; and RE, renal.

Pathway enrichment analysis for 10-gene chemotherapy-benefit
signature-associated gene networks

   To better understand the biology underlying the 10-gene
   chemotherapy-benefit signature, we next performed pathway enrichment of
   the ten genes (THY1, BUB1, DEPDC1, STON1, ATAD2, HSD17B2, TPX2, AURKB,
   CYR61, and FAM84A), along with the 135 genes connected to them on the
   SPACE gene network ([200]Figure 2A). We tested 743 canonical pathways
   collected in MSigDB (Molecular Signature Database from Broad Institute)
   that contained genes overlapping with the 145 genes described earlier
   ([201]Table S7). The results showed a total of 147 pathways with
   nominal p values less than 0.05. Next, we computed a pathway enrichment
   score for each sample using ssGSEA (single-sample gene set enrichment
   analysis) and compared these scores for the ACT-benefit and non-benefit
   groups in the testing cohort [202]GSE17538
   ([203]Table S7).[204]^41 [205]Figure S4 shows a heatmap including
   pathways with significantly different ssGSEA scores between the two
   groups (adjusted p value less than 0.1). Among these significant
   pathways, 49 pathways showed high expression activities in the
   ACT-non-benefit group, and notably, many of them are tumor
   microenvironment-related pathways, such as
   REACTOME_INTEGRIN_CELL_SURFACE_INTERACTIONS,
   REACTOME_EXTRACELLULAR_MATRIX_ORGANIZATION, and others.

The potential predicted response to immune checkpoint blockade of 10-gene
chemotherapy-benefit signature

   A study explored the roles of fatty acid metabolism in CRC and
   developed a prognostic risk score model. The study utilized the Tumor
   Immune Dysfunction and Exclusion (TIDE) score alongside a SubMap
   algorithm to demonstrate that high-risk patients, as identified by the
   model, are more suitable for immunotherapy. This suggests that the TIDE
   score can help identify patients with CRC who might benefit from
   specific immunotherapies.[206]^42 Given our findings relative to the
   patients with MSI-H and the 10-gene signature, we investigated whether
   the 10-gene chemotherapy-benefit signature could also predict response
   to immune checkpoint blockade (ICB). In our study, we assessed the
   relationship between the TIDE score[207]^43 and the 10-gene
   chemotherapy-benefit score for each sample in our datasets. We then
   examined the Pearson correlation between these two scores. Our
   analysis, involving 442 tumor samples across four datasets
   ([208]GSE17538
   , [209]GSE33113, [210]GSE37892, and [211]GSE38832), revealed a
   significant yet moderate positive correlation between the TIDE score
   and the 10-gene chemotherapy-benefit score (r = 0.33, p = 4.97 ×
   10^−^13, [212]Figure 6A). However, this correlation varied across
   datasets. Specifically, in the [213]GSE37892 dataset, the correlation
   was relatively weak and not statistically significant (r = 0.13, p =
   0.13, [214]Figure 6D), while in the [215]GSE17538 (r = 0.42, p = 1.05 ×
   10^−^6, [216]Figure 6B), [217]GSE33113 (r = 0.42, p = 3.77 × 10^−^5,
   [218]Figure 6C), and [219]GSE38832 (r = 0.53, p = 9.2 × 10^−^7,
   [220]Figure 6E) datasets, a moderate and significant positive
   correlation was observed. Our results, therefore, underscore the need
   for further investigation, particularly in clinical settings where
   immunotherapy is administered, to better understand and validate the
   potential predictive role of our 10-gene signature in the context of
   ICB.

Figure 6.

   [221]Figure 6
   [222]Open in a new tab

   Association of the 10-gene chemotherapy-benefit signature with TIDE
   score

   The Pearson correlation between the TIDE score and the 10-gene
   chemotherapy-benefit score in the merged four datasets (A),
   [223]GSE17538
   (B), [224]GSE33113 (C), [225]GSE37892 (D), and [226]GSE38832 (E).

Discussion

   In our pursuit of personalized cancer therapy, identifying patients
   with stage II and III colon cancer most likely to benefit from
   5-FU-based ACT remains a crucial goal. However, developing gene
   signatures related to prognosis or ACT benefit poses challenges due to
   limited training datasets, variability in gene expression data across
   different platforms, and heterogeneity in patient stages, treatments,
   and tumor molecular characteristics. To overcome these challenges, we
   employed a semi-supervised machine learning approach, identifying genes
   significantly associated with survival and highly interconnected within
   gene networks. Utilizing a discovery dataset of 933 stage II and III
   colon cancer samples, the largest of its kind, we aimed to maximize the
   robustness of our model.

   Gene regulatory network modeling, which studies how genes interact with
   each other and work together as modules to perform biological
   functions, has been extensively applied to genomic data analysis and
   shown to be an effective approach for understanding complex biological
   systems.[227]^44^,[228]^45^,[229]^46^,[230]^47^,[231]^48 It also has
   been demonstrated that compared with the genes selected by
   meta-analysis p values, highly connected hub genes in the network have
   more biological functional importance and are more predictive of
   outcome.[232]^27 The hub genes have been used as effective biomarkers
   for the prognosis and prediction of survival in different cancer
   types.[233]^49^,[234]^50^,[235]^51^,[236]^52^,[237]^53^,[238]^54 To
   select relevant genes for the construction of a gene network, we first
   applied the Cox regression model to select 3,091 genes significantly
   associated with patient survival outcomes in a subset of the discovery
   cohort with available survival information (n = 566) and then
   constructed a gene network using the survival-associated genes with all
   933 samples in the discovery cohort.

   The 18-hub-gene selected from our semi-supervised gene network, in
   which each gene has more than 11 connections with its neighbors in the
   gene co-expression network, led to coordinated changes at the system
   level. We found that compared to the 18-top-ranked gene set, our
   18-hub-gene set has higher information content as well as more robust
   prognostic performances across different datasets. Further genomic and
   DNA methylation analysis also developed a 10-gene chemotherapy-benefit
   signature, which was successfully validated in another independent
   fresh-frozen tissues dataset we generated previously ([239]GSE17538).
   Moreover, we demonstrated both 18- and 10-gene signatures to have
   independent prognostic and predictive value beyond the clinical
   variables including tumor stage. These results suggested that the gene
   signatures we developed were not just prognostic for patients with
   colon cancer but, more importantly, were also predictive of patients’
   benefit from 5-FU-based chemotherapy in terms of RFS as in previous
   work.[240]^55^,[241]^56

   The pathway analyses of the 10-ACT benefit gene signature and its
   network revealed that for the ACT-non-benefit group, a series of the
   tumor microenvironment and cell surface interaction-related pathways
   such as integrin 5 and integrin 2 pathways are activated, which might
   provide a survival advantage for tumor cells against chemotherapy
   treatment.[242]^57^,[243]^58^,[244]^59 It is also well known that
   increased TGFβ signaling in the microenvironment is associated with
   adverse outcomes in patients with CRC.[245]^60^,[246]^61 On the other
   hand, the ACT-benefit group is enriched with cell cycle-related
   pathways. Thus, the 10-ACT benefit gene signature captures the distinct
   biological functions and activities associated with chemotherapy
   responses.

   ICB is an effective strategy for enhancing anti-tumor T cell activity
   and therapeutic effects. ICB therapy has resulted in unprecedented and
   durable responses in a sizable proportion of patients with diverse
   cancers such as melanoma, lung, kidney, bladder, and
   others.[247]^62^,[248]^63^,[249]^64 ICB drugs such as pembrolizumab and
   nivolumab, which are inhibitors of programmed cell death protein 1
   (PD-1), have been authorized by the Food and Drug Administration to
   treat patients with metastatic CRC whose tumors are MSI-H or deficient
   mismatch repair (dMMR). Recently PD-1 blockade was applied to patients
   with locally advanced rectal cancer with dMMR in a neoadjuvant setting,
   and all 12 patients achieved a striking clinical complete
   response.[250]^65 However, MSI-H/dMMR CRC only comprises around 15% of
   CRC, and patients with microsatellite stable (MSS)/mismatch
   repair-proficient (pMMR) CRC show poor responses to ICB treatments due
   to “immune-cold” tumor immune
   microenvironment.[251]^66^,[252]^67^,[253]^68^,[254]^69^,[255]^70

   Although overall patients with MSS/pMMR CRC lack strong responses to
   ICB therapy, the heterogeneity nature of CRC indicates that the
   different subgroups may have divergent treatment
   responses.[256]^71^,[257]^72^,[258]^73 A recent large-scale
   transcriptome analysis identified a subgroup of “immune-hot” MSS/pMMR
   rectal cancer with favorable outcomes to neoadjuvant therapy, which
   highlights the potential for effective response to ICB therapy.[259]^74
   On the other hand, there is a small yet notable fraction of patients
   with MSI-H/dMMR metastatic CRC who unfortunately show early disease
   progression after undergoing ICB treatment.[260]^75^,[261]^76
   Therefore, there is an urgent need to identify potential biomarkers for
   ICB response.

   TIDE score has been shown to capture tumor immune evasion by combining
   gene signatures estimating T cell dysfunction and exclusion to predict
   ICB response.[262]^43 However, TIDE score calculation may need a large
   number of genes and is focused on the state of T cells. We observed a
   significant, although moderate, correlation (r = 0.33, p = 4.97E−13,
   [263]Figure 6A) between the 10-gene signature and TIDE score using 442
   colon tumors. While this finding is intriguing, we acknowledge that it
   does not strongly indicate a predictive capability for checkpoint
   inhibition response. This association, therefore, should be interpreted
   with caution, and we do not assert it as a definitive predictor of
   immunotherapy benefit. Further investigation in populations receiving
   immunotherapy is required to explore the potential relevance of this
   correlation.

   One of the key bottlenecks for translating biomarkers discovered from
   the microarray/RNA sequencing transcriptomic data based on fresh-frozen
   tissue to routine clinical practice is the development of accurate and
   robust clinical tests, especially on FFPE samples. Previously, we
   systematically evaluated the gene expression pattern of colon cancer
   prognostic gene signature between Affymetrix microarray data on
   fresh-frozen tissue and NanoString nCounter data on matched FFPE
   samples and showed the moderate correlation between these two
   measurements, which indicated the feasibility of translating the
   discovery from fresh-frozen tissues to clinical tests using FFPE
   samples for colon cancer gene signatures.[264]^77^,[265]^78 In this
   study, we developed the clinical assay containing 18 genes using
   NanoString nCounter platform and tested the 18- and 10-gene signatures
   in VUMC stage II to III colon cancer samples that were sectioned from
   FFPE tissue blocks. The 18-gene prognosis signature performed
   exceptionally well in the FFPE validation cohort (log rank p value =
   0.009). Despite the relatively small sample size of the
   chemotherapy-benefit validation cohort (56 patients with 13 recurrent
   events), the 10-gene chemotherapy-benefit signature also achieved
   marginal significance (log rank p value = 0.0583), indicating a
   promising trend that distinguishes 5-FU-based ACT-beneficial and
   non-beneficial groups in terms of survival outcomes. These results
   demonstrated that our 10-gene signature set could provide useful
   information for identifying patients with colon cancer who are most
   likely to benefit from 5-FU-based ACT.

   The development of our custom NanoString panel represents a significant
   advancement in translating our gene signature from the research setting
   into clinical practice. This innovative approach enables the practical
   application of our 10-gene signature, potentially changing how patients
   with stage II and III colon cancer could be managed in terms of
   5-FU-based ACT. By providing a robust and clinically applicable tool,
   we anticipate that our signature could aid in stratifying patients more
   accurately, thereby enabling personalized treatment strategies. The
   panel’s utility in distinguishing patients likely to benefit from ACT
   could lead to more precise and effective treatments, reducing
   unnecessary exposure to chemotherapy and its associated side effects
   for those unlikely to benefit.

   Furthermore, the integration of our panel into clinical workflows
   necessitates a series of next steps to fully realize its potential.
   Prospective clinical trials are imperative to validate the predictive
   power of the 10-gene signature in a real-world setting and assess its
   impact on patient outcomes. These trials should also explore the
   utility of the signature in conjunction with other emerging diagnostic
   tools, like circulating tumor DNA, to further refine treatment
   strategies. Additionally, the broader applicability of the signature in
   various stages of colon cancer and in different treatment scenarios,
   such as immune checkpoint blockade therapy, should be investigated.
   This expanded research could uncover more comprehensive applications of
   the signature, thereby contributing to the overarching goal of
   precision oncology. The economic aspects of implementing this
   technology in routine clinical settings also warrant thorough
   evaluation to ensure its cost-effectiveness and accessibility.
   Ultimately, these steps will be crucial in transitioning our research
   findings from the bench to the bedside, offering new hope for optimized
   cancer treatment strategies.

   In our study, we also conducted a simulation analysis to assess the
   adequacy of our sample size for validating the 10-gene signature for
   prospective studies. Adhering to the methodological framework suggested
   by Riley et al.,[266]^79 we targeted a calibration slope with narrow
   confidence. Our simulations across varying sample sizes (N = 250 to
   3,000) demonstrated a consistent trend of decreasing SE with increasing
   sample size, underscoring the enhancement in the precision of our
   model’s calibration ([267]Table S9). Notably, the SE decreased from
   0.259 at N = 250 to 0.075 at N = 3,000, highlighting the improved
   precision in larger cohorts. This analysis not only substantiates the
   reliability of our gene signature in smaller cohorts but also suggests
   increased robustness in larger ones. These findings, pivotal in
   contextualizing the clinical applicability of our model, support the
   notion that while larger sample sizes offer enhanced precision, even
   smaller sample sizes in our study provide valuable insights,
   contributing significantly to the nuanced understanding of gene
   signature validations in oncological research. This alignment of our
   results with statistical principles and clinical relevance underlines
   the robustness of our study’s findings and reinforces its potential
   impact in guiding ACT decisions in colon cancer.

   Our 10-gene signature offers promise, especially with its strong
   prognostic value and hints at predicting the benefits of ACT. Pathway
   analysis of this signature has uncovered specific biological functions
   related to chemotherapy responses, highlighting its potential to deepen
   our understanding of treatment impacts. In summary, our research
   introduces a 10-gene signature that could play a significant role in
   predicting ACT benefits. Looking ahead, its value in clinical
   decision-making could greatly enhance how we tailor ACT and
   immunotherapy for patients with colon cancer. The next steps involve
   more extensive validation through prospective clinical trials to firmly
   establish its predictive power.

Limitations of the study

   Our study presents a 10-gene signature for predicting ACT benefit in
   stage II and III colon cancer, yet it is not without limitations.
   The primary challenge was our inability to benchmark this signature
   against others due to the unique focus on chemotherapy response, a
   contrast to the prevalent prognostic nature of existing signatures.
   This limitation, combined with the absence of comparable signatures and
   suitable datasets, restricts direct comparative analysis. In addition,
   future validation of the 10-gene signature is essential to confirm its
   clinical utility. This should involve prospective clinical trials and
   diverse datasets to ensure its efficacy across various patient groups.
   Exploring and benchmarking against alternative computational models
   could enrich the understanding of its relative performance.

   Moreover, it is important to note that our study was based solely on
   transcriptome data. While our approach provides valuable insights into
   gene expression profiles related to chemotherapy response, it does not
   capture the full spectrum of biological mechanisms that might influence
   treatment outcomes. The generation of multi-omics data from future
   clinical trials could offer a more comprehensive understanding of the
   biological landscape of colon cancer. Incorporating data from genomics,
   proteomics, metabolomics, and other omics layers will not only enhance
   our biological insight but also facilitate the discovery of biomarkers
   from different omics data, potentially leading to more robust and
   predictive biomarkers for chemotherapy benefit.

STAR★Methods

Key resources table

   REAGENT or RESOURCE SOURCE IDENTIFIER
   Software and algorithms
     __________________________________________________________________

   R (v3.6.3) R CRAN [268]https://cran.r-project.org/
   frma R package (v1.38.0) McCall et al. (2010)[269]^80
   [270]https://git.bioconductor.org/packages/frma
   sva R package (v3.34.0) Leek et al. (2012)[271]^81
   [272]https://git.bioconductor.org/packages/sva
   space R package (v 0.1-1.1) Peng et al. (2009)[273]^82
   [274]https://cran.r-project.org/src/contrib/Archive/space/
   randomForestSRC R package (v2.7.0) Ishwaran et al. (2008)[275]^83
   [276]https://cran.r-project.org/web/packages/randomForestSRC/index.html
   clusterProfiler R package (v3.14.3) Yu et al. (2012)[277]^84
   [278]https://git.bioconductor.org/packages/clusterProfiler
   GSVA R package (v1.34.0) Hanzelmann et al. (2013)[279]^85
   [280]https://git.bioconductor.org/packages/GSVA
   hgu133plus2.db R package (v3.2.3) Carlson (2016)[281]^86
   [282]https://bioconductor.org/packages/release/data/annotation/html/hgu
   133plus2.db.html
   survminer R package (v0.4.9) Alboukadel et al. (2021)[283]^87
   [284]https://cran.r-project.org/web/packages/survminer/index.html
   NanoStringNorm R package (v1.2.1) Waggott et al. (2012)[285]^88
   [286]https://cran.r-project.org/src/contrib/Archive/NanoStringNorm/
   GISTIC Mermel et al. (2011)[287]^37
   [288]https://portals.broadinstitute.org/tcga/home
   MethHC Huang et al. (2021)[289]^38
   [290]https://methhc.mbc.nctu.edu.tw/php/index.php
   DisGeNET Pinero et al. (2020)[291]^39
   [292]https://www.disgenet.org/home/
   CellMiner Reinhold et al. (2012)[293]^40
   [294]https://discover.nci.nih.gov/cellminer/
   TIDE Jiang et al. (2018)[295]^43 [296]https://tide.dfci.harvard.edu/
     __________________________________________________________________

   Deposited data
     __________________________________________________________________

   Microarray of colon cancer expression profiles Marisa et al.
   (2013)[297]^89 GEO: [298]GSE39582
   Microarray of colon cancer expression profiles Smith et al.
   (2010)[299]^90 [300]GSE17538
   Microarray of colon cancer expression profiles Tripathi et al.
   (2014)[301]^91 [302]GSE38832
   Microarray of colon cancer expression profiles De Sousa et al.
   (2011)[303]^92 [304]GSE33113
   Microarray of colon cancer expression profiles Laibe et al.
   (2012)[305]^93 [306]GSE37892
   Microarray of colon cancer expression profiles Kikuchi et al.
   (2012)[307]^94 [308]GSE27854
   Microarray of colon cancer expression profiles Birnbaum et al.
   (2011)[309]^95 [310]GSE26906
   Microarray of colon cancer expression profiles Tsukamoto et al.
   (2011)[311]^96 [312]GSE21510
   Microarray of colon cancer expression profiles Gröne et al.
   (2011)[313]^97 [314]GSE18088
   Microarray of colon cancer expression profiles Expression Project for
   Oncology ([315]http://intgen.org) [316]GSE2109
   The VUMC NanoString nCounter® data This study
   [317]https://github.com/PoShine/CRC-5-FU
   [318]Open in a new tab

Resource availability

Lead contact

   Further information and requests for resources and reagents should be
   directed to and will be fulfilled by the lead contact, X. Steven Chen
   (steven.chen@miami.edu).

Materials availability

   This study did not generate new unique reagents.

Data and code availability

     * •
       All public datasets used in this study are available in the Gene
       Expression Omnibus ([319]https://www.ncbi.nlm.nih.gov/geo/).
       NanoString nCounter® data is available in the Supplementary
       [320]Data S1.
     * •
       The R code for the computational refinement of gene signatures was
       deployed [321]https://github.com/PoShine/CRC-5-FU. The DOI at
       Zenodo was [322]https://doi.org/10.5281/zenodo.10748699.
     * •
       Any additional information required to reanalyze the data reported
       in this paper is available from the [323]lead contact upon request.

Experimental model and study participant details

Patients and ethics statements

   The FFPE samples of surgically resected tumor tissues were obtained
   from colon cancer patients undergoing surgery at Vanderbilt University
   Medical Center (Nashville, USA) from 1990 to 2013 after informed
   consent was provided. All cases were confirmed by histopathology. This
   research was approved by the Institutional Review Board (IRB) and
   Ethics Committee of Vanderbilt University Medical Center (Nashville,
   USA). Each patient provided written informed consent. The patient
   demographic information is available in the Supplementary [324]Data S1.

Method details

Patients and tumor samples

   To ensure the most comprehensive coverage of probesets in the human
   genome, as well as concordance in the microarray platform, currently
   available colon cancer expression profiles detected by "Affymetrix
   Human Genome U133 Plus 2.0 Array" platform ([325]GPL570), were
   collected and used for analysis. All gene expression data of stage I
   and IV patients were removed before signature derivation/verification.
   As a result, a total of 1375 stage II to III colon cancer frozen tissue
   specimens from 10 independent gene expression datasets were collected
   from the Gene Expression Omnibus (GEO) microarray data repository,
   including [326]GSE39582
   , [327]GSE17538, [328]GSE38832, [329]GSE33113, [330]GSE37892,
   [331]GSE27854, [332]GSE26906, [333]GSE21510, [334]GSE18088, and
   [335]GSE2109 ([336]Table 1). Among them, the first five datasets
   provide clinical information about relapse-free survival (RFS) time
   ([337]Table 1); and [338]GSE39582 and [339]GSE17538 contain the samples
   of patients who had received 5-fluorouracil-based (5-FU) adjuvant
   chemotherapy ([340]Table S8).

Data pre-processing

   First, we applied the Frozen Robust Multi-array Analysis (fRMA)
   algorithm to each microarray dataset for data normalization.[341]^80
   The probeset identifiers (IDs) were mapped to gene symbols using the R
   package "hgu133plus2.db". When multiple probesets correspond to one
   gene, the probeset with the most significant p-value associated with
   survival outcome was selected to represent the gene. For the discovery
   cohort, batch effects within [342]GSE39582
   and among six different GEO datasets ([343]GSE39582, [344]GSE27854,
   [345]GSE26906, [346]GSE21510, [347]GSE18088, and [348]GSE2109) were
   removed using the ComBat method as implemented in the R package
   'sva'.[349]^81

Construction of survival-related gene network and identification of hub genes

   Currently, for colon cancer, there are limited publicly available gene
   expression datasets with complete clinical annotation, such as survival
   outcomes. To maximally leverage information in gene expression datasets
   without survival information, we developed and validated our prognosis
   prediction models using a new semi-supervised gene-network based
   strategy. Within the largest stage II to III colon cancer collection,
   [350]GSE39582
   was the only dataset with RFS information. We first used [351]GSE39582
   to select genes significantly associated with RFS outcome based on the
   Cox proportional hazards model (p < 0.05). Next, we further filtered
   these genes using a network-based strategy. We hypothesized that hub
   genes, which have many interactions with other genes in the
   survival-related network, are more likely to be predictive of
   prognosis. To obtain a robust survival-related network, we combined all
   stage II and III colon cancer samples in [352]GSE39582, together with
   the other five expression datasets: [353]GSE27854, [354]GSE26906,
   [355]GSE21510, [356]GSE18088, and [357]GSE2109, all of which did not
   have RFS information, into an aggregated gene expression matrix. In the
   combined gene expression dataset, survival-associated genes were used
   to construct the survival-related gene network using the Sparse Partial
   Correlation Estimation (SPACE) algorithm,[358]^82 which was shown to be
   one of the best-performing gene network approaches to discover hub
   genes.[359]^98 Finally, we selected genes with more than 11 connections
   to their neighbor nodes in the network as hub genes.

Prognosis model construction and evaluation

   Given the selected set of hub genes, we next used the "randomForestSRC"
   R package to construct the random survival forest prognosis model,
   using [360]GSE39582
   which has RFS time as our training samples, and validated the
   prediction model in four additional independent datasets with available
   RFS time, including [361]GSE17538, [362]GSE33113, [363]GSE37892, and
   [364]GSE38832. The random survival forests approach is particularly
   advantageous for its ability to handle complex interactions between
   genes and its robustness in managing high-dimensional genomic data,
   making it an excellent tool for survival analysis without the need for
   assumptions on the data’s distribution.[365]^83^,[366]^99 The maximally
   selected rank statistic (maxstat) provided in the "survminer" package
   was used to determine the optimal cut-off point for discriminating
   high-risk vs. low-risk groups. The prediction performance of the RSF
   prognosis model was assessed using the hazard ratio (HR) and log rank
   test p-value. The detailed workflow for our proposed semi-supervised
   gene network-based prognosis model-building strategy is illustrated in
   [367]Figure 1A. The other existing colon cancer prognosis signatures,
   including OncotypeDX,[368]^29 ColoGuidePro,[369]^30 and
   ColoFinder,[370]^31 were used to compare the 18-gene prognosis model.

Determining a robust chemotherapy-benefit signature set from adjuvant
chemotherapy (ACT) samples

   To additionally identify a gene signature set for stage II to III colon
   cancer patients who are most likely to benefit from the 5-FU-based ACT,
   we further refined the 18-hub-gene set and investigated its ability to
   predict colon cancer outcomes following 5-FU-based ACT. More
   specifically, we identified potential cancer driver genes with either
   significant genomic aberration or DNA methylation alteration between
   colon cancer and normal samples using the GISTIC tool[371]^37 for TCGA
   copy number and MethHC database[372]^38 (see [373]Figure 1B). We also
   verified the 18-hub-genes in the DisGeNET database.[374]^39 An RSF
   chemotherapy-benefit model based on these 10 genes was then
   re-constructed using 142 samples with 5-FU-based ACT information in the
   [375]GSE39592
   dataset. Finally, we accessed the prediction performance of this
   chemotherapy-benefit signature using two datasets generated from our
   group-microarray data based on fresh frozen tissues in [376]GSE17538
   and NanoString nCounter data using FFPE samples from Vanderbilt
   University Medical Center (VUMC), which included expression data of
   colon cancer patients who received 5-FU-based ACT samples
   ([377]Figure 1B).

NanoString nCounter assay development

   To facilitate the clinical application of the gene signatures, we
   designed a custom nCounter assay (NanoString Technologies, Seattle, WA)
   for quantitative assessment of the expression of 25 gene elements using
   a multiple probe approach. The final codeset contained the 18-hub-gene
   set identified from the discovery cohort based on the microarray
   platform using fresh frozen tissue. Seven housekeeping genes derived
   from the least variant expression elements in pilot nCounter studies
   were also included. To provide a more robust measurement, three probes
   were designed for each of the 18 hub genes and 6 of the housekeeping
   genes. Only two probes were designed for RPL38 due to proximity issues
   between the probes. A full list of gene symbols and the probe design
   for these 25 genes can be found in [378]Table S1. This multiplexed
   assay is reported to detect expression of up to 800 transcripts at very
   low mRNA concentrations (0.1fM/1 copy per cell) even in RNA samples
   that are significantly degraded, as long as at least 20% of the sample
   has RNA fragments of 300 base pairs or longer. Therefore, the 144 CRC
   samples with 20%–76% (median value is 43%) of the RNA sample containing
   fragments of 300 base pairs or longer were hybridized to the custom
   nCounter assay.

Evaluation of the predictive performance in the VUMC NanoString nCounter data

   To analyze stage II to III colon cancer specimens using the NanoString
   nCounter platform, total RNAs were isolated from 144 FFPE preserved
   colon cancer specimens (with associated RFS prognosis information), of
   which 109 specimens were stage II to III colon cancer patients, 56 of
   which had received 5-fluorouracil-based (5-FU) adjuvant chemotherapy
   ([379]Table 1). We used the R package 'NanoStringNorm' to normalize the
   raw expression data, summarized the CodeCount (positive) and
   SampleContent (housekeeping) controls by the geometric mean, and used
   the 'mean' option for background correction. Normalized data were
   log2-transformed for further analyses. Batch effects were removed using
   the ComBat method as implemented in the R package 'sva'.[380]^81
   Finally, when multiple probesets were targeted to the same gene, the
   probeset with the maximum expression value was used to represent the
   gene. A total of 109 stage II to III samples were utilized to assess
   the RFS prognosis model constructed by the 18-hub-gene set, and 56
   samples that received 5-fluorouracil-based (5-FU) adjuvant chemotherapy
   were utilized to validate the chemotherapy-benefit 10-gene prediction
   model. Because of the difference in microarray platforms, when the
   NanoString data were utilized to evaluate the efficiency of our
   prediction model, for each gene, we computed Z score transformations
   for samples in the training set and samples in the testing set
   separately. The NanoString data is available in the Supplementary
   [381]Data S1.

Drug sensitivity analyses

   Gene transcript z scores data for the chemotherapy-benefit gene
   signature set and their corresponding -log10(GI 50) values treated by
   5-FU in NCI-60 cell lines were obtained from CellMiner database
   ([382]https://discover.nci.nih.gov/cellminer/).[383]^40 GI 50
   represents the concentration of a drug required to reduce cancer cell
   proliferation by 50%. Using Z score data from the chemotherapy-benefit
   gene signature set, complete-linkage hierarchical clustering was
   performed to categorize all cell lines into resistant and sensitive
   groups. The Wilcoxon rank-sum test was then carried out to evaluate the
   difference between resistant and sensitive cell lines by their median
   -log10(GI 50) values.

Pathway analysis

   The five pathway collections including BioCarta, KEGG, PID, Reactome,
   and WikiPathways from MSigDB C2 Canonical pathway database were used
   for pathway analysis. We used the enricher function in the Bioconductor
   package clusterProfiler for pathway enrichment analysis.[384]^84 The
   ssGSEA method was implemented in the GSVA R package and was applied to
   compare ACT benefit and non-benefit groups and heatmap visualization
   for samples from the [385]GSE17538
   dataset.[386]^85

Tumor immune dysfunction and exclusion (TIDE) analysis

   The TIDE score computed for each tumor sample can serve as a surrogate
   biomarker to predict response to immune checkpoint blockade, including
   anti-PD1 and anti-CTLA.[387]^43 We utilized the TIDE online tool
   ([388]http://tide.dfci.harvard.edu/) to calculate the TIDE score for
   each tumor sample in four datasets: [389]GSE17538
   , [390]GSE33113, [391]GSE37892, and [392]GSE38832. Then, we calculated
   a 10-gene chemotherapy-benefit score for each sample in the four
   datasets, irrespective of whether the sample had received 5FU
   chemotherapy. The TIDE scores and the 10-gene chemotherapy-benefit
   scores were further compared.

Quantification and statistical analysis

   The two-sample Kolmogorov-Smirnov test or Wilcoxon rank-sum test was
   appropriately used to compare groups. The Kaplan–Meier method was used
   to analyze the RFS of patients with different risk stratification. All
   analyses were performed using R software (version 3.6.3).

Acknowledgments