Abstract Background Escherichia coli (E. coli) is a part of normal gastrointestinal microbiota but it could also cause human gastrointestinal diseases. Understanding the mechanism of E. coli in the progression of gastrointestinal tumors can provide novel prevention and treatment strategies for gastrointestinal tumors. Methods The E. coli infection score was calculated by single sample GSEA (ssGSEA). Weighted correlation network analysis (WGCNA) and differentially expressed genes (DEGs) analysis were used to identify genes related to E. coli infection in gastrointestinal tumors. Hub genes were selected by machine learning methods to establish a diagnostic model. The diagnostic performance of the model was evaluated by the area under the receiver operating characteristic (ROC) curve (AUC) and validated in three external datasets. After determining the biomarkers, immune infiltration analysis and GSEA were further performed. The mRNA expressions of the biomarkers in stomach adenocarcinoma (STAD) cells and the invasion and migration of the tumor cells were detected by conducting in vitro experiments. Results The E. coli infection score was lower in tumor samples than in normal samples. Eight hub genes were selected from a total of 28 genes associated with E. coli-related dysbiosis in gastrointestinal tumors to establish an accurate diagnostic model. The AUC values of PRKCB and IL16 were all greater than 0.7 in three external datasets and the mRNA expression pattern was consistent with TCGA cohort, therefore PRKCB and IL16 were selected as the diagnostic biomarkers. PRKCB and IL16 exhibited significant positive correlations with most immune cells, and inflammation-related pathways were activated in the high expression groups of PRKCB and IL16. Moreover, IL16 was high-expressed but PRKCB was low-expressed in STAD cells, and silencing IL16 suppressed the invasion and migration of STAD cells. Conclusions Overall, we identified and validated 8 robust genes related to E. coli applying bioinformatics and machine learning algorithms, providing theoretical foundations for the relationship between E. coli-related dysbiosis and gastrointestinal tumors. Keywords: Gastrointestinal tumors, Escherichia coli, Biomarker, Machine learning, Diagnosis 1. Introduction Gastrointestinal tumors are the most common tumors with high morbidity and mortality rates worldwide [[31]1,[32]2]. According to the global cancer statistics for 2020, colorectal cancer (CRC) and stomach adenocarcinoma (STAD) showed the highest incidence among all gastrointestinal tumors worldwide [[33]3,[34]4]. The occurrence of gastrointestinal tumors are related to multiple risk factors, such as smoking, alcohol consumption, high saturated fat diets, Helicobacter pylori infection, and inflammatory bowel disease [[35]5,[36]6]. At present, the relationship between gastrointestinal microbiome and tumors has gradually attracted increasing research attention [[37]7]. The human gut is the home to roughly 30 trillion microorganisms [[38]8]. Alterations in the gut microbiota and its byproducts play a crucial role in the onset of various illnesses, including carcinomas, gastrointestinal issues, metabolic disorders, and neurodegenerative diseases [[39]9]. Growing studies indicated a notable connection between the development of CRC and gut microbiome maintained through several mechanisms, including DNA damage, inflammation, and related molecular pathways [[40]10,[41]11]. A specific strain of Escherichia coli (E. coli) contains a polyketide synthetase island that spans 54 kilobases. The island is responsible for the production of enzymes related to colibactin, a genotoxin composed of polyketide and peptide components [[42]12,[43]13]. E. coli is a major bacterial agent leading to foodborne illnesses worldwide [[44]14]. Different strains of pathogenic E. coli exhibit a range of virulence factors and genes that allow them to induce diseases in humans, for instance, adhesion factors such as type 1 pili, P fimbriae, and S fimbriae enhance the colonization of epithelial cells [[45]15]. Hence, a better understanding of the mechanism of E. coli during the progression of gastrointestinal tumors can improve the prevention and treatment for gastrointestinal tumors. Weighted gene co-expression network analysis (WGCNA) and machine learning have been increasingly used to discover potential biomarkers for cancers [[46]16,[47]17]. WGCNA sections critical modules and identifies disease-relevant genes by establishing scale-free networks through connecting gene expressions with clinical features [[48]18]. As a branch of artificial intelligence, machine learning focuses on using algorithms to create decision models to complete tasks [[49]19,[50]20]. In recent years, various machine learning algorithms such as randomForest and LASSO regression analysis have been extensively applied in screening biomarkers and constructing diagnostic models for malignant tumors [[51]21,[52]22]. For example, combining WGCNA with machine learning methods, Lv et al. found 6 biomarkers for the prediction, prevention, and treatment of liver hepatocellular carcinoma (LIHC) [[53]23]. However, identification of biomarkers related to E. coli infection for diagnosing gastrointestinal tumors based on biological network analysis and machine learning algorithms has not been reported. Alterations in complex systems, including the relationship between gut microbiota and its host, help understand the fundamental molecular processes. In gastrointestinal diseases, the composition of gut microbiota experiences shifts at the phylum level, a phenomenon known as dysbiosis [[54]24]. Therefore, this study set out to discover new diagnostic biomarkers associated with E. coli-related dysbiosis in gastrointestinal tumors. Firstly, the co-expressed gene modules were classified to identify module genes showing the highest correlation with the diseases. By intersecting the differentially expressed genes (DEGs) with the module genes, the DEGs associated with E. coli-related dysbiosis in gastrointestinal tumors were obtained. Then, hub genes were selected to establish a diagnostic model, which was further validated in external datasets, and the diagnosis biomarkers were determined. Additionally, we examined the correlation between the biomarkers and immune cells in tumor samples to analyze their potential role in the immune microenvironment of gastrointestinal tumors and possible immune regulatory mechanisms. Finally, the mRNA expressions of biomarkers in STAD cells were validated and the tumor cell invasion and migration were assessed by conducting in vitro assays. 2. Material and methods 2.1. Data collection and data preprocessing The standardized pan-cancer datasets of The Cancer Genome Atlas (TCGA) Pan-Cancer were downloaded from the University of California Santa Cruz (UCSC) database ([55]https://xenabrowser.net/). The TCGA Pan-Cancer dataset contained 1622 tumor samples and 144 normal samples. The Fragments Per Kilobase per Million (FPKM) sample data of Solid Tissue Normal and Primary Tumor of TCGA-ESCA, TCGA-COAD, TCGA-STAD, TCGA-READ, TCGA-PAAD, and TCGA-LIHC were retained. Subsequently, after removing samples without clinical follow-up data, those with a survival time longer than 0 were retained. Then, the TCGA datasets were merged and the batch effect was removed using the "removeBatchEffect" function of the Limma package ([56]Fig. S1) [[57]25]. [58]GSE66229 dataset (100 normal samples and 300 tumor samples), [59]GSE44076 dataset (148 normal samples and 98 tumor samples), and [60]GSE54236 dataset (80 normal samples and 81 tumor samples) were downloaded from the Gene Expression Omnibus (GEO) database ([61]https://www.ncbi.nlm.nih.gov/geo/) [[62]26]. For GEO datasets, the annotation information of the corresponding chip platform was downloaded. The probes were mapped to genes based on the annotation information while removing the probes matched to multiple genes. The average value of the multiple probes that matched one gene was calculated as the expression value of the gene. It should be noted that some datasets used in this study had a small sample size that may amplify individual differences and cause potential bias. For this reason, we are planning to include more cohorts for validation in our future studies to promote the generalizability of the current findings. 2.2. Calculation of E. coli score, StromalScore, ImmuneScore, ESTIMATEScore, and TumorPurity The E. coli infection gene set with 56 genes was downloaded from Kyoto Encyclopedia of Genes and Genomes (KEGG) database ([63]http://www.genome.ad.jp/kegg/). The E. coli score for all samples in TCGA cohort was calculated by ssGSEA [[64]27]. Meanwhile, the StromalScore, ImmuneScore, and ESTIMATEscore were computed by performing ssGSEA using ESTIMATE algorithm in the ‘estimated’ R package [[65]28], the TumorPurity was calculated based on the ESTIMATEScore [[66]29]. Finally, the correlation analysis between these scores in tumor samples was conducted. 2.3. Development of a gene co-expression network WGCNA package in R [[67]30] was employed to section co-expressed gene modules of tumor samples. Spearman method was used to analyze the module-trait relationships between the modules and E. coli score and the correlation coefficients were visualized by plotting a heatmap. Next, module-trait associations were determined based on the correlation between gene significance (GS) for E. coli score and module membership (MM) in the critical module with the highest correlation coefficient. Finally, gene ontology (GO) enrichment analysis in biological process (BP) term was performed using the genes in the critical module with the clusterProfiler package in R software [[68]31]. 2.4. Identification of differentially expressed genes (DEGs) Based on the mRNA expression profile data of TCGA, the DEGs analysis between tumor and normal samples were performed using Limma package in R software [[69]32] under the screening threshold of p.adj<0.05 and |log2FoldChange (FC)|>1. Then, the expressions of the top 50 DEGs were visualized by generating a heatmap [[70]33]. Simultaneously, the DEGs between the two types of samples were intersected with the critical module genes of |MM|≥0.8 to obtain the DEGs associated with E. coli-related dysbiosis in gastrointestinal tumors. 2.5. Screening key markers through machine learning methods Random forest is an integrated learning algorithm capable of processing high dimensional datasets without overfitting problems. It improves the accuracy and stability of a model by creating multiple decision trees and taking their votes [[71]34]. LASSO regression can automatically reduce the regression coefficients of irrelevant or less relevant genes to zero for feature selection. This is particularly important when processing biological data containing a large number of features, and it can effectively filter out the most valuable biomarkers [[72]35]. Hence, we combined these two methods to identify key genes with diagnostic and prognostic value. Firstly, a randomForest model was established using randomForest function in the R software package randomForest [[73]36]. The number of variables (mtry) used for binary trees in the specified nodes of the model and decision trees (ntree) included in the randomForest were determined. The mtry was ascertained according to the average misjudgment rate of the model based on Outofbag (OOB) samples, and the ntree was determined according to the correlation between the number of decision trees and model error. Next, the randomForest model was built with the determined variables. The top 10 markers were selected based on MeandecreaseAccuracy and MeandecreasGini, respectively. Then, the key disease markers were screened by LASSO regression analysis. The feature factors were identified with 10-fold cross validation using R software package glmnet (parameters: nfolds = 10 and family = 'binary') [[74]37]. The number of key disease markers was determined according to the λ value corresponding to the minimum binomial deviance. 2.6. Establishment and validation of diagnostic model The candidate hub genes with diagnostic value were identified from the intersection of the markers obtained by randomForest and LASSO regression methods. Then, receiver operating characteristic (ROC) analysis was applied to assess the diagnosis accuracy of each hub gene and the model. The area under the ROC curve (AUC) was calculated, with an AUC value > 0.7 indicating a high prediction accuracy of the model [[75]38]. The differential expressions of the hub genes between normal and tumor samples were displayed by boxplot. Moreover, the diagnostic value of the hub genes was validated in the external datasets of [76]GSE66229, [77]GSE44076, and [78]GSE54236. 2.7. GSEA and immune infiltration analysis Spearman method [[79]39] was employed to examine the correlation between the identified biomarkers and immune cells in tumor samples. Based on the median gene expression of the biomarkers, the TCGA samples were divided into low and high expression groups. Next, GSEA was used for biological function and pathway enrichment analysis in the background gene set (FDR<0.05). The "h.all.v2023.1.Hs.entrez.gmt" was a reference gene set obtained from the Molecular Signature Database (MsigDB, [80]https://www.gsea-msigdb.org/gsea/msigdb/) [[81]40]. 2.8. Cell culture and transfection DMEM medium comprising 10 % fetal bovine serum (CBP60512M) was used to culture normal human gastric epithelial cell line GES-1 (CBP60512), whereas STAD cell line AGS (CBP60476) were cultured in RPMI-1640 medium comprising 10 % fetal bovine serum (CBP60476M). The two types of cells were commercially purchased from Nanjing Cobioer Biosciences Co., Ltd. (Nanjing, China). All the cells were incubated at 37 °C in an incubator in 5 % CO[2]. Next, cell transfection was performed for silencing IL16. The small interfering (si) RNA targeting IL16 (si-IL16, target sequence: 5′-GAGGAGTTACTTTGTAGTTTTAA-3′) and negative control (si-NC) were synthesized by Sangon Biotech (Shanghai) Co., Ltd. The cells were transfected for 48 h (h) with the use of Lipofectamine 2000 transfection assay kit (Invitrogen, Thermo Fisher Scientific) following the description. 2.9. Quantitative reverse transcription polymerase chain reaction (qRT-PCR) The total RNAs of GES-1 and AGS cells were extracted by TriZol reagent (15596018CN, Invitrogen) and converted into cDNA applying Maxima First Strand cDNA Synthesis Kit (K1641, Invitrogen). Further, 1 μL cDNA was applied for qRT-PCR with SYBR Green Supermix (11744500, Invitrogen) following the specifications. The parameters of qRT-PCR were as follows: denaturation at 95 °C for 5 min (min), 40 cycles of 95 °C for 15 s (s), and annealing at 60 °C for 30s, then elongation at 75 °C for 30s. The primer sequences were presented in [82]Supplementary Table 1. The relative mRNA expressions of IL16 and PRKCB were quantified with the 2^−ΔΔCT method, with β-actin as a normalization control [[83]41]. 2.10. Transwell assay Transwell assay was performed to detect the invasion of AGS cells [[84]42]. First, 80 μL Matrigel (diluted at 1:8) was slowly added into the Transwell chamber (with 8 μm filters) and incubated at 37 °C for 1h to solidify the Matrigel. Then, the transfected AGS cell (1 × 10^5 cells/well) was transferred to the upper chamber with 200 μL serum-free medium, and 700 μL RPMI-1640 medium containing 10 % fetal bovine serum was added in the lower chamber. After culturing the cells for 24h, the AGS cells in the lower chamber were fixed with 5 % formaldehyde for 30min and stained with 1 % crystal violet for 15min. Finally, 10 regions were randomly selected to observe the numbers of AGS cells in the lower chamber under an Axiovert 5 digital microscope (ZEISS, Oberkochen, Germany). 2.11. Wound healing assay The migration of AGS cells was evaluated by wound-healing assay [[85]43]. In brief, the AGS cells were inoculated to a 6-well plate with 800 μL cell suspension in each well and cultured for 24h. Then, the cells were scratched to create a wound using an aseptic 100 μL pipette and then maintained in serum-free medium at 37 °C. Next, the representative images were captured at 0 and 48h using the Axiovert 5 digital microscope (ZEISS, Oberkochen, Germany) and the wound closure (%) of the AGS cells were calculated. 2.12. Statistical analysis R software (version 3.6.0) was employed in bioinformatics analyses. Differences between two groups of continuous variables were calculated using Wilcoxon rank-sum test. All the data were expressed as mean ± standard deviation (SD), and statistical analyses were conducted using SPSS (version, SPSS, Inc., Chicago, IL). Two-group differences were compared by student's t-test. P < 0.05 was considered as statistically significant. 3. Results 3.1. Gene modules closely correlated with E. coli infection in gastrointestinal tumors Firstly, the E. coli infection score of all TCGA samples was calculated using ssGSEA method. Comparison on the difference of E. coli score between primary tumor samples and normal samples showed that tumor samples had lower E. coli infection score compared to normal samples ([86]Fig. 1A). Specifically, the E. coli infection score was strongly positively related to StromalScore, ImmuneScore, and ESTIMATEScore but negatively related to TumorPurity in tumor samples ([87]Fig. 1B). Next, a total of 9 co-expressed gene modules in tumor samples was classified by WGCNA method, with the brown module exhibiting the highest correlation (r = 0.66) with E. coli score ([88]Fig. 1C). Thus, the brown module was selected for subsequent research. Next, the correlation analysis demonstrated a significant positive correlation between the GS for E. coli score and MM in the brown module (correlation coefficient = 0.89, [89]Fig. 1D). These results suggested that the genes in the brown module were most closely associated with E. coli-related dysbiosis in gastrointestinal tumors. Furthermore, GO enrichment analysis in BP term was also performed based on the genes in brown module. The results showed that the pathways related to the proliferation and activation of immune cells such as regulation of leucocyte activation and T cells activation were significantly enriched ([90]Fig. 1E). Fig. 1. [91]Fig. 1 [92]Open in a new tab Construction of co-expressed gene modules through WGCNA analysis. (A) The difference in E. coli score between tumor and normal samples. (B) The correlation of E. coli score, StromalScore, ImmuneScore, ESTIMATEScore, and TumorPurity in tumor samples. (C) The relationship between different modules and E. coli score. (D) Scatter diagram of the module membership (MM) in brown module versus gene significance for E. coli infection score. (E) GO enrichment results of brown module in the biological process. (For interpretation of the references to color in