Abstract

Background

   Escherichia coli (E. coli) is a part of normal gastrointestinal
   microbiota but it could also cause human gastrointestinal diseases.
   Understanding the mechanism of E. coli in the progression of
   gastrointestinal tumors can provide novel prevention and treatment
   strategies for gastrointestinal tumors.

Methods

   The E. coli infection score was calculated by single sample GSEA
   (ssGSEA). Weighted correlation network analysis (WGCNA) and
   differentially expressed genes (DEGs) analysis were used to identify
   genes related to E. coli infection in gastrointestinal tumors. Hub
   genes were selected by machine learning methods to establish a
   diagnostic model. The diagnostic performance of the model was evaluated
   by the area under the receiver operating characteristic (ROC) curve
   (AUC) and validated in three external datasets. After determining the
   biomarkers, immune infiltration analysis and GSEA were further
   performed. The mRNA expressions of the biomarkers in stomach
   adenocarcinoma (STAD) cells and the invasion and migration of the tumor
   cells were detected by conducting in vitro experiments.

Results

   The E. coli infection score was lower in tumor samples than in normal
   samples. Eight hub genes were selected from a total of 28 genes
   associated with E. coli-related dysbiosis in gastrointestinal tumors to
   establish an accurate diagnostic model. The AUC values of PRKCB and
   IL16 were all greater than 0.7 in three external datasets and the mRNA
   expression pattern was consistent with TCGA cohort, therefore PRKCB and
   IL16 were selected as the diagnostic biomarkers. PRKCB and IL16
   exhibited significant positive correlations with most immune cells, and
   inflammation-related pathways were activated in the high expression
   groups of PRKCB and IL16. Moreover, IL16 was high-expressed but PRKCB
   was low-expressed in STAD cells, and silencing IL16 suppressed the
   invasion and migration of STAD cells.

Conclusions

   Overall, we identified and validated 8 robust genes related to E. coli
   applying bioinformatics and machine learning algorithms, providing
   theoretical foundations for the relationship between E. coli-related
   dysbiosis and gastrointestinal tumors.

   Keywords: Gastrointestinal tumors, Escherichia coli, Biomarker, Machine
   learning, Diagnosis

1. Introduction

   Gastrointestinal tumors are the most common tumors with high morbidity
   and mortality rates worldwide [[31]1,[32]2]. According to the global
   cancer statistics for 2020, colorectal cancer (CRC) and stomach
   adenocarcinoma (STAD) showed the highest incidence among all
   gastrointestinal tumors worldwide [[33]3,[34]4]. The occurrence of
   gastrointestinal tumors are related to multiple risk factors, such as
   smoking, alcohol consumption, high saturated fat diets, Helicobacter
   pylori infection, and inflammatory bowel disease [[35]5,[36]6]. At
   present, the relationship between gastrointestinal microbiome and
   tumors has gradually attracted increasing research attention [[37]7].
   The human gut is the home to roughly 30 trillion microorganisms
   [[38]8]. Alterations in the gut microbiota and its byproducts play a
   crucial role in the onset of various illnesses, including carcinomas,
   gastrointestinal issues, metabolic disorders, and neurodegenerative
   diseases [[39]9]. Growing studies indicated a notable connection
   between the development of CRC and gut microbiome maintained through
   several mechanisms, including DNA damage, inflammation, and related
   molecular pathways [[40]10,[41]11]. A specific strain of Escherichia
   coli (E. coli) contains a polyketide synthetase island that spans 54
   kilobases. The island is responsible for the production of enzymes
   related to colibactin, a genotoxin composed of polyketide and peptide
   components [[42]12,[43]13]. E. coli is a major bacterial agent leading
   to foodborne illnesses worldwide [[44]14]. Different strains of
   pathogenic E. coli exhibit a range of virulence factors and genes that
   allow them to induce diseases in humans, for instance, adhesion factors
   such as type 1 pili, P fimbriae, and S fimbriae enhance the
   colonization of epithelial cells [[45]15]. Hence, a better
   understanding of the mechanism of E. coli during the progression of
   gastrointestinal tumors can improve the prevention and treatment for
   gastrointestinal tumors.

   Weighted gene co-expression network analysis (WGCNA) and machine
   learning have been increasingly used to discover potential biomarkers
   for cancers [[46]16,[47]17]. WGCNA sections critical modules and
   identifies disease-relevant genes by establishing scale-free networks
   through connecting gene expressions with clinical features [[48]18]. As
   a branch of artificial intelligence, machine learning focuses on using
   algorithms to create decision models to complete tasks [[49]19,[50]20].
   In recent years, various machine learning algorithms such as
   randomForest and LASSO regression analysis have been extensively
   applied in screening biomarkers and constructing diagnostic models for
   malignant tumors [[51]21,[52]22]. For example, combining WGCNA with
   machine learning methods, Lv et al. found 6 biomarkers for the
   prediction, prevention, and treatment of liver hepatocellular carcinoma
   (LIHC) [[53]23]. However, identification of biomarkers related to E.
   coli infection for diagnosing gastrointestinal tumors based on
   biological network analysis and machine learning algorithms has not
   been reported.

   Alterations in complex systems, including the relationship between gut
   microbiota and its host, help understand the fundamental molecular
   processes. In gastrointestinal diseases, the composition of gut
   microbiota experiences shifts at the phylum level, a phenomenon known
   as dysbiosis [[54]24]. Therefore, this study set out to discover new
   diagnostic biomarkers associated with E. coli-related dysbiosis in
   gastrointestinal tumors. Firstly, the co-expressed gene modules were
   classified to identify module genes showing the highest correlation
   with the diseases. By intersecting the differentially expressed genes
   (DEGs) with the module genes, the DEGs associated with E. coli-related
   dysbiosis in gastrointestinal tumors were obtained. Then, hub genes
   were selected to establish a diagnostic model, which was further
   validated in external datasets, and the diagnosis biomarkers were
   determined. Additionally, we examined the correlation between the
   biomarkers and immune cells in tumor samples to analyze their potential
   role in the immune microenvironment of gastrointestinal tumors and
   possible immune regulatory mechanisms. Finally, the mRNA expressions of
   biomarkers in STAD cells were validated and the tumor cell invasion and
   migration were assessed by conducting in vitro assays.

2. Material and methods

2.1. Data collection and data preprocessing

   The standardized pan-cancer datasets of The Cancer Genome Atlas (TCGA)
   Pan-Cancer were downloaded from the University of California Santa Cruz
   (UCSC) database ([55]https://xenabrowser.net/). The TCGA Pan-Cancer
   dataset contained 1622 tumor samples and 144 normal samples. The
   Fragments Per Kilobase per Million (FPKM) sample data of Solid Tissue
   Normal and Primary Tumor of TCGA-ESCA, TCGA-COAD, TCGA-STAD, TCGA-READ,
   TCGA-PAAD, and TCGA-LIHC were retained. Subsequently, after removing
   samples without clinical follow-up data, those with a survival time
   longer than 0 were retained. Then, the TCGA datasets were merged and
   the batch effect was removed using the "removeBatchEffect" function of
   the Limma package ([56]Fig. S1) [[57]25].

   [58]GSE66229 dataset (100 normal samples and 300 tumor samples),
   [59]GSE44076 dataset (148 normal samples and 98 tumor samples), and
   [60]GSE54236 dataset (80 normal samples and 81 tumor samples) were
   downloaded from the Gene Expression Omnibus (GEO) database
   ([61]https://www.ncbi.nlm.nih.gov/geo/) [[62]26]. For GEO datasets, the
   annotation information of the corresponding chip platform was
   downloaded. The probes were mapped to genes based on the annotation
   information while removing the probes matched to multiple genes. The
   average value of the multiple probes that matched one gene was
   calculated as the expression value of the gene. It should be noted that
   some datasets used in this study had a small sample size that may
   amplify individual differences and cause potential bias. For this
   reason, we are planning to include more cohorts for validation in our
   future studies to promote the generalizability of the current findings.

2.2. Calculation of E. coli score, StromalScore, ImmuneScore, ESTIMATEScore,
and TumorPurity

   The E. coli infection gene set with 56 genes was downloaded from Kyoto
   Encyclopedia of Genes and Genomes (KEGG) database
   ([63]http://www.genome.ad.jp/kegg/). The E. coli score for all samples
   in TCGA cohort was calculated by ssGSEA [[64]27]. Meanwhile, the
   StromalScore, ImmuneScore, and ESTIMATEscore were computed by
   performing ssGSEA using ESTIMATE algorithm in the ‘estimated’ R package
   [[65]28], the TumorPurity was calculated based on the ESTIMATEScore
   [[66]29]. Finally, the correlation analysis between these scores in
   tumor samples was conducted.

2.3. Development of a gene co-expression network

   WGCNA package in R [[67]30] was employed to section co-expressed gene
   modules of tumor samples. Spearman method was used to analyze the
   module-trait relationships between the modules and E. coli score and
   the correlation coefficients were visualized by plotting a heatmap.
   Next, module-trait associations were determined based on the
   correlation between gene significance (GS) for E. coli score and module
   membership (MM) in the critical module with the highest correlation
   coefficient. Finally, gene ontology (GO) enrichment analysis in
   biological process (BP) term was performed using the genes in the
   critical module with the clusterProfiler package in R software
   [[68]31].

2.4. Identification of differentially expressed genes (DEGs)

   Based on the mRNA expression profile data of TCGA, the DEGs analysis
   between tumor and normal samples were performed using Limma package in
   R software [[69]32] under the screening threshold of p.adj<0.05 and
   |log2FoldChange (FC)|>1. Then, the expressions of the top 50 DEGs were
   visualized by generating a heatmap [[70]33]. Simultaneously, the DEGs
   between the two types of samples were intersected with the critical
   module genes of |MM|≥0.8 to obtain the DEGs associated with E.
   coli-related dysbiosis in gastrointestinal tumors.

2.5. Screening key markers through machine learning methods

   Random forest is an integrated learning algorithm capable of processing
   high dimensional datasets without overfitting problems. It improves the
   accuracy and stability of a model by creating multiple decision trees
   and taking their votes [[71]34]. LASSO regression can automatically
   reduce the regression coefficients of irrelevant or less relevant genes
   to zero for feature selection. This is particularly important when
   processing biological data containing a large number of features, and
   it can effectively filter out the most valuable biomarkers [[72]35].
   Hence, we combined these two methods to identify key genes with
   diagnostic and prognostic value. Firstly, a randomForest model was
   established using randomForest function in the R software package
   randomForest [[73]36]. The number of variables (mtry) used for binary
   trees in the specified nodes of the model and decision trees (ntree)
   included in the randomForest were determined. The mtry was ascertained
   according to the average misjudgment rate of the model based on
   Outofbag (OOB) samples, and the ntree was determined according to the
   correlation between the number of decision trees and model error. Next,
   the randomForest model was built with the determined variables. The top
   10 markers were selected based on MeandecreaseAccuracy and
   MeandecreasGini, respectively. Then, the key disease markers were
   screened by LASSO regression analysis. The feature factors were
   identified with 10-fold cross validation using R software package
   glmnet (parameters: nfolds = 10 and family = 'binary') [[74]37]. The
   number of key disease markers was determined according to the λ value
   corresponding to the minimum binomial deviance.

2.6. Establishment and validation of diagnostic model

   The candidate hub genes with diagnostic value were identified from the
   intersection of the markers obtained by randomForest and LASSO
   regression methods. Then, receiver operating characteristic (ROC)
   analysis was applied to assess the diagnosis accuracy of each hub gene
   and the model. The area under the ROC curve (AUC) was calculated, with
   an AUC value > 0.7 indicating a high prediction accuracy of the model
   [[75]38]. The differential expressions of the hub genes between normal
   and tumor samples were displayed by boxplot. Moreover, the diagnostic
   value of the hub genes was validated in the external datasets of
   [76]GSE66229, [77]GSE44076, and [78]GSE54236.

2.7. GSEA and immune infiltration analysis

   Spearman method [[79]39] was employed to examine the correlation
   between the identified biomarkers and immune cells in tumor samples.
   Based on the median gene expression of the biomarkers, the TCGA samples
   were divided into low and high expression groups. Next, GSEA was used
   for biological function and pathway enrichment analysis in the
   background gene set (FDR<0.05). The "h.all.v2023.1.Hs.entrez.gmt" was a
   reference gene set obtained from the Molecular Signature Database
   (MsigDB, [80]https://www.gsea-msigdb.org/gsea/msigdb/) [[81]40].

2.8. Cell culture and transfection

   DMEM medium comprising 10 % fetal bovine serum (CBP60512M) was used to
   culture normal human gastric epithelial cell line GES-1 (CBP60512),
   whereas STAD cell line AGS (CBP60476) were cultured in RPMI-1640 medium
   comprising 10 % fetal bovine serum (CBP60476M). The two types of cells
   were commercially purchased from Nanjing Cobioer Biosciences Co., Ltd.
   (Nanjing, China). All the cells were incubated at 37 °C in an incubator
   in 5 % CO[2].

   Next, cell transfection was performed for silencing IL16. The small
   interfering (si) RNA targeting IL16 (si-IL16, target sequence:
   5′-GAGGAGTTACTTTGTAGTTTTAA-3′) and negative control (si-NC) were
   synthesized by Sangon Biotech (Shanghai) Co., Ltd. The cells were
   transfected for 48 h (h) with the use of Lipofectamine 2000
   transfection assay kit (Invitrogen, Thermo Fisher Scientific) following
   the description.

2.9. Quantitative reverse transcription polymerase chain reaction (qRT-PCR)

   The total RNAs of GES-1 and AGS cells were extracted by TriZol reagent
   (15596018CN, Invitrogen) and converted into cDNA applying Maxima First
   Strand cDNA Synthesis Kit (K1641, Invitrogen). Further, 1 μL cDNA was
   applied for qRT-PCR with SYBR Green Supermix (11744500, Invitrogen)
   following the specifications. The parameters of qRT-PCR were as
   follows: denaturation at 95 °C for 5 min (min), 40 cycles of 95 °C for
   15 s (s), and annealing at 60 °C for 30s, then elongation at 75 °C for
   30s. The primer sequences were presented in [82]Supplementary Table 1.
   The relative mRNA expressions of IL16 and PRKCB were quantified with
   the 2^−ΔΔCT method, with β-actin as a normalization control [[83]41].

2.10. Transwell assay

   Transwell assay was performed to detect the invasion of AGS cells
   [[84]42]. First, 80 μL Matrigel (diluted at 1:8) was slowly added into
   the Transwell chamber (with 8 μm filters) and incubated at 37 °C for 1h
   to solidify the Matrigel. Then, the transfected AGS cell
   (1 × 10^5 cells/well) was transferred to the upper chamber with 200 μL
   serum-free medium, and 700 μL RPMI-1640 medium containing 10 % fetal
   bovine serum was added in the lower chamber. After culturing the cells
   for 24h, the AGS cells in the lower chamber were fixed with 5 %
   formaldehyde for 30min and stained with 1 % crystal violet for 15min.
   Finally, 10 regions were randomly selected to observe the numbers of
   AGS cells in the lower chamber under an Axiovert 5 digital microscope
   (ZEISS, Oberkochen, Germany).

2.11. Wound healing assay

   The migration of AGS cells was evaluated by wound-healing assay
   [[85]43]. In brief, the AGS cells were inoculated to a 6-well plate
   with 800 μL cell suspension in each well and cultured for 24h. Then,
   the cells were scratched to create a wound using an aseptic 100 μL
   pipette and then maintained in serum-free medium at 37 °C. Next, the
   representative images were captured at 0 and 48h using the Axiovert 5
   digital microscope (ZEISS, Oberkochen, Germany) and the wound closure
   (%) of the AGS cells were calculated.

2.12. Statistical analysis

   R software (version 3.6.0) was employed in bioinformatics analyses.
   Differences between two groups of continuous variables were calculated
   using Wilcoxon rank-sum test. All the data were expressed as
   mean ± standard deviation (SD), and statistical analyses were conducted
   using SPSS (version, SPSS, Inc., Chicago, IL). Two-group differences
   were compared by student's t-test. P < 0.05 was considered as
   statistically significant.

3. Results

3.1. Gene modules closely correlated with E. coli infection in
gastrointestinal tumors

   Firstly, the E. coli infection score of all TCGA samples was calculated
   using ssGSEA method. Comparison on the difference of E. coli score
   between primary tumor samples and normal samples showed that tumor
   samples had lower E. coli infection score compared to normal samples
   ([86]Fig. 1A). Specifically, the E. coli infection score was strongly
   positively related to StromalScore, ImmuneScore, and ESTIMATEScore but
   negatively related to TumorPurity in tumor samples ([87]Fig. 1B). Next,
   a total of 9 co-expressed gene modules in tumor samples was classified
   by WGCNA method, with the brown module exhibiting the highest
   correlation (r = 0.66) with E. coli score ([88]Fig. 1C). Thus, the
   brown module was selected for subsequent research. Next, the
   correlation analysis demonstrated a significant positive correlation
   between the GS for E. coli score and MM in the brown module
   (correlation coefficient = 0.89, [89]Fig. 1D). These results suggested
   that the genes in the brown module were most closely associated with E.
   coli-related dysbiosis in gastrointestinal tumors. Furthermore, GO
   enrichment analysis in BP term was also performed based on the genes in
   brown module. The results showed that the pathways related to the
   proliferation and activation of immune cells such as regulation of
   leucocyte activation and T cells activation were significantly enriched
   ([90]Fig. 1E).

Fig. 1.

   [91]Fig. 1
   [92]Open in a new tab

   Construction of co-expressed gene modules through WGCNA analysis. (A)
   The difference in E. coli score between tumor and normal samples. (B)
   The correlation of E. coli score, StromalScore, ImmuneScore,
   ESTIMATEScore, and TumorPurity in tumor samples. (C) The relationship
   between different modules and E. coli score. (D) Scatter diagram of the
   module membership (MM) in brown module versus gene significance for E.
   coli infection score. (E) GO enrichment results of brown module in the
   biological process. (For interpretation of the references to color in