ABSTRACT Antimicrobial resistance (AMR) is becoming one of the largest threats to public health worldwide, with the opportunistic pathogen Escherichia coli playing a major role in the AMR global health crisis. Unravelling the complex interplay between drug resistance and metabolic rewiring is key to understand the ability of bacteria to adapt to new treatments and to the development of new effective solutions to combat resistant infections. We developed a computational pipeline that combines machine learning with genome-scale metabolic models (GSMs) to elucidate the systemic relationships between genetic determinants of resistance and metabolism beyond annotated drug resistance genes. Our approach was used to identify genetic determinants of 12 AMR profiles for the opportunistic pathogenic bacterium E. coli. Then, to interpret the large number of identified genetic determinants, we applied a constraint-based approach using the GSM to predict the effects of genetic changes on growth, metabolite yields, and reaction fluxes. Our computational platform leads to multiple results. First, our approach corroborates 225 known AMR-conferring genes, 35 of which are known for the specific antibiotic. Second, integration with the GSM predicted 20 top-ranked genetic determinants (including accA, metK, fabD, fabG, murG, lptG, mraY, folP, and glmM) essential for growth, while a further 17 top-ranked genetic determinants linked AMR to auxotrophic behavior. Third, clusters of AMR-conferring genes affecting similar metabolic processes are revealed, which strongly suggested that metabolic adaptations in cell wall, energy, iron and nucleotide metabolism are associated with AMR. The computational solution can be used to study other human and animal pathogens. IMPORTANCE Escherichia coli is a major public health concern given its increasing level of antibiotic resistance worldwide and extraordinary capacity to acquire and spread resistance via horizontal gene transfer with surrounding species and via mutations in its existing genome. E. coli also exhibits a large amount of metabolic pathway redundancy, which promotes resistance via metabolic adaptability. In this study, we developed a computational approach that integrates machine learning with metabolic modeling to understand the correlation between AMR and metabolic adaptation mechanisms in this model bacterium. Using our approach, we identified AMR genetic determinants associated with cell wall modifications for increased permeability, virulence factor manipulation of host immunity, reduction of oxidative stress toxicity, and changes to energy metabolism. Unravelling the complex interplay between antibiotic resistance and metabolic rewiring may open new opportunities to understand the ability of E. coli, and potentially of other human and animal pathogens, to adapt to new treatments. KEYWORDS: antimicrobial resistance, Escherichia coli, genome-scale metabolic model, machine learning INTRODUCTION Antimicrobial resistance is a major threat to global health. Worryingly, a growing number of pathogens exhibit an extraordinary capacity for acquiring new antibiotic resistance traits in the bacterial population worldwide ([43]1). New multidrug resistance mechanisms have emerged and spread globally, resulting in current treatments becoming less effective against common bacteria that cause severe and often deadly infections. Consequently, the development of new drugs and novel treatment strategies is urgently needed ([44]2, [45]3). The opportunistic pathogen Escherichia coli plays a major role in the antimicrobial resistance (AMR) global health crisis. First, the ability of E. coli to acquire resistance via single nucleotide polymorphisms (SNPs) in its existing genome ([46]4[47]–[48]7) and via acquisition of resistance genes through horizontal gene transfer (HGT) from surrounding species ([49]8[50]–[51]10) has resulted in increased levels of resistance to many antibiotic classes, including penicillins, carbapenems, cephalosporins, fluoroquinolones, aminoglycosides, and tetracyclines ([52]11[53]–[54]15). Second, the ease of its transmission from humans and environmental sources has resulted in alarming numbers of multidrug-resistant E. coli strains being reported worldwide ([55]16, [56]17). Third, the ease by which the bacteria can transfer genetic material via HGT, combined with the bacterium’s ability to colonize different environments, including the gut where it has particularly close interaction with many other species, allows E. coli to act as a reservoir of AMR genes for other opportunistic pathogens, while also acquiring further resistance ([57]18[58]–[59]21). For these reasons, the World Health Organization (WHO) has recently classified E. coli as a critical priority pathogen for which the development of a new treatment is high priority ([60]22). Recent advances in data generation and data mining, combined with machine learning (ML), have led to invaluable results in the identification of specific genomic markers which could be used to effectively predict resistant strains and to detect AMR genes ([61]23[62]–[63]30). Most of these methods work to identify known AMR mutations giving rise to phenotypic resistance. This has great potential for fast diagnostic evaluation of bacteria compared to laboratory methods. Furthermore, ML-based approaches offer further powerful opportunities compared to conventional methods, as they allow for the genome-wide identification of truly novel features (i.e., k-mers and SNPs) ranked on their strength of correlation with the resistance phenotype. Recently, several studies have used these approaches ([64]29, [65]30), which not only allow the identification of genes with known functional relationship with the resistance phenotype but also allow the identification of genes which have no prior association with a specific resistance phenotype. This creates a path for generating nonintuitive testable hypotheses about the association of antibiotic resistance to a wider repertoire of genes, including deletions and functional mutations altering metabolism, and therefore provides a significant advantage in comparison with the conventional use of annotated gene databases. Recent findings have shown the interconnectivity of antibiotic resistance with metabolism and emphasize the importance of considering this relationship in the design of new antibiotic regimens ([66]31[67]–[68]33). Through its ease of HGT, E. coli has been able to adopt a highly flexible carbon and energy metabolism for adaptation against stresses in niche environments ([69]34, [70]35). For this reason, the bacterium is an ideal organism for investigating the interplay between AMR and metabolic adaptation mechanisms. Connecting antimicrobial genes and specific mutations and alleles to metabolic phenotypes, however, still remains a significant challenge ([71]36, [72]37). Black-box ML predictions lack biological interpretation of the genetic determinants ([73]30), and therefore, previous methods have often not accounted for the characterization of new advantageous genetic variants occurring in targets beyond annotated drug resistance genes ([74]29, [75]38), therefore neglecting important metabolic adaptations that allow resistance and tolerance to antibiotic stress ([76]39[77]–[78]41). A genome-scale metabolic model (GSM) offers a way of mechanistically evaluating the genetic determinants identified using ML. A GSM is a computational model of metabolism, which includes all known biochemical reactions and their corresponding gene-protein-reaction (GPR) rules. The GPR rules provide important information linking genes to the reactions that are catalyzed by the enzymes they encode and provide a means of simulating the metabolic system-level behavior of the bacteria to perturbations in the gene. While GSMs have proven invaluable tools for predicting genotype-phenotype relationships ([79]42), they lack the power of machine learning algorithms ([80]30). Recent studies have therefore been developing new approaches that integrate the power of ML with GSMs to allow for a mechanistic interpretation of the genetic associations discovered by machine learning, which offers a significant advantage over ML approaches alone ([81]30, [82]43). In this study, we developed a computational solution integrating the discriminant power of ML with GSM models to reveal the systemic relationships connecting the genetic determinants of AMR to important metabolic evolutionary adaptations in E. coli. Using our approach, first we were able to accurately predict AMR resistant and susceptible phenotypes against 11 out of 12 different antibiotics, as well as identifying 225 (35 of which matched the specific antibiotic class reported in AMR-related databases) known AMR-conferring genes in 3,616 E. coli strains. Second, by elucidating the effects of genetic discriminants on bacterial growth, metabolite yields, and biochemical fluxes using the GSM, we were able to relate genetic determinants to a number of metabolic adaptation mechanisms, including reduced growth, alternative carbon source utilization, changes to energy metabolism, iron metabolism, nucleotide metabolism, and modifications to cell wall metabolism. RESULTS Framework of the computational pipeline that combines machine learning with genome-scale metabolic models. To identify the genomic features correlated with the selected AMR phenotypes and to interpret the systemic relationships between genetic determinants of resistance and metabolism, we developed a computational pipeline that combines ML with genome-scale metabolic models (see [83]Fig. S1 in the supplemental material). A set of unique E. coli genomes for which AMR testing and metadata were available from public databases was selected. To efficiently analyze the AMR phenotypic variability that is likely to arise from a combination of SNPs and changes in gene content, we used an integrated k-mer and SNP-based ML approach. A gradient boosting classifier (GBC) ([84]44, [85]45) was chosen as it is a powerful approach to quickly and efficiently scan entire genomes against selected phenotypes, allowing for the identification of arbitrary numbers of genomic features ranked on strength of correlation with the antimicrobial-resistant and -susceptible phenotype. The ML approach offers the opportunity to identify genes and/or mutations which, individually or in combinations, feature a strong correlation with resistance to antibiotics. A set of thresholds were applied to select only the top-ranked AMR genetic determinants strongly contributing to the performance of the ML classifier. The interconnectivity of antibiotic resistance, antimicrobial genes, and specific mutations and alleles to metabolic phenotypes, as well as the identification of new advantageous genetic variants occurring in targets beyond annotated drug resistance genes was determined using the GSM ([86]Fig. S1). Flux balance analysis (FBA), a constraint-based approach, was used to predict the effects of the genetic determinants on the metabolic network. Importantly, we considered the protein-coding regions only in the ML classifiers, and therefore, the genetic variants are potentially increasing or decreasing enzymatic activity, or in some cases completely block the function of the gene. Here, we evaluated the effect of each genetic determinant by blocking the flux through its corresponding enzyme and assessed the propagation of this “loss of function” through the entire metabolic network. Specifically, we used the GSM to predict the effect of each genetic determinant on bacterial growth, production of individual metabolites, and the feasible flux range through individual reactions. Changes to metabolic phenotype capabilities in each gene knockout model (i.e., reduction in growth rate, reduced metabolite production or reduction in flux span through a reaction) were assessed using the wild-type model of E. coli K-12 MG1655. FIG S1 Flow chart of the integrated machine learning/genome-scale metabolic modeling pipeline. The different sections of the pipeline are highlighted in different colors. The actions relating to the k-mers are shown in dark blue. The actions performed on the SNPs are shown in purple. The steps to build and annotate the pan-genome are shown in red. The machine learning performed on the k-mers and SNPs is shown in orange. Finally, the genome-scale metabolic modelling analysis is shown in green. Download [87]FIG S1, PDF file, 0.04 MB^ (44.9KB, pdf) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [88]Creative Commons Attribution 4.0 International license. Genomic and metadata characteristics of the E. coli cohort. Our first goal was to characterize the genetic content and diversity of E. coli strains. We selected a set of 3,616 unique E. coli genomes for which AMR testing and metadata were available from the Pathosystems Resource Integration Centre (PATRIC) ([89]46). Importantly, the genome sequences of these strains were all listed in PATRIC as “good” quality assemblies, had less than 250 contigs, and were labeled as either “WGS” (for whole genome sequenced) or “Complete” as the genome status in PATRIC. The genomes have experimentally measured AMR phenotypes, which are annotated as either “susceptible” or “resistant.” These isolates included a wide variety of geographic locations (see [90]Fig. S2a) and AMR phenotypes for a diverse set of antibiotic classes, including penams (ampicillin), carbapenems (meropenem), monocyclic beta-lactam (aztreonam), cephalosporins (cefoxitin, cefepime, and cefuroxime), fluoroquinolones (ciprofloxacin and levofloxacin), aminoglycosides (gentamicin and tobramycin), diaminopyrimidines (trimethoprim), and tetracyclines (tetracycline). The number of resistant strains for each of the 12 individual antibiotics ranged between 427 (levofloxacin) and 2,600 (ciprofloxacin) of the 3,616 strains (see [91]Fig. S2b). FIG S2 The 3,616 unique E. coli strains were selected based on the availability of metadata provided in Pathosystems Resource Integration Centre (PATRIC). (a) Geographic locations of the E. coli strains. The color on the map scales logarithmically with the number of resistant strains found in that area. (b) Bar chart with 12 antibiotics on the x axis and the number of genomes resistant (red) or susceptible (blue) to each antibiotic on the y axis. The counts are stacked on top of each other. Download [92]FIG S2, PDF file, 0.1 MB^ (114.9KB, pdf) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [93]Creative Commons Attribution 4.0 International license. Next, the pan-genome was extracted for the selected strains using the default parameters in Roary version 3.13.0 ([94]47), which classified the catalogue of annotated genes as either core (i.e., occurring in >99% of strains) or accessory (i.e., occurring in <99% of strains). k-mer-based genomic feature selection through a gradient boosting classifier model identifies AMR-conferring genes. The next goal of this analysis was to identify features in the genome sequence of each isolate which strongly correlated with resistance or susceptibility to each of the 12 antibiotics described above. To this aim, we implemented a gradient boosting classifier model for each antibiotic studied. DNA segments (k-mers) of 13 bp long were used as features in the classifiers, with the AMR phenotype used as the model labels (resistant or susceptible). For each classifier, 10,000 features were selected based on the chi-square test. We used the performance metrics accuracy, area under the receiver operator characteristic curve (AUC), precision, and recall to evaluate each model. A synthetic minority oversampling technique (SMOTE) was used to reduce the impact of unbalanced classes in the antimicrobial label groups and achieve robust classification results. The performance metrics were calculated as the mean of 50 simulations ([95]Fig. 1). The performance metrics for the 12 antibiotics ranged from 90% to 99% for accuracy, 75% to 98% for precision, 62% to 95% for recall, and 88% to 98% for AUC. All antibiotics except meropenem and cefuroxime achieved an AUC of >95%. Features were selected from the remaining 10 AMR classifiers based on this AUC threshold. FIG 1. [96]FIG 1 [97]Open in a new tab k-mer-based supervised machine learning prediction of antibiotic resistance signature profiles to 12 antibiotics in the E. coli cohort. Boxplots showing the prediction performance results of the gradient boosting classifier for the 50 iterations. The performance indicators (y axis) are accuracy, precision, recall, and AUC. Predictive models were generated to classify the resistance versus susceptibility profiles of 12 different antibiotics (x axis). The maximum importance in the 50 runs was captured for each k-mer. To identify important genes, the k-mers with a maximum importance score greater than 0 for each antibiotic model (as assigned by the GBC), were cross-referenced to the pan-genome of the 3,616 genomes. The identified k-mers, their corresponding genes, and maximum importance scores obtained by the GBC are shown in [98]Table S1 and [99]Fig. S3 in the supplemental material. When mapped to the CARD ([100]48) and MutationDB databases ([101]49), 84 unique AMR genes were identified in the top 10% of features (ranked according to the maximum weight found in the 50 runs), 25 of which had evidence of the AMR gene for the specific antibiotic class ([102]Table 1). TABLE 1. Known AMR genes identified by the k-mer-based AMR classifiers[103]^a Antibiotic Drug class Known AMR gene(s) to the antibiotic[104]^b Known AMR genes associated with other antibiotics[105]^b Ampicillin Beta-lactam TEM-1**, CTX-M-15, yicJ* sul1**, folP**, APH(3′′)-Ib, katE*, yadV*, arnC, fsr, nmpC, pepT, yeeJ, yhdJ Aztreonam Beta-lactam CTX-M-55* AAC(6′)-Ib-cr, acrD, catIII, nmpC, pitA, yicI, cpdB, yoaE, rapA, dinG, yeeJ, oppA, arnC Cefepime Beta-lactam CTX-M-1**, CTX-M-15, CTX-M-55 dfrA25*, AAC(6′)-Ib10*, AAC(3)-IId, catB3, AAC(6′)-Ib-cr, folA*, yadV*, citF, yeeJ, ftsI Cefoxitin Beta-lactam CMY-2*, ybiW*, betT, chiP, cra, envZ, htrE, lyxK, mdlA, yeeJ, yghA dfrA25, AAC(3)-IId, catIII, blc, yaiY, folA, putA, lpoA Ciprofloxacin Fluoroquinolone gyrA** OXA-1*, CTX-M-15*, arnC, nmpC, htrE, cpdB, arcA, flu Gentamicin Aminoglycoside AAC(3)-IId**, AAC(6′)-Ib7**, aadA13*, AAC(3)-IIe*, AAC(6′)-Ib9*, aadA7, ANT(2′′)-Ia floR, CTX-M-15, dfrA17, mphA, intS*, fliC*, arnC, yicJ Levofloxacin Fluoroquinolone gyrA** lacI*, yqiK, flu, arcA, fimC, phoE, ybiH, dadA Tetracycline Tetracycline tet(A)**, tet(B)**, mdfA APH(6)-Id, sul2, yeeJ, folP, csiD Tobramycin Aminoglycoside AAC(3)-IId**, AAC(6′)-Ib-cr**, AAC(3)-IIe, AAC(6′)-Ib7 catB3*, CTX-M-55, dfrA17, OXA-1, fliC*, pinR, ydfU, dnaQ Trimethoprim Diaminopyrimidine ANT(2′′)-Ia**, sul2*, aadA16*, aadA25*, APH(3′′)-Ib*, TEM-1, tet(A), APH(6)-Id, mphA, TEM-150, sul1, folP*, dosP, valS, nmpC, htrE, groL, putP [106]Open in a new tab ^a Genes in the top 10% features, ranked according to their maximum feature important assigned by the GBC classifier, are presented. ^b Symbols: **, gene was associated with feature in the top 10% features; *, gene was associated with feature in the top 50% features. FIG S3 Number of features or genes identified in each step of our novel pipeline for (a) the k-mer-based approach and (b) the SNP-based approach. Results are shown for the antibiotics with an AUC of  >0.95 only. Download [107]FIG S3, PDF file, 2 MB^ (2.1MB, pdf) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [108]Creative Commons Attribution 4.0 International license. TABLE S1 The identified k-mers, their corresponding genes, and importance scores obtained by the GBC. Download [109]Table S1, XLS file, 0.4 MB^ (377KB, xls) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [110]Creative Commons Attribution 4.0 International license. SNP-based machine learning approach uncovers additional and different AMR genetic determinants. Together with the k-mer-based approach, we also analyzed the contribution of SNPs to the acquisition of drug resistance phenotypes by using them as features in a ML approach to find correlations with resistance or susceptibility to specific antibiotics. The variant sites (SNPs) in the protein-coding genes of the core genome of the pan-genome were identified using the SNPsites tool ([111]www.github.com/sanger-pathogens/snp-sites) and used as the features in the GBC model for fitting AMR labels. A synthetic minority oversampling technique was applied to oversample data of the minority class, compensating for unbalanced classes. The performance metrics were calculated as the mean of 50 simulations ([112]Fig. 2). Performance metrics were calculated for each model as the mean of 50 simulations. The performance metrics for the 12 antibiotics ranged from 75% to 98% for accuracy, 75% to 99% for precision, 71% to 98% for recall and 75% to 98% for AUC. The best predicted antibiotic was ciprofloxacin with a mean accuracy of 98% (±1%), precision of 99% (±1%), recall of 97% (±1%), and AUC of 98% (±1%). This suggests that SNPs in ciprofloxacin may have significant implications for the evolution of resistance, which is consistent with the study of K. Bhatnagar and A. Wong ([113]50). The levofloxacin and meropenem models also achieved high performances, with an AUC of >95%. FIG 2. [114]FIG 2 [115]Open in a new tab SNP-based supervised machine learning prediction of antibiotic resistance signature profiles to 12 antibiotics in the E. coli cohort. Boxplots showing the prediction performance results of the gradient boosting classifier of the 50 iterations. The performance indicators (y axis) are accuracy, precision, recall, and AUC. Predictive models were generated to classify the resistance versus susceptibility profiles of 12 different antibiotics (x axis). The maximum importance in the 50 runs was captured for each SNP. To understand the relationship between AMR phenotype and genotype, we cross-referenced the SNPs that acted as predictors for AMR phenotype for each antibiotic to the pan-genome for each model data set and identified the corresponding genes. The identified SNPs, their corresponding genes, and the maximum importance scores obtained by the GBC are shown in [116]Table S2. Importantly, the SNP-based approach could identify additional AMR genes that were not identified by the k-mer-based approach. By comparisons with the CARD and MutationDB databases, we identified 146 unique AMR genes associated with at least one antibiotic ([117]Table 2) that were in the top 10% of features (ranked according to the maximum feature importance in the 50 runs). Out of these 146 genes, 8 had evidence in the database of the AMR gene for the specific antibiotic class ([118]Table 2). Note, however, that the MutationDB database does not include entries for AMR genes for the levofloxacin and meropenem antibiotics. TABLE 2. Known AMR genes identified by the SNP-based AMR classifiers[119]^a Antibiotic Known AMR genes to the antibiotic[120]^b Known AMR genes associated with other antibiotics[121]^b Ciprofloxacin gyrA**, parC**, parE*, typA*, hofN, valS, pnp, gyrB speB**, yegU*, ugpB*, ampH*, fhuB*, poxB*, gss*, hybB*, phoE, speC, bglX, ftnA, pphA, yjfF, yjaB, yjjV, hofQ, yidC, prmB, hisF, plaP, truC, gcvP, mltC, rstB, mtlD, folA, metH, rnd, waaA, upp, putP, yohK, aidB, yegQ, uvrB, trmH, ulaG, yqjG, cpxA, proC, uvrA, recJ, hflX, tamB, cysK, metC, nrdB, mutM, mpl, osmF, mrcA, dcd, ravA, pepD, yejA, ribC, cstA, yeiQ, nusA, hemA, yaiZ, hybF, mglA, ysaA, potA, hemY, yjjP, recG, yebY, aroC Levofloxacin parC**, gyrA**, hemF*, recG*, mysB*, metC*, tktA*, aceF*, yicR*, blgX*, fabD*, mutS*, chaA*, msyB*, rbsA*, gcvP, glnE, pcnB, mdtB, hisF, purT, menD, nikC, ftnA, frwB, yjiN, nadR, cyoB, fumC, mdtD, citG, glgX, valS, ldcC, yebQ, adiA Meropenem parC**, gyrA**, creC**, yrfF**, valS**, bglX**, fucI*, hisF*, parE*, plaP*, nikA*, pykF*, aidB*, yjjG*, gcvP*, yjfF*, dsbD*, lepA*, thrA*, hybB, yccS, mdtB, murC, yegR, ravA, yjjV, yjjK, mscM, menD, mutS, metF, mglA, yjcD, nuoL, nadR, rplL, dusB, yegU, sufB, nudI, ulaG, ccmD, rnr, tamB, pdxA, dld, asd, ychO, soxR, yebK, nrdB, argD, baeS, glgX, osmF, trmI, yegS, dnaX, yejH, waaC, fhuE, aroP, folA, ycbZ, rbbA, polA, recJ, speC [122]Open in a new tab ^a Genes in the top 10% features, ranked according to their maximum contribution to the classifier, are presented. ^b Symbols: **, gene was associated with feature in the top 10% features; *, gene was associated with feature in the top 50% features. TABLE S2 The identified SNPs, their corresponding genes, and importance scores obtained by the GBC. Download [123]Table S2, XLS file, 0.1 MB^ (138.5KB, xls) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [124]Creative Commons Attribution 4.0 International license. The AMR-related signatures occur in targets beyond annotated drug resistance genes and are associated with a wide range of metabolic systems. To understand the systemic relationships connecting the identified AMR genetic signatures on a mechanistic level and to elucidate their mechanistic effects beyond genes encoding proteins targeted by drugs (i.e., positive selection in basal biosynthetic, regulation, and repair pathways), we integrated the genetic determinants with the GSM iML1515 ([125]51) of E. coli K-12 MG1655. We limited our GSM analysis to the top 10% ranked genetic determinants identified, for each antibiotic classifier with an AUC of >95%, by the k-mer and SNP ML-based methods. The ratio of metabolic genes to total genes corresponding to the top-ranked 10% of important features was considerably higher for the SNP-based models than the k-mer-based models ([126]Fig. 3a and [127]Fig. S3). The percentage of metabolic genes accounted for in iML1515 from the top features, for example, ranged from 43% (ciprofloxacin) to 48% (levofloxacin) in SNP-based AMR models. The percentage of metabolic genes identified by the k-mer-based approach and present in iML1515 were considerably lower, ranging from 5% (tobramycin) to 19% (levofloxacin). A large number of genes identified by the k-mer-based approach, however, were from the accessory genome, which currently lack many functional annotations, as shown in [128]Fig. S3 (see the decrease in cyan bars to yellow bars). Additionally, since the GSM is based on the K-12 strain, accessory genes missing from this reference genome will not be included in the analysis. Nevertheless, a total of 289 genes present in iML1515 were identified by combining the genes that were associated with the top-ranked 10% of genes in the two machine learning models, which motivates the integration with GSM analysis. FIG 3. [129]FIG 3 [130]Open in a new tab Number of metabolic genes occurring in the 11 AMR classifiers. (a) Bar chart showing proportions of metabolic genes compared to the entire set of genes found in each AMR model. The blue bars represent gene proportions from the k-mer AMR models, whereas the red bars represent gene proportions from the SNP AMR models (AUC > 95%). (b) Heatmap showing the Jaccard index comparing the gene sets between two antibiotic classes. (c) Pie chart showing the proportions of genes associated with 10 metabolic systems (outer ring presented using the “tab10” color theme in Matplotlib). The inner ring shows the proportion of genes from each antibiotic class associated with each metabolic system and is presented using the “Set3” color theme in Matplotlib. Note that genes contributing to multiple antibiotic classifications will contribute multiple times in the pie chart, and therefore, the total area of the pie chart does not amount to 289. (d) Heatmap showing the normalized number of genes associated with each metabolic system. Note that the number of genes was normalized via column standardization. Hierarchical clustering was applied to both rows (metabolic systems) and columns (antibiotic classes) using the single linkage method and Euclidean distance as the metric. Each panel shows the results for the top 10% of genes identified in each AMR classifier. Panels b, c, and d show the results for the 289 genes found by combining the genes that correspond to the features in the top 10% of the k-mer and SNP classifications. The contribution of genes from each AMR classifier ranged between 1 (tobramycin) and 123 (ciprofloxacin), with a small number of genes overlapping between antibiotic AMR models ([131]Fig. 3b and [132]Table S3). These 289 important metabolic genes were associated with a wide range of metabolic systems ([133]Fig. 3c), including transport metabolism, cofactor and prosthetic group metabolism, cell wall metabolism, alternative carbon metabolism, nucleotide metabolism, and amino acid metabolism were particularly prevalent across the diverse antibiotic classes ([134]Fig. 3c and [135]d). We performed gene pathway enrichment tests using the 40 metabolic subsystems included in iML1515 and also using the 352 gene-pathway annotation list downloaded from EcoCyc ([136]52) (see Materials and Methods). The significant pathways with a false discovery rate (FDR) of less than 1% are provided in [137]Table S3. TABLE S3 The GSM results for the 289 genes corresponding to the top 10% of k-mer and SNP-based AMR models. Download [138]Table S3, XLS file, 2.4 MB^ (2.5MB, xls) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [139]Creative Commons Attribution 4.0 International license. We found genes enriched in amino acid metabolism (histidine and arginine), the pyrimidine salvage pathway, putrescine biosynthesis pathway, and transport metabolism. Importantly, histidine metabolism has been found to play an important role in stress resistance in E. coli ([140]53, [141]54), while putrescine, which is a polyamine, has been found to relieve the effects of oxidative stress in E. coli ([142]55). Additionally, changes to genes involved in the pyrimidine salvage pathway have been found linked to the production of important biofilm components in E. coli ([143]56), and therefore induce persistence ([144]57). Furthermore, transport reactions are known to play a role in multidrug resistance by restricting the uptake of the antibiotic to reduce the toxicity ([145]58, [146]59). Next, using the GSM, we investigated the system-level effect of each important gene on metabolism, beyond the pathways they are encoded for. To this aim, we blocked the flux through reactions associated with an important gene (gene knockout) and evaluated the metabolic processes that were affected. In doing so, we can infer potential metabolic adaptation mechanisms that can be linked to a change in gene function (i.e., downregulation, overexpression, or deletion). GSM knockout analysis reveals genes related to growth limitation, auxotrophic behavior, and alternative carbon source utilization. Next, to investigate further the metabolic processes involved in adaptation to antibiotic stresses, we considered the effects of the 289 genes on bacterial growth. The ability of bacteria to adjust their metabolism to slow down growth has for example been found to be advantageous for reducing the damage that occurs as a result of being the primary target of antibiotics ([147]60[148]–[149]62). Identifying those that are essential for growth, while also being highly important in the ML models, may therefore provide a novel opportunity for selecting targets with dual mechanism. To this aim, the GSM was used to simulate the behavior of E. coli with mutations in the 289 genes. Single gene deletions under rich environmental conditions were carried out in iML1515 to mimic the effect of a “loss of function” mutation on the entire system (see Materials and Methods). Importantly, we found a total of 20 gene knockouts that were lethal to the bacteria. These genes show a high level of agreement with in vivo gene essentiality results ([150]63), as shown in [151]Table 3. The lethal genes with the highest contribution (i.e., associated with the top 50 features) to the ML models, and therefore of greatest interest, included the following: accA and metK for ciprofloxacin, fabD and fabG for levofloxacin, murG, lptG, and mraY for meropenem, folP for ampicillin and trimethoprim, and glmM for gentamicin. These genes play essential roles in fatty acid elongation (fabD, fabG, and accA), peptidoglycan metabolism (murG, mraY, and glmM), lipolysaccharide biosynthesis (lptG), S-adenosyl-l-methionine metabolism (metK), and folate metabolism (folP) ([152]Fig. 4). Importantly, folP, lptG, fabG, and murG are already known AMR-conferring genes, as shown by [153]Tables 1 and [154]2. TABLE 3. In silico-predicted gene lethality from the top-ranked discriminant genes in k-mer-based and SNP-based classifiers listed for each antibiotic Antibiotic Essential genes (rich media)[155]^a Essential genes (glucose minimal medium only)[156]^a Ampicillin folP* Aztreonam asd, purL Cefepime pyrF Cefoxitin Ciprofloxacin murJ*, lptG, hemG, ribC, accA*, cysG, aroC, waaA, hemA, metK*, lptF purA, pheA, hisD, hisG, purL*, hisF, dapE, panD, purM, hisI*, ilvD, iscS, thiD, hisA, hisB*, hisH*, proC, purD Levofloxacin fabG*, fabD* hisF*, purL*, thiD* Gentamicin glmM*, cysG cysH Meropenem lptG*, mraY*, murG*, ispA cysJ, hisD*, pdxA, hisC*, asd, hisF*, metF, murC, iscS, hisA, hisB, hisH*, hisG Tetracyline folP, murB Tobramycin iscS Trimethoprim folP*, ftsI [157]Open in a new tab ^a Symbol: *, genes associated with top 50 ranked features of the antibiotic AMR model. Boldface genes have not been found essential in experimental studies. FIG 4. [158]FIG 4 [159]Open in a new tab An overview of the metabolic pathways involving potential gene targets for E. coli. The genes accA, lptG, fabD, fabG, murG, mraY, folP, glmM, and metK were all found to be essential in the GSM of E. coli, whereas knockout of the genes hisA and thiD all resulted in auxotrophic behavior. The genes fucK, fucI, nupG, speB, uxaA, uxaB, dgoD, uidB, and ttdB were all found to be essential to the growth on alternative carbon sources. Note that all genes presented corresponded to the top 50 features of the AMR models. Note that the antibiotic that each of these genes were found important to by the AMR models are provided. Abbreviations: 2-dehydro-3-deoxy-d-galactonate (2-DH3DGAL), 2-dehydro-3-deoxy-d-galactonate 6-phosphate (2-DH3DGAL-6P), fuculose 1-phosphate (fuculose-1P), dihydroxyacetone phosphate (DHAP), glyceraldehyde 3-phosphate (glyceraldehyde-3P), tagaturonate (TAG), altronate (ALTR), 2-dehydro-3-deoxy-d-galactonate 6-phosphate (2-DH3DGLUC-6P), 2-dehydro-3-deoxygluconate (2-DH3DGLUC), 1-O-methyl-beta-d-glucuronic acid (MG), oxalacetate (OXA), citrate (CIT), isocitrate (ICIT), alpha-ketoglutarate (AKG), succinyl-CoA (SUC-CoA), succinate (SUC), fumarate (FUM), malate (MAL), tetrahydrofolate (THF), glucose 6-phosphate (glucose-6P), fructose 6-phosphate (fructose-6P), guanosine-triphosphate (GTP), ribulose 5-phosphate (ribulose-5P), 5-phospho-alpha-d-ribose 1-diphosphate (PRPP), phosphoribosyl-ATP (PRBATP), phosphoribulosyl-formimino-5-aminoimidazole-4-carboxamide ribonucleotide phosphate (PRFAR), 5′- 5-aminoimidazole ribonucleotide (AIR), 4-amino-2-methyl-5-phosphomethylpyrimidine (4AMPM), 2-methyl-4-amino-5-hydroxymethylpyrimidine diphosphate (2MAHMP), thiamine phosphate (thiamine-P), phosphoribosylaminoimidazolecarboxamide formyltransferase(AICAR), d-erythro-imidazole-glycerol-phosphate (IGP), imidazole acetol-phosphate (IMIDAZOLE-ACETOL-P), histidinol-phosphatase (HISTIDINOL-P), glucosamine-6-phosphate (GlcN-6P), UDP N-acetylglucosamine (UDP-GlcNAc), UDP-N-acetylmuramyl-pentapeptide (UDP-MurNac-Pentapeptide), S-adenosyl-L-methionine (SAM), methionine (MET), homocysteine (HCYS). Next, we considered genes that were growth limiting when the bacteria were grown on minimal medium with glucose as the carbon source. We found an additional 26 genes that were essential under these conditions ([160]Table 3), which again showed high agreement with the in vivo results ([161]64). Under poor nutrient conditions of the host, changes in the function of these genes may contribute to slowing the growth rate, as before. However, if the environment is rich in nutrients, then a loss of function of these genes may have led to advantageous auxotrophic behavior. To test this hypothesis, we reran the knockout simulations for growth on glucose, while also allowing for individual metabolites to be utilized. Importantly, we found that 17 of these genes could be linked to auxotrophic behavior to amino acids, including cysteine (meropenem and gentamicin), histidine (levofloxacin, ciprofloxacin, and meropenem), phenylalanine (ciprofloxacin), and proline (ciprofloxacin) ([162]Table 4). Auxotrophy for the vitamins thiamine (levofloxacin, tobramycin, and meropenem) and pantothenate (ciprofloxacin) was also found. Auxotrophy to peptidoglycan precursors was also found for the antibiotics ciprofloxacin and meropenem, while purine and pyrimidine precursors were found for ciprofloxacin and cefepime. Importantly, auxotrophy for histidine and thiamine has previously been found to elevate fitness ([163]65). TABLE 4. In silico-predicted gene knockouts that lead to auxotrophy from the top-ranked discriminant genes in k-mer-based and SNP-based classifiers listed for each antibiotic Antibiotic Gene(s) leading to specific auxotrophy[164]^a Ampicillin Aztreonam Cefepime Pyrimidine compounds (pyrF) Cefoxitin Ciprofloxacin Phenylalanine (pheA), histidine (hisA, hisB*, hisD, hisF, hisG, hisI*, hisH*), pantothenate (panD), thiamine (iscS, thiD), proline (proC), nucleosides (purA), peptidoglycan precursors (dapE) Levofloxacin Histidine (hisF), thiamine (thiD*) Gentamicin Cysteine-derived compounds (cysH) Meropenem Histidine (hisA, hisB, hisD*, hisC*, hisF*, hisH*), S-methyl-l-methionine (metF), thiamine (iscS), pyridoxine (pdxA), cysteine (cysJ), peptidoglycan precursors (murC) Tetracycline Tobramycin Thiamine (iscS) Trimethoprim [165]Open in a new tab ^a Symbol: *, genes associated with the top 50 ranked features of the antibiotic AMR model. Additionally, gene modifications that affect the utilization of alternative carbon sources was also investigated. Alternative carbon source utilization has been found advantageous for pathogenic survival of bacteria, including E. coli, Salmonella, Vibrio cholerae, and Campylobacter jejuni ([166]66[167]–[168]68). To this aim, we used the GSM to test the effect of the 289 genes on the 297 different carbon sources in the iML1515 model. Single gene knockouts were repeated for each individual carbon source, under minimal medium conditions. We found 39 genes whose deletion blocked growth on a variety of alternative carbon sources ([169]Table 5). The carbon sources that were blocked by the genes with the highest importance (i.e., associated with the top 50 features) in the ML models included the following: fucose (cefoxitin and meropenem), galactonate (cefoxitin), tartrate (levofloxacin), agmatine (ciprofloxacin), galacturonate (ciprofloxacin and levofloxacin), methyl-beta-d-glucuronate (cefoxitin), and a variety of nucleosides (ciprofloxacin). TABLE 5. In silico-predicted essential genes on specific carbon sources from the top-ranked discriminant genes in k-mer-based and SNP-based classifiers listed for each antibiotic Antibiotic Lethal genes for growth on specific carbon sources important in AMR model[170]^a Ampicillin gatC, mhpB Aztreonam adiC, yihP, cpdB, garD, mngB, paaK Cefepime Cefoxitin garD, kgtP, fucK*, ulaC, putA, fecA, mngB, uidB*, dgoD* Ciprofloxacin malF, ulaG, nupG*, nanE, deoA, pepD, deoC, tonB, nanA, mtlD, xylA, uxaA*, putP, speB*, mngB, cpdB, lamB Levofloxacin adiC, ttdT, uxuB, uxuA*, ttdB* Gentamicin hcaB Meropenem manZ, adiC, ulaG, exuT, fucI* Tetracyline Tobramycin Trimethoprim putP, emrE [171]Open in a new tab ^a Symbol: *, genes associated with top 50 ranked features of the antibiotic AMR model. Flux balance analysis elucidates the effects of AMR-conferring genes on metabolite yields and reaction fluxes and suggests important metabolic adaptations in cell wall, energy metabolism, purine and pyrimidine metabolism and iron metabolism that increase antibiotic resistance. Next, the GSM was used to investigate whether the genetic determinants could be linked to additional metabolic adaptation mechanisms, beyond those affecting the growth rate and alternative carbon utilization. For this analysis, we examined the effect of each gene on metabolite reproducibility and reaction fluxes. More specifically, we simulated single gene knockouts as before, however this time, we captured the effect on metabolite yields and flux spans (i.e., the variation of possible flux values for a given reaction) for all metabolites and reactions in the iML1515 model. The output of this analysis is twofold: (i) to identify clusters of genes that have similar metabolic phenotypes and (ii) to elucidate the metabolic adaptations that are most important in providing bacteria with possible resistance to antibiotic stress. Genes that confer a similar phenotype could give rise to higher variation of strains, while providing similar advantages for resistance ([172]69). Determining the most important metabolic adjustments that provide resistance to antibiotic stress may inform the development of novel treatments. The genetic determinants that have the largest system-level impact, i.e., an increase or decrease in their functionality (modeled here via gene knockouts) disrupts the largest number of metabolite yields and/or reaction fluxes, could provide promising new targets. Maximum theoretical yields of metabolites affected by AMR-conferring genes. The lethality of each genetic determinant on all metabolites in the iML1515 model was determined using FBA. A gene knockout was considered lethal to the production of a specific metabolite if it results in blocking the biosynthesis of the metabolite (see also Materials and Methods). The results are represented as a bipartite graph of 98 genes and 508 metabolites. A gene is connected to a metabolite via an edge if its knockout results in preventing the metabolite’s production. Using the Clauset-Newman-Moore greedy modularity maximization algorithm, we clustered the genes and metabolites into groups of similar phenotypes ([173]Table S3). The largest six clusters are shown in [174]Fig. 5. The metabolites within each cluster are involved in a variety of metabolic processes, including cell wall metabolism, nucleotide metabolism, transport metabolism, alternative carbon metabolism, amino acid metabolism, and cofactor and prosthetic group metabolism ([175]Fig. 5b). To test which of these metabolic systems was significantly affected, we performed a pathway enrichment hypergeometric test on the metabolites in each cluster (see Materials and Methods). The most significant pathways associated with each cluster (FDR < 0.01) are shown in [176]Fig. S4a and b. FIG 5. [177]FIG 5 [178]Open in a new tab Effects of genetic determinants on metabolite yields. (a) Bipartite network with genes and metabolites as nodes. Labeled nodes represent genes, whereas unlabeled nodes represent metabolites. A gene and metabolite are connected by an edge if the deletion of the gene blocks the metabolite production. Genes and metabolites are highlighted according to the cluster they were assigned to via the Networkx modularity algorithm. The number of clusters in the figure was reduced by considering only those of size greater than 10. (b) Heatmap showing the metabolic systems associated with each of the six clusters. A gene was associated with a metabolic system, if at least one metabolite correlated with the system could no longer be produced after the gene was deleted. (c) Heatmap showing the antibiotics associated with each cluster. Note that genes occurring in multiple antibiotics were accounted for twice. Hierarchical clustering was applied to the rows of each heatmap (metabolic systems or antibiotic class) using the single linkage method and Euclidean distance as the metric. The gene counts have been normalized by the total number of genes in each cluster in each heatmap. FIG S4 (a) Network diagram showing the significantly enriched pathways in each cluster in [179]Fig. 5a. Genes are connected to a metabolic pathway, if the gene affected the production of at least one metabolite in the pathway. Note that the colors of the gene clusters are coordinated with the clusters in [180]Fig. 5a. (b) Heatmap showing the number of genes in each cluster that is connected, by at least one metabolite, to the significant metabolic pathway. (c) Heatmap showing the number of genes in each antibiotic ML model that affected at least one metabolite in the significant pathway. Download [181]FIG S4, TIF file, 2.7 MB^ (2.8MB, tif) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [182]Creative Commons Attribution 4.0 International license. A number of clusters could be linked to cell wall metabolites ([183]Fig. S4 and [184]S5). First, all 12 genes in cluster 5 affect the production of metabolites involved in lipopolysaccharide (LPS) metabolism. LPS are important compounds on the outer membrane and therefore have been found to play an important role in virulence ([185]70, [186]71). Additionally, the genes murC, ftsI, dapE, glmM, murB, murG, mraY, and mpl in cluster 2 had a significant effect on the production of the metabolites involved in peptidoglycan (PG) metabolism. Peptidoglycan is a mesh-like structure that provides the strength and shape of the outer cell membrane, as well as providing protection against osmotic pressure. Modifications to PG can prevent the release of cell wall components, which initiate the host immune response ([187]72), while also protecting the cell against antibiotic uptake ([188]73). Similarly, changes to metabolites involved in fatty acid oxidation and phospholipids, specifically CDP-diaglycerol, whose production is affected by the six genes purM, purD, ilvD, panD, purA, and purL in cluster 1, may also provide protection against the immune response. The immune system, for instance, has been found to take advantage of the antimicrobial properties of long-chain fatty acids, which disrupt cell wall permeability when in excess in the extracellular environment ([189]74). Pathogens have been found capable of modifying the biophysical properties of the cell membrane via changes to fatty acid structure, to increase the resistance to these antimicrobial peptides produced by the immune system ([190]74). FIG S5 (a) Heatmap showing the different subsystems that are affected by the genes in the gene-metabolite network in [191]Fig. 5a. The numbers in the matrix correspond to the number of metabolites, per metabolic subsystem, that are blocked by the deletion of the corresponding gene. The genes are grouped along the x axis according to the six clusters shown in [192]Fig. 5a. Download [193]FIG S5, PDF file, 0.1 MB^ (55.7KB, pdf) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [194]Creative Commons Attribution 4.0 International license. In addition to cell wall metabolism, the genes in cluster 1 and cluster 4 are associated with a large number of pathways involved in purine and pyrimidine metabolism. Purine and pyrimidine metabolism is involved in the generation of DNA and RNA production; therefore, changes to the genes in this cluster may be important in repairing DNA from reactive oxygen species (ROS) ([195]75). Importantly, metabolomics analysis showed purine metabolism pathways were highly enriched in multidrug-resistant E. coli strains ([196]76). The genes involved in purine metabolism in cluster 1, purL, purD, purM, and purA in particular also have a downstream effect on many other metabolic pathways, including nitrogen metabolism, ppGpp metabolism, and allantoin biosynthesis, all of which can be linked to the regulation of the stringent response ([197]77[198]–[199]79). Importantly, changes in ppGpp concentration play an important role in controlling cellular growth, and depletion of this metabolite has been found to trigger a dormant cell metabolic state, promoting antibiotic-tolerant persistence cells ([200]80, [201]81). E. coli cells starved of nitrogen have been found to have increased ppGpp, which again has been found to induce tolerance to ciprofloxacin ([202]78). Allantoin degradation has been found as an important adaptive response to recovery after nitrogen starvation ([203]77). Furthermore, these genes, as well as the folP gene, also affect metabolites involved in folate metabolism, tetrahydrofolate (THF) biosynthesis in particular. Importantly, point mutations in folP have been identified to prevent sulfonamides from inhibiting THF production ([204]82). We identified folP in the trimethoprim, tetracycline, and ampicillin ML models. Folate metabolism, including THF, however, are again important for nucleotide biosynthesis and have in fact been found important for persistence in E. coli cells exposed to ampicillin ([205]83). The production of coenzyme A (CoA) is also affected by these genes, as well as the ilvD and panD genes. CoA is an important cofactor in many metabolic processes, including fatty acid biosynthesis, which are used in LPS, and the tricarboxylic acid (TCA) cycle. The concentration of acetyl-CoA, an important derivative of CoA, has also been found to play a key role in assessing the cell metabolic state, which, in turn, determines the fate of either cell growth, survival, or death ([206]84). The production of metabolites relating to iron metabolism were affected by genes in both clusters 2 and 6. The four genes in cluster 6 affect metabolites involved in heme biosynthesis. The capability (or improved capability) for heme synthesis may provide pathogens with a competitive advantage for colonization, since heme is the largest source of iron for the cell. Excess heme, however, increases the level of ROS and therefore is extremely toxic to the cell, so the regulation of heme concentration is essential ([207]85). The genes menD, entS, aroC, dxr, pheA, menC, menE, ispA, and entD in cluster 2 all affect enterobactin biosynthesis, either directly or via the chorismate biosynthesis pathway. Enterobactin is an iron-scavenging siderophore and has been found important for pathogen virulence ([208]86[209]–[210]88). An important response of the immune system is to use nutrient immunity by limiting iron availability, which has an important function in energy metabolism and DNA replication ([211]89, [212]90). Changes to genes affecting iron metabolism may therefore enhance the resistance by improving their ability to scavenge iron from the environment. Additionally, however, genes menD, dxr, aroC, menE, ispA, and menC also affect metabolites involved in the electron transport chain (ETC). Importantly, previous work has found reduced respiration via the ETC resulted in mutant strains highly resistant against ampicillin and gentamicin ([213]91). The ETC reduces the proton motive force that is necessary for gentamicin uptake. Reduced flux through ETC, however, also reduces the growth rate, which as previously discussed enables multidrug level persistence ([214]60[215]–[216]62). The genes in cluster 3 are also affecting metabolites involved in electron carrier metabolism. The eight genes cysJ, metK, asd, gss, cysH, cydC, metF, and gshB for instance are all affecting metabolites involved in glutathionylspermidine (GSP) biosynthesis. Importantly, GSP can be recycled back to glutathione and spermidine. Glutathionine is an important antioxidant metabolite required for detoxifying ROS ([217]92, [218]93), while spermidine is a polyamine also found to provide protection against ROS exposure ([219]94). A subset of these genes, cysJ, metK, asd, cysH, and metF, are also affecting biotin production. Importantly, biotin has been identified as important for the virulence of enteropathogenic E. coli (EPEC) strains, due to its involvement in the regulation of the locus of enterocyte enfacement (LEE). The LEE system is essential to these pathogenic bacteria in order to attach and infect host epithelium cells ([220]95). Increased biotin concentrations have been shown to limit enterohemorrhagic E. coli (EHEC) infections in mice ([221]96). In general, the metabolic processes described here are affected by genes identified in the ML models for diverse antibiotic classes ([222]Fig. 5c and [223]Fig. S4c). This is not too surprising, however, since these processes are suggested to increase antibiotic resistance via protection from the immune response, oxidative stress, and/or the stringent response, which are multidrug adaptation mechanisms for enhancing fitness, persistence, and/or virulence ([224]97[225]–[226]99). Flux variability analysis identifies the biochemical reactions whose flux span was affected by AMR-conferring genes. Next, we investigated the system-level effect of the AMR-conferring genes on metabolic fluxes. Specifically, flux variability analysis (FVA) was used to identify the biochemical reactions whose flux span was affected by mutations in the genetic determinants. The results are represented as a bipartite graph of 145 genes and 861 affected reactions ([227]Table S3). A gene is connected to a reaction via an edge if its knockout results in reduced flux span through the reaction. As before, the Clauset-Newman-Moore greedy modularity maximization algorithm was used to cluster the genes and reactions into groups of similar phenotypes ([228]Table S3). The largest nine clusters are shown in [229]Fig. 6a, i.e., those with greater than 10 nodes (genes and metabolites). A variety of metabolic processes were enriched in the clusters ([230]Fig. 6b), similar to the gene-metabolite clusters. To test which of these metabolic systems was significantly being affected, we performed a pathway enrichment hypergeometric test on the reactions in each cluster. The most significant pathways associated with each cluster (FDR < 0.01) are shown in [231]Fig. S6a and b. FIG 6. [232]FIG 6 [233]Open in a new tab Effect of genetic determinants on reaction fluxes. (a) Bipartite network with genes and reactions as nodes. Labeled nodes represent the genes, whereas unlabeled nodes represent reactions. A gene and reaction are connected by an edge if the deletion of the gene reduces the reaction flux by at least 10%. Genes and reactions are highlighted according to the cluster they were assigned to via the Networkx modularity algorithm. Note that to reduce the initial size of the network, we only included clusters of size greater than 10. (b) Heatmap showing the metabolic systems associated with each of the nine clusters. A gene was associated with a metabolic system, if the flux span of at least one reaction correlated with the system was reduced after the gene was deleted. (c) Heatmap showing the antibiotics associated with each cluster. Genes occurring in multiple antibiotics were accounted for twice. Hierarchical clustering was applied to the rows of each heatmap (metabolic systems or antibiotic class) using the single linkage method and Euclidean distance as the metric. The gene counts have also been normalized by the total number of genes in each cluster in each heatmap. FIG S6 (a) Network diagram showing the significantly enriched pathways in each cluster in [234]Fig. 6a. Genes are connected to a metabolic pathway, if the gene affected the flux span of at least one reaction in the pathway. Note that the colors of the gene clusters are coordinated with the clusters in [235]Fig. 6a. (b) Heatmap showing the number of genes in each cluster that is connected, by at least one reaction, to the significant metabolic pathway. (c) Heatmap showing the number of genes in each antibiotic ML model that affected at least one reaction in the significant pathway. Download [236]FIG S6, TIF file, 2.7 MB^ (2.8MB, tif) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [237]Creative Commons Attribution 4.0 International license. The gene-reaction network was clustered into similar groups of genes to the gene-metabolite network. Again, the clusters were enriched with metabolic processes, including cell wall metabolism (LPS, PG, fatty acids, and phospholipids), nucleotide metabolism (purine, pyrimidine, and folate metabolism), amino acid metabolism (histidine and methionine), and iron metabolism (heme). The main differences between networks involve the set of genes in cluster 1. Unlike before, this analysis reveals the genes gmhB, waaC, waaP, accA, lptG, lptF, waaA, hldD, fabG, lpxL, hldE, fabD, and glmM are affecting the biosynthesis of nucleotide sugars. These sugars are incorporated into the O-antigen region of LPS, which is located in immunodominant part of LPS ([238]100). Furthermore, the genes accA, nuoL, nuoN, fabG, tesA, and fabD are affecting fatty acid biosynthesis, as well as biotin biosynthesis. As discussed previously, both fatty acids and biotin metabolites can affect the host immune response and bacterial virulence. Furthermore, the FVA analysis also revealed that the genes in cluster 7 all affect iron transport, which, as previously discussed, may be important for scavenging iron from the host. Additionally, disruptions to the genes in cluster 2, specifically asd, gcvP, gcvT, serA, metF, cysH, cysJ, and serB, were found to affect amino acid metabolism (cysteine, serine, glycine, aspartate and/or methionine), all of which are involved in folate transformation of E. coli. As previously discussed, folate metabolism can affect persistence to antibiotic exposure. Alternatively, however, sulfur amino acid residues in proteins, including methionine and cysteine, are found to be extremely reactive with ROS; therefore, changes to the genes specifically affecting these amino acids may play a role in ROS detoxification ([239]101). Again, these metabolic pathways could be associated with a diverse set of antibiotic classes ([240]Fig. 5c and [241]Fig. S5c), suggesting the changes in these genes are linked to secondary multidrug adaptation mechanisms. DISCUSSION Machine learning provides powerful and robust means for predicting AMR phenotypes and their genetic determinants. ML methods have proven successful in identifying known AMR mechanisms ([242]23[243]–[244]29). Interpreting ML models, however, remains a challenge due to their complexity and large number of contributing features. Current approaches mostly consider only the genes with known AMR associations in AMR databases and neglect genetic determinants relating to important metabolic phenotypes, which are known to play a role in antibiotic resistance. Kavvas et al. ([245]30) have recently developed the first computational pipeline that combines machine learning with genome-scale metabolic models to enable biochemical interpretation of genetic determinants. In their pipeline, the effect of alleles on the flux solution space was used to successfully classify AMR phenotypes of Mycobacterium tuberculosis strains. In our work, we take an alternative two-step approach. First, a combination of a k-mer- and SNP-based machine learning approach is used to identify genetic determinants. Second, a genome-scale metabolic model is used to assess the effect of genetic determinants on metabolite producibility and biochemical fluxes to elucidate possible metabolic adaption mechanisms. Our approach produced AMR models of E. coli that achieved performance accuracies competitive with the current approaches. Moreover, we were able to reveal novel biomarkers based on the systemic effect the genetic determinants have on growth, metabolite yields, and metabolic fluxes. The competitiveness of our ML approaches is that two methods were applied in parallel: a k-mer-based approach and a SNP-based approach, and only the genes identified in either the top 10% of the k-mer or SNP classifiers with an AUC of >95% were used for the GSM. Notably, the k-mer-based approach outperformed the SNP-based approach for 8 of the 11 antibiotics (AUC > 0.95), specifically, 8% higher for aztreonam, 17% higher for ampicillin, 2% higher for cefepime, 5% higher for cefoxitin, 6% higher for gentamicin, 28% higher for trimethoprim, 10% higher for tobramycin, and 21% higher for tetracycline. A possible reason for this is due to the inclusion or exclusion of accessory genes in the two approaches. That is, the k-mer-based approach allows for discriminating between resistance and susceptibility according to both the core and accessory genome, whereas the SNP-based approach is restricted to the core genome only. The antibiotics that performed well only via the k-mer-based approach may therefore be highly dependent on acquired resistance genes, such as the highly discriminant beta-lactamases. The SNP-based approach, however, successfully predicted AMR resistance for the two fluoroquinolone antibiotics ciprofloxacin and levofloxacin, and for the beta-lactam antibiotic meropenem. Importantly, the SNP-based approach performed extremely well (AUC > 0.98) for ciprofloxacin, suggesting that the antibiotic induces mutations, which is consistent with the literature ([246]50). Importantly, the k-mer- and SNP-based approaches identified different known AMR genes, validating the advantage of combining the important features from both approaches. The combined approaches identified 225 known AMR genes corresponding to the top 10% of ranked features recognized as discriminant by the AMR classifiers. Out of these 225 genes, 35 matched the specific antibiotic class that has been reported in the databases. Importantly, a number of the genes identified by both the k-mer- and SNP-based models were associated with metabolic reactions. Using the GSM iML1515, we found a total of 289 metabolic genes from the top 10% of features from both the k-mer- and SNP-based models. The number of metabolic genes from the SNP-based models was considerably higher than the number of metabolic genes from the k-mer-based models. This is not too surprising, however, since the k-mer-based approach included the important accessory genes responsible for drug target modifications, drug efflux, and enzymatic inhibition. Metabolic-gene-specific mutations provide a secondary adaptation mechanism to reduce the antibiotic efficacy. Importantly, previous studies have also found that metabolic-gene-specific mutations are present in the core genes of E. coli ([247]102). The 289 total metabolic genes were significantly enriched in various metabolic pathways, including transport metabolism, nucleotide metabolism, and amino acid metabolism. To understand the mechanistic effects of these 289 genes, we used flux balance analysis to predict the system-level metabolic changes that result from genetic variants of the genes (i.e., mutations or absence). More specifically, we predicted metabolic phenotypes of genetic variants via gene knockouts and identified the metabolic processes that were being affected. Importantly, using our new ML-FBA integrated approach, we could reveal interesting links between genes and potential metabolic adaptation mechanisms that, importantly, were not identified using standard gene pathway enrichment analysis. Using the GSM, we found 20 genes essential for growth under rich environmental conditions. The essential genes with the highest importance in the ML models may be promising targets for generating dual-mechanism antibiotics. That is, the antibiotic targets pathways that would lead to inhibition of an essential metabolic process, while simultaneously reducing the ability of the pathogen to adapt. The most promising new candidates as targets included the following: accA and metK for ciprofloxacin; fabD and fabG for levofloxacin; murG, lptG, and mraY for meropenem; folP for ampicillin; and glmM for gentamicin. Modifications to these genes may result in slower growth, which has previously been found advantageous to pathogenic bacteria, including E. coli and Salmonella, for reducing the damage that occurs as a result of being the primary target of antibiotics ([248]60[249]–[250]62, [251]103). Alternatively, however, the genes accA, fabG, fabD, lptG, murG, and mraY affect biosynthesis of cell wall components and therefore may have had an effect on membrane properties for antibiotic uptake or manipulation of the host’s immune response ([252]73, [253]74). The gene folP, which is involved in folate metabolism, has previously been identified to prevent sulfonamide drugs from inhibiting folate metabolism ([254]82). Importantly, however, we identified folP in the trimethoprim, tetracycline, and ampicillin ML models. Folate metabolism, including tetrahydrofolate (THF), however, are again important for nucleotide biosynthesis and have in fact been found important for persistence in E. coli cells exposed to ampicillin ([255]83). Importantly, a number of additional genes affecting folate metabolism were also identified in the metabolite reproducibility analysis and flux variability analysis. Interestingly, we also found a number of gene knockouts which resulted in auxotrophic behavior to a number of amino acids, including histidine, cysteine, phenylalanine, and proline, as well as auxotrophy to the vitamins thiamine and pantothenate. The production of these metabolites are particularly energy intensive, and therefore, their acquisition from the host may provide pathogens with a competitive fitness advantage against commensal bacteria ([256]104). Alternatively, auxotrophy may have developed due to the critical role the metabolite plays in host-pathogen interactions. Using these genes as new drug targets has the disadvantage that the pathogen may be able to utilize exogeneous nutrients from the host environment. Additionally, we identified 39 genes whose knockout affected the growth of E. coli on alternative carbon sources. The genetic determinants with the highest importance in the ML models affected growth on various carbohydrates. Interestingly, a previous study found that various carbohydrates, including fucose, promote natural transformation of E. coli, therefore potentially contributing to the acquisition of antibiotic resistance and virulence ([257]105). Fucose is particularly interesting as it has also been found to positively regulate microbiome bacterial colonization and host immune activation ([258]106). Furthermore, clustering of genes according to metabolic phenotypes also revealed a strong link to cell wall metabolism adaptations. Genes were found to affect phospholipids, lipolysaccharides, fatty acid, and peptidoglycan metabolism, all of which can be associated with increased antibiotic tolerance via increased permeability of the membrane, as well as playing a role in virulence by manipulating the host immune response ([259]73). Pathogens have been found to modify the cell wall components, for instance, that are usually recognized by the host’s innate immune response ([260]74, [261]107). Changes to a number of genes that were affecting cofactor biosynthesis may also be involved in immune response manipulation, including the biosynthesis of biotin and iron. Increased biotin concentrations, for example, have been found to reduce the ability of EHEC to attach and infect host epithelium cells ([262]95, [263]96). Furthermore, genes affecting both enterobactin metabolism and heme metabolism were also found, both of which may improve resistance to nutrient immunity by increasing the pathogen’s ability to scavenge iron from the environment ([264]108, [265]109). Iron is important for many enzymes in bacteria, particularly those involved in oxidative phosphorylation and DNA synthesis; therefore, it is essential for the bacterium’s survival. Purine and pyrimidine metabolism was also enriched in the gene clusters. Modifications to these genes may limit the inhibitory effect of antibiotics that target DNA replication, such as ciprofloxacin and levofloxacin. Importantly, however, the genes encoding purine and pyrimidine biosynthesis enzymes have a large system-level effect involving many different metabolic processes. The genes purL, purD, purA, and purM in particular affect the production of DNA building blocks, which may be important for DNA repair against antibiotic-induced ROS ([266]75). Furthermore, these genes also affect ppGpp metabolism, which is important for regulating cellular growth and inducing antibiotic-induced persistence ([267]80, [268]81). Additionally, these genes also affect the production of important cofactors of energy metabolism, such as ATP, NAD, and NADPH, which are important for the electron transport chain (ETC). Other ETC metabolites, including ubiquinone, menaquinone, and flavin, were also being affected by the important genetic determinants. Changes in the flux through ETC may contribute to antibiotic resistance in a number of ways. Reduced ETC reduces the proton motive force (PMF) required for aminoglycoside uptake ([269]110), while also reducing the growth rate for increased persistence ([270]60[271]–[272]62). Furthermore, the ETC reactions are also responsible for ROS production. A related study that applied gene knockout simulations on an extended GSM of E. coli, which included specific ROS-producing reactions, identified genes associated with the ETC as ROS-inducing targets for improved antibiotic killing ([273]111). Further evidence to suggest adaptation to ROS was found by a number of additional genes, whose knockout was found to affect glutathionine, spermidine, methionine, or cysteine biosynthesis. Importantly, these metabolites have all previously been found to provide protection of E. coli cells by acting as antioxidants ([274]92, [275]93, [276]101). Importantly, the genetic determinants associated with the metabolic adaptation mechanisms described here were identified in the ML models for diverse antibiotic classes. Changes in these genes are therefore suggested to be contributing to secondary resistance mechanisms via a generic response against toxicity and stress, but it is nonetheless essential for their survival ([277]97[278]–[279]99). In summary, we have demonstrated that our new approach is capable of identifying several metabolic adaptation mechanisms, including reduced reactive oxidative stress toxicity, reduced proton motive force, increased colonization via utilization of alternative nutrients, increased persistence via reduced growth and host immunity defense mechanisms. These metabolic adjustments occur downstream of the initial drug target inhibition and are suggested here to play a role in antibiotic resistance. Targeting the most important genetic determinants with the highest effect on these secondary adaptation mechanisms while simultaneously targeting essential metabolic processes, however, may provide novel new treatments that increase antibiotic efficacy ([280]112). Our new approach can be applied to study genetic determinants of any pathogen of interest, providing a large cohort of AMR phenotypes are available and a genome-scale metabolic model exists for a reference genome. The second step of our approach depends only on the GSM, and therefore, precomputing the metabolic changes (e.g., effects on metabolite yields or metabolic fluxes) for the entire set of genes in the model is possible, which could be readily available for future AMR studies to draw insights on potential new AMR genes. Future efforts may precompute all of these deeper metabolic effects for each gene in a given GSM(s). Such future endeavors will offer the possibility to future AMR genome-wide association studies (GWAS) to readily draw insights from potential AMR gene metabolic effects as predicted by these methods without needing to set-up and solve all the GSM problems independently. While this information is useful, this new approach has many other areas of future development that could lead to deeper understanding of the metabolic changes that facilitate antibiotic resistance. The k-mer-based AMR models, for example, included the primary mechanisms of resistance, and while these strong genetic determinants provide highly accurate models and a means of validation, other resistance mechanisms may be diluted or “washed out” ([281]25). Additionally, many of the genes corresponding to the important k-mers in the AMR models had unknown functional annotation, therefore limiting the power of integration with the GSM. Furthermore, our approach was limited to protein-coding genes only, and therefore lacks the ability to identify important non-protein-coding regions, which have previously been found to confer resistance, such as eis and rrs ([282]113). Likewise, we are not considering synonymous changes in the protein that also have been related to resistance ([283]114). However, as also pointed out by Kavvas et al. ([284]29), these types of computational platforms are open to account for non-protein-coding genes and synonymous SNPs in future work. Additionally, using more advanced GSM frameworks, such as regulatory FBA ([285]115) and GEM-PRO ([286]116), for example, would allow us to investigate the effects of genetic determinants on metabolic phenotypes via changes to gene regulation and protein structure. The characterization of the AMR-associated SNPs, in respect to a reference genome such as E. coli K-12 MG1655, would allow us to link the specific amino acid substitutions or deletions to antibiotic resistance. One-dimensional (1D)−three-dimensional (3D) structure-function prediction analysis may then enable us to determine whether the SNPs result in a loss or gain of function, which is directly integrated as constraints into models such as GEM-PRO. The effects of the SNPs on the genes (i.e., loss of function or gain of function) is not determined in our approach and if considered would allow further insights into the biological interpretation. Finally, our approach was applied to a GSM model of the E. coli K-12 MG1655 strain, and therefore was limited to the genes in this genome. Developing a GSM of the pan-genome of the 3,616 strains used may reveal additional metabolic genes that are important for resistance. Furthermore, an extended version of iML1515 has been developed that includes ROS-specific reactions ([287]111). Applying the approach developed here to this model would therefore be useful future work for exploring the most important genetic determinants for improving antibiotic efficacy via ROS-associated cell death ([288]111). Taken together, our new pipeline was able to determine known AMR genes and suggest new ones that may weaken the pathogen’s resistance to antibiotics. Continued improvement to the approach by increased availability of AMR phenotype data, the further enhancement of ML tools, further development of GSM representations of pathogenic bacteria, and improved functional annotation of genes will provide a means to confidently predict the metabolic responses that facilitate AMR resistance. MATERIALS AND METHODS Data collection and antimicrobial susceptibility phenotypes. Resistance phenotypes and isolation country data for E. coli genomes were downloaded from the PATRIC database ([289]https://www.patricbrc.org/). We selected genomes that were annotated as either “susceptible” or “resistant” to a single antibiotic. These AMR phenotypes were derived from laboratory analyses only and included a mixture of both Clinical and Laboratory Standard Institute (CLSI) and European Committee on Antimicrobial Susceptibility Testing (EUCAST) AMR standards. The list of the laboratory method standards used to determine the AMR phenotypes is detailed for each isolate in the supplemental Excel file “Ecoli_genomes_metadata.xlsx,” which is provided on [290]https://github.com/tan0101/GSM_mSystems_2021. All the genome sequences of the isolates that were used in this study were listed in PATRIC as “good” quality assemblies. Isolates labeled “good” quality in PATRIC meet the criteria set by Parrello et al. ([291]117) that contamination is less than 10%, fine consistency is greater or equal to 87%, and the completeness of the sequence is greater or equal to 80%. We have also included only isolates that were labeled “WGS” (for whole genome sequenced) or “Complete” in the genome status in PATRIC, which removes any cases that are “plasmid-only.” Finally, we also included an additional filtering that removed any isolates with a contig number greater than 250, as previously done by Hyun et al. ([292]25). The 12 antibiotics chosen to be studied had at least 200 genomes annotated as “susceptible” or “resistant.” These antibiotics were ampicillin, aztreonam, cefepime, cefoxitin, cefuroxime, ciprofloxacin, gentamicin, levofloxacin, meropenem, tetracycline, tobramycin, and trimethoprim. These antibiotics encompass a range of classes, including beta-lactams, aminoglycosides, and carbapenems, as well as multiple generations of antibiotics. Genome assembly and annotation, in silico subtyping identification, pangenome construction, and core genome alignment. The genomes of all selected isolates were annotated with Prokka v1.13 ([293]118) using default parameters. All annotated files by the antibiotic model were taken as input for pan-genome analysis with core gene alignments through Roary v3.13.0 ([294]47). SNP sites 2.5.12 was then used for extracting the core gene variant sites from the core gene multiple alignment obtained for each different antibiotic. The variants were used as features for the machine learning classifiers ([295]119). Each core gene nucleotide sequence was further aligned, and single nucleotide variants were identified. The position of a SNP in a gene was selected as a feature in the machine learning if the nucleotide varied in more than 5% of strains (i.e., was constant in less than 95% of strains). Such variation could be determined only if including genes present in 100% of isolates (core genome) within the study population and by aligning the core genome (the genes present in all isolates). This is why we considered only the core genome of the pan-genome for this analysis. k-mer counting and dimensionality reduction. Lists of k-mers of length 13 which occurred in at least one of the genome files were generated for each antibiotic using all the genomes for each antibiotic with GenomeTester4 ([296]https://github.com/bioinfo-ut/GenomeTester4). These k-mers were then counted in each individual genome. All the counts were compiled into an n samples × n-k-mers + 1 matrix with the additional column for the resistance phenotype (0 for susceptible, 1 for resistant). Because the number of k-mers was on average over 2 million for each antibiotic, we performed an initial downsampling of the k-mers to be used as features using pairwise testing between any given k-mer and the resistance phenotype. We used the chi-square test to find the 10,000 features most associated with the resistance phenotype. SNP counting. The variant sites (SNPs) in the core genome alignment were extracted using a SNP sites tool ([297]www.github.com/sanger-pathogens/snp-sites) based on homologous gene groups produced by Roary ([298]https://github.com/sanger-pathogens/Roary). Each strain was processed according to its AMR phenotype for each antibiotic as follows. (i) First, the genome sequences were used to obtain the core gene sets, which are present in ≥99% of each set, ranging from 1,627 to 2,903 depending on the strain sets in each antibiotic. (ii) Each core gene nucleotide sequence was further aligned, and single nucleotide variants were identified. The position of a SNP in a gene was selected as a feature in the machine learning if the nucleotide varied in more than 5% of strains (i.e., was constant in less than 95% of strains). (iii) The data set of SNPs was assigned (1 for “A,” 2 for “G,” 3 for “T,” and 4 for “C”) for each allele of the strain as the matrix for machine learning. Machine learning. The k-mers and SNPs (features) were analyzed using the gradient boosting classifier (GBC) model in scikit-learn ([299]120) (v0.19.1) in Python (v3.6) using the default parameters. For both analyses, initially, the features were standardized by removing the mean and scaling to unit variance. The synthetic minority oversampling technique (SMOTE) ([300]121) was applied to oversample data of minority class, compensating for unbalanced classes. For the k-mer analysis, the data were split randomly using a fivefold stratified cross-validation, while for the SNP analysis, the data were split in 70% for training and 30% for testing. In both analyses, 50 iterations were carried out, and the following four performance metrics were recorded for each classifier, P and N indicating positive and negative cases, respectively, and T indicating true (correct) and F indicating false (wrong) predictions: * • Recall (true positive rate [TPR]) = TP/P * • Precision (positive predictive value [PPV]) = TP/(TP + FP) * • Accuracy (ACC) = (TP + TN)/(P + N) * • Area under the receiver operator characteristic curve (AUC) The mean of these 50 iterations was then used as the result statistic for the performance. Boxplots from the Seaborn ([301]122) package were used to show the final prediction metrics. While the model was being simulated, we captured the maximum importance of each k-mer or SNP, as well as the number of times each k-mer or SNP was assigned an importance greater than zero. The features were ranked using the maximum importance, that is, the maximum weight that the feature contributes to the GBC in the 50 runs. Features that had a maximum importance of zero were removed from the results. k-mer searching and SNP filtering. The GBC assigns an importance score between 0 and 1. This does not indicate which phenotype the k-mer is more associated with. To work out which phenotype the important k-mers were associated with, we compared the number of times a k-mer occurred in the susceptible condition to the number of times it occurred in the resistant condition. Next, we performed a chi-square test on these counts to determine whether there was a significant (P > 0.05) difference between the number of times a k-mer occurred in the susceptible or resistant condition. The k-mers found by the GBC were searched against the pan-genome of all our genomes using BLAST. The parameters for the search were as follows: E-value, 1,000; word size, 13 (same size as the k-mers); gap opening penalty, 5; and gap extension penalty, 2. The search hits were annotated by searching a gene transfer format (GTF) file corresponding to the pan-genome and by retrieving the information about each hit. Genome-scale metabolic model and flux balance analysis. The cobra toolbox in python was used for all simulations. The model iML1515 ([302]51) of E. coli K-12 MG1655 strain was downloaded from the BiGG database ([303]123) using the cameo python toolbox ([304]124). Flux balance analysis (FBA) and its variants were used to predict optimal flux distributions. FBA, based on linear programming, identifies the flux distribution that either minimizes or maximizes some objective function given a set of constraints ([305]125). All simulations assumed M9 minimal medium (unless stated otherwise), such that the sulfate, phosphate, and ammonium were allowed to freely enter the system. Oxygen uptake was constrained to have a maximum uptake of 18.5 mmol/g (dry cell weight [DCW])/h to mimic aerobic conditions ([306]126). A knockout model for each gene of interest was constructed by blocking all corresponding reactions to zero, given that the reaction is not catalyzed by an isozyme. Gene essentiality under various nutritional environments. FBA was used with maximization of growth rate as the objective function to predict the lethality of each gene of interest (i.e., those associated with the important features from the SNPs and k-mers). We considered the essentiality of a gene under both rich medium conditions and M9 minimal medium conditions. To mimic rich medium conditions, the model was constrained to allow all carbon sources into the system, with a fixed uptake rate of 1 mmol/gDCW/h. If a feasible solution exists, while maximizing the biomass equation as the objective function, then the knockout of the gene was not essential. To mimic M9 minimal medium conditions, the model was constrained so one individual carbon source had a maximum uptake of 10 mmol/gDCW/h. This simulation (minimal medium condition) was repeated for each carbon source in the model. The genes whose corresponding knockout model achieved a growth rate of 0.0001 h^−1 or less were considered essential. Calculating metabolite yields and construction of the gene-metabolite bipartite network. A drain reaction was added to iML1515 for each metabolite in the model (i.e., a reaction that consumes the metabolite of interest). The maximum theoretical yield of each metabolite was calculated by setting its corresponding drain reaction as the objective function, with glucose as the only carbon source in aerobic minimal M9 medium conditions. The simulations were carried out for the wild-type model and each gene knockout model. The maximum theoretical yield of metabolite i for the wild type (wt) and the jth gene knockout model is denoted as y[i,wt] and y[i,j], respectively. The networkx package ([307]127) in python was then used to construct a bipartite graph G = (U, V, E), such that the nodes, N, represent genes (U) and metabolites (V), and the edges, E, connect a gene in U to a metabolite in V. The ith metabolite is connected by an edge to the jth gene, if y[i,j] = 0, given that y[i,wt]>0. Networkx’s greedy modularity algorithm was applied to the network to assign genes and metabolites to clusters that were densely connected. The algorithm minimizes the number of interconnections between clusters, while maximizing the number of intraconnections. Cytoscape v3.7.1 was used for visualization of the clusters in the bipartite network ([308]128). Flux balance impact degree and construction of the gene-reaction bipartite network. We adapted the method by Zhou et al. ([309]129) to model the effects of genetic mutations on metabolic fluxes. Here, flux variability analysis (FVA) was applied to the wild-type model and each knockout model using the cobra toolbox in python ([310]130). FVA calculates the minimum and maximum flux through each reaction in the model, given a set of constraints, resulting in the range of possible fluxes for each reaction (flux span). FVA was simulated using glucose as the only carbon source in aerobic minimal M9 medium conditions. Note that reaction loops in the solution were not allowed. Similar to before with metabolite yields, a bipartite graph G = (U, V, E) was constructed using networkx, such that the nodes represent genes (U) and reactions (W), and the edges (E) connect a gene in U to a reaction in W. The ith gene is connected by an edge to the jth reaction, if the knockout of the ith gene reduces the flux span by at least 10% compared to the wild type. As before, networkx’s greedy modularity algorithm was applied to assign genes and metabolites to a cluster in order to identify groups of genes that have a similar impact on the metabolic fluxes. As with the metabolite yields, Cytoscape v3.7.1 was used to then visualize the clusters in the bipartite network. Gene pathway enrichment analysis. We identified metabolic pathways that were enriched in each cluster of the bipartite networks using hypergeometric enrichment tests using the scipy function hypergeom ([311]131). We considered a pathway as significantly enriched in a cluster if the false discovery rate (FDR) was less than 1% and used the Benjamini-Hochbery method for correction against multiple testing. We considered two sets of pathway lists for the enrichment. The first used the 40 subsystems as defined in the iML1515 GSM. A second list of pathways was downloaded from the BioCyc database using the SMART tables for E. coli ([312]52), which provided a more extensive list of specific metabolic pathways. Data availability. The accession numbers of the 3,616 E. coli genomes and the metadata available from PATRIC and used in this study, as well as the code used for the machine learning analysis and Jupyter notebooks for the genome-scale model analysis (including bipartite network generation and plotting of results and the Cytoscape networks), are available on [313]https://github.com/tan0101/GSM_mSystems_2021. FIG S7 (a) Heatmap showing the different subsystems that are affected by the genes in the gene-reaction network in [314]Fig. 6a. The numbers in the matrix correspond to the number of reactions, per metabolic subsystem, that are blocked by the deletion of the corresponding gene. The genes are grouped along the x axis according to the nine clusters shown in [315]Fig. 6a. Download [316]FIG S7, PDF file, 0.1 MB^ (58.4KB, pdf) . Copyright © 2021 Pearcy et al. This content is distributed under the terms of the [317]Creative Commons Attribution 4.0 International license. Supplementary Material Reviewer comments [318]reviewer-comments.pdf^ (777.3KB, pdf) ACKNOWLEDGMENTS