Abstract Infectious disease is the result of interactions between host and pathogen and can depend on genetic variations in both. We conduct a genome-to-genome study of paired human and Mycobacterium tuberculosis genomes from a cohort of 1556 tuberculosis patients in Lima, Peru. We identify an association between a human intronic variant (rs3130660, OR = 10.06, 95%CI: 4.87 − 20.77, P = 7.92 × 10^−8) in the FLOT1 gene and a subclavaluee of Mtb Lineage 2. In a human macrophage infection model, we observe hosts with the rs3130660-A allele exhibited stronger interferon gene signatures. The interacting strains have altered redox states due to a thioredoxin reductase mutation. We investigate this association in a 2020 cohort of 699 patients recruited during the COVID-19 pandemic. While the prevalence of the interacting strain almost doubled between 2010 and 2020, its infection is not associated with rs3130660 in this recent cohort. These findings suggest a complex interplay among host, pathogen, and environmental factors in tuberculosis dynamics. Subject terms: Disease genetics, Tuberculosis __________________________________________________________________ Here, the authors conduct a paired genome analysis of humans and Mycobacterium tuberculosis in a cohort from Lima, Peru. Their research reveals that genetic variations in both the host and pathogen significantly influence the progression to tuberculosis. Introduction Infectious diseases account for much of the burden of all diseases worldwide, but their host genetic architecture is less well understood than many other types of complex traits, despite having comparable genetic heritability^[60]1,[61]2. Infectious diseases occur only when pathogenic organisms infect a host, and thus they are unique in that the effect of host risk alleles may be influenced by genetic variation within the pathogen as well as by environmental factors. In tuberculosis (TB), the pathogenic organism Mycobacterium tuberculosis (Mtb) has co-existed with humanity for millennia and may be co-evolved. Currently Mtb is estimated to infect one-fourth of the population worldwide^[62]3 and in 2019 alone, ~1.4 million people succumbed to it^[63]4. Human genome-wide association studies (GWAS) performed by us and others have identified only a few confirmed disease alleles in TB. Most TB risk alleles appear to be unique to populations within restricted geographical locations^[64]5–[65]10. One possibility is that human genetic susceptibility is strain specific and host genetic risk therefore varies with the varying prevalence of different Mtb strains in different locales. If true, this interaction could manifest in a statistical association between host and Mtb genetic variations^[66]11,[67]12. Previous studies in TB have explored the possibility that human host alleles are associated with clinical phenotypes, such as disease severity and age of onset, in a Mtb lineage specific manner^[68]13,[69]14. A full variant-to-variant search among Mtb and human host genomes might reveal heterogeneity beyond established Mtb lineages and may implicate novel human risk loci. Scaling of sequencing and genotyping technologies now makes a pathogen-host genome-wide examination possible. Here, we sought to use a genome-to-genome (g2g) approach to examine both genomes comprehensively and find genetic determinants of host and pathogen interactions in progression to TB pulmonary disease. Results We hypothesized that host genetic variation can predispose certain individuals to a higher risk of disease from specific bacterial lineages or clades, including those that are not yet defined. If these differences in bacterial prevalence across human hosts are genetically driven, we should be able to identify an association in TB patients between host and bacterial genetic alleles. We tested this hypothesis in a cohort composed of 1556 TB patients in Lima, Peru, from whom we collected both human genotype and Mtb whole-genome sequences (WGS) (Fig. [70]1a, Supplementary Data [71]1). After quality control, we obtained 676,110 genotyped human variants with a minor allele frequency ≥1%. We focused our g2g analysis on 2298 out of 45,831 called Mtb variants with an allele frequency between 5% and 95%, since statistical power for rare Mtb alleles would be limited. Fig. 1. Human-to-Mtb genome-wide association study in 1556 tuberculosis patients. [72]Fig. 1 [73]Open in a new tab a Study design schematic. We obtained DNA from 1556 Peruvian individuals with TB disease and cultured pathogens to perform host genotyping and Mtb WGS. The genotype of each common Mtb variant was considered as the response variable (Y: 0 or 1), and the genotype of each host variant was the independent variable (X: 0, 1 or 2), resulting in one test per host SNP-Mtb SNP pair. b Grid plot summarizing the genome-to-genome analysis. The x-axis denotes position within the human genome with alternating colors (white and light gray) for each chromosome. The y-axis denotes position within the Mtb genome. Point colors represent the association p-value (-log[10](P)) from the mixed effect logistic regression. The most significant host-Mtb pair association is indicated. Six randomly chosen Mtb variants in tight linkage (Pearson r^2 > 0.8) with position 271640 are shown in light blue, indicating that the same human variant rs3130660 is significantly associated with multiple Mtb positions. c Manhattan plot of the GWAS analysis when treating genotypes of Mtb position 271640 as the outcome. The x-axis indicates genomic location, where as the y-axis shows the (-log[10](P)) from mixed effect logistic regression model (d) A maximum likelihood phylogenetic tree inferred from 13,981 variants of 1,555 Peruvian Mtb isolates (excluding one Lineage 1 sample for visualization purposes). Branch colors represent the inferred lineages. Filled squares on the right indicate the presence (red) or absence (gray) of the six Mtb variants identified in the g2g analysis and highlighted in (b) Source data for (a).-c are provided in the Source Data 1 file. Source data for (d) are provided in Source Data 2 file. We tested whether the co-occurrence of each pair of common bacterial and genotyped human variants was higher than expected by performing a mixed effects logistic regression. We considered the presence or absence of each common Mtb SNP as a binary trait and assumed an additive model correcting for human cryptic relatedness, population structure, age and sex. To reduce the computational burden, we only included 1267 (out of 2298) common Mtb SNPs that were not in near perfect linkage (Pearson r^2 < 0.99) in the association tests. Similar to performing GWAS on two correlated traits (e.g., blood pressure and cholesterol level), we allowed correlation in our outcome variables (Mtb SNPs) and did not correct for Mtb structure in our model. As a result, a single host variant can be associated with multiple Mtb variants. In total, we ran >850 million regression models between every common Mtb and host variant pair (Fig. [74]1b). We examined genomic inflation factors (λ[gc]) of each of 1267 GWAS and observed no inflation of test statistics (median λ[gc] = 1.00, Supplementary Fig. [75]1), suggesting that our model is robust to false-positive findings. Genome-to-genome study identifies human-Mtb genomic associations We identified an association between an intronic human variant, rs3130660, located in the 6p21 region on chromosome 6 and a phylogenetic marker (Position 271640) of a subclade of Lineage 2 (L2) Mtb (OR = 10.06, 95% CI: 4.87 − 20.77, P = 7.92 × 10^−8, Fig. [76]1c, Supplementary Data [77]2). Individuals with each rs3130660-A allele were 10 times more likely to be infected with Mtb strains carrying the interacting Mtb variant marker. To increase confidence in our reporting, we next performed a permutation analysis to estimate the likelihood of our observed P-value under the null. We found that the likelihood of obtaining results similar to the observed results is <1% (P < 5.91 × 10^−7 assuming a 5% false-discovery rate, Supplementary Fig. [78]2). The associated Mtb variant is in tight linkage with 78 of the 2298 common Mtb variants (Pearson r^2 > 0.8, Supplementary Data [79]3). To better understand these 79 Mtb variants, we constructed a maximum likelihood phylogenetic tree using WGS of 1556 collected isolates. In this analysis, we looked at 13,981 Mtb variants which included both the 2298 common variants along with 11,683 variants that passed stringent quality control. We classified the strains into well-defined Mtb lineages, including L2 and L4.1–8 using previously defined lineage defining markers^[80]15. We observed that previously defined lineages fell within distinct branches of the phylogenetic tree (Fig. [81]1d). We observed that all 79 Mtb SNPs were present in the same L2 subclade of the phylogenetic tree. We henceforth define the clade by the presence of the g2g Mtb variant (Position 271640) as the g2g-L2 clade. Among the 1556 Mtb isolates, 102 were g2g-L2 (6.56%). To test whether this result reflected social mixing patterns within the community or a true biological association, we repeated our analysis adjusting for year of TB diagnosis, Mtb population structure (as reflected by the top two principal components), none of which significantly changed the reported association (Supplementary Fig. [82]3). Although the prevalence of non-L2 Mtb strains remained consistent over the 2-year collection period, we observed an increase in the associated Mtb clade (g2g-L2) defined by the phylogenetic marker (Position 271640), from 5.4% to 8.9% (Supplementary Fig. [83]4a). However, this did not alter our association strength (Supplementary Fig. [84]3). We also observed that the associated Mtb clade remained consistent over geographical space (Supplementary Fig. [85]4b), reinforcing the conclusion that the reported signals are independent from these covariates. We next considered the possibility that the observed association between a host allele and the Mtb g2g-L2 subclade was driven by host ancestry (Supplementary Fig. [86]5) or Mtb lineage (Supplementary Fig. [87]6). We examined host alleles for associations to the two common Mtb lineages in Peru (L2 and L4), and found the strongest association was between the same host allele and the L2 lineage, but that this association was substantially weaker than with the g2g-L2 lineage (rs3130660, OR = 5.96, 95% CI: 2.97-11.97, P = 1.42 × 10^−6, Supplementary Fig. [88]7, Supplementary Data [89]4). A conditional analysis including g2g-L2 obviated this association (P = 0.97). In contrast, adding L2 as a covariate to our original model did not obviate the association with g2g-L2 (P = 4.21 × 10^−15). This result suggested that the observed association between rs3130660 and L2 was driven by the g2g-L2 subclade rather than L2 more broadly. Host variant associated with Mtb diversity colocalizes with expression of multiple genes in lung and other tissue types The host genetic variant (rs3130660) associated with g2g-L2 infection is intronic to the FLOT1 gene. FLOT1 is a lipid raft-associated scaffolding protein that plays a role in membrane trafficking and phagosome maturation^[90]16,[91]17. We investigated whether this SNP modulates the expression levels of nearby genes by performing an expression quantitative trait loci (eQTL) analysis. Using the Genotype Tissue Expression (GTEx release v8^[92]18) database in lung, we observed that rs3130660-A is associated with increased FLOT1 expression (P = 2.22× 10^−16, Supplementary Fig. [93]8), increased PPP1R18 expression (P = 3.75 × 10^−7) and multiple other genes in the region (Supplementary Fig. [94]9). Since our variant is within the MHC class-I region, to increase the resolution of our reported associated region, we performed HLA imputation using a multi-ancestry MHC reference panel^[95]19. After imputation, rs3130660 remained the strongest signal in the region (Supplementary Fig. [96]8). To understand whether the rs3130660 allele corresponded to the signal reported in eQTL studies, we applied a colocalization analysis using the coloc software^[97]20. Using 109 RNA sequencing datasets included in the eQTL catalog release 6^[98]21, we assessed colocalization signals between all 48 protein coding genes within a 700 kb window of rs3130660 (Supplementary Fig. [99]10, Supplementary Data [100]5). We note that there are differences in genetic ancestry between our study and reported eQTL studies, which were predominantly conducted among individuals of European ancestry, and this might have reduced our power to detect colocalizing signals across these datasets^[101]22. Despite this limitation, we found evidence of colocalization between the identified interacting host SNP and multiple genes within the MHC class-I region, such as FLOT1, IER3 and PPP1R18 eQTL association in 14 cell and tissue types (posterior probability >0.75), including FLOT1 in T-cell (posterior probability = 0.96), IER3 in whole blood (posterior probability = 0.90) and PPP1R18 in thyroid (posterior probability = 0.90). These results provided strong evidence that the strain-associated host variant regulates expression of multiple genes in the MHC class-I region. Interacting host allele and Mtb strain show distinct global transcriptomic effects in macrophages To experimentally test the plausibility of the identified host and Mtb genetic association, we assessed the responses of human macrophages from Peruvian donors with different rs3130660 genotypes to infection with g2g-L2 and nearest neighbor L2 Mtb strains, here called “non-g2g-L2”. To this end, we obtained macrophages from three randomly selected Peruvian donors from our cohort who carried the risk allele A (rs3130660-AT) and three donors who did not (rs313060-TT). We infected these monocyte-derived donor macrophages (MDMs) with three g2g-L2 strains and three non-g2g-L2 strains with similar in vitro growth characteristics, and measured macrophage transcriptional responses by RNA-sequencing (Supplementary Fig. [102]11), which we interpret as intermediate traits relevant to human infection outcomes consistent with other published studies^[103]23–[104]26. To explore the effect of g2g-L2 and non-g2g-L2 Mtb infection in rs3130660-AT and TT donors in macrophages, we first scored the data for “response to infection” using a gene model that includes the 20 most highly induced genes identified in an independent study of human MDM responses to infection with the reference Mtb strain^[105]27 (Supplementary Data [106]6). We integrated expression values of these 20 genes under each condition to test whether rs3130660-AT and TT donors exhibit different transcriptional responses upon g2g-L2 versus non-g2g-L2 infection. We saw quantitative transcriptional differences in responses impacted by host allele and Mtb strain with AT donors manifesting larger transcriptional responses than TT donors (P[t-test] = 7.00 × 10^−7). This response was blunted in g2g-L2 strain infection as compared to non-g2g-L2 infection in both host AT (P[t-test] = 0.0068) and TT backgrounds (P[t-test] = 0.0042, Fig. [107]2a). Fig. 2. Human monocyte-derived macrophage (hMDM) transcriptional response to g2g-L2 and non-g2g-L2 infection. [108]Fig. 2 [109]Open in a new tab a From n = 12 samples, the infection score is calculated based on gene expression levels of the top 20 infection-induced genes from an independent study. P-values are calculated by two-tailed pairwise student’s t-test. b, c show volcano plots of differentially expressed (DE) genes specific to bacterial strain or donor rs3130660 genotype, respectively. Significance was called by FDR-adjusted P-value < 0.01 and log[2](fold-change) >0.7. d Hierarchical clustering of gene expression based on the union DE gene sets from (b–e) Pathway analysis of genes in each cluster from (e). f, k the expression of genes at the cis-region of rs3130660, specifically IER3 (f), VARS2 (g), ZNRD1 (h), FLOT1 (i), HLA-E (j) and PPP1R18 (k). Each dot represents the average gene log[2](TPM + 1) of g2g-L2 (red) or non-g2g-L2 (blue) infection within individual AT or TT donors from n = 12 samples. The P-values are calculated by two-tailed pairwise Student’s t-test either between AT and TT donors or between g2g-L2 and non-g2g-L2 within donors with the same genotype. Source data are provided in the Source Data 1 file. To further understand how differences in host allele and Mtb strain alter the global transcriptional macrophage response to infection, we extracted differentially expressed genes (DEGs, >0.7 log[2](fold-change), FDR-adjusted P-value < 0.01, Fig. [110]2b, c, Supplementary Data [111]7). We performed hierarchical clustering based on 184 Mtb strain-specific response genes and 924 rs3130660 genotype-specific response genes. We identified four major expression clusters in response to Mtb infection (Fig. [112]2d) and performed pathway enrichment analysis to describe the biological processes govened by these gene expression clusters (Fig. [113]2e). We observed rs3130660-AT donors had higher expression of genes implicated in both Type 1 and Type 2 interferon pro-inflammatory signaling; MHC-I antigen processing and presentation, IL-1B signaling, cytosolic DNA sensing and zinc homeostasis. We found rs3130660-TT donors expressed higher levels of genes implicated in altered metabolism and protein disulfide bond formation including thioredoxin and glutathione related enzymes. These responses differed quantitatively by the infecting Mtb strain with g2g-L2 strains inducing higher levels of IL-1B signaling and altered zinc responses in both donor alleles while non-g2g-L2 strains induced higher expression levels of genes involved in interferon signaling and MHC-I antigen processing and presentation (Fig. [114]2f). Taken together, these data are consistent with a model in which rs3130660-AT donors mount a stronger transcriptional response to infection, perhaps because of more sensitive induction of innate responses but that response is skewed towards interferon after non-g2g-L2 infection, in contrast to IL-1B dominant responses after g2g-L2 infection. To assess if the rs3130660-A mutation alters transcriptional responses on chromosome 6 in addition to FLOT1, we next performed a directed analysis of the 30 genes on chromosome 6 previously implicated in the eQTL analyses. Among the 30 protein-coding genes that pass quality control flanking the rs3130660 region, we observed no statistical differences in gene expression between donors with AT and TT genotype in the absence of infection (no P[t-test] < 0.05, Supplementary Fig. [115]12). However, after infection, we observed 15 genes with significant differences between the two host genotype groups (P[t-test] < 0.05, Supplementary Fig. [116]13), including IER3, VARS2 and ZNRD1 involved in immune response, ER stress and zinc homeostasis (Fig. [117]2f–h). Among the 15 genes, 8 gene expression levels were further modified by the Mtb strains, including FLOT1, HLA-E and PPP1R18 (P[t-test] < 0.05, Fig. [118]2i–k). Taken together, these data reveal host-pathogen transcriptional interactions with highly consistent but directionally distinct quantitative effects of both the host allele and bacterial strain. To test for differential FLOT1 expression in the absence of host genotype variation, we evaluated FLOT1 expression in MDMs from three healthy non-Peruvian donors after Mtb infection (Supplementary Fig. [119]14). Using an expanded Mtb strains set, we found FLOT1 expression was significantly lower in the setting of g2g-L2 compared to non-g2g-L2 infection (two-way ANOVA, P-values = 0.0650, 0.0089, 0.0022 across three donors, Fig. [120]3a). We then profiled the immune response in these three donors using a panel of Nanostring probes for genes involved in metabolism and myeloid immune responses to infection. Using bacterial-specific DEGs identified in the Peruvian donor MDMs (Fig. [121]2b), we found a significant correlation between differential gene expression in the Peruvian and local donors when comparing g2g-L2 and non-g2g-L2 infection (P-value = 0.0045, Fig. [122]3b). Additionally, scoring the data for the response to infection using the same 20 highly induced genes (Fig. [123]2c), we again observed quantitative transcriptional differences in response to g2g-L2 and non-g2g-L2 strains in multiple donors (two-way ANOVA, P-value = 0.0021, 0.0185, 0.4037, Fig. [124]3d, Supplementary Data [125]6). We found that there was significantly higher induction of genes involved in Type I IFN signaling by non-g2g-L2 strains and dampening of this response by g2g strain infection in two of three donors (two-way ANOVA, P-value = 0.0021, 0.3119, 0.0126, Fig. [126]3e, Supplementary Data [127]8). Fig. 3. Boston donor human monocyte-derived macrophage (MDM) transcriptional response to g2g-L2 and non-g2g-L2 infection. [128]Fig. 3 [129]Open in a new tab a FLOT1 expression of MDM from three local anonymous donors after g2g-L2 or non-g2g-L2 infection (5 representative strains, n = 5 per donor). Data is presented as mean +/- SD. A two-way ANOVA (two-sided, Sidak’s multiple comparison test) was performed to determine statistical significance for each donor. b Pearson correlation (two-sided) between bacterial-specific DEGs (Fig. [130]2b) of the Peruvian donor MDMs (n = 12 samples) and their expression in healthy donor MDMs (n = 3 samples). Genes are colored according to their upregulation in the g2g-L2 (red) or non-g2g-L2 (blue) infected Peruvian MDMs. c Heatmap of the top 20 infection-induced genes in g2g-L2 and non-g2g-L2 infected Boston donor MDMs (n = 3). d Canonical Mtb infection gene module score and (e). interferon α/β signaling gene module score after infection with representative g2g-L2 or non-g2g-L2 strains in the local donor MDMs (5 representative strains, n = 5 per donor except donor 1 non-g2g-L2 n = 4). Data are presented as mean +/- SD. A two-way ANOVA (two-sided, Sidak’s multiple comparison test) was performed to determine statistical significance for each donor. Source data are provided in the Source Data 1 file. The Mtb g2g-L2 subclade displays altered redox metabolism Next, we sought to identify biochemical features distinguishing the g2g-L2 from non-g2g-L2 strains. Strains. However, the Mtb envelope contains 109 subclasses of lipids^[131]28 which include many known determinants of differential Mtb virulence^[132]29–[133]31. Therefore, we performed an unbiased whole cell lipidomics analysis from three g2g-L2 and eight non-g2g-L2 strains, which detected 28,209 distinct molecules. No differentially abundant lipid (P-value < 0.05 after adjustment by the Benjamini-Hochberg method) was found with respect to g2g-L2 status (Supplementary Data [134]9), which is most consistent with equivalent lipid compositions among the set of L2 isolates. We next developed a high throughput imaging-based phenotyping platform to agnostically identify intrinsic features that distinguish Mtb strains, which we interpret as intermediate traits for more complex biological processes^[135]32 (Fig. [136]4a, Supplementary Fig. [137]15). We assessed seven functional phenotypes including cell morphology (length, width and area), total cellular lipid content, chromosomal DNA content with DAPI staining and growth dynamics inferred by pulsing with fluorescently tagged D-amino acids (NADA) which are incorporated into nascent peptidoglycan. Redox state is inferred from autofluorescence at 420 nm; signal is generated by the oxidized form of F[420], a flavin derived cofactor derived cofactor named because of its 420 nm absorption peak at its oxidized state^[138]33,[139]34. We phenotyped 23 g2g-L2 and 11 non-g2g-L2 strains and found a significantly higher autofluorescence signal in g2g-L2 compared to non-g2g-L2 strains (P[Wilcox-test] = 2.0 × 10^-4, Fig. [140]4a, b), indicative of a shift in the redox state of F[420] towards a more oxidized state^[141]35. Consistent with this interpretation, the g2g-L2 strains were less sensitive to growth inhibition caused by pretomanid, an antibiotic pro-drug that requires the reduced form of F[420] to be activated^[142]36 (P[Wilcox-test] = 0.032, Supplementary Fig. [143]16). The F[420] cofactor pools are linked to the redox state of the other major redox cofactor pools, and the NAD + /NADH ratio was also significantly higher in g2g-L2 strains, indicating a broader change in cellular redox balance (P[t-test] = 0.0106, Fig. [144]4c). Further, g2g-L2 strains were more significantly resistant than control L2 strains to redox stress induced by menadione, which causes redox cycling and has been shown to lead to decreased NAD + /NADH ratio in Mtb^[145]37 (two-way ANOVA, P-value = 0.0165, Fig. [146]4d) but do not differ in susceptibility to an exogenous oxidant (H2O2) (two-way ANOVA, P-value = 0.433, Fig. [147]4e). Fig. 4. Functional characterization of g2g-L2 Mtb strains. [148]Fig. 4 [149]Open in a new tab a Violin plots for each phenotype measured by high-throughput microscopy, showing the distribution of each feature between the g2g-L2 and non-g2g-L2 strains. Samples were assayed at minimum in duplicate. The line inside each plot indicate indicates the median. P-values obtained by a Wilcoxon test. The red dotted line indicates the Bonferroni corrected significance threshold after multiple testing (-log[10](0.05/7)). b Representative images of autofluorescence signals in two representative g2g-L2 strains and two non-g2g-L2 strains. Scale bar: 5 µm. Images are representative of two independent experiments and the remainder of the g2g-L2 and non-g2g-L2 strains. c Total NAD was extracted from five g2g-L2 and five non-g2g-L2 Mtb strains and the NAD + /NADH ratio was determined. Each point represents the average of two independent replicates per strain (n = 5). Data are presented as mean +/- SD. A two-tailed unpaired t-test was used to determine statistical significance between the groups. Mtb mid-log phase cultures were treated with (d) 50 uM menadione for 24 h or (e) 25 mM H2O2 for 4 h, and surviving CFUs were determined by plating. A total of 10 Mtb strains were used (five g2g-L2 and five non-g2g-L2), with two independent replicates per strain. Data are presented as mean +/- SD. A two-way ANOVA (two-sided, Sidak’s multiple comparison test) was used to determine statistical significance between the groups. TRFS-green was incubated with (f) four g2g-L2 and four non-g2g-L2 Mtb strains (n = 5 per strain) or with (g) M.smegmatis strains constitutively expressing either the g2g or non-g2g-L2 variant Rv3913-3914 operon (n = 5 per strain). Fluorescence intensity was measured over time, the mean AUC for each strain was quantified and a two-tailed unpaired t-test was used to determine statistical significance. Data are presented as mean +/- SD. h Total NAD was extracted from a wildtype M.smegmatis strain, or M.smegmatis constructs constitutively expressing either the g2g or non-g2g-L2 variant Rv3913-3914 operon, and the NAD + /NADH ratio was determined (n = 3). Data are presented as mean +/- SD. A one-way ANOVA (two-sided, Tukey’s multiple comparison test) was used to determine statistical significance between groups. Source data are provided in the Source Data 1 file. We sought to identify putative Mtb genetic variants that might be driving this functional phenotype. There are 49 nonsynonymous changes among 79 Mtb SNPs defining g2g-L2 (Supplementary Data [150]3). Many of these occur in genes that belong to redox-related pathways (Supplementary Fig. [151]17). Among these genes, the most biologically likely candidate mediator of altered redox state was trxB2, encoding thioredoxin reductase, which isomerizes protein disulfide bonds using NADP+ as a cofactor. In g2g-L2 strains, trxB2 has a Thr2Asn mutation strongly predicted to alter protein function (SIFT score = 0.01). We used a published fluorescent substrate-based reporter probe^[152]38,[153]39 to assess thioredoxin reductase activity in a panel of four g2g-L2 and four non-g2g-L2 strains and found that the g2g-L2 strains have significantly higher activity (P[Wilcox-test] = 0.0024, Fig. [154]4f). We sought to investigate whether the Thr2Asn variant in trxB2 was sufficient to generate this increase in activity. Ideally, we would seek to validate the effect of the Thr2Asn in Mtb, but the construction of an isogenic allelic variant in an essential gene in a clinical strain of Mtb has not yet been successfully accomplished. As an alternative, we expressed the trxB2 operon containing the Thr2Asn or wildtype alleles of trxB2 in the model mycobacterium, M. smegmatis, a model organism used to dissect the essential cell biology of mycobacteria including Mtb. Consistent with the Mtb data, expression of the trxB2 Thr2Asn operon resulted in significantly more TrxB2 activity than the wildtype operon (P[Wilcox-test] = 0.0097, Fig. [155]4g). The strain expressing the trxB2 Thr2Asn operon also had a significant shift in the NAD + /NADH ratio toward the oxidized state as compared to the wildtype operon, consistent with the Mtb g2g-L2 phenotypes and the model that the trxB2 Thr2Asn variant is a gain of function mutation (P[t-test] = 0.0139, Fig. [156]4h). Mtb g2g-L2 subclade is recently expanded in Peru and transmits differently than other L2 strains Our data suggest that the g2g-L2 subclade evolved distinct biologic features that interact with host cells. To further understand the effects of these strain differences in the broader context of Peruvian L2 strains. To this end, we used 178 L2 isolates from this study and 77 previously collected L2 strains from the same population to reconstruct a maximum likelihood phylogenetic tree^[157]40,[158]41. We found three major clades: clade A consisting of the g2g-L2 isolates, and clades B and C (Fig. [159]5a). We merged the whole genome sequences of these L2 clade isolates with 1000 global L2 isolates (Supplementary Data [160]10); the L2 strains from Peru formed subclades that were largely separated from other global L2 isolates (Supplementary Fig. [161]18). This pattern of restriction suggests that Mtb L2 strains were introduced to Peru and then diversified locally. Fig. 5. Phylogenetic structure of the g2g-L2 clade associated with the human alleles. [162]Fig. 5 [163]Open in a new tab a A phylogenetic tree of L2 constructed with 255 L2 Peruvian isolates. The g2g-L2 clade identified via the g2g analysis is highlighted in red (Clade-A), two other Peruvian subclades of L2 are highlighted in blue (Clade-B) and green (Clade-C) respectively. *ybp years before present. Estimated emerging time (median value, with 95% highest posterior distribution in bracket) for the ancestor strains and cluster rate when using 6-SNP distance of each marked L2 clade (Clade-A, B and C) are listed. b Histogram of pairwise minimum SNP distance to the closest neighbors within the marked clades. c Comparison of transmission cluster rate between the three marked clades when using 6-SNP and 12-SNP distance as the threshold. d The percentage of g2g-L2 strains among all co-circulating strains from the 2010 cohort and the 2020 cohort. e The percentage of g2g-L2 strains among L2 strains from the 2010 cohort and the 2020 cohort. P-values shown in (c) and (d) were obtained by two-sided Fisher’s exact test. Source data for (a). are provided in the Source Data 3 file. Source data for (b). -d are provided in the Source Data 1 file. We next performed a Bayesian coalescent analysis to estimate the emerging times of these three L2 subclades^[164]42. These results suggest that the g2g-L2 (clade A) was introduced to Peru more recently than the other two L2 subclades B and C (62.1 versus 138.7 and 112.3 years ago, Fig. [165]5a). We expect that locally diversified Mtb strains which were introduced into a human population earlier would have achieved larger population sizes^[166]43,[167]44. Instead, we found that g2g-L2 strains have a larger than expected population size. Among 255 participants infected with L2 in our cohort, 150 of 255 (58.8%) of them were infected by g2g-L2 strains (clade A), which is much higher than the number of infections caused by clade B or C (18.0% and 6.7%). Thus, the g2g-L2 strains may have undergone a recent, rapid expansion, consistent with increased transmissibility. To further assess transmissibility, we performed transmission cluster analysis. Two TB cases were considered to belong to the same transmission cluster if they were separated by <6 SNPs or the less stringent 12-SNP cut-off. By both criteria, the g2g-L2 cluster rate is much higher than that of other L2 strains collected from the same region (Fig. [168]5c). These data suggest that the g2g-L2 clade is associated with a higher rate of recent transmission in the Peruvian population. Temporal trajectory of host and pathogen interactions Finally, we sought to evaluate the population behavior of g2g-L2 and the interaction between g2g-L2 and rs3130660 in an independent patient cohort. We initiated a second cohort in 2020, coincident with the COVID-19 surge in Peru. Among the 699 TB cases, 88 (12.6%) were caused by g2g-L2, which represents nearly double the prevalence from 10 years, which was 6.6% (Fig. [169]5d, Supplementary Fig. [170]19). Most of g2g-L2 success came at the expense of other L2 strains–by 2020 g2g-L2 had nearly completely replaced non-g2g-L2 strains in Lima (Fig. [171]5e). These findings are consistent with the population expansion and increased transmissibility of g2g-L2 suggested by our analysis of the 2010 cohort. Given the initial effect size of OR = 10.06 for the association between rs3130660 and g2g-L2 strain, we estimated that in this cohort we would have had 76% power to reproduce this association with one-tailed p < 0.05. After genotyping these samples, we no longer identified an increased association between individuals carrying the rs3130660-A allele and TB disease caused by g2g-L2 strains (P-value = 0.49, OR = 0.57, 95% CI: 0.12 - 2.80, Supplementary Data [172]11). We considered several reasons for this null association. It is possible that the initially reported statistical association was a false-positive. The relatively small sample size in our study may have introduced bias into the estimates, affecting the accuracy of the logistic model. This bias could have arisen from unrepresentative data or chance variability. Alternatively, it is possible that the host association is not temporally stable for example because of interaction with an environmental factor. Given the strong interactive effects of g2g-L2 and rs3130660 on Type 1 interferon signaling, we specifically considered the possibility that COVID infection might modify TB risk in an Mtb strain and/or host dependent fashion. We did not have sufficient representation of AT individuals in the current cohort to assess COVID outcomes as a function of host genotype but we were able to assess COVID outcomes associated with Mtb strains. We observed a trend for patients with a self-reported history of COVID to be less likely to have g2g-L2 infection (8.6%) than patients with no self-reported COVID history (13.1%, P-value = 0.121, Fisher’s exact test). These data suggest that g2g-L2 is a highly transmissible strain distinguished most notably in its host interactions by altered interferon signaling. However, our data further suggest that dynamic host genetic and environmental factors impact the TB disease manifestations of this strain. Discussion Previous work has shown that Mtb lineages are often restricted to particular geographical regions while other lineages are globally distributed, suggesting that Mtb can adapt to diverse natural environments and host populations^[173]12,[174]45. In this work, we examined the hypothesis that genomic interactions between host and pathogen alleles can modify the risk of TB disease by performing a paired genome analysis using host and bacterial samples from a large cohort of TB patients in Lima, Peru and a smaller, independent cohort of TB patients from the same region 10 years later. Our analysis of the 2010 cohort indicated that a host variant in the FLOT1 gene was preferentially associated with infection with a highly transmissible and functionally distinct subclade of L2 Mtb, g2g-L2 that emerged in Peru in the 1950’s. Multiple lines of evidence supported both the biologic features conferred by the associated host allele and the g2g-L2 strains as well as the biologic association between the host FLOT1 variant and response to g2g-L2 Mtb strains. FLOT1 is a lipid-raft associated membrane protein that has been previously implicated in host-pathogen immunity including control of Mycobacterium marinum^[175]46,[176]47. FLOT1-dependent microdomains are present on the phagolysosome where they act as platforms for the assembly of NADPH oxidase complexes and vATPase^[177]17. This function contributes to antifungal immunity, and FLOT1 alleles are associated with invasive aspergillosis^[178]47. FLOT1 may play a similar role in Mtb-macrophage interactions. This protein also contributes to other signaling and cell migration pathways^[179]48–[180]51, raising the possibility that altered FLOT1 expression could have additional effects on the host immune response to Mtb. Finally, FLOT1 may not be the only effector of the g2g-L2 associated human risk allele. The associated intronic variant is in linkage with and predicted to alter the expression of other immune related genes and several of these, including PPP1R18, an adapter protein of RAPK signaling, and HLA-E, which can present Mtb peptides^[181]52, are also differentially expressed after Mtb infection. Consistent with this hypothesis, macrophages from donors with different genetic backgrounds, referred to by rs3130660-AT and TT, differed in their responses to Mtb infection and these differences were modified by the infecting strain genotype. Donors with the AT genotype exhibited a robust response skewed towards Type 1 IFN related genes. Interestingly, g2g-L2 strain infection dampened the IFN response in all donors, instead promoting an IL-1B-dominated response. This shift in transcriptional response might be attributed to the known negative regulatory relationship between Type 1 IFNs and IL-1B^[182]53,[183]54. Further, TT donor macrophages were distinguished by their genes implicated in redox balance. This was striking to us because, using unbiased phenotypic profiling, we identified altered redox balance, and resistance to reductive stress as defining characteristics of g2g-L2 strains in vitro due to a gain of function mutation in thioredoxin reductase, trxB2. These data suggest that g2g-L2 strains may be both inducing a distinct immunometabolic response and adapting to the immunometabolic environment that they induce. There could be other g2g-L2 mutations contributing to the altered macrophage responses. For example, eccCb1, a core component of the ESX1 secretion system which is required for Mtb to induce the Type 1 IFN response in infected macrophages, carries a S74F mutation (SIFT = 0.01) which could abrogate ESX1 function. This would be consistent with the reduced expression of the Type 1 IFN gene signature in macrophages after g2g-L2 function although loss of ESX1 function otherwise would be counter-dogmatic as it is thought to be required for virulence in humans. Given the experimental data supporting the distinct biologic features of the rs310660 host variant and g2g-L2 Mtb, as well as the association between the two, certain outcomes from the 2020 cohort were unexpected. Whereas the proportion of rs310660-AT donors was unchanged while the prevalence of g2g-L2 had nearly doubled, essentially replacing the other L2 strains. The apparent expansion of g2g-L2 between 2010 and 2020 was consistent with our phylogenetic analysis of the 2010 cohort. However, we had not anticipated a nearly complete L2 strain replacement or an L2/L4 difference in fitness. Strain replacement is commonly observed in the context of pathogens like Streptococcus pneumoniae or SARS-CoV2 that promote strain specific immunity. Mtb infection has been shown to provide protection against subsequent Mtb infection, and Mtb strain-specific differences in innate and adaptive immunity have been described but strain-specific immune exclusion has not been previously characterized. In this context, we did not find an association between g2g-L2 and rs3130660 in the 2020 cohort. We consider it possible that the initial association was a false-positive. Alternatively, there may be temporal variables which we do not have the capacity to resolve; for example, the g2g-L2 and rs3130660 association might alter the rate of progression to disease rather than total disease susceptibility, such that for a given period transmission rs3130660 variant hosts might manifest early while wildtype hosts might manifest years later. Finally, we highlight that both host and pathogen variants impact Type 1 IFN responses, and our data suggests a trend towards interaction with symptomatic COVID, an association that we will be able to test with greater power when the full cohort is analyzed. Both of these models suggest that genetic associations between variants in Mtb and human hosts contribute to the transmission dynamics of tuberculosis but that we should expect these interactions to be modified by contextual variables. Further analysis is required to fully understand how the host genes and g2g-L2 strain are mechanistically linked, and to establish their clinical relevance. Our results open the possibility that other Mtb strain-specific host alleles are present and may explain genetic differences driving TB susceptibility across the globe. Methods Ethics statement We recruited 1632 subjects from a large catchment area of Lima, Peru that included 20 urban districts and ~3.3 million residents to donate a blood sample for use in our study. We obtained written informed consent from all the participants. The study protocol was approved by the Institutional Review Board of Harvard School of Public Health and by the Research Ethics Committee of the National Institute of Health of Peru. Participant enrollment and follow-up The study design and methods were previously described in detail^[184]40,[185]55. Briefly, over the 4-year period from 2009-2012, we identified patients ≥15 years of age who had received a diagnosis of pulmonary TB at any of 106 participating health centers. We confirmed the microbiological status of their disease with either a positive sputum smear or mycobacterial culture. We also recorded the index patients’ baseline smear status, HIV status, and drug-resistance profiles. Index cases with HIV infection or infected with multiple Mtb strains were excluded from the analyses. The sex of each participant was assigned based on genotyping and cross-checked with self-report data. Host genotyping, quality control and imputation DNA samples were genotyped using a customized genotyping array (LIMAArray) based on whole-exome sequencing data from 116 active TB cases to optimize the capture of genome-wide genetic variation in Peruvian individuals^[186]10 in the 2010 cohort. DNA samples were genotyped using the Illumina Global Screening Assay with customized contents in the 2020 cohort. In both cohorts, we excluded samples that were missing >5% of the genotype data, had an excess of heterozygous genotypes, and/or duplicated with identity-by-state >0.9. We also excluded variants with a call rate <95%, with duplicated position markers, minor allele frequency <1%, and marked deviation from Hardy-Weinberg equilibrium (excluded if P < 10^-5). We imputed eight classical HLA genes HLA-A, -B, -C, -DQA1, -DQB1, -DRB1, -DPA1, and -DPB1, amino acids and intergenic variants using HLA-TAPAS using a multi-ancestry reference panel which contains data from 21,546 unrelated individuals^[187]56 in the 2010 cohort of 1556 individuals. Mycobacterium tuberculosis variant calling and phylogeny construction Mtb DNA was extracted from cryopreserved cultures of sputum samples. Each sample was thawed and subcultured on LJ agar and a big loop of colonies were lysed with lysozyme and proteinase K to obtain DNA using cetyltrimethylammonium bromide (CTAB)/Chloroform extraction and ethanol precipitation. Samples were sequenced on an Illumina HiSeq 2500 or 4000 sequencer yielding paired-end reads of length 125 bp which were aligned to the reference assembly, H37Rv [188]NC_000962.3 (GenBank accession [189]CP003248). We called variants using Pilon (v1.22)^[190]57. Genome coverage was assessed using Samtools (v1.10). We excluded isolates with evidence of mixed infection using the barcode method^[191]15. We assigned a variant call as missing if the valid depth of coverage at a specific variant was <12 reads, if the mean read mapping quality at the site did not reach 10, or if none of the alternative alleles account for at least 85% of the valid coverage. The phylogenetic tree was constructed based on the WGS Mtb alignment. Variants occurring in genes with repetitive elements such as transposases, proline-glutamate (PE) or proline-proline-glutamate (PPE)^[192]58 were excluded to avoid any inaccuracies in the alignment. After applying these filters, 13,981 Mtb variants were used to conduct a genetic distance matrix. We built a maximum likelihood phylogenetic tree of 1602 Mtb isolates that was inferred using IQ-TREE v2^[193]59 with bootstrap support from 1000 replications. The best-fit nucleotide substitution model was GTR + F as determined by the ModelFinder function. To characterize the Peruvian L2 Mtb strains in the context of global strains, we randomly selected 1000 whole-genome sequenced L2 strains from published studies in 112 bioprojects to represent the major phylogenetic structures of L2. We used these global strains together with 255 L2 Peruvian strains to reconstruct the global L2 phylogenetic tree. We estimated divergence times via BEAST v1.8.0^[194]42, using an uncorrelated lognormal relaxed clock that allows for tree branches to evolve at different rates. The XML input file was modified to specify the number of invariant sites in the Mtb genomes. For the Mtb substitution rate, we used a normal distribution with a mean of 4.6 × 10^-8 substitution per genome per site per year (3.0 × 10^-8 to 6.2 × 10^-8, 95% highest polar density interval), which was calibrated by ancient DNA samples^[195]60,[196]61. An uncorrelated log-normal distribution was used for the substitution rate and a constant population size for the tree priors. We ran three chains of 5 × 10^-7 generations and sampled every 10,000 generations to ensure independent convergence of the chains; we discarded the first 10% as a burn-in. Convergence was assessed using Tracer (v1.7.0)^[197]62, ensuring all relevant parameters reached an effective sample size >100. Transmission cluster analysis To define transmission clusters of the collected L2 Mtb isolates, we applied two SNP thresholds (6 and 12) to separate a patient isolate from that of at least one other patient in the cluster. The 6-SNP threshold was chosen based on the range of SNP distances between paired isolates of the same strain obtained at different times from relapsed TB patients^[198]63. The 12-SNP threshold was a previously defined upper limit of genomic relatedness noted within human hosts and between epidemiologically related human hosts^[199]64. Genome-to-genome association analysis To systematically test for associations between human and Mtb genomic variants at the genome-wide level, we performed logistic mixed regression modeling implemented in SAIGE^[200]65 version 1.1.6. This model was specifically designed to account for the binary nature of our outcome variable (absent/present of Mtb variant) and the unbalanced case-control ratio in our sample. We assumed an additive genetic model for each host and bacterial variant. For each bacterial variant j (j = 1,…, 1267) and host variant i (i = 1,…, 676,110), we test [MATH: yj=xiβji+G< /mi>RM+ωjcj, :MATH] 1 where y[j] is an N-vector of binary labels indicating the absence or presence of a bacterial variant in the host samples i (n = 1556); x[i] is an N-vector of host genotypes; β[ji] is the additive effect of the host variant i on the bacterial variant j; c[j] is the j^th column of an N × C matrix of covariates (age and sex) including a column of 1 s; and ω[j] is a c-vector of the corresponding coefficients including the intercept. We used genetic relatedness matrix (GRM) as a random effect to correct for cryptic relatedness and population stratification between collected host samples. The GRM was obtained from an LD-pruned (r^2 < 0.2) genotype with minor allele frequency >1%. Permutation strategy To determine an appropriate empirical genome-wide significance threshold for the g2g analysis accounting for population structure both in the Mtb and human genome, we conducted 200 simulations for each g2g analysis using a permutation procedure. To preserve the phenotype structure defined by the Mtb variants, for each association analysis, we randomly permuted the presence/absence status for each of the 1267 Mtb variants within the 1556 individuals included in the study. In total, we ran 1267 × 200 = 253,400 association tests to obtain an empirical genome-wide significance threshold. We measured the distributions of the minimum p-values of the variants (P[min]) for each association of the permutation. We defined an empirical genome-wide significance threshold, -log[10](P[sig]), as the 95th percentile (1-ɑ) of -log[10](P[min]). Colocalization analysis using eQTLs We integrated our GWAS results with cis-eQTL data using a Bayesian method (coloc v3.2-1)^[201]66. We evaluated whether the GWAS and eQTL associations best fit a model in which the associations are due to a single shared variant that is summarized by the posterior probability, as opposed to different regulating variants. We used gene expression datasets from the eQTL catalog release 6^[202]21. We tested pairwise colocalization between 109 bulk RNA-sequencing expression datasets in 69 distinct cell and tissue types and GWAS from the most significant genome-to-genome pair (using Mtb position 271640 as phenotype). We used GWAS and all variant-gene cis-eQTL associations tested in each tissue, including non-significant associations of genes within a 700 kb window around the top association host SNP (rs3130660). A posterior probability of colocalization >0.75 was considered to be strong evidence for shared causality. Bacterial strains and culture conditions All 23 g2g-L2 and 11 non-g2g-L2 Mtb strains were drug sensitive and obtained from HIV negative donors. They were grown shaking at 37 °C. Cultures were grown in 7H9 media (Middlebrook 7H9 salts with 0.2% glycerol, 10% OADC [oleic acid, albumin, dextrose, catalase]). Cell density (OD600) was determined by a spectrophotometer. For lipidomic analysis, liquid cultures in tween-free 7H9 medium supplemented with albumin, dextrose and sodium chloride were transferred to nitrocellulose filter-paper discs using vacuum filtration and grown on solid agar media^[203]67. The bacteria from the filter disc were dislodged into 2 ml of methanol and transferred to 15 ml conical glass tube. Then 1 ml of chloroform was added and incubated for 30 min for inactivation of the bacteria, using the extracted lipids for MS-based lipidomics. Mammalian cell culture Mammalian cells were cultured in RPMI 1640 with 10% fetal bovine serum (FBS), 10 mM HEPES, and 2mM L-glutamine. Primary human monocytes were isolated from peripheral blood mononuclear cells (PBMCs) from three rs3130660-AT and three rs3130660-TT donors from Peru. PBMCs were also obtained by Ficoll gradient centrifugation of randomly selected healthy donor leukaphereses (Research Blood Components) or buffy coat blood (Massachusetts General Hospital). Monocytes were isolated by CD14-positive selection (Stemcell Technologies) and matured in 50 ng/mL human recombinant macrophage colony-stimulating factor (M-CSF) for 6 days. RNA extraction To isolate RNA from Mtb-infected hMDMs, hMDMs were lysed in Buffer RLT (Qiagen) + 1% β-mercaptoethanol after 24 h of infection. RNA was purified using the Zymo Direct-Zol kit according to the manufacturer’s instructions, with off-column DNase treatment and subsequent repurification using the Zymo Clean & Concentrator kit according to manufacturer’s instructions. cDNA generation and qPCR cDNA was generated using the SuperScript IV First Strand Synthesis System (ThermoFisher). Gene expression was quantified with iTaq Universal SYBR green supermix (Bio-Rad) on an Applied Biosystems ViiA 7 system. FLOT1 expression was normalized to that of GADPH. All qPCR primers used are listed in Supplementary Data [204]12. Nanostring assay and analysis RNA extracted from local donor MDMs was used as input in a Nanostring assay with a custom-designed probe set and analyzed with nSolver version 4.0 (Nanostring Technologies). Target sequences are listed in Supplementary Data [205]13. Data were normalized against internal positive controls and housekeeping genes (ABCF1, COG7, GUSB, MRPS5, POLR2A, SAP130, SDHA, TLK2) to correct for technical variation. Bulk RNA-sequencing gene expression quantification For RNA-sequencing, the single-end raw reads were filtered by Trimmomatic version 0.39 to remove the adapters and low-quality bases. We used STAR version 2.6.0c and RSEM version 1.2.29 to quantify gene expression using human genome primary assembly (GRCh38) and its basic gene annotation from gencode human release 43 (GRCh38.p13). The pipeline generates expected counts and transcripts per million (TPM) for each gene. We used log-transformed, log[2](TPM + 1), as our main expression measure, which accounts for library size and gene size. We considered as expressed genes those with a log[2](TPM + 1) > 1 in at least half of the samples. Expected counts were used for DESeq in R for PCA analysis. The specific parameters for Trimmomatic are: –Truseq3-SE.fa:2:30:10 --LEADING:3 --TRAILING:3 --SLIDINGWINDOW:4:1 --MINLEN:36.The specific parameters for STAR are: --outSAMunmapped Within --outFilterType BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1 --runThreadN 12 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --outSAMheaderHD \@HD VN:1.4 SO:unsorted. The specific parameters for RSEM are: rsem-calculate-expression --star --star-output-genome-bam --num-threads 12 --star-gzipped-read-file. Differential expression analyses We used linear mixed models using lme4::lmer() function in R in the differential expression and expression association analysis. To call significant caes, we used a likelihood ratio test between two nested models using anova() function in R, and an FDR-adjusted P-value at 0.01 and an absolute log[2] fold-change greater or equal to 0.7. To search for genes that are differentially expressed upon g2g and non-g2g L2 infection, we tested the following model (2): [MATH: H0:E~Infection+1dono< /mi>rH1 :E~G2G+Infection+1dono< /mi>r :MATH] 2 Finally, to nominate genes that are differentially expressed with different genetic background, we tested the following model (3): [MATH: H0:E~G2G+1stra< /mi>inH< mrow>1:E~G+Infection+1stra< /mi>in :MATH] 3 where E is the expression level for each gene, Infection indicates if the sample is infected or uninfected, G2G represents if the sample is infected with g2g-L2 (1) or not (0), G represents the genotype of the infected samples (AT = 1; TT = 0). We included donor as a random effect in the infection and strain-specific model and strain as a random effect when testing for donor-specific expression changes. Microscopy imaging and analysis Mtb cultures were grown to OD600 of ~1.0, then fixed with 2% paraformaldehyde (PFA) for 1 h. The fluorescent D-amino acid NADA (3-[(7-Nitro-2,1,3-benzoxadiazol-4-yl)amino]-D alanine hydrochloride) was added at a final concentration of 1 mM 16 h prior to fixation. All samples were seeded onto molded 1.8% agarose in phosphate buffered saline (PBS) with DAPI (2.5 μg/mL) and Nile red (0.1 μg/mL). Samples were incubated at 37 °C for 1 h prior to imaging. Samples were imaged with a Plan Apo 100× 1.45 NA objective using a Nikon Ti-E inverted, widefield microscope equipped with a Nikon Perfect Focus system with a Piezo Z drive motor, Andor Zyla sCMOS camera, and NIS Elements (v4.5). Semi-automated imaging was carried out using a customized Nikon JOBS script to locate imaging fields of interest, 24 images were taken for each strain. Cell segmentation and quantification was performed using our previously published pipeline, MOMIA^[206]32. Mycobacterial lipidomics Whole cell lipid profiles were generated using chloroform-methanol extraction of cells from strains grown in biological triplicate using a previously published liquid chromatography and mass spectrometry protocol^[207]28, with samples profiled using both positive and negative mode. The order of extraction and mass spectrometry was randomized, with mass and retention time values assigned using the R package xcms^[208]68. Ion peaks with a < 10 ppm match to the exact mass of a theoretical lipid were assigned a unique identifier to group peaks across the data set. The median of triplicate samples was used for differential abundance analysis using the R package limma^[209]69. Antibiotic susceptibility determination Pretomanid (PA-824), isoniazid (INH), and rifampicin (RIF) MICs were determined using an OD-based growth assay. 96-well plates containing 1.5-fold serial dilutions of PA-824, INH or RIF were prepared as before. All Mtb strains were grown to mid-log phase and diluted to OD600 = 0.003 into each well of the prepared plates. Plates were incubated at 37 °C with shaking in sealed plates. Growth was determined by repeated OD600 measurements over a 2-week period. We assessed cumulative bacterial growth inhibition over the course of the experiment by calculating the area under the curve (AUC) of the OD values^[210]70. Data are representative of two independent replicates per strain. Using the final OD600 measurement of each strain for each drug concentration, the data was fitted to the Gompertz equation to determine the MIC^[211]71. NAD/NADH metabolite extraction and measurement Mtb strains were grown in 7H9 until mid-log phase, then pelleted by centrifugation, washed twice with PBS, and resuspended in an extraction buffer from an NAD/NADH Quantitation Kit (Sigma Aldrich). Bacterial suspensions were transferred to tubes containing Lysing Matrix B (MP Biomedicals) and lysed via bead beating. The resulting lysate was used to quantify NAD and NADH following the kit’s instructions. Oxidative stress resistance assay Mtb cultures were grown in 7H9 to mid-log phase. Cultures were treated with menadione (50 uM or 100 uM), or with H2O2 (25 mM or 50 mM), or were left untreated as a control. After 4 h (H2O2) or 24 h (H2O2 and menadione), all Mtb cultures were serially diluted and plated on 7H11 agar plates. Plates were incubated at 37 °C for 2–3 weeks and the number of colonies were counted. Pathway analysis In the host pathway analysis, we first identified significant DE human genes upon bacteria or host genetic backgrounds as shown in Model 2 and 3 in the method section describing “differential expression analysis”. To enrich pathway genes and reduce the risk of false positives, we took the union set of these DE genes with a more stringent threshold of FDR-adjusted p-value < 0.01 and absolute log[2] fold-change >0.7. We categorized samples in four groups according to the bacteria and host genotype. For each of the groups, we calculated the z-score for the average expression levels of the selected DE genes. We used function cluster::pam() in R to partition the z-score profile into 4 clusters around medoids based on Euclidean distance. For genes within each cluster, we applied DAVID Functional Annotation Clustering function^[212]72,[213]73 to enrich pathways from annotated databases of “KEGG Pathway”, “Reactome pathway” and “Wikipathways”, with significant thresholds of FDR-adjusted p-value < 0.01. In the Mtb pathway analysis, we first annotate the Mtb genome using Variant Effect Predictor^[214]74 (v.105). Among 79 highly correlated (Pearson r^2 > 0.8) common Mtb variants that are associated with the human genome, 49 were annotated as nonsynonymous (missense) variants. We defined Mtb genes that have one of these 49 missense variants as g2g-L2 genes. We used the DAVID^[215]75 Functional Annotation Clustering function to group g2g-L2 genes into a similar functional network. We used PANTHER^[216]76 to test for functional overlaps with known pathways. We calculated a Fisher’s exact p-value by constructing a 2 × 2 contingency table with g2g-L2 genes (n = 48) and all other Mtb genes with a common missense mutation in the g2g analysis (n = 898) overlapping the flavin adenine dinucleotide binding pathway included in the gene ontology category (GO:0050660). TRFS-green assay Mtb or M. smegmatis strains were grown in 7H9 until mid-log phase. Cultures were washed twice with PBS, then resuspended in PBS containing 10 uM TRFS-green (MedChemExpress) and added to a 96 well plate. The fluorescent signal was read using a BioTek Synergy H1 (Mtb strains) or a TECAN Spark (M.smegmatis strains) microplate reader, at excitation 438 nm and emission 538 nm, immediately after probe addition and at various time points thereafter. Between readings, the plates were kept at 37 °C. Generation of M. smegmatis strains expressing the Mtb TrxB2_TrxC operon The Mtb operon containing Rv3913 (trxB2) and Rv3914 (trxC) was expressed in M.smegmatis mc2155 using the integrating plasmid pCT94 under control of the constitutive promoter pUV15. The Rv3913-3914 operon was amplified from representative g2g-L2 and non-g2g-L2 genomic DNA using primers listed in Supplementary Data [217]12. A subsequent round of PCR was used to add on overlapping vector handles. All PCR was performed using Q5 High-Fidelity DNA polymerase (NEB). Expression vectors were constructed via Gibson assembly (NEB) of the PCR fragments ligated into NdeI/HindIII digested pCT94. The g2g and non-g2g constructs were transformed into DH5a cells and selected on LB plates with 50 ug/ml kanamycin. CT94-g2g-trxB2trxC and CT94-nong2g-trxB2trxC were electroporated into mc 2 155 and selected on 7H10 plates with 20 ug/ml kanamycin. Successful transformation was confirmed by Sanger Sequencing (Azenta). Reporting summary Further information on research design is available in the [218]Nature Portfolio Reporting Summary linked to this article. Supplementary information [219]Supplementary Information^ (5.4MB, pdf) [220]41467_2024_54741_MOESM2_ESM.pdf^ (68.3KB, pdf) Description of Additional Supplementary Files [221]Supplementary Data 1-13^ (5.6MB, xlsx) [222]Reporting Summary^ (3.1MB, pdf) [223]Transparent Peer Review file^ (11.7MB, xlsx) Source data [224]Source Dataset 1^ (82.9KB, txt) [225]Source Dataset 2^ (14.8KB, txt) [226]Source Dataset 3^ (16.7KB, pdf) Acknowledgements