Abstract

   Infectious disease is the result of interactions between host and
   pathogen and can depend on genetic variations in both. We conduct a
   genome-to-genome study of paired human and Mycobacterium tuberculosis
   genomes from a cohort of 1556 tuberculosis patients in Lima, Peru. We
   identify an association between a human intronic variant (rs3130660,
   OR = 10.06, 95%CI: 4.87 − 20.77, P = 7.92 × 10^−8) in the FLOT1 gene
   and a subclavaluee of Mtb Lineage 2. In a human macrophage infection
   model, we observe hosts with the rs3130660-A allele exhibited stronger
   interferon gene signatures. The interacting strains have altered redox
   states due to a thioredoxin reductase mutation. We investigate this
   association in a 2020 cohort of 699 patients recruited during the
   COVID-19 pandemic. While the prevalence of the interacting strain
   almost doubled between 2010 and 2020, its infection is not associated
   with rs3130660 in this recent cohort. These findings suggest a complex
   interplay among host, pathogen, and environmental factors in
   tuberculosis dynamics.

   Subject terms: Disease genetics, Tuberculosis
     __________________________________________________________________

   Here, the authors conduct a paired genome analysis of humans and
   Mycobacterium tuberculosis in a cohort from Lima, Peru. Their research
   reveals that genetic variations in both the host and pathogen
   significantly influence the progression to tuberculosis.

Introduction

   Infectious diseases account for much of the burden of all diseases
   worldwide, but their host genetic architecture is less well understood
   than many other types of complex traits, despite having comparable
   genetic heritability^[60]1,[61]2. Infectious diseases occur only when
   pathogenic organisms infect a host, and thus they are unique in that
   the effect of host risk alleles may be influenced by genetic variation
   within the pathogen as well as by environmental factors. In
   tuberculosis (TB), the pathogenic organism Mycobacterium tuberculosis
   (Mtb) has co-existed with humanity for millennia and may be co-evolved.
   Currently Mtb is estimated to infect one-fourth of the population
   worldwide^[62]3 and in 2019 alone, ~1.4 million people succumbed to
   it^[63]4.

   Human genome-wide association studies (GWAS) performed by us and others
   have identified only a few confirmed disease alleles in TB. Most TB
   risk alleles appear to be unique to populations within restricted
   geographical locations^[64]5–[65]10. One possibility is that human
   genetic susceptibility is strain specific and host genetic risk
   therefore varies with the varying prevalence of different Mtb strains
   in different locales. If true, this interaction could manifest in a
   statistical association between host and Mtb genetic
   variations^[66]11,[67]12. Previous studies in TB have explored the
   possibility that human host alleles are associated with clinical
   phenotypes, such as disease severity and age of onset, in a Mtb lineage
   specific manner^[68]13,[69]14. A full variant-to-variant search among
   Mtb and human host genomes might reveal heterogeneity beyond
   established Mtb lineages and may implicate novel human risk loci.
   Scaling of sequencing and genotyping technologies now makes a
   pathogen-host genome-wide examination possible. Here, we sought to use
   a genome-to-genome (g2g) approach to examine both genomes
   comprehensively and find genetic determinants of host and pathogen
   interactions in progression to TB pulmonary disease.

Results

   We hypothesized that host genetic variation can predispose certain
   individuals to a higher risk of disease from specific bacterial
   lineages or clades, including those that are not yet defined. If these
   differences in bacterial prevalence across human hosts are genetically
   driven, we should be able to identify an association in TB patients
   between host and bacterial genetic alleles.

   We tested this hypothesis in a cohort composed of 1556 TB patients in
   Lima, Peru, from whom we collected both human genotype and Mtb
   whole-genome sequences (WGS) (Fig. [70]1a, Supplementary Data [71]1).
   After quality control, we obtained 676,110 genotyped human variants
   with a minor allele frequency ≥1%. We focused our g2g analysis on 2298
   out of 45,831 called Mtb variants with an allele frequency between 5%
   and 95%, since statistical power for rare Mtb alleles would be limited.

Fig. 1. Human-to-Mtb genome-wide association study in 1556 tuberculosis
patients.

   [72]Fig. 1
   [73]Open in a new tab

   a Study design schematic. We obtained DNA from 1556 Peruvian
   individuals with TB disease and cultured pathogens to perform host
   genotyping and Mtb WGS. The genotype of each common Mtb variant was
   considered as the response variable (Y: 0 or 1), and the genotype of
   each host variant was the independent variable (X: 0, 1 or 2),
   resulting in one test per host SNP-Mtb SNP pair. b Grid plot
   summarizing the genome-to-genome analysis. The x-axis denotes position
   within the human genome with alternating colors (white and light gray)
   for each chromosome. The y-axis denotes position within the Mtb genome.
   Point colors represent the association p-value (-log[10](P)) from the
   mixed effect logistic regression. The most significant host-Mtb pair
   association is indicated. Six randomly chosen Mtb variants in tight
   linkage (Pearson r^2 > 0.8) with position 271640 are shown in light
   blue, indicating that the same human variant rs3130660 is significantly
   associated with multiple Mtb positions. c Manhattan plot of the GWAS
   analysis when treating genotypes of Mtb position 271640 as the outcome.
   The x-axis indicates genomic location, where as the y-axis shows the
   (-log[10](P)) from mixed effect logistic regression model (d) A maximum
   likelihood phylogenetic tree inferred from 13,981 variants of 1,555
   Peruvian Mtb isolates (excluding one Lineage 1 sample for visualization
   purposes). Branch colors represent the inferred lineages. Filled
   squares on the right indicate the presence (red) or absence (gray) of
   the six Mtb variants identified in the g2g analysis and highlighted in
   (b) Source data for (a).-c are provided in the Source Data 1 file.
   Source data for (d) are provided in Source Data 2 file.

   We tested whether the co-occurrence of each pair of common bacterial
   and genotyped human variants was higher than expected by performing a
   mixed effects logistic regression. We considered the presence or
   absence of each common Mtb SNP as a binary trait and assumed an
   additive model correcting for human cryptic relatedness, population
   structure, age and sex. To reduce the computational burden, we only
   included 1267 (out of 2298) common Mtb SNPs that were not in near
   perfect linkage (Pearson r^2 < 0.99) in the association tests. Similar
   to performing GWAS on two correlated traits (e.g., blood pressure and
   cholesterol level), we allowed correlation in our outcome variables
   (Mtb SNPs) and did not correct for Mtb structure in our model. As a
   result, a single host variant can be associated with multiple Mtb
   variants. In total, we ran >850 million regression models between every
   common Mtb and host variant pair (Fig. [74]1b). We examined genomic
   inflation factors (λ[gc]) of each of 1267 GWAS and observed no
   inflation of test statistics (median λ[gc] = 1.00, Supplementary
   Fig. [75]1), suggesting that our model is robust to false-positive
   findings.

Genome-to-genome study identifies human-Mtb genomic associations

   We identified an association between an intronic human variant,
   rs3130660, located in the 6p21 region on chromosome 6 and a
   phylogenetic marker (Position 271640) of a subclade of Lineage 2 (L2)
   Mtb (OR = 10.06, 95% CI: 4.87 − 20.77, P = 7.92 × 10^−8, Fig. [76]1c,
   Supplementary Data [77]2). Individuals with each rs3130660-A allele
   were 10 times more likely to be infected with Mtb strains carrying the
   interacting Mtb variant marker. To increase confidence in our
   reporting, we next performed a permutation analysis to estimate the
   likelihood of our observed P-value under the null. We found that the
   likelihood of obtaining results similar to the observed results is <1%
   (P < 5.91 × 10^−7 assuming a 5% false-discovery rate, Supplementary
   Fig. [78]2).

   The associated Mtb variant is in tight linkage with 78 of the 2298
   common Mtb variants (Pearson r^2 > 0.8, Supplementary Data [79]3). To
   better understand these 79 Mtb variants, we constructed a maximum
   likelihood phylogenetic tree using WGS of 1556 collected isolates. In
   this analysis, we looked at 13,981 Mtb variants which included both the
   2298 common variants along with 11,683 variants that passed stringent
   quality control. We classified the strains into well-defined Mtb
   lineages, including L2 and L4.1–8 using previously defined lineage
   defining markers^[80]15. We observed that previously defined lineages
   fell within distinct branches of the phylogenetic tree (Fig. [81]1d).
   We observed that all 79 Mtb SNPs were present in the same L2 subclade
   of the phylogenetic tree. We henceforth define the clade by the
   presence of the g2g Mtb variant (Position 271640) as the g2g-L2 clade.
   Among the 1556 Mtb isolates, 102 were g2g-L2 (6.56%).

   To test whether this result reflected social mixing patterns within the
   community or a true biological association, we repeated our analysis
   adjusting for year of TB diagnosis, Mtb population structure (as
   reflected by the top two principal components), none of which
   significantly changed the reported association (Supplementary
   Fig. [82]3). Although the prevalence of non-L2 Mtb strains remained
   consistent over the 2-year collection period, we observed an increase
   in the associated Mtb clade (g2g-L2) defined by the phylogenetic marker
   (Position 271640), from 5.4% to 8.9% (Supplementary Fig. [83]4a).
   However, this did not alter our association strength (Supplementary
   Fig. [84]3). We also observed that the associated Mtb clade remained
   consistent over geographical space (Supplementary Fig. [85]4b),
   reinforcing the conclusion that the reported signals are independent
   from these covariates. We next considered the possibility that the
   observed association between a host allele and the Mtb g2g-L2 subclade
   was driven by host ancestry (Supplementary Fig. [86]5) or Mtb lineage
   (Supplementary Fig. [87]6). We examined host alleles for associations
   to the two common Mtb lineages in Peru (L2 and L4), and found the
   strongest association was between the same host allele and the L2
   lineage, but that this association was substantially weaker than with
   the g2g-L2 lineage (rs3130660, OR = 5.96, 95% CI: 2.97-11.97,
   P = 1.42 × 10^−6, Supplementary Fig. [88]7, Supplementary Data [89]4).
   A conditional analysis including g2g-L2 obviated this association
   (P = 0.97). In contrast, adding L2 as a covariate to our original model
   did not obviate the association with g2g-L2 (P = 4.21 × 10^−15). This
   result suggested that the observed association between rs3130660 and L2
   was driven by the g2g-L2 subclade rather than L2 more broadly.

Host variant associated with Mtb diversity colocalizes with expression of
multiple genes in lung and other tissue types

   The host genetic variant (rs3130660) associated with g2g-L2 infection
   is intronic to the FLOT1 gene. FLOT1 is a lipid raft-associated
   scaffolding protein that plays a role in membrane trafficking and
   phagosome maturation^[90]16,[91]17. We investigated whether this SNP
   modulates the expression levels of nearby genes by performing an
   expression quantitative trait loci (eQTL) analysis. Using the Genotype
   Tissue Expression (GTEx release v8^[92]18) database in lung, we
   observed that rs3130660-A is associated with increased FLOT1 expression
   (P = 2.22× 10^−16, Supplementary Fig. [93]8), increased PPP1R18
   expression (P = 3.75 × 10^−7) and multiple other genes in the region
   (Supplementary Fig. [94]9). Since our variant is within the MHC class-I
   region, to increase the resolution of our reported associated region,
   we performed HLA imputation using a multi-ancestry MHC reference
   panel^[95]19. After imputation, rs3130660 remained the strongest signal
   in the region (Supplementary Fig. [96]8). To understand whether the
   rs3130660 allele corresponded to the signal reported in eQTL studies,
   we applied a colocalization analysis using the coloc software^[97]20.
   Using 109 RNA sequencing datasets included in the eQTL catalog release
   6^[98]21, we assessed colocalization signals between all 48 protein
   coding genes within a 700 kb window of rs3130660 (Supplementary
   Fig. [99]10, Supplementary Data [100]5). We note that there are
   differences in genetic ancestry between our study and reported eQTL
   studies, which were predominantly conducted among individuals of
   European ancestry, and this might have reduced our power to detect
   colocalizing signals across these datasets^[101]22. Despite this
   limitation, we found evidence of colocalization between the identified
   interacting host SNP and multiple genes within the MHC class-I region,
   such as FLOT1, IER3 and PPP1R18 eQTL association in 14 cell and tissue
   types (posterior probability >0.75), including FLOT1 in T-cell
   (posterior probability = 0.96), IER3 in whole blood (posterior
   probability = 0.90) and PPP1R18 in thyroid (posterior
   probability = 0.90). These results provided strong evidence that the
   strain-associated host variant regulates expression of multiple genes
   in the MHC class-I region.

Interacting host allele and Mtb strain show distinct global transcriptomic
effects in macrophages

   To experimentally test the plausibility of the identified host and Mtb
   genetic association, we assessed the responses of human macrophages
   from Peruvian donors with different rs3130660 genotypes to infection
   with g2g-L2 and nearest neighbor L2 Mtb strains, here called
   “non-g2g-L2”. To this end, we obtained macrophages from three randomly
   selected Peruvian donors from our cohort who carried the risk allele A
   (rs3130660-AT) and three donors who did not (rs313060-TT). We infected
   these monocyte-derived donor macrophages (MDMs) with three g2g-L2
   strains and three non-g2g-L2 strains with similar in vitro growth
   characteristics, and measured macrophage transcriptional responses by
   RNA-sequencing (Supplementary Fig. [102]11), which we interpret as
   intermediate traits relevant to human infection outcomes consistent
   with other published studies^[103]23–[104]26.

   To explore the effect of g2g-L2 and non-g2g-L2 Mtb infection in
   rs3130660-AT and TT donors in macrophages, we first scored the data for
   “response to infection” using a gene model that includes the 20 most
   highly induced genes identified in an independent study of human MDM
   responses to infection with the reference Mtb strain^[105]27
   (Supplementary Data [106]6). We integrated expression values of these
   20 genes under each condition to test whether rs3130660-AT and TT
   donors exhibit different transcriptional responses upon g2g-L2 versus
   non-g2g-L2 infection. We saw quantitative transcriptional differences
   in responses impacted by host allele and Mtb strain with AT donors
   manifesting larger transcriptional responses than TT donors
   (P[t-test] = 7.00 × 10^−7). This response was blunted in g2g-L2 strain
   infection as compared to non-g2g-L2 infection in both host AT
   (P[t-test] = 0.0068) and TT backgrounds (P[t-test] = 0.0042,
   Fig. [107]2a).

Fig. 2. Human monocyte-derived macrophage (hMDM) transcriptional response to
g2g-L2 and non-g2g-L2 infection.

   [108]Fig. 2
   [109]Open in a new tab

   a From n = 12 samples, the infection score is calculated based on gene
   expression levels of the top 20 infection-induced genes from an
   independent study. P-values are calculated by two-tailed pairwise
   student’s t-test. b, c show volcano plots of differentially expressed
   (DE) genes specific to bacterial strain or donor rs3130660 genotype,
   respectively. Significance was called by FDR-adjusted P-value < 0.01
   and log[2](fold-change) >0.7. d Hierarchical clustering of gene
   expression based on the union DE gene sets from (b–e) Pathway analysis
   of genes in each cluster from (e). f, k the expression of genes at the
   cis-region of rs3130660, specifically IER3 (f), VARS2 (g), ZNRD1 (h),
   FLOT1 (i), HLA-E (j) and PPP1R18 (k). Each dot represents the average
   gene log[2](TPM + 1) of g2g-L2 (red) or non-g2g-L2 (blue) infection
   within individual AT or TT donors from n = 12 samples. The P-values are
   calculated by two-tailed pairwise Student’s t-test either between AT
   and TT donors or between g2g-L2 and non-g2g-L2 within donors with the
   same genotype. Source data are provided in the Source Data 1 file.

   To further understand how differences in host allele and Mtb strain
   alter the global transcriptional macrophage response to infection, we
   extracted differentially expressed genes (DEGs, >0.7
   log[2](fold-change), FDR-adjusted P-value < 0.01, Fig. [110]2b, c,
   Supplementary Data [111]7). We performed hierarchical clustering based
   on 184 Mtb strain-specific response genes and 924 rs3130660
   genotype-specific response genes. We identified four major expression
   clusters in response to Mtb infection (Fig. [112]2d) and performed
   pathway enrichment analysis to describe the biological processes
   govened by these gene expression clusters (Fig. [113]2e). We observed
   rs3130660-AT donors had higher expression of genes implicated in both
   Type 1 and Type 2 interferon pro-inflammatory signaling; MHC-I antigen
   processing and presentation, IL-1B signaling, cytosolic DNA sensing and
   zinc homeostasis. We found rs3130660-TT donors expressed higher levels
   of genes implicated in altered metabolism and protein disulfide bond
   formation including thioredoxin and glutathione related enzymes. These
   responses differed quantitatively by the infecting Mtb strain with
   g2g-L2 strains inducing higher levels of IL-1B signaling and altered
   zinc responses in both donor alleles while non-g2g-L2 strains induced
   higher expression levels of genes involved in interferon signaling and
   MHC-I antigen processing and presentation (Fig. [114]2f). Taken
   together, these data are consistent with a model in which rs3130660-AT
   donors mount a stronger transcriptional response to infection, perhaps
   because of more sensitive induction of innate responses but that
   response is skewed towards interferon after non-g2g-L2 infection, in
   contrast to IL-1B dominant responses after g2g-L2 infection.

   To assess if the rs3130660-A mutation alters transcriptional responses
   on chromosome 6 in addition to FLOT1, we next performed a directed
   analysis of the 30 genes on chromosome 6 previously implicated in the
   eQTL analyses. Among the 30 protein-coding genes that pass quality
   control flanking the rs3130660 region, we observed no statistical
   differences in gene expression between donors with AT and TT genotype
   in the absence of infection (no P[t-test] < 0.05, Supplementary
   Fig. [115]12). However, after infection, we observed 15 genes with
   significant differences between the two host genotype groups
   (P[t-test] < 0.05, Supplementary Fig. [116]13), including IER3, VARS2
   and ZNRD1 involved in immune response, ER stress and zinc homeostasis
   (Fig. [117]2f–h). Among the 15 genes, 8 gene expression levels were
   further modified by the Mtb strains, including FLOT1, HLA-E and PPP1R18
   (P[t-test] < 0.05, Fig. [118]2i–k). Taken together, these data reveal
   host-pathogen transcriptional interactions with highly consistent but
   directionally distinct quantitative effects of both the host allele and
   bacterial strain.

   To test for differential FLOT1 expression in the absence of host
   genotype variation, we evaluated FLOT1 expression in MDMs from three
   healthy non-Peruvian donors after Mtb infection (Supplementary
   Fig. [119]14). Using an expanded Mtb strains set, we found FLOT1
   expression was significantly lower in the setting of g2g-L2 compared to
   non-g2g-L2 infection (two-way ANOVA, P-values = 0.0650, 0.0089, 0.0022
   across three donors, Fig. [120]3a). We then profiled the immune
   response in these three donors using a panel of Nanostring probes for
   genes involved in metabolism and myeloid immune responses to infection.
   Using bacterial-specific DEGs identified in the Peruvian donor MDMs
   (Fig. [121]2b), we found a significant correlation between differential
   gene expression in the Peruvian and local donors when comparing g2g-L2
   and non-g2g-L2 infection (P-value = 0.0045, Fig. [122]3b).
   Additionally, scoring the data for the response to infection using the
   same 20 highly induced genes (Fig. [123]2c), we again observed
   quantitative transcriptional differences in response to g2g-L2 and
   non-g2g-L2 strains in multiple donors (two-way ANOVA, P-value = 0.0021,
   0.0185, 0.4037, Fig. [124]3d, Supplementary Data [125]6). We found that
   there was significantly higher induction of genes involved in Type I
   IFN signaling by non-g2g-L2 strains and dampening of this response by
   g2g strain infection in two of three donors (two-way ANOVA,
   P-value = 0.0021, 0.3119, 0.0126, Fig. [126]3e, Supplementary
   Data [127]8).

Fig. 3. Boston donor human monocyte-derived macrophage (MDM) transcriptional
response to g2g-L2 and non-g2g-L2 infection.

   [128]Fig. 3
   [129]Open in a new tab

   a FLOT1 expression of MDM from three local anonymous donors after
   g2g-L2 or non-g2g-L2 infection (5 representative strains, n = 5 per
   donor). Data is presented as mean +/- SD. A two-way ANOVA (two-sided,
   Sidak’s multiple comparison test) was performed to determine
   statistical significance for each donor. b Pearson correlation
   (two-sided) between bacterial-specific DEGs (Fig. [130]2b) of the
   Peruvian donor MDMs (n = 12 samples) and their expression in healthy
   donor MDMs (n = 3 samples). Genes are colored according to their
   upregulation in the g2g-L2 (red) or non-g2g-L2 (blue) infected Peruvian
   MDMs. c Heatmap of the top 20 infection-induced genes in g2g-L2 and
   non-g2g-L2 infected Boston donor MDMs (n = 3). d Canonical Mtb
   infection gene module score and (e). interferon α/β signaling gene
   module score after infection with representative g2g-L2 or non-g2g-L2
   strains in the local donor MDMs (5 representative strains, n = 5 per
   donor except donor 1 non-g2g-L2 n = 4). Data are presented as
   mean +/- SD. A two-way ANOVA (two-sided, Sidak’s multiple comparison
   test) was performed to determine statistical significance for each
   donor. Source data are provided in the Source Data 1 file.

The Mtb g2g-L2 subclade displays altered redox metabolism

   Next, we sought to identify biochemical features distinguishing the
   g2g-L2 from non-g2g-L2 strains. Strains. However, the Mtb envelope
   contains 109 subclasses of lipids^[131]28 which include many known
   determinants of differential Mtb virulence^[132]29–[133]31. Therefore,
   we performed an unbiased whole cell lipidomics analysis from three
   g2g-L2 and eight non-g2g-L2 strains, which detected 28,209 distinct
   molecules. No differentially abundant lipid (P-value < 0.05 after
   adjustment by the Benjamini-Hochberg method) was found with respect to
   g2g-L2 status (Supplementary Data [134]9), which is most consistent
   with equivalent lipid compositions among the set of L2 isolates.

   We next developed a high throughput imaging-based phenotyping platform
   to agnostically identify intrinsic features that distinguish Mtb
   strains, which we interpret as intermediate traits for more complex
   biological processes^[135]32 (Fig. [136]4a, Supplementary
   Fig. [137]15). We assessed seven functional phenotypes including cell
   morphology (length, width and area), total cellular lipid content,
   chromosomal DNA content with DAPI staining and growth dynamics inferred
   by pulsing with fluorescently tagged D-amino acids (NADA) which are
   incorporated into nascent peptidoglycan. Redox state is inferred from
   autofluorescence at 420 nm; signal is generated by the oxidized form of
   F[420], a flavin derived cofactor derived cofactor named because of its
   420 nm absorption peak at its oxidized state^[138]33,[139]34. We
   phenotyped 23 g2g-L2 and 11 non-g2g-L2 strains and found a
   significantly higher autofluorescence signal in g2g-L2 compared to
   non-g2g-L2 strains (P[Wilcox-test] = 2.0 × 10^-4, Fig. [140]4a, b),
   indicative of a shift in the redox state of F[420] towards a more
   oxidized state^[141]35. Consistent with this interpretation, the g2g-L2
   strains were less sensitive to growth inhibition caused by pretomanid,
   an antibiotic pro-drug that requires the reduced form of F[420] to be
   activated^[142]36 (P[Wilcox-test] = 0.032, Supplementary Fig. [143]16).
   The F[420] cofactor pools are linked to the redox state of the other
   major redox cofactor pools, and the NAD + /NADH ratio was also
   significantly higher in g2g-L2 strains, indicating a broader change in
   cellular redox balance (P[t-test] = 0.0106, Fig. [144]4c). Further,
   g2g-L2 strains were more significantly resistant than control L2
   strains to redox stress induced by menadione, which causes redox
   cycling and has been shown to lead to decreased NAD + /NADH ratio in
   Mtb^[145]37 (two-way ANOVA, P-value = 0.0165, Fig. [146]4d) but do not
   differ in susceptibility to an exogenous oxidant (H2O2) (two-way ANOVA,
   P-value = 0.433, Fig. [147]4e).

Fig. 4. Functional characterization of g2g-L2 Mtb strains.

   [148]Fig. 4
   [149]Open in a new tab

   a Violin plots for each phenotype measured by high-throughput
   microscopy, showing the distribution of each feature between the g2g-L2
   and non-g2g-L2 strains. Samples were assayed at minimum in duplicate.
   The line inside each plot indicate indicates the median. P-values
   obtained by a Wilcoxon test. The red dotted line indicates the
   Bonferroni corrected significance threshold after multiple testing
   (-log[10](0.05/7)). b Representative images of autofluorescence signals
   in two representative g2g-L2 strains and two non-g2g-L2 strains. Scale
   bar: 5 µm. Images are representative of two independent experiments and
   the remainder of the g2g-L2 and non-g2g-L2 strains. c Total NAD was
   extracted from five g2g-L2 and five non-g2g-L2 Mtb strains and the
   NAD + /NADH ratio was determined. Each point represents the average of
   two independent replicates per strain (n = 5). Data are presented as
   mean +/- SD. A two-tailed unpaired t-test was used to determine
   statistical significance between the groups. Mtb mid-log phase cultures
   were treated with (d) 50 uM menadione for 24 h or (e) 25 mM H2O2 for
   4 h, and surviving CFUs were determined by plating. A total of 10 Mtb
   strains were used (five g2g-L2 and five non-g2g-L2), with two
   independent replicates per strain. Data are presented as mean +/- SD. A
   two-way ANOVA (two-sided, Sidak’s multiple comparison test) was used to
   determine statistical significance between the groups. TRFS-green was
   incubated with (f) four g2g-L2 and four non-g2g-L2 Mtb strains (n = 5
   per strain) or with (g) M.smegmatis strains constitutively expressing
   either the g2g or non-g2g-L2 variant Rv3913-3914 operon (n = 5 per
   strain). Fluorescence intensity was measured over time, the mean AUC
   for each strain was quantified and a two-tailed unpaired t-test was
   used to determine statistical significance. Data are presented as
   mean +/- SD. h Total NAD was extracted from a wildtype M.smegmatis
   strain, or M.smegmatis constructs constitutively expressing either the
   g2g or non-g2g-L2 variant Rv3913-3914 operon, and the NAD + /NADH ratio
   was determined (n = 3). Data are presented as mean +/- SD. A one-way
   ANOVA (two-sided, Tukey’s multiple comparison test) was used to
   determine statistical significance between groups. Source data are
   provided in the Source Data 1 file.

   We sought to identify putative Mtb genetic variants that might be
   driving this functional phenotype. There are 49 nonsynonymous changes
   among 79 Mtb SNPs defining g2g-L2 (Supplementary Data [150]3). Many of
   these occur in genes that belong to redox-related pathways
   (Supplementary Fig. [151]17). Among these genes, the most biologically
   likely candidate mediator of altered redox state was trxB2, encoding
   thioredoxin reductase, which isomerizes protein disulfide bonds using
   NADP+ as a cofactor. In g2g-L2 strains, trxB2 has a Thr2Asn mutation
   strongly predicted to alter protein function (SIFT score = 0.01). We
   used a published fluorescent substrate-based reporter
   probe^[152]38,[153]39 to assess thioredoxin reductase activity in a
   panel of four g2g-L2 and four non-g2g-L2 strains and found that the
   g2g-L2 strains have significantly higher activity
   (P[Wilcox-test] = 0.0024, Fig. [154]4f). We sought to investigate
   whether the Thr2Asn variant in trxB2 was sufficient to generate this
   increase in activity. Ideally, we would seek to validate the effect of
   the Thr2Asn in Mtb, but the construction of an isogenic allelic variant
   in an essential gene in a clinical strain of Mtb has not yet been
   successfully accomplished. As an alternative, we expressed the trxB2
   operon containing the Thr2Asn or wildtype alleles of trxB2 in the model
   mycobacterium, M. smegmatis, a model organism used to dissect the
   essential cell biology of mycobacteria including Mtb. Consistent with
   the Mtb data, expression of the trxB2 Thr2Asn operon resulted in
   significantly more TrxB2 activity than the wildtype operon
   (P[Wilcox-test] = 0.0097, Fig. [155]4g). The strain expressing the
   trxB2 Thr2Asn operon also had a significant shift in the NAD + /NADH
   ratio toward the oxidized state as compared to the wildtype operon,
   consistent with the Mtb g2g-L2 phenotypes and the model that the trxB2
   Thr2Asn variant is a gain of function mutation (P[t-test] = 0.0139,
   Fig. [156]4h).

Mtb g2g-L2 subclade is recently expanded in Peru and transmits differently
than other L2 strains

   Our data suggest that the g2g-L2 subclade evolved distinct biologic
   features that interact with host cells. To further understand the
   effects of these strain differences in the broader context of Peruvian
   L2 strains. To this end, we used 178 L2 isolates from this study and 77
   previously collected L2 strains from the same population to reconstruct
   a maximum likelihood phylogenetic tree^[157]40,[158]41. We found three
   major clades: clade A consisting of the g2g-L2 isolates, and clades B
   and C (Fig. [159]5a). We merged the whole genome sequences of these L2
   clade isolates with 1000 global L2 isolates (Supplementary
   Data [160]10); the L2 strains from Peru formed subclades that were
   largely separated from other global L2 isolates (Supplementary
   Fig. [161]18). This pattern of restriction suggests that Mtb L2 strains
   were introduced to Peru and then diversified locally.

Fig. 5. Phylogenetic structure of the g2g-L2 clade associated with the human
alleles.

   [162]Fig. 5
   [163]Open in a new tab

   a A phylogenetic tree of L2 constructed with 255 L2 Peruvian isolates.
   The g2g-L2 clade identified via the g2g analysis is highlighted in red
   (Clade-A), two other Peruvian subclades of L2 are highlighted in blue
   (Clade-B) and green (Clade-C) respectively. *ybp years before present.
   Estimated emerging time (median value, with 95% highest posterior
   distribution in bracket) for the ancestor strains and cluster rate when
   using 6-SNP distance of each marked L2 clade (Clade-A, B and C) are
   listed. b Histogram of pairwise minimum SNP distance to the closest
   neighbors within the marked clades. c Comparison of transmission
   cluster rate between the three marked clades when using 6-SNP and
   12-SNP distance as the threshold. d The percentage of g2g-L2 strains
   among all co-circulating strains from the 2010 cohort and the 2020
   cohort. e The percentage of g2g-L2 strains among L2 strains from the
   2010 cohort and the 2020 cohort. P-values shown in (c) and (d) were
   obtained by two-sided Fisher’s exact test. Source data for (a). are
   provided in the Source Data 3 file. Source data for (b). -d are
   provided in the Source Data 1 file.

   We next performed a Bayesian coalescent analysis to estimate the
   emerging times of these three L2 subclades^[164]42. These results
   suggest that the g2g-L2 (clade A) was introduced to Peru more recently
   than the other two L2 subclades B and C (62.1 versus 138.7 and
   112.3 years ago, Fig. [165]5a). We expect that locally diversified Mtb
   strains which were introduced into a human population earlier would
   have achieved larger population sizes^[166]43,[167]44. Instead, we
   found that g2g-L2 strains have a larger than expected population size.
   Among 255 participants infected with L2 in our cohort, 150 of 255
   (58.8%) of them were infected by g2g-L2 strains (clade A), which is
   much higher than the number of infections caused by clade B or C (18.0%
   and 6.7%). Thus, the g2g-L2 strains may have undergone a recent, rapid
   expansion, consistent with increased transmissibility. To further
   assess transmissibility, we performed transmission cluster analysis.
   Two TB cases were considered to belong to the same transmission cluster
   if they were separated by <6 SNPs or the less stringent 12-SNP cut-off.
   By both criteria, the g2g-L2 cluster rate is much higher than that of
   other L2 strains collected from the same region (Fig. [168]5c). These
   data suggest that the g2g-L2 clade is associated with a higher rate of
   recent transmission in the Peruvian population.

Temporal trajectory of host and pathogen interactions

   Finally, we sought to evaluate the population behavior of g2g-L2 and
   the interaction between g2g-L2 and rs3130660 in an independent patient
   cohort. We initiated a second cohort in 2020, coincident with the
   COVID-19 surge in Peru. Among the 699 TB cases, 88 (12.6%) were caused
   by g2g-L2, which represents nearly double the prevalence from 10 years,
   which was 6.6% (Fig. [169]5d, Supplementary Fig. [170]19). Most of
   g2g-L2 success came at the expense of other L2 strains–by 2020 g2g-L2
   had nearly completely replaced non-g2g-L2 strains in Lima
   (Fig. [171]5e). These findings are consistent with the population
   expansion and increased transmissibility of g2g-L2 suggested by our
   analysis of the 2010 cohort.

   Given the initial effect size of OR = 10.06 for the association between
   rs3130660 and g2g-L2 strain, we estimated that in this cohort we would
   have had 76% power to reproduce this association with one-tailed
   p < 0.05. After genotyping these samples, we no longer identified an
   increased association between individuals carrying the rs3130660-A
   allele and TB disease caused by g2g-L2 strains (P-value = 0.49,
   OR = 0.57, 95% CI: 0.12 - 2.80, Supplementary Data [172]11). We
   considered several reasons for this null association. It is possible
   that the initially reported statistical association was a
   false-positive. The relatively small sample size in our study may have
   introduced bias into the estimates, affecting the accuracy of the
   logistic model. This bias could have arisen from unrepresentative data
   or chance variability. Alternatively, it is possible that the host
   association is not temporally stable for example because of interaction
   with an environmental factor. Given the strong interactive effects of
   g2g-L2 and rs3130660 on Type 1 interferon signaling, we specifically
   considered the possibility that COVID infection might modify TB risk in
   an Mtb strain and/or host dependent fashion. We did not have sufficient
   representation of AT individuals in the current cohort to assess COVID
   outcomes as a function of host genotype but we were able to assess
   COVID outcomes associated with Mtb strains. We observed a trend for
   patients with a self-reported history of COVID to be less likely to
   have g2g-L2 infection (8.6%) than patients with no self-reported COVID
   history (13.1%, P-value = 0.121, Fisher’s exact test).

   These data suggest that g2g-L2 is a highly transmissible strain
   distinguished most notably in its host interactions by altered
   interferon signaling. However, our data further suggest that dynamic
   host genetic and environmental factors impact the TB disease
   manifestations of this strain.

Discussion

   Previous work has shown that Mtb lineages are often restricted to
   particular geographical regions while other lineages are globally
   distributed, suggesting that Mtb can adapt to diverse natural
   environments and host populations^[173]12,[174]45. In this work, we
   examined the hypothesis that genomic interactions between host and
   pathogen alleles can modify the risk of TB disease by performing a
   paired genome analysis using host and bacterial samples from a large
   cohort of TB patients in Lima, Peru and a smaller, independent cohort
   of TB patients from the same region 10 years later.

   Our analysis of the 2010 cohort indicated that a host variant in the
   FLOT1 gene was preferentially associated with infection with a highly
   transmissible and functionally distinct subclade of L2 Mtb, g2g-L2 that
   emerged in Peru in the 1950’s. Multiple lines of evidence supported
   both the biologic features conferred by the associated host allele and
   the g2g-L2 strains as well as the biologic association between the host
   FLOT1 variant and response to g2g-L2 Mtb strains.

   FLOT1 is a lipid-raft associated membrane protein that has been
   previously implicated in host-pathogen immunity including control of
   Mycobacterium marinum^[175]46,[176]47. FLOT1-dependent microdomains are
   present on the phagolysosome where they act as platforms for the
   assembly of NADPH oxidase complexes and vATPase^[177]17. This function
   contributes to antifungal immunity, and FLOT1 alleles are associated
   with invasive aspergillosis^[178]47. FLOT1 may play a similar role in
   Mtb-macrophage interactions. This protein also contributes to other
   signaling and cell migration pathways^[179]48–[180]51, raising the
   possibility that altered FLOT1 expression could have additional effects
   on the host immune response to Mtb. Finally, FLOT1 may not be the only
   effector of the g2g-L2 associated human risk allele. The associated
   intronic variant is in linkage with and predicted to alter the
   expression of other immune related genes and several of these,
   including PPP1R18, an adapter protein of RAPK signaling, and HLA-E,
   which can present Mtb peptides^[181]52, are also differentially
   expressed after Mtb infection.

   Consistent with this hypothesis, macrophages from donors with different
   genetic backgrounds, referred to by rs3130660-AT and TT, differed in
   their responses to Mtb infection and these differences were modified by
   the infecting strain genotype. Donors with the AT genotype exhibited a
   robust response skewed towards Type 1 IFN related genes. Interestingly,
   g2g-L2 strain infection dampened the IFN response in all donors,
   instead promoting an IL-1B-dominated response. This shift in
   transcriptional response might be attributed to the known negative
   regulatory relationship between Type 1 IFNs and IL-1B^[182]53,[183]54.
   Further, TT donor macrophages were distinguished by their genes
   implicated in redox balance. This was striking to us because, using
   unbiased phenotypic profiling, we identified altered redox balance, and
   resistance to reductive stress as defining characteristics of g2g-L2
   strains in vitro due to a gain of function mutation in thioredoxin
   reductase, trxB2. These data suggest that g2g-L2 strains may be both
   inducing a distinct immunometabolic response and adapting to the
   immunometabolic environment that they induce. There could be other
   g2g-L2 mutations contributing to the altered macrophage responses. For
   example, eccCb1, a core component of the ESX1 secretion system which is
   required for Mtb to induce the Type 1 IFN response in infected
   macrophages, carries a S74F mutation (SIFT = 0.01) which could abrogate
   ESX1 function. This would be consistent with the reduced expression of
   the Type 1 IFN gene signature in macrophages after g2g-L2 function
   although loss of ESX1 function otherwise would be counter-dogmatic as
   it is thought to be required for virulence in humans.

   Given the experimental data supporting the distinct biologic features
   of the rs310660 host variant and g2g-L2 Mtb, as well as the association
   between the two, certain outcomes from the 2020 cohort were unexpected.
   Whereas the proportion of rs310660-AT donors was unchanged while the
   prevalence of g2g-L2 had nearly doubled, essentially replacing the
   other L2 strains. The apparent expansion of g2g-L2 between 2010 and
   2020 was consistent with our phylogenetic analysis of the 2010 cohort.
   However, we had not anticipated a nearly complete L2 strain replacement
   or an L2/L4 difference in fitness. Strain replacement is commonly
   observed in the context of pathogens like Streptococcus pneumoniae or
   SARS-CoV2 that promote strain specific immunity. Mtb infection has been
   shown to provide protection against subsequent Mtb infection, and Mtb
   strain-specific differences in innate and adaptive immunity have been
   described but strain-specific immune exclusion has not been previously
   characterized.

   In this context, we did not find an association between g2g-L2 and
   rs3130660 in the 2020 cohort. We consider it possible that the initial
   association was a false-positive. Alternatively, there may be temporal
   variables which we do not have the capacity to resolve; for example,
   the g2g-L2 and rs3130660 association might alter the rate of
   progression to disease rather than total disease susceptibility, such
   that for a given period transmission rs3130660 variant hosts might
   manifest early while wildtype hosts might manifest years later.
   Finally, we highlight that both host and pathogen variants impact Type
   1 IFN responses, and our data suggests a trend towards interaction with
   symptomatic COVID, an association that we will be able to test with
   greater power when the full cohort is analyzed. Both of these models
   suggest that genetic associations between variants in Mtb and human
   hosts contribute to the transmission dynamics of tuberculosis but that
   we should expect these interactions to be modified by contextual
   variables. Further analysis is required to fully understand how the
   host genes and g2g-L2 strain are mechanistically linked, and to
   establish their clinical relevance. Our results open the possibility
   that other Mtb strain-specific host alleles are present and may explain
   genetic differences driving TB susceptibility across the globe.

Methods

Ethics statement

   We recruited 1632 subjects from a large catchment area of Lima, Peru
   that included 20 urban districts and ~3.3 million residents to donate a
   blood sample for use in our study. We obtained written informed consent
   from all the participants. The study protocol was approved by the
   Institutional Review Board of Harvard School of Public Health and by
   the Research Ethics Committee of the National Institute of Health of
   Peru.

Participant enrollment and follow-up

   The study design and methods were previously described in
   detail^[184]40,[185]55. Briefly, over the 4-year period from 2009-2012,
   we identified patients ≥15 years of age who had received a diagnosis of
   pulmonary TB at any of 106 participating health centers. We confirmed
   the microbiological status of their disease with either a positive
   sputum smear or mycobacterial culture. We also recorded the index
   patients’ baseline smear status, HIV status, and drug-resistance
   profiles. Index cases with HIV infection or infected with multiple Mtb
   strains were excluded from the analyses. The sex of each participant
   was assigned based on genotyping and cross-checked with self-report
   data.

Host genotyping, quality control and imputation

   DNA samples were genotyped using a customized genotyping array
   (LIMAArray) based on whole-exome sequencing data from 116 active TB
   cases to optimize the capture of genome-wide genetic variation in
   Peruvian individuals^[186]10 in the 2010 cohort. DNA samples were
   genotyped using the Illumina Global Screening Assay with customized
   contents in the 2020 cohort. In both cohorts, we excluded samples that
   were missing >5% of the genotype data, had an excess of heterozygous
   genotypes, and/or duplicated with identity-by-state >0.9. We also
   excluded variants with a call rate <95%, with duplicated position
   markers, minor allele frequency <1%, and marked deviation from
   Hardy-Weinberg equilibrium (excluded if P < 10^-5).

   We imputed eight classical HLA genes HLA-A, -B, -C, -DQA1, -DQB1,
   -DRB1, -DPA1, and -DPB1, amino acids and intergenic variants using
   HLA-TAPAS using a multi-ancestry reference panel which contains data
   from 21,546 unrelated individuals^[187]56 in the 2010 cohort of 1556
   individuals.

Mycobacterium tuberculosis variant calling and phylogeny construction

   Mtb DNA was extracted from cryopreserved cultures of sputum samples.
   Each sample was thawed and subcultured on LJ agar and a big loop of
   colonies were lysed with lysozyme and proteinase K to obtain DNA using
   cetyltrimethylammonium bromide (CTAB)/Chloroform extraction and ethanol
   precipitation. Samples were sequenced on an Illumina HiSeq 2500 or 4000
   sequencer yielding paired-end reads of length 125 bp which were aligned
   to the reference assembly, H37Rv [188]NC_000962.3 (GenBank accession
   [189]CP003248). We called variants using Pilon (v1.22)^[190]57. Genome
   coverage was assessed using Samtools (v1.10). We excluded isolates with
   evidence of mixed infection using the barcode method^[191]15. We
   assigned a variant call as missing if the valid depth of coverage at a
   specific variant was <12 reads, if the mean read mapping quality at the
   site did not reach 10, or if none of the alternative alleles account
   for at least 85% of the valid coverage.

   The phylogenetic tree was constructed based on the WGS Mtb alignment.
   Variants occurring in genes with repetitive elements such as
   transposases, proline-glutamate (PE) or proline-proline-glutamate
   (PPE)^[192]58 were excluded to avoid any inaccuracies in the alignment.
   After applying these filters, 13,981 Mtb variants were used to conduct
   a genetic distance matrix. We built a maximum likelihood phylogenetic
   tree of 1602 Mtb isolates that was inferred using IQ-TREE v2^[193]59
   with bootstrap support from 1000 replications. The best-fit nucleotide
   substitution model was GTR + F as determined by the ModelFinder
   function.

   To characterize the Peruvian L2 Mtb strains in the context of global
   strains, we randomly selected 1000 whole-genome sequenced L2 strains
   from published studies in 112 bioprojects to represent the major
   phylogenetic structures of L2. We used these global strains together
   with 255 L2 Peruvian strains to reconstruct the global L2 phylogenetic
   tree. We estimated divergence times via BEAST v1.8.0^[194]42, using an
   uncorrelated lognormal relaxed clock that allows for tree branches to
   evolve at different rates. The XML input file was modified to specify
   the number of invariant sites in the Mtb genomes. For the Mtb
   substitution rate, we used a normal distribution with a mean of
   4.6 × 10^-8 substitution per genome per site per year (3.0 × 10^-8 to
   6.2 × 10^-8, 95% highest polar density interval), which was calibrated
   by ancient DNA samples^[195]60,[196]61. An uncorrelated log-normal
   distribution was used for the substitution rate and a constant
   population size for the tree priors. We ran three chains of 5 × 10^-7
   generations and sampled every 10,000 generations to ensure independent
   convergence of the chains; we discarded the first 10% as a burn-in.
   Convergence was assessed using Tracer (v1.7.0)^[197]62, ensuring all
   relevant parameters reached an effective sample size >100.

Transmission cluster analysis

   To define transmission clusters of the collected L2 Mtb isolates, we
   applied two SNP thresholds (6 and 12) to separate a patient isolate
   from that of at least one other patient in the cluster. The 6-SNP
   threshold was chosen based on the range of SNP distances between paired
   isolates of the same strain obtained at different times from relapsed
   TB patients^[198]63. The 12-SNP threshold was a previously defined
   upper limit of genomic relatedness noted within human hosts and between
   epidemiologically related human hosts^[199]64.

Genome-to-genome association analysis

   To systematically test for associations between human and Mtb genomic
   variants at the genome-wide level, we performed logistic mixed
   regression modeling implemented in SAIGE^[200]65 version 1.1.6. This
   model was specifically designed to account for the binary nature of our
   outcome variable (absent/present of Mtb variant) and the unbalanced
   case-control ratio in our sample. We assumed an additive genetic model
   for each host and bacterial variant. For each bacterial variant j
   (j = 1,…, 1267) and host variant i (i = 1,…, 676,110), we test
   [MATH:
   <msub><mrow><mi>y</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>=</mo><m
   sub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></mrow></msub><msub><mrow><m
   i>β</mi></mrow><mrow><mi>j</mi><mi>i</mi></mrow></msub><mo>+</mo><mi>G<
   /mi><mi>R</mi><mi>M</mi><mo>+</mo><msub><mrow><mi>ω</mi></mrow><mrow><m
   i>j</mi></mrow></msub><msub><mrow><mi>c</mi></mrow><mrow><mi>j</mi></mr
   ow></msub><mo>,</mo> :MATH]
   1

   where y[j] is an N-vector of binary labels indicating the absence or
   presence of a bacterial variant in the host samples i (n = 1556); x[i]
   is an N-vector of host genotypes; β[ji] is the additive effect of the
   host variant i on the bacterial variant j; c[j] is the j^th column of
   an N × C matrix of covariates (age and sex) including a column of 1 s;
   and ω[j] is a c-vector of the corresponding coefficients including the
   intercept. We used genetic relatedness matrix (GRM) as a random effect
   to correct for cryptic relatedness and population stratification
   between collected host samples. The GRM was obtained from an LD-pruned
   (r^2 < 0.2) genotype with minor allele frequency >1%.

Permutation strategy

   To determine an appropriate empirical genome-wide significance
   threshold for the g2g analysis accounting for population structure both
   in the Mtb and human genome, we conducted 200 simulations for each g2g
   analysis using a permutation procedure. To preserve the phenotype
   structure defined by the Mtb variants, for each association analysis,
   we randomly permuted the presence/absence status for each of the 1267
   Mtb variants within the 1556 individuals included in the study. In
   total, we ran 1267 × 200 = 253,400 association tests to obtain an
   empirical genome-wide significance threshold.

   We measured the distributions of the minimum p-values of the variants
   (P[min]) for each association of the permutation. We defined an
   empirical genome-wide significance threshold, -log[10](P[sig]), as the
   95th percentile (1-ɑ) of -log[10](P[min]).

Colocalization analysis using eQTLs

   We integrated our GWAS results with cis-eQTL data using a Bayesian
   method (coloc v3.2-1)^[201]66. We evaluated whether the GWAS and eQTL
   associations best fit a model in which the associations are due to a
   single shared variant that is summarized by the posterior probability,
   as opposed to different regulating variants. We used gene expression
   datasets from the eQTL catalog release 6^[202]21. We tested pairwise
   colocalization between 109 bulk RNA-sequencing expression datasets in
   69 distinct cell and tissue types and GWAS from the most significant
   genome-to-genome pair (using Mtb position 271640 as phenotype). We used
   GWAS and all variant-gene cis-eQTL associations tested in each tissue,
   including non-significant associations of genes within a 700 kb window
   around the top association host SNP (rs3130660). A posterior
   probability of colocalization >0.75 was considered to be strong
   evidence for shared causality.

Bacterial strains and culture conditions

   All 23 g2g-L2 and 11 non-g2g-L2 Mtb strains were drug sensitive and
   obtained from HIV negative donors. They were grown shaking at 37 °C.
   Cultures were grown in 7H9 media (Middlebrook 7H9 salts with 0.2%
   glycerol, 10% OADC [oleic acid, albumin, dextrose, catalase]). Cell
   density (OD600) was determined by a spectrophotometer. For lipidomic
   analysis, liquid cultures in tween-free 7H9 medium supplemented with
   albumin, dextrose and sodium chloride were transferred to
   nitrocellulose filter-paper discs using vacuum filtration and grown on
   solid agar media^[203]67. The bacteria from the filter disc were
   dislodged into 2 ml of methanol and transferred to 15 ml conical glass
   tube. Then 1 ml of chloroform was added and incubated for 30 min for
   inactivation of the bacteria, using the extracted lipids for MS-based
   lipidomics.

Mammalian cell culture

   Mammalian cells were cultured in RPMI 1640 with 10% fetal bovine serum
   (FBS), 10 mM HEPES, and 2mM L-glutamine. Primary human monocytes were
   isolated from peripheral blood mononuclear cells (PBMCs) from three
   rs3130660-AT and three rs3130660-TT donors from Peru. PBMCs were also
   obtained by Ficoll gradient centrifugation of randomly selected healthy
   donor leukaphereses (Research Blood Components) or buffy coat blood
   (Massachusetts General Hospital). Monocytes were isolated by
   CD14-positive selection (Stemcell Technologies) and matured in 50 ng/mL
   human recombinant macrophage colony-stimulating factor (M-CSF) for
   6 days.

RNA extraction

   To isolate RNA from Mtb-infected hMDMs, hMDMs were lysed in Buffer RLT
   (Qiagen) + 1% β-mercaptoethanol after 24 h of infection. RNA was
   purified using the Zymo Direct-Zol kit according to the manufacturer’s
   instructions, with off-column DNase treatment and subsequent
   repurification using the Zymo Clean & Concentrator kit according to
   manufacturer’s instructions.

cDNA generation and qPCR

   cDNA was generated using the SuperScript IV First Strand Synthesis
   System (ThermoFisher). Gene expression was quantified with iTaq
   Universal SYBR green supermix (Bio-Rad) on an Applied Biosystems ViiA 7
   system. FLOT1 expression was normalized to that of GADPH. All qPCR
   primers used are listed in Supplementary Data [204]12.

Nanostring assay and analysis

   RNA extracted from local donor MDMs was used as input in a Nanostring
   assay with a custom-designed probe set and analyzed with nSolver
   version 4.0 (Nanostring Technologies). Target sequences are listed in
   Supplementary Data [205]13. Data were normalized against internal
   positive controls and housekeeping genes (ABCF1, COG7, GUSB, MRPS5,
   POLR2A, SAP130, SDHA, TLK2) to correct for technical variation.

Bulk RNA-sequencing gene expression quantification

   For RNA-sequencing, the single-end raw reads were filtered by
   Trimmomatic version 0.39 to remove the adapters and low-quality bases.
   We used STAR version 2.6.0c and RSEM version 1.2.29 to quantify gene
   expression using human genome primary assembly (GRCh38) and its basic
   gene annotation from gencode human release 43 (GRCh38.p13). The
   pipeline generates expected counts and transcripts per million (TPM)
   for each gene. We used log-transformed, log[2](TPM + 1), as our main
   expression measure, which accounts for library size and gene size. We
   considered as expressed genes those with a log[2](TPM + 1) > 1 in at
   least half of the samples. Expected counts were used for DESeq in R for
   PCA analysis.

   The specific parameters for Trimmomatic are: –Truseq3-SE.fa:2:30:10
   --LEADING:3 --TRAILING:3 --SLIDINGWINDOW:4:1 --MINLEN:36.The specific
   parameters for STAR are: --outSAMunmapped Within --outFilterType
   BySJout --outSAMattributes NH HI AS NM MD --outFilterMultimapNmax 20
   --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.04
   --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000
   --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --sjdbScore 1
   --runThreadN 12 --genomeLoad NoSharedMemory --outSAMtype BAM Unsorted
   --quantMode TranscriptomeSAM --outSAMheaderHD \@HD VN:1.4 SO:unsorted.

   The specific parameters for RSEM are: rsem-calculate-expression --star
   --star-output-genome-bam --num-threads 12 --star-gzipped-read-file.

Differential expression analyses

   We used linear mixed models using lme4::lmer() function in R in the
   differential expression and expression association analysis. To call
   significant caes, we used a likelihood ratio test between two nested
   models using anova() function in R, and an FDR-adjusted P-value at 0.01
   and an absolute log[2] fold-change greater or equal to 0.7.

   To search for genes that are differentially expressed upon g2g and
   non-g2g L2 infection, we tested the following model (2):
   [MATH:
   <msub><mrow><mi>H</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>:</mo><m
   space
   width="0.25em"></mspace><mi>E</mi><mo>~</mo><mi>I</mi><mi>n</mi><mi>f</
   mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi><mo>+</m
   o><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>∣</mo><mi>d</mi><mi>o</mi><mi>n</mi><mi>o<
   /mi><mi>r</mi></mrow></mfenced><msub><mrow><mi>H</mi></mrow><mrow><mn>1
   </mn></mrow></msub><mo>:</mo><mspace
   width="0.25em"></mspace><mi>E</mi><mo>~</mo><mi>G</mi><mn>2</mn><mi>G</
   mi><mo>+</mo><mi>I</mi><mi>n</mi><mi>f</mi><mi>e</mi><mi>c</mi><mi>t</m
   i><mi>i</mi><mi>o</mi><mi>n</mi><mo>+</mo><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>∣</mo><mi>d</mi><mi>o</mi><mi>n</mi><mi>o<
   /mi><mi>r</mi></mrow></mfenced> :MATH]
   2

   Finally, to nominate genes that are differentially expressed with
   different genetic background, we tested the following model (3):
   [MATH:
   <msub><mrow><mi>H</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>:</mo><m
   space
   width="0.25em"></mspace><mi>E</mi><mo>~</mo><mi>G</mi><mn>2</mn><mi>G</
   mi><mo>+</mo><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>∣</mo><mi>s</mi><mi>t</mi><mi>r</mi><mi>a<
   /mi><mi>i</mi><mi>n</mi></mrow></mfenced><msub><mrow><mi>H</mi></mrow><
   mrow><mn>1</mn></mrow></msub><mo>:</mo><mspace
   width="0.25em"></mspace><mi>E</mi><mo>~</mo><mi>G</mi><mo>+</mo><mi>I</
   mi><mi>n</mi><mi>f</mi><mi>e</mi><mi>c</mi><mi>t</mi><mi>i</mi><mi>o</m
   i><mi>n</mi><mo>+</mo><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>∣</mo><mi>s</mi><mi>t</mi><mi>r</mi><mi>a<
   /mi><mi>i</mi><mi>n</mi></mrow></mfenced> :MATH]
   3

   where E is the expression level for each gene, Infection indicates if
   the sample is infected or uninfected, G2G represents if the sample is
   infected with g2g-L2 (1) or not (0), G represents the genotype of the
   infected samples (AT = 1; TT = 0). We included donor as a random effect
   in the infection and strain-specific model and strain as a random
   effect when testing for donor-specific expression changes.

Microscopy imaging and analysis

   Mtb cultures were grown to OD600 of ~1.0, then fixed with 2%
   paraformaldehyde (PFA) for 1 h. The fluorescent D-amino acid NADA
   (3-[(7-Nitro-2,1,3-benzoxadiazol-4-yl)amino]-D alanine hydrochloride)
   was added at a final concentration of 1 mM 16 h prior to fixation. All
   samples were seeded onto molded 1.8% agarose in phosphate buffered
   saline (PBS) with DAPI (2.5 μg/mL) and Nile red (0.1 μg/mL). Samples
   were incubated at 37 °C for 1 h prior to imaging.

   Samples were imaged with a Plan Apo 100× 1.45 NA objective using a
   Nikon Ti-E inverted, widefield microscope equipped with a Nikon Perfect
   Focus system with a Piezo Z drive motor, Andor Zyla sCMOS camera, and
   NIS Elements (v4.5). Semi-automated imaging was carried out using a
   customized Nikon JOBS script to locate imaging fields of interest, 24
   images were taken for each strain. Cell segmentation and quantification
   was performed using our previously published pipeline, MOMIA^[206]32.

Mycobacterial lipidomics

   Whole cell lipid profiles were generated using chloroform-methanol
   extraction of cells from strains grown in biological triplicate using a
   previously published liquid chromatography and mass spectrometry
   protocol^[207]28, with samples profiled using both positive and
   negative mode. The order of extraction and mass spectrometry was
   randomized, with mass and retention time values assigned using the R
   package xcms^[208]68. Ion peaks with a < 10 ppm match to the exact mass
   of a theoretical lipid were assigned a unique identifier to group peaks
   across the data set. The median of triplicate samples was used for
   differential abundance analysis using the R package limma^[209]69.

Antibiotic susceptibility determination

   Pretomanid (PA-824), isoniazid (INH), and rifampicin (RIF) MICs were
   determined using an OD-based growth assay. 96-well plates containing
   1.5-fold serial dilutions of PA-824, INH or RIF were prepared as
   before. All Mtb strains were grown to mid-log phase and diluted to
   OD600 = 0.003 into each well of the prepared plates. Plates were
   incubated at 37 °C with shaking in sealed plates. Growth was determined
   by repeated OD600 measurements over a 2-week period. We assessed
   cumulative bacterial growth inhibition over the course of the
   experiment by calculating the area under the curve (AUC) of the OD
   values^[210]70. Data are representative of two independent replicates
   per strain. Using the final OD600 measurement of each strain for each
   drug concentration, the data was fitted to the Gompertz equation to
   determine the MIC^[211]71.

NAD/NADH metabolite extraction and measurement

   Mtb strains were grown in 7H9 until mid-log phase, then pelleted by
   centrifugation, washed twice with PBS, and resuspended in an extraction
   buffer from an NAD/NADH Quantitation Kit (Sigma Aldrich). Bacterial
   suspensions were transferred to tubes containing Lysing Matrix B (MP
   Biomedicals) and lysed via bead beating. The resulting lysate was used
   to quantify NAD and NADH following the kit’s instructions.

Oxidative stress resistance assay

   Mtb cultures were grown in 7H9 to mid-log phase. Cultures were treated
   with menadione (50 uM or 100 uM), or with H2O2 (25 mM or 50 mM), or
   were left untreated as a control. After 4 h (H2O2) or 24 h (H2O2 and
   menadione), all Mtb cultures were serially diluted and plated on 7H11
   agar plates. Plates were incubated at 37 °C for 2–3 weeks and the
   number of colonies were counted.

Pathway analysis

   In the host pathway analysis, we first identified significant DE human
   genes upon bacteria or host genetic backgrounds as shown in Model 2 and
   3 in the method section describing “differential expression analysis”.
   To enrich pathway genes and reduce the risk of false positives, we took
   the union set of these DE genes with a more stringent threshold of
   FDR-adjusted p-value < 0.01 and absolute log[2] fold-change >0.7. We
   categorized samples in four groups according to the bacteria and host
   genotype. For each of the groups, we calculated the z-score for the
   average expression levels of the selected DE genes. We used function
   cluster::pam() in R to partition the z-score profile into 4 clusters
   around medoids based on Euclidean distance. For genes within each
   cluster, we applied DAVID Functional Annotation Clustering
   function^[212]72,[213]73 to enrich pathways from annotated databases of
   “KEGG Pathway”, “Reactome pathway” and “Wikipathways”, with significant
   thresholds of FDR-adjusted p-value < 0.01.

   In the Mtb pathway analysis, we first annotate the Mtb genome using
   Variant Effect Predictor^[214]74 (v.105). Among 79 highly correlated
   (Pearson r^2 > 0.8) common Mtb variants that are associated with the
   human genome, 49 were annotated as nonsynonymous (missense) variants.
   We defined Mtb genes that have one of these 49 missense variants as
   g2g-L2 genes. We used the DAVID^[215]75 Functional Annotation
   Clustering function to group g2g-L2 genes into a similar functional
   network. We used PANTHER^[216]76 to test for functional overlaps with
   known pathways. We calculated a Fisher’s exact p-value by constructing
   a 2 × 2 contingency table with g2g-L2 genes (n = 48) and all other Mtb
   genes with a common missense mutation in the g2g analysis (n = 898)
   overlapping the flavin adenine dinucleotide binding pathway included in
   the gene ontology category (GO:0050660).

TRFS-green assay

   Mtb or M. smegmatis strains were grown in 7H9 until mid-log phase.
   Cultures were washed twice with PBS, then resuspended in PBS containing
   10 uM TRFS-green (MedChemExpress) and added to a 96 well plate. The
   fluorescent signal was read using a BioTek Synergy H1 (Mtb strains) or
   a TECAN Spark (M.smegmatis strains) microplate reader, at excitation
   438 nm and emission 538 nm, immediately after probe addition and at
   various time points thereafter. Between readings, the plates were kept
   at 37 °C.

Generation of M. smegmatis strains expressing the Mtb TrxB2_TrxC operon

   The Mtb operon containing Rv3913 (trxB2) and Rv3914 (trxC) was
   expressed in M.smegmatis mc2155 using the integrating plasmid pCT94
   under control of the constitutive promoter pUV15. The Rv3913-3914
   operon was amplified from representative g2g-L2 and non-g2g-L2 genomic
   DNA using primers listed in Supplementary Data [217]12. A subsequent
   round of PCR was used to add on overlapping vector handles. All PCR was
   performed using Q5 High-Fidelity DNA polymerase (NEB). Expression
   vectors were constructed via Gibson assembly (NEB) of the PCR fragments
   ligated into NdeI/HindIII digested pCT94. The g2g and non-g2g
   constructs were transformed into DH5a cells and selected on LB plates
   with 50 ug/ml kanamycin. CT94-g2g-trxB2trxC and CT94-nong2g-trxB2trxC
   were electroporated into mc 2 155 and selected on 7H10 plates with
   20 ug/ml kanamycin. Successful transformation was confirmed by Sanger
   Sequencing (Azenta).

Reporting summary

   Further information on research design is available in the [218]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [219]Supplementary Information^ (5.4MB, pdf)
   [220]41467_2024_54741_MOESM2_ESM.pdf^ (68.3KB, pdf)

   Description of Additional Supplementary Files
   [221]Supplementary Data 1-13^ (5.6MB, xlsx)
   [222]Reporting Summary^ (3.1MB, pdf)
   [223]Transparent Peer Review file^ (11.7MB, xlsx)

Source data

   [224]Source Dataset 1^ (82.9KB, txt)
   [225]Source Dataset 2^ (14.8KB, txt)
   [226]Source Dataset 3^ (16.7KB, pdf)

Acknowledgements