Abstract

   We systemically identified tuberculosis (TB)-related DNA methylation
   biomarkers and further constructed classifiers for TB diagnosis.
   TB-related DNA methylation datasets were searched through October 3,
   2020. Limma and DMRcate were employed to identify differentially
   methylated probes (DMPs) and regions (DMRs). Machine learning methods
   were used to construct classifiers. The performance of the classifiers
   was evaluated in discovery datasets and a prospective independent
   cohort. Eighty-nine DMPs and 24 DMRs were identified based on 67 TB
   patients and 45 healthy controls from 4 datasets. Nine and three DMRs
   were selected by elastic net regression and logistic regression,
   respectively. Among the selected DMRs, two regions (chr3:
   195635643–195636243 and chr6: 29691631–29692475) were differentially
   methylated in the independent cohort (p = 4.19 × 10^−5 and 0.024,
   respectively). Among the ten classifiers, the 3-DMR logistic regression
   classifier exhibited the strongest performance. The sensitivity,
   specificity, and area under the curve were, respectively, 79.1%, 84.4%,
   and 0.888 in the discovery datasets and 64.5%, 90.3%, and 0.838 in the
   independent cohort. The differential diagnostic ability of this
   classifier was also assessed. Collectively, these data showed that DNA
   methylation might be a promising TB diagnostic biomarker. The 3-DMR
   logistic regression classifier is a potential clinical tool for TB
   diagnosis, and further validation is needed.

   Keywords: tuberculosis, DNA methylation, biomarker, classifier,
   molecular diagnosis

Graphical abstract

   graphic file with name fx1.jpg
   [37]Open in a new tab
     __________________________________________________________________

   We systemically identified TB-related DNA methylation biomarkers and
   constructed diagnostic classifiers. A 3-DMR logistic regression
   classifier showed excellent performance in discovery datasets and a
   prospective cohort. DNA methylation might be a promising TB-related
   biomarker and the 3-DMR logistic regression classifier is a potential
   clinical tool for TB diagnosis.

Introduction

   The complex nature of Mycobacterium tuberculosis (MTB) has greatly
   contributed to the continuous effects of tuberculosis (TB), the world’s
   top infectious killer, on human populations for thousands of
   years.[38]^1 Globally, in 2019, approximately 10 million people fell
   ill with TB, and 1.4 million people died of TB.[39]^2 Luckily, TB is a
   preventable and curable illness, and the key to thwarting the TB
   epidemic is early diagnosis.[40]^3 The detection power of etiological
   tests (Xpert MTB/RIF, culture, etc.) largely depends on sample
   quality.[41]^4 Traditional host immune response assays (interferon-γ
   [IFN-γ] release assays, tuberculin tests, etc.) work mainly by
   measuring the products of the host immune response. Most of these
   products are downstream molecules in various biological pathways,
   suggesting that these molecules are regulated by various factors.[42]^5
   In addition to the limited performance of the available detection
   methods, certain features of TB also increase the difficulty of its
   clinical diagnosis. For example, incipient TB is characterized by an
   indeterminate period of asymptomatic infection or absence of typical
   clinical symptoms and imaging features, which limit the effectiveness
   of TB diagnosis.[43]^6 Therefore, new and more powerful detection tools
   are urgently needed. Current evidence has indicated that the detection
   of non-sputum blood biomarkers of TB is a preferred approach for
   clinical diagnosis and progression surveillance.[44]^7^,[45]^8

   TB is an infectious disease whose pathogenesis involves dynamic
   interactions among the host, MTB, and the environment. The circulatory
   system serves as a site for cellular communication and dynamic exchange
   of various cellular factors and chemokines, and thus peripheral blood
   has been considered the preferred sample type for studying and
   diagnosing infectious diseases.[46]^9 To date, peripheral blood
   biomarkers of TB at the genomic, transcriptomic, epigenetic, and
   proteomic levels have received enormous attention, and numerous markers
   have been identified.[47]^7^,[48]9, [49]10, [50]11 Of note, epigenetics
   can bridge the gaps between the host, MTB, and the
   environment.[51]^12^,[52]^13 Therefore, epigenetic biomarkers in the
   peripheral blood may have great potential in TB diagnosis and
   progression surveillance.

   DNA methylation, the most widely studied epigenetic modification,
   refers to the formation of 5-methylcytosine through the transfer of a
   methyl group to the carbon-5 position of the cytosine base by DNA
   methyltransferase enzymes.[53]^14 DNA methylation is responsible for
   regulating gene expression, chromatin structure, and alternative
   splicing.[54]^15^,[55]^16 In human somatic cells, methylation occurs in
   approximately 90% of the cytosines in CpG sites; most importantly, this
   ratio varies among different tissue types, cell types, and disease
   states.[56]^15 As the advantages of DNA methylation include critical
   roles in biological processes, disease/tissue/cell-type-specific
   patterns, and inherent stability, specific DNA methylation profiles
   have been mapped for many diseases, such as breast cancer, adiposity,
   and others.[57]^17^,[58]^18 Some pathogens, such as human
   immunodeficiency virus and hepatitis B virus, have been verified to
   alter the DNA methylation landscape of the host;[59]^19^,[60]^20 MTB
   has also been reported to have such activity. DiNardo et al.[61]^21
   found that the methylation level of mycobacterial immunity-related
   genes was upregulated, leading to a downregulation of the host immune
   response against MTB. These findings indicated that DNA methylation is
   important in TB development and may serve as a promising biomarker.
   However, systematic work to explore the value of DNA methylation in TB
   has thus far been less common.

   Toward this end, we (1) summarized available datasets to systematically
   describe the DNA methylation characteristics of TB and identify
   potential biomarkers; (2) applied machine learning methods to construct
   TB diagnostic classifiers by triaging a parsimonious list of the most
   promising targets; and (3) performed validation studies in an
   independent cohort to ensure that these molecules are robust as
   TB-related biomarkers. This study aimed to establish TB-related
   methylation profiles and facilitate TB diagnosis, as well as decipher
   the underlying connection between TB and DNA methylation.

Results

   The flow chart of the study protocol is shown in [62]Figure 1.

Figure 1.

   [63]Figure 1
   [64]Open in a new tab

   The flow chart of the study protocol

   Abbreviations: TB, tuberculosis; DMPs, differentially methylated
   probes; DMRs, differentially methylated regions; HC, healthy control.

Different DNA methylation patterns in TB patients and healthy controls

   Altogether, 4 datasets, including 67 TB patients and 45 healthy
   controls (HCs), met the inclusion criteria ([65]Table 1). A total of
   363,416 probes were retained for analysis after filtering and adjusting
   for batch effects. In total, 89 differentially methylated probes (DMPs)
   were found between TB patients and HCs ([66]Figure 2A). Mapping of the
   89 DMPs onto genomic features showed that the most probes targeted
   intronic features (29%, 26/89), followed by exons (21%, 19/89) and
   intergenic features (20%, 18/89) ([67]Figure 2C). Among the 89 DMPs,
   68.5% (61/89) sites were hypermethylated and 31.5% (28/89) were
   hypomethylated in TB patients compared with HCs ([68]Figure 2E).

Table 1.

   The detailed information of included methylation arrays
   GEO ID Platform Characteristics of included subjects
     __________________________________________________________________

   Sample numbers
     __________________________________________________________________

   Sample type
   Race Mean age (years) Sex (male/female, %) HIV status TB HC
   [69]GSE118469 [70]GPL13534 Asian >18 100 0 15 6 PBMC
   [71]GSE104287 [72]GPL13534 Caucasian >18 62.5 0 32 16 PBMC and NK cell
   [73]GSE72338 [74]GPL13534 African >18 63.2 0 17[75]^a 20 monocyte and
   neutrophil
   [76]GSE107917 [77]GPL23976 Asian N/A N/A N/A 3 3 whole blood
   [78]Open in a new tab

   GEO, Gene Expression Omnibus; HIV, human immunodeficiency virus; TB,
   tuberculosis; HC, healthy control; PBMC, peripheral blood mononuclear
   cell; N/A, not applicable; NK, natural killer.
   ^a

   One sample ([79]GSM1860484) was not included due to the filter
   conditions used in raw data procession.

Figure 2.

   [80]Figure 2
   [81]Open in a new tab

   Differently methylated probes and differentially methylated regions
   between tuberculosis patients and healthy controls

   (A and B) The Manhattan plot of all methylated probes and regions,
   respectively. The ordinate and abscissa represented the −log[10] p
   value and chromosomes, respectively. The different colors above the
   abscissa show the probe numbers in corresponding chromosomes and the
   horizontal line indicates the threshold value of p. (C and D) The types
   of differentially methylated probe-related genes and differentially
   methylated region-related genes, respectively. ncRNA, non-coding RNA.
   (E and F) The heatmap of differentially methylated probes and
   differentially methylated regions between tuberculosis patients and
   healthy controls, respectively. The method of correlation was used for
   row clustering.

   Further analysis was performed to identify ethnicity-specific and
   sample type-specific DMPs ([82]Figure S1).

   We then shifted focus to the methylation regions with the strongest
   prospective ability to regulate gene expression. Twenty-seven
   differentially methylated regions (DMRs) covering 310 CpG sites were
   identified ([83]Figure 2B). Three regions were excluded because only
   one CpG site was present in each region. Of the 24 DMRs, genomic
   annotation showed that 42% (10/24) were located in exon regions, and
   25% (6/24) were located in 5′ UTRs ([84]Figure 2D). The locations of
   DMRs exhibited a chromosomal bias, and nearly half of the DMRs (10/24)
   were distributed on chromosome 6. Of all DMRs, only one (chr3:
   50336343–50337494) mapped to two genes (HYAL3 and NAT6). According to
   the mean fold change (FC) of the β value, TB patients showed
   significant hypermethylation compared with HCs in 22 regions and
   hypomethylation in 2 regions ([85]Table 2; [86]Figure 2F).

Table 2.

   Identified top different methylation regions between TB patients and
   HCs
   Chromosome Genetic position[87]^a
     __________________________________________________________________

   Gene symbol CpG site number Mean beta fold change Adjusted p value
   Start End
   Chr1 160068509 160068681 IGSF8 6 1.22 × 10^−2 1.06 × 10^−8
   Chr3[88]^b^,[89]^c 195635643 195636243 TNK2-AS1 2 9.36 × 10^−3 1.17 ×
   10^−9
   Chr3 50336343 50337494 HYAL3; NAT6 17 7.01 × 10^−4 4.13 × 10^−16
   Chr6 33160067 33160976 COL11A2 19 4.60 × 10^−3 3.96 × 10^−18
   Chr6[90]^c 30458519 30458601 HLA-E 3 −3.76 × 10^−2 5.74 × 10^−9
   Chr6 44190729 44191600 SLC29A1 10 6.24 × 10^−4 1.88 × 10^−11
   Chr6[91]^c 33244976 33246390 B3GALT4 37 1.19 × 10^−2 7.05 × 10^−20
   Chr6[92]^c 31627090 31627313 C6orf47 6 1.04 × 10^−2 6.81 × 10^−9
   Chr6[93]^c 33283789 33284168 ZBTB22 10 8.26 × 10^−3 3.96 × 10^−9
   Chr6 29600108 29600468 GABBR1 10 2.66 × 10^−3 3.30 × 10^−9
   Chr6[94]^c 31937968 31938372 DXO 6 4.98 × 10^−3 7.91 × 10^−10
   Chr6[95]^b^,[96]^c 29691631 29692475 HLA-F 21 −1.10 × 10^−2 2.75 ×
   10^−12
   Chr6 30711586 30712559 IER3 23 5.67 × 10^−5 3.03 × 10^−11
   Chr8 144328914 144329279 ZFP41 7 4.18 × 10^−3 2.05 × 10^−9
   Chr11[97]^b^,[98]^c 65315205 65315625 LTBP3 4 2.28 × 10^−2 3.63 × 10^−9
   Chr11 2019930 2020560 H19 15 1.65 × 10^−2 3.12 × 10^−10
   Chr11 66034896 66035392 KLC2 14 3.86 × 10^−3 1.98 × 10^−9
   Chr12 133065912 133066762 FBRSL1 18 5.40 × 10^−3 1.22 × 10^−15
   Chr15[99]^c 75743753 75744225 SIN3A 8 6.00 × 10^−3 8.06 × 10^−11
   Chr16 11350112 11350371 SOCS1 6 6.65 × 10^−3 1.28 × 10^−10
   Chr17 7210796 7211307 EIF5A 6 6.65 × 10^−3 8.61 × 10^−12
   Chr19 55850629 55851365 KMT5C 12 3.31 × 10^−3 2.55 × 10^−13
   Chr20 57426538 57427973 GNAS 38 1.59 × 10^−2 2.88 × 10^−21
   Chr22 38851318 38852154 KCNJ4 9 1.06 × 10^−2 1.26 × 10^−16
   [100]Open in a new tab
   ^a

   The positions were obtained according to hg19, GRCh37 (Genome Reference
   Consortium Human Reference 37).
   ^b

   These three different methylation regions were selected by logistic
   regression into models.
   ^c

   These nine different methylation regions were selected by elastic net
   regression into models.

   Principal-component analysis was conducted to explore the difference in
   24 DMRs among ethnicity-specific and sample type-specific subgroups
   ([101]Figure S2).

Function enrichment of DMRs

   Gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes
   and Genomes (KEGG) pathway enrichment analysis were performed for 25
   DMR-associated genes. GO analysis indicated that DMR-associated genes
   contributed to many immune-related biological functions, such as immune
   cell activation and regulation, cellular response to IFN-γ, and
   cytotoxicity ([102]Figure S3A). KEGG analysis suggested that these 25
   genes mainly participated in antigen processing and presentation and
   pathogen infection-related pathways (cytomegalovirus, papillomavirus,
   etc.) ([103]Figure S3B). A protein-protein interaction (PPI) network
   was built to visualize the interactions among the proteins encoded by
   these 25 genes. HLA-F, ZBTB22, SIN3A, and GABBR1 were considered hub
   genes in the constructed network ([104]Figure S3C).

Construction of diagnostic classifiers based on the validated DMRs

   Variables were selected by logistic regression and elastic net
   regression. A total of six machine learning methods were applied to
   construct classifiers based on the selected variables.

   Through binary univariate and multivariate logistic regression, three
   DMRs (chr11: 65315205–65315625, chr3: 195635643–195636243, and chr6:
   29691631–29692475) were finally selected as classifiers ([105]Figures
   3A and 3B). For these three DMRs, the logistic regression classifier
   yielded a sensitivity of 79.1%, specificity of 84.4%, and area under
   the curve (AUC) of 0.888 (95% confidence interval [CI]: 0.831–0.945)
   ([106]Figures 3C and 3D). This classifier also showed a net benefit in
   performance regardless of the risk threshold selected ([107]Figure 3E).
   The highest AUC was 0.999 (95% CI: 0.997–1.000) for the random tree
   classifier, followed by that for the extreme gradient boosting
   (XGBoost) (AUC = 0.972; 95% CI: 0.948–0.995) and k-nearest neighbor
   (KNN) (AUC = 0.945; 95% CI: 0.908–0.982) classifiers. The lowest AUC
   was observed for the support vector machine (SVM) classifier (AUC =
   0.833; 95% CI: 0.758–0.907).

Figure 3.

   [108]Figure 3
   [109]Open in a new tab

   The included variables, nomogram, and performance of the 3-DMR logistic
   regression classifier

   (A) Customizing visualizations of included variables in 3-DMR logistic
   regression classifiers. The upper two plots are the genomic coordinates
   of the targeted differentially methylated regions, followed by a line
   chart that shows the mean beta value of all probes in the corresponding
   region among tuberculosis patients (red line) and healthy controls
   (green line). Next, the genomic annotations, including CpG island
   locations and DNAseI hypersensitive sites, were plotted. The data of
   CpG island locations and DNAseI hypersensitive sites were obtained from
   Wu et al.[110]^22 and UCSC Genome Browser. Finally, RefSeq tracks were
   added. (B) Nomogram of the 3-DMR logistic regression classifier. For
   simplicity, region-related genes were employed to represent the
   corresponding differentially methylated regions. LTBP3, TNK2-AS1, and
   HLA-F represent the regions of chr11: 65315205–65315625, chr3:
   195635643–195636243, and chr6: 29691631–29692475, respectively. (C)
   Calibration curve of the 3-DMR logistic regression classifier. (D)
   Receiver operating characteristic of the 3-DMR logistic regression
   classifier. (E) Decision curve analysis curve of the 3-DMR logistic
   regression classifier. The bold red curve shows the benefit net of this
   classifier at different risk thresholds, while the curves on both sides
   represent its 95% confidence interval.

   To avoid overfitting, the largest λ at which the mean squared error was
   within 1 standard error of the minimum (λ = 0.068) was used in the
   process of variable selection by elastic net regression
   ([111]Figure 4A). In addition to the above three DMRs, six DMRs (chr15:
   75743753–75744225, chr6: 30458519–30458601, chr6: 33244976–33246390,
   chr6: 31627090–31627313, chr6: 33283789–33284168, and chr6:
   31937968–31938372) ([112]Figure 4B) were selected by elastic net
   regression. For these nine DMRs, the elastic net regression classifier
   reached a sensitivity of 82.1%, specificity of 86.7%, and AUC of 0.918
   (95% CI: 0.871–0.966) ([113]Figure 4C). The highest AUC was found for
   the random tree classifier (AUC = 1.000, 95% CI: 1.000–1.000), followed
   by the XGBoost (AUC = 0.997; 95% CI: 0.990–1.000) and KNN (AUC = 0.919,
   95% CI: 0.869–0.969) classifiers.

Figure 4.

   [114]Figure 4
   [115]Open in a new tab

   Optimal hyperparameter selection, included variables and receiver
   operating characteristics of the 9-DMR elastic net regression
   classifier

   (A) The cross-validation plot for the optimal λ selection. The red dots
   represent the target mean squared error corresponding to each λ. The
   left dotted line shows the value of λ when MSE was at its minimum,
   while the right dotted line shows the maximum value of λ when MSE was
   within 1 standard error of the minimum. (B) Customizing visualizations
   of included variables in 9-DMR elastic net regression classifier. The
   upper two plots are the genomic coordinates of the targeted
   differentially methylated regions, followed by the line chart which
   shows the mean beta value of all probes in the corresponding region
   among tuberculosis patients (red line) and healthy controls (green
   line). Next, the genomic annotations, including CpG island locations
   and DNAseI hypersensitive sites, were plotted. The data of CpG island
   locations and DNAseI hypersensitive sites were obtained by the
   publication of Wu et al.[116]^22 and UCSC Genome Browser. Finally,
   RefSeq tracks were added. (C) The receiver operating characteristics of
   the 9-DMR elastic net regression classifier.

   The optimal hyperparameters used in each classifier are provided in
   [117]Table S2.

Validation of DMRs by region-specific multiple sequencing

   The methylation levels of the above nine DMRs were further tested.
   Altogether, 62 samples from 31 TB patients and 31 HCs were collected in
   a prospective clinical cohort.

   The regions of chr3: 195635643–195636243 and chr6: 29691631–29692475
   were found to be differentially methylated (p = 4.19 × 10^−5 and 0.024,
   respectively) ([118]Figure 5A). However, no meaningful findings were
   observed in the other seven regions (original data are provided in
   [119]Data S1).

Figure 5.

   [120]Figure 5
   [121]Open in a new tab

   The methylation levels of three differentially methylated regions and
   the performance of the 3-DMR logistic regression classifier in an
   independent cohort

   (A) A violin plot of the methylation levels of three differentially
   methylated regions in an independent cohort. TB, tuberculosis; HC,
   healthy control. (B) The receiver operating characteristics of the
   3-DMR logistic regression classifier in an independent cohort.

   Given that the data are generated in different ways (array and
   sequencing), the cutoff value of the classifier based on microarray
   datasets might not be suitable for sequencing data; therefore, new
   thresholds generated by sequencing data were used. Among all the
   classifiers constructed by different DMRs and modeling methods, the
   3-DMR logistic regression classifier exhibited the highest AUC (0.838;
   95% CI: 0.737–0.938). The specificity of this classifier increased to
   90.3%, while the sensitivity declined slightly but still reached 64.5%
   ([122]Table 3; [123]Figure 5B).

Table 3.

   The performance of built classifiers
   3-DMR classifier
     __________________________________________________________________

   9-DMR classifier
     __________________________________________________________________

   Variable selection method
     __________________________________________________________________

   Logistic regression
     __________________________________________________________________

   Elastic net regression
     __________________________________________________________________

   Modeling method Logistic regression Support vector machine K-nearest
   neighbor Random tree XGBoost Elastic net regression Support vector
   machine K-nearest neighbor Random tree XGBoost
   Discovery set AUC (95% CI) 0.888 (0.831–0.945) 0.833 (0.758–0.907)
   0.945 (0.908–0.982) 0.999 (0.997–1.000) 0.972 (0.948–0.995) 0.918
   (0.871–0.966) 0.849 (0.780–0.918) 0.919 (0.869–0.969) 1.000
   (1.000–1.000) 0.997 (0.990–1.000)
   Sensitivity 79.1% 76.1% 65.7% 100% 92.5% 82.1% 91.0% 86.6% 100% 97.0%
   Specificity 84.4% 77.8% 100% 93.3% 88.9% 86.7% 66.7% 86.7% 100% 100%
   Validation set AUC (95% CI) 0.838 (0.737–0.938) 0.778 (0.663–0.894)
   0.500 (0.355–0.645) 0.750 (0.625–0.875) 0.772 (0.652–0.891) 0.536
   (0.389–0.638) 0.506 (0.357–0.654) 0.516 (0.371–0.662) 0.512
   (0.367–0.657) 0.532 (0.389–0.675)
   Sensitivity 64.5% 61.3% – 93.5% 80.6% 61.3% 77.4% 71% 58.1% 87.1%
   Specificity 90.3% 87.1% – 54.8% 64.5% 58.1% 45.2% 38.7% 45.2% 22.6%
   [124]Open in a new tab

   DMR, different methylation region; XGBoost, extreme gradient boosting;
   AUC, area under curve; CI, confidence interval.

   To facilitate the application of these findings to clinical practice,
   an online classifier based on the 3-DMRs logistic regression was
   designed at [125]https://mengyuan.shinyapps.io/TB_DNAmethylation/. For
   simplicity, in this online tool, region-related genes were employed to
   represent the corresponding DMRs. LTBP3, TNK2-AS1, and HLA-F represent
   the regions of chr11: 65315205–65315625, chr3: 195635643–195636243, and
   chr6: 29691631–29692475, respectively.

Further evaluation of the 3-DMR logistic regression classifier in different
situations

   Considering the serious consequences of TB spread, 35 TB patients and
   32 participants injected with Bacillus Calmette-Guerin (BCG) were
   combined into the same group to decrease the likelihood of missed
   diagnosis. However, differential diagnosis of TB patients and BCG
   recipients is needed. Although the 3-DMR logistic regression classifier
   exhibited moderate AUC (0.689; 95% CI: 0.563–0.816), it could
   distinguish TB patients from BCG participants with a specificity of
   82.9%.

   The differential diagnosis performance of the 3-DMR logistic regression
   classifier was also evaluated. The details of disease controls (DCs)
   are shown in [126]Table S3. The 3-DMR logistic regression classifier
   showed a strong ability to distinguish TB from malaria (sensitivity,
   61.2%; specificity, 87.5%; AUC = 0.778; 95% CI: 0.598–0.959) and
   systemic inflammatory response syndrome (sensitivity, 100%;
   specificity, 94.0%; AUC = 0.955; 95% CI: 0.907–1.000). The sensitivity,
   specificity, and AUC were 100%, 92.5%, and 0.965 (95% CI: 0.929–1.000)
   when using this classifier to differentiate sepsis patients (GEO:
   [127]GSE138074) from TB patients, while they were 100%, 100%, and 1.000
   (95% CI: 1.000–1.000) when distinguishing sepsis patients (GEO:
   [128]GSE58651 or GEO: [129]GSE155952) from TB patients. However, this
   classifier failed to efficiently differentiate TB patients from
   patients with subclinical parasitemia (sensitivity, 100%; specificity,
   41.8%; AUC = 0.631; 95% CI: 0.417–0.844).

   The differential diagnosis ability of the 3-DMR logistic regression
   classifier is shown in [130]Figure 6.

Figure 6.

   [131]Figure 6
   [132]Open in a new tab

   Receiver operating characteristic curves of the 3-DMR logistic
   regression classifier when distinguishing tuberculosis patients from
   other disease controls

   (A) The receiver operating characteristic curve of the 3-DMR logistic
   regression classifier when distinguishing tuberculosis patients from
   malaria cases. (B) The receiver operating characteristic curve of the
   3-DMR logistic regression classifier when distinguishing cases with
   systemic inflammatory response syndrome from tuberculosis patients. (C)
   The receiver operating characteristic curve of the 3-DMR logistic
   regression classifier when distinguishing sepsis patients (GEO:
   [133]GSE138074) from tuberculosis patients. (D) The receiver operating
   characteristic curve of the 3-DMR logistic regression classifier when
   distinguishing sepsis patients (GEO: [134]GSE58651) from tuberculosis
   patients. (E) The receiver operating characteristic curve of the 3-DMR
   logistic regression classifier when distinguishing sepsis patients
   (GEO: [135]GSE155952) from tuberculosis patients. (F) The receiver
   operating characteristic curve of the 3-DMR logistic regression
   classifier when distinguishing tuberculosis patients from patients with
   subclinical parasitemia.

Discussion

   This work represents a comprehensive analysis of TB-related DNA
   methylation biomarkers. By integrating the available datasets, we
   identified TB-related targets (89 DMPs and 24 DMRs). With respect to
   the likelihood and degree of influencing gene expression, 24 DMRs were
   treated as subsequent candidates. Logistic regression and elastic net
   regression were used to select potential biomarkers for further
   validation in an independent cohort. Based on the selected candidates,
   six different methods were applied to construct classifiers, and the
   3-DMR logistic regression classifier outperformed the others. The good
   performance of this classifier in both discovery and validation cohorts
   emphasize the tremendous potential of DNA methylation as a TB
   diagnostic biomarker.

   Over the past few decades, clinicians have expressed a preference for
   obtaining the most direct evidence possible to diagnose diseases.
   Therefore, tissues or body fluid from lesion sites are considered ideal
   sample types; and, certainly, analyses based on these samples are
   regarded as reference methods. However, a high level of invasiveness,
   challenging techniques, and limited applicable populations currently
   constrain the use of such samples for disease diagnosis. In this
   context, peripheral blood has emerged as a focus for researchers. The
   peripheral blood is the environment in which various cytokines and
   immune cells interact in infectious diseases. Therefore, the
   circulatory system can be treated as a source of markers for the host
   immune response[136]^23 and offers information about when and how
   disease progresses. The exploration of diagnostic biomarkers in the
   peripheral blood has encompassed genomics, transcriptomics, and
   epigenetics studies. However, genomics and transcriptomics do not
   provide an appropriate balance between molecular stability and flexible
   monitoring of disease progression. DNA methylation perfectly offsets
   the above two limitations. Thanks to technological developments, many
   choices for detection methods are available, which also contributes
   greatly to the clinical application of DNA methylation analysis. These
   advantages have prompted the deep exploration of DNA methylation in
   infectious diseases.

   To clarify how peripheral blood biomarkers are regulated and determine
   their functions in targeted disease, it is essential to demonstrate
   their roles as biomarkers of disease progression. The association
   between DNA methylation and TB was reported as early as 1980.[137]^24
   Although the question of whether the change in host genomic methylation
   is caused by MTB itself or secondary to inflammation remains
   unanswered, some progress has been made to decode the functions of
   these changes in DNA methylation. By summarizing and comparing the
   characteristics of different methylation sites before and after host
   infection with MTB, scholars found that significantly changed
   candidates were mainly located in promoter regions. Methylation in the
   promoter region can lead to the failure of gene transcription
   initiation and gene silencing the by hampering transcription factor
   (TF) binding to specific motifs.[138]^25^,[139]^26 Changes in gene
   activity and expression will further affect a series of signaling
   pathways and biological functions, including immune cell regulation,
   cytokine regulation, IFN-γ signaling pathways, and other TB-related
   immune activities,[140]^21^,[141]^27 consistent with the results of
   this study. As mentioned above, methylation is able to interfere with
   TF binding, while TFs are also capable of regulating methylation
   status. TF occupancy theory suggests that TFs can modulate gene
   methylation levels by competing with DNA methyltransferase binding at
   promoter sequences.[142]^26 Pacis et al.[143]^28 cultured MTB-infected
   human dendritic cells and collected time-dependent data on DNA
   methylation, gene expression, and chromatin accessibility patterns. The
   resulting data verified that immune-related TFs could regulate
   methylation levels by binding to cis-acting elements. Notably, not all
   TFs are negatively regulated by DNA methylation, and positive
   correlations between them have also been reported.[144]^29
   Collectively, the existing data show that DNA methylation has crucial
   roles in the pathogenesis and development of TB. Mapping TB-specific
   methylation and further selecting promising sites may open a new
   horizon for TB elimination to some extent.

   The understanding of how DNA methylation functions increases its
   potential as a promising biomarker. Over time, a number of TB-specific
   DNA methylation biomarkers, from the global DNA methylation level to
   detailed methylated sites or regions, have been reported. Maruthai
   et al.[145]^30 took the global DNA methylation level as a TB-specific
   biomarker and reported an AUC of 0.81. Their group also focused on the
   VDR promoter methylation level and the median DNA methylation level of
   Alu repetitive elements and found that these two indicators exhibited
   good performance for diagnosing pediatric TB (AUC = 0.977 and 0.969,
   respectively).[146]^31^,[147]^32 However, these results were obtained
   from relatively small cohorts (76 and 68 children). The team of Das and
   Chen reported some differentially methylated sites in TLR2, SNX26, and
   other genes but did not analyze the diagnostic values of these
   sites.[148]33, [149]34, [150]35 Wang et al.[151]^36 demonstrated
   meaningfully methylated sites in key genes in the vitamin D metabolic
   pathway and further assessed the diagnostic capacity of these genes by
   calculating the cumulative methylation level of CpG sites on these
   genes via four different methods. The AUCs of candidate genes varied
   from 0.578 to 0.794. Esterhuyse et al.[152]^9 incorporated different
   CpG sites by machine learning to distinguish active TB from latent TB
   infection and obtained a model with an AUC of 0.74. However, these
   methylation data were generated from only monocytes and granulocytes,
   which might restrict the applicability of the model. The high AUCs of
   these candidate loci underscores the potential of DNA methylation as a
   TB diagnostic biomarker. However, these biomarkers were identified in
   relatively small cohorts or specific cell types. Moreover, current
   studies tend to use a single CpG site or CpG sites in a single gene as
   a biomarker, which implies limited diagnostic power and
   generalizability. Comprehensive and systematic work is particularly
   necessary to avoid the above problems and further develop the
   diagnostic potential of DNA methylation. Therefore, we first combined
   all available datasets to properly expand the sample size, cover more
   ethnic populations, and target more sample types (whole blood,
   peripheral blood mononuclear cells, etc.). Through rigorous data
   analysis, the high population coverage and many different sample types
   ensure the wide applicability of biomarkers. Compared with CpG sites,
   methylated regions are more likely to regulate gene expression to a
   greater extent and thus were chosen as biomarkers in this work. In
   addition, each single biomarker has its own limitations, which will be
   particularly magnified in complex and changeable clinical settings.
   Constructing models to incorporate different biomarkers can resolve
   these questions effectively and maximize the underrated power of
   TB-specific DNA methylation biomarkers. We tested various combinations
   of variable selection methods and modeling methods to build different
   classifiers, which offered us a better opportunity to select a
   high-performing classifier. Such efforts were finally realized in the
   3-DMR logistic regression classifier, which exhibited superior
   performance in both the discovery and validation cohorts. Fewer
   included variables, excellent capacity, strong generalizability, and
   availability as a convenient online tool enable this classifier to be
   used in real clinical applications.

   Understanding why this 3-DMR logistic regression classifier possesses
   such capacity and differential diagnostic ability is crucial to its
   optimal utilization. The regions of chr3: 195635643–195636243 and chr6:
   29691631–29692475 occupied a dominant position in this classifier. The
   region of chr3: 195635643–195636243 is at the position of TNK2-AS1.
   TNK2-AS1 is closely related to cell proliferation, invasion and
   apoptosis in cancer.[153]^37^,[154]^38 TNK2-AS1 acts as a microRNA
   sponge and, to date, miR-4319 and miR-150-5p have been reported as
   targets of TNK2-AS1.[155]^38^,[156]^39 Our previous study suggested
   that miR-150-5p was a promising biomarker for TB diagnosis,[157]^40 and
   mounting evidence demonstrates that miR-150-5p contributes greatly to
   immune cell differentiation and the capacity of effector CD8+ T cells
   to kill infected cells.[158]^41 Herein, we hypothesized that the
   interaction between MTB and the host influences TNK2-AS1 expression by
   regulating its methylation level, while altered TNK2-AS1 expression
   triggers a series of downstream pathways through a ceRNA mechanism.

   The region of chr6: 29691631–29692475 is located within the promoter
   region of HLA-F. In the late 1980s, Geraghty et al.[159]^42 first
   reported HLA-F as a nonclassical class I antigen in the human leukocyte
   antigen (HLA) family. Due to their prominent role in antigen
   presentation and immune regulation, many members of the HLA family have
   received ongoing attention from researchers,[160]^43 but HLA-F has been
   less well studied. However, the current evidence suggests that HLA-F is
   an underappreciated player in immune regulation.[161]^44 Changes in
   HLA-F at the genome, transcription, or epigenetic level can fuel a
   series of immune alterations and are thus involved in cancer,
   infection, autoimmunity, and other disorders.[162]^45 In our work, TB
   patients exhibited decreased methylation levels in the promoter region
   of HLA-F, also indicating elevated HLA-F mRNA expression. Increased
   HLA-F mRNA expression had positive impacts on its interaction with
   immune receptors (KIR3DL1, KIR3DS1, etc.), cytokine production (CCL4,
   IFN-γ, etc.), and antigen presentation, which have already been
   verified in other infectious diseases. Lunemann et al.[163]^46
   documented that through binding to KIR3DS1, increased HLA-F promoted
   natural killer cell cytolysis and cytokine production and thus
   suppressed hepatitis C virus replication. This was also observed in
   HIV-1 infection.[164]^47 Based on the above, we speculated that the
   host responded to MTB infection by downregulating the methylation level
   of the promoter region in HLA-F, elevating HLA-F mRNA expression and
   then promoting immune cell activation to kill the bacteria. The CpG
   sites in the HLA-F promoter region may serve as promising targets to
   enhance host immunity and achieve precise MTB clearance.

   Aside from its diagnostic ability, generalizability, detection time,
   cost, and feasibility in economically poor areas must also be
   considered in the transition of this classifier to clinical practice.
   In terms of generalizability, data generated from different sample
   types and ethnicities were included to increase the generalizability of
   the classifiers. The limited sample size in subgroups of different
   ethnicities restricted the reliability of related findings, whereas
   relevant results were presented in this paper to provide potential
   clues. For sample type subgroups, principal-component analysis
   indicated that 24 regions in 5 subgroups exhibited similar methylation
   patterns. This suggests that the 3-DMR logistic regression classifier
   has the potential to be used in these five sample types. The excellent
   performance of the 3-DMR logistic regression classifier in whole blood
   samples from an independent cohort supports the above speculation.
   Although different blood cells have their own epigenetic patterns, they
   are all found in the circulation and thus may be stimulated by the same
   factors and communicate frequently with each other, leading to
   similarities in their epigenetic patterns.[165]^48^,[166]^49 In terms
   of cost and detection time, the use of whole blood as samples allows a
   shorter detection time and lowers costs because it does not require
   complex cell sorting, gradient density centrifugation, or other sample
   processes. However, sample processing accounts for a limited part of
   the total detection time and costs, while detection assays greatly
   influence detection time and cost. Assays based on bisulfite
   conversion, restriction enzymes, and affinity enrichment are currently
   applied to analyze specific methylated regions and, among these,
   bisulfite conversion is the most widely used.[167]^50 Polymerase chain
   reaction (PCR), microarrays, and sequencing are utilized to read
   methylation information after bisulfite conversion.[168]^51 PCR is
   suitable for primary health facilities due to its low cost, short
   detection time, and simple operation; however, limited sensitivity and
   poor ability for methylated region detection should also be taken
   seriously. Microarrays and sequencing overcome the above shortcomings
   of PCR but at the expense of high costs and rigorous laboratory
   conditions. When choosing different detection methods, it should be
   considered that these methods had different sensitivities and thus
   different captured efficiencies of CpG sites in targeted regions, which
   was also observed in this study. Unfortunately, we have not proposed a
   standardized process to address this problem. In addition, the
   differential diagnostic ability of the targeted classifier was
   evaluated in relatively small populations, and more reliable evidence
   is needed. Herein, much work remains to be done before applying a novel
   diagnostic pattern in clinical practice.

   Collectively, the available evidence suggests that DNA methylation
   might be a kind of biomarker for TB diagnosis. Among all classifiers
   tested here, the 3-DMR logistic regression classifier presented
   excellent performance in both the discovery and validation datasets.
   This classifier might provide insights into how DNA methylation
   biomarkers could fit into future TB diagnosis and allow TB patients to
   be shielded from disease progress by timely diagnosis. Further decoding
   the profile of DNA methylation could offer crucial hints related to
   TB development, a research direction that requires an increased
   emphasis on TB.

Materials and methods

Data preparation

   DNA methylation array datasets were searched in the NCBI Gene
   Expression Omnibus database (GEO) and European Bioinformatics Institute
   ArrayExpress from their inception to October 3, 2020. The search terms
   used were (“tuberculosis” OR “TB”) AND (“methylation” OR “methylate”),
   with restrictions on species (Homo sapiens) and sample type (peripheral
   blood or its components). Considering the influence of complex body
   interactions on methylation patterns,[169]^52 datasets generated based
   on in vitro infection cell models, such as GEO: [170]GSE83379, were
   excluded. Datasets without raw IDAT or CEL files, such as GEO:
   [171]GSE50835, were also excluded. The search strategy for DCs is
   described in [172]Data S1.

   Two authors (Lyu and Zhou) were responsible for reviewing the eligible
   datasets and extracting the design information of each included
   dataset. The extracted content included the demographic characteristics
   of the subjects (age, sex, race, etc.), number and type of samples,
   detection platform used, etc. Any discrepancy was resolved by
   discussion.

Raw data processing

   To decrease the impact of using different probe filtering conditions
   and data processing methods, the raw files for the included datasets
   were downloaded and processed in the same way. The Minfi v.1.30.0
   package[173]^53 was applied to handle the data. The p value of the
   probe was calculated to assess the reliability of the probe signal, and
   a probe was filtered out if its p value was more than 0.01 in any
   sample.[174]^53 Samples with poor quality, such as GEO:
   [175]GSM1860483, were excluded according to the mean p values of all
   probes for each sample. In addition, probes with single-nucleotide
   polymorphisms at CpG sites or on sex chromosomes were excluded.
   Preprocessing, background noise reduction, and normalization were also
   performed.

DMP and DMR analysis

   The batch effects among different datasets were adjusted by the sva
   v.3.32.1 package.[176]^54 Both beta and M values were calculated to
   assess the methylation level of each CpG site in each sample. Based on
   the M value matrix,[177]^55 the Limma package v.3.40.6[178]^56 was used
   to assess DMPs with the standard of |FC| > 1.5 and p < 0.05.

   For DMRs, the DMRcate package v.1.20.0[179]^57 was employed to identify
   methylated regions and perform differential methylation analysis.
   Methylated regions were identified based on a Gaussian kernel, and DMRs
   were found by tunable kernel smoothing of the differential methylation
   signal.[180]^57 The region methylation level was assessed on the basis
   of the mean beta value of all CpG sites in the corresponding region.
   Based on 363,416 probes, methylated regions with a false discovery
   rate < 0.05 were regarded as meaningful regions. The differential
   analysis was also performed in different ethnicities and sample types.
   DMR annotation was carried out by an online tool, wANNOVAR:
   [181]http://wannovar.wglab.org.

Functional analysis

   GO enrichment analysis and KEGG pathway enrichment analysis were
   performed for DMR-related genes. A PPI network was constructed to
   identify the hub genes among DMR-associated genes by an online tool,
   STRING: [182]https://string-db.org/. The node score suggested the
   importance of this node in the whole PPI network. The interactions of
   the combined score were set at 0.4.

Diagnostic classifier development

   Incorporating DMRs into classifiers might open the door to new methods
   for clinical diagnosis of TB. Logistic regression and elastic net
   regression were used to select classifier variables and construct
   classifiers. All DMRs underwent binary univariate and multivariable
   logistic regression. DMRs with p < 0.05 in the final multivariable
   logistic regression were incorporated. For elastic net regression, the
   key parameter λ was selected by K-fold cross-validation. In addition to
   the above two methods, SVM, KNN, random tree, and XGBoost were also
   used to construct classifiers, and optimal hyperparameters of these
   modeling methods were chosen by grid search or cross-validation. Grid
   search was realized by GridSearchCV, which evaluates and compares all
   scores by adjusting a series of parameter values and outputs the
   optimal values of the parameters that generated the highest score.
   Cross-validation divides all data into K parts, calculates the average
   verification accuracy, and selects the optimal parameter when the
   verification accuracy reached the best value. The Youden index was
   calculated to determine suitable cutoff values. Sensitivity,
   specificity, and AUC were used to assess the diagnostic capacity of the
   classifiers. Receiver operating characteristic curves were plotted to
   visually present classifier performance.

Validation cohort recruitment and sample preparation

   A total of 62 Chinese participants admitted to West China Hospital of
   Sichuan University between January 2019 and October 2020 were enrolled.
   TB patients met the following criteria: (1) diagnosed with TB according
   to the Diagnostic Criteria for Tuberculosis (WS 288-2008);[183]^58 (2)
   age older than 18 years; and (3) free of other lung diseases (lung
   cancer, chronic obstructive pulmonary disease, etc.), liver diseases
   (hepatitis, hepatocellular carcinoma, etc.), metabolic diseases
   (diabetes mellitus, hyperlipidemia, etc.), and autoimmune diseases.
   Pregnant women were excluded unless specifically indicated. HCs with
   negative results on TB-related examinations and no history of TB were
   recruited. The TB and HC groups were age and sex matched. All
   participants signed written informed consents.

   EDTA-treated whole blood (3.0 mL) was collected from each subject, and
   genomic DNA was extracted (Kuang Yuan Diagnostics Q1001, China)
   according to the manufacturer’s protocol. DNA purity was measured by
   spectrometry at 260/280 nm. The extracted DNA was stored at −80°C.

   The protocol of this study was approved by the Clinical Trials and
   Biomedical Ethics Committee of West China Hospital, Sichuan University
   (registration number in the Chinese Clinical Trial Registry:
   ChiCTR1900028670).

Region-specific multiple sequencing and data analysis

   To verify DMRs found in array analysis and test classifier performance
   and generalization, region-specific multiple sequencing was performed.
   Ultrasound was used to fragment the genomic double-stranded DNA into
   pieces of 300 bp. After purification with 2× magnetic beads, DNA was
   bisulfite-treated with an EZ DNA Methylation-Gold Kit (Zymo Research
   D5005, USA). Bisulfite-converted DNA samples further underwent
   phosphorylation modification by T4 polynucleotide kinase (Thermo
   Scientific EK0031, USA) and the addition of 5′ adapters.
   Bisulfite-specific PCR was conducted to amplify regions of interest
   (primers used in this step are listed in [184]Table S1) using EpiTaq HS
   (TaKaRa R110A, Japan). Then, Taq (TaKaRa R001WZ, Japan) was used for a
   second PCR to add sample barcodes and sequencing adapters. PCR
   amplicons were visualized by gel electrophoresis, purified by 1.2×
   magnetic beads and sequenced on an Illumina Sequencer.

   Raw data were subjected to quality control and read trimming. Bismark
   was applied to align trimmed reads to the reference genome sequence and
   calculate the percentage of 5-methylated cytosine (5-mC). According to
   DMRfinder, the methylation degree of each region was assessed by the
   following formula: methylation fraction of a region = sum of methylated
   counts at CpG sites within a region/sum of methylated and unmethylated
   counts at CpG sites within a region. Differences between TB patients
   and HCs were evaluated by Limma package v.3.40.6.

Acknowledgments