Abstract We systemically identified tuberculosis (TB)-related DNA methylation biomarkers and further constructed classifiers for TB diagnosis. TB-related DNA methylation datasets were searched through October 3, 2020. Limma and DMRcate were employed to identify differentially methylated probes (DMPs) and regions (DMRs). Machine learning methods were used to construct classifiers. The performance of the classifiers was evaluated in discovery datasets and a prospective independent cohort. Eighty-nine DMPs and 24 DMRs were identified based on 67 TB patients and 45 healthy controls from 4 datasets. Nine and three DMRs were selected by elastic net regression and logistic regression, respectively. Among the selected DMRs, two regions (chr3: 195635643–195636243 and chr6: 29691631–29692475) were differentially methylated in the independent cohort (p = 4.19 × 10^−5 and 0.024, respectively). Among the ten classifiers, the 3-DMR logistic regression classifier exhibited the strongest performance. The sensitivity, specificity, and area under the curve were, respectively, 79.1%, 84.4%, and 0.888 in the discovery datasets and 64.5%, 90.3%, and 0.838 in the independent cohort. The differential diagnostic ability of this classifier was also assessed. Collectively, these data showed that DNA methylation might be a promising TB diagnostic biomarker. The 3-DMR logistic regression classifier is a potential clinical tool for TB diagnosis, and further validation is needed. Keywords: tuberculosis, DNA methylation, biomarker, classifier, molecular diagnosis Graphical abstract graphic file with name fx1.jpg [37]Open in a new tab __________________________________________________________________ We systemically identified TB-related DNA methylation biomarkers and constructed diagnostic classifiers. A 3-DMR logistic regression classifier showed excellent performance in discovery datasets and a prospective cohort. DNA methylation might be a promising TB-related biomarker and the 3-DMR logistic regression classifier is a potential clinical tool for TB diagnosis. Introduction The complex nature of Mycobacterium tuberculosis (MTB) has greatly contributed to the continuous effects of tuberculosis (TB), the world’s top infectious killer, on human populations for thousands of years.[38]^1 Globally, in 2019, approximately 10 million people fell ill with TB, and 1.4 million people died of TB.[39]^2 Luckily, TB is a preventable and curable illness, and the key to thwarting the TB epidemic is early diagnosis.[40]^3 The detection power of etiological tests (Xpert MTB/RIF, culture, etc.) largely depends on sample quality.[41]^4 Traditional host immune response assays (interferon-γ [IFN-γ] release assays, tuberculin tests, etc.) work mainly by measuring the products of the host immune response. Most of these products are downstream molecules in various biological pathways, suggesting that these molecules are regulated by various factors.[42]^5 In addition to the limited performance of the available detection methods, certain features of TB also increase the difficulty of its clinical diagnosis. For example, incipient TB is characterized by an indeterminate period of asymptomatic infection or absence of typical clinical symptoms and imaging features, which limit the effectiveness of TB diagnosis.[43]^6 Therefore, new and more powerful detection tools are urgently needed. Current evidence has indicated that the detection of non-sputum blood biomarkers of TB is a preferred approach for clinical diagnosis and progression surveillance.[44]^7^,[45]^8 TB is an infectious disease whose pathogenesis involves dynamic interactions among the host, MTB, and the environment. The circulatory system serves as a site for cellular communication and dynamic exchange of various cellular factors and chemokines, and thus peripheral blood has been considered the preferred sample type for studying and diagnosing infectious diseases.[46]^9 To date, peripheral blood biomarkers of TB at the genomic, transcriptomic, epigenetic, and proteomic levels have received enormous attention, and numerous markers have been identified.[47]^7^,[48]9, [49]10, [50]11 Of note, epigenetics can bridge the gaps between the host, MTB, and the environment.[51]^12^,[52]^13 Therefore, epigenetic biomarkers in the peripheral blood may have great potential in TB diagnosis and progression surveillance. DNA methylation, the most widely studied epigenetic modification, refers to the formation of 5-methylcytosine through the transfer of a methyl group to the carbon-5 position of the cytosine base by DNA methyltransferase enzymes.[53]^14 DNA methylation is responsible for regulating gene expression, chromatin structure, and alternative splicing.[54]^15^,[55]^16 In human somatic cells, methylation occurs in approximately 90% of the cytosines in CpG sites; most importantly, this ratio varies among different tissue types, cell types, and disease states.[56]^15 As the advantages of DNA methylation include critical roles in biological processes, disease/tissue/cell-type-specific patterns, and inherent stability, specific DNA methylation profiles have been mapped for many diseases, such as breast cancer, adiposity, and others.[57]^17^,[58]^18 Some pathogens, such as human immunodeficiency virus and hepatitis B virus, have been verified to alter the DNA methylation landscape of the host;[59]^19^,[60]^20 MTB has also been reported to have such activity. DiNardo et al.[61]^21 found that the methylation level of mycobacterial immunity-related genes was upregulated, leading to a downregulation of the host immune response against MTB. These findings indicated that DNA methylation is important in TB development and may serve as a promising biomarker. However, systematic work to explore the value of DNA methylation in TB has thus far been less common. Toward this end, we (1) summarized available datasets to systematically describe the DNA methylation characteristics of TB and identify potential biomarkers; (2) applied machine learning methods to construct TB diagnostic classifiers by triaging a parsimonious list of the most promising targets; and (3) performed validation studies in an independent cohort to ensure that these molecules are robust as TB-related biomarkers. This study aimed to establish TB-related methylation profiles and facilitate TB diagnosis, as well as decipher the underlying connection between TB and DNA methylation. Results The flow chart of the study protocol is shown in [62]Figure 1. Figure 1. [63]Figure 1 [64]Open in a new tab The flow chart of the study protocol Abbreviations: TB, tuberculosis; DMPs, differentially methylated probes; DMRs, differentially methylated regions; HC, healthy control. Different DNA methylation patterns in TB patients and healthy controls Altogether, 4 datasets, including 67 TB patients and 45 healthy controls (HCs), met the inclusion criteria ([65]Table 1). A total of 363,416 probes were retained for analysis after filtering and adjusting for batch effects. In total, 89 differentially methylated probes (DMPs) were found between TB patients and HCs ([66]Figure 2A). Mapping of the 89 DMPs onto genomic features showed that the most probes targeted intronic features (29%, 26/89), followed by exons (21%, 19/89) and intergenic features (20%, 18/89) ([67]Figure 2C). Among the 89 DMPs, 68.5% (61/89) sites were hypermethylated and 31.5% (28/89) were hypomethylated in TB patients compared with HCs ([68]Figure 2E). Table 1. The detailed information of included methylation arrays GEO ID Platform Characteristics of included subjects __________________________________________________________________ Sample numbers __________________________________________________________________ Sample type Race Mean age (years) Sex (male/female, %) HIV status TB HC [69]GSE118469 [70]GPL13534 Asian >18 100 0 15 6 PBMC [71]GSE104287 [72]GPL13534 Caucasian >18 62.5 0 32 16 PBMC and NK cell [73]GSE72338 [74]GPL13534 African >18 63.2 0 17[75]^a 20 monocyte and neutrophil [76]GSE107917 [77]GPL23976 Asian N/A N/A N/A 3 3 whole blood [78]Open in a new tab GEO, Gene Expression Omnibus; HIV, human immunodeficiency virus; TB, tuberculosis; HC, healthy control; PBMC, peripheral blood mononuclear cell; N/A, not applicable; NK, natural killer. ^a One sample ([79]GSM1860484) was not included due to the filter conditions used in raw data procession. Figure 2. [80]Figure 2 [81]Open in a new tab Differently methylated probes and differentially methylated regions between tuberculosis patients and healthy controls (A and B) The Manhattan plot of all methylated probes and regions, respectively. The ordinate and abscissa represented the −log[10] p value and chromosomes, respectively. The different colors above the abscissa show the probe numbers in corresponding chromosomes and the horizontal line indicates the threshold value of p. (C and D) The types of differentially methylated probe-related genes and differentially methylated region-related genes, respectively. ncRNA, non-coding RNA. (E and F) The heatmap of differentially methylated probes and differentially methylated regions between tuberculosis patients and healthy controls, respectively. The method of correlation was used for row clustering. Further analysis was performed to identify ethnicity-specific and sample type-specific DMPs ([82]Figure S1). We then shifted focus to the methylation regions with the strongest prospective ability to regulate gene expression. Twenty-seven differentially methylated regions (DMRs) covering 310 CpG sites were identified ([83]Figure 2B). Three regions were excluded because only one CpG site was present in each region. Of the 24 DMRs, genomic annotation showed that 42% (10/24) were located in exon regions, and 25% (6/24) were located in 5′ UTRs ([84]Figure 2D). The locations of DMRs exhibited a chromosomal bias, and nearly half of the DMRs (10/24) were distributed on chromosome 6. Of all DMRs, only one (chr3: 50336343–50337494) mapped to two genes (HYAL3 and NAT6). According to the mean fold change (FC) of the β value, TB patients showed significant hypermethylation compared with HCs in 22 regions and hypomethylation in 2 regions ([85]Table 2; [86]Figure 2F). Table 2. Identified top different methylation regions between TB patients and HCs Chromosome Genetic position[87]^a __________________________________________________________________ Gene symbol CpG site number Mean beta fold change Adjusted p value Start End Chr1 160068509 160068681 IGSF8 6 1.22 × 10^−2 1.06 × 10^−8 Chr3[88]^b^,[89]^c 195635643 195636243 TNK2-AS1 2 9.36 × 10^−3 1.17 × 10^−9 Chr3 50336343 50337494 HYAL3; NAT6 17 7.01 × 10^−4 4.13 × 10^−16 Chr6 33160067 33160976 COL11A2 19 4.60 × 10^−3 3.96 × 10^−18 Chr6[90]^c 30458519 30458601 HLA-E 3 −3.76 × 10^−2 5.74 × 10^−9 Chr6 44190729 44191600 SLC29A1 10 6.24 × 10^−4 1.88 × 10^−11 Chr6[91]^c 33244976 33246390 B3GALT4 37 1.19 × 10^−2 7.05 × 10^−20 Chr6[92]^c 31627090 31627313 C6orf47 6 1.04 × 10^−2 6.81 × 10^−9 Chr6[93]^c 33283789 33284168 ZBTB22 10 8.26 × 10^−3 3.96 × 10^−9 Chr6 29600108 29600468 GABBR1 10 2.66 × 10^−3 3.30 × 10^−9 Chr6[94]^c 31937968 31938372 DXO 6 4.98 × 10^−3 7.91 × 10^−10 Chr6[95]^b^,[96]^c 29691631 29692475 HLA-F 21 −1.10 × 10^−2 2.75 × 10^−12 Chr6 30711586 30712559 IER3 23 5.67 × 10^−5 3.03 × 10^−11 Chr8 144328914 144329279 ZFP41 7 4.18 × 10^−3 2.05 × 10^−9 Chr11[97]^b^,[98]^c 65315205 65315625 LTBP3 4 2.28 × 10^−2 3.63 × 10^−9 Chr11 2019930 2020560 H19 15 1.65 × 10^−2 3.12 × 10^−10 Chr11 66034896 66035392 KLC2 14 3.86 × 10^−3 1.98 × 10^−9 Chr12 133065912 133066762 FBRSL1 18 5.40 × 10^−3 1.22 × 10^−15 Chr15[99]^c 75743753 75744225 SIN3A 8 6.00 × 10^−3 8.06 × 10^−11 Chr16 11350112 11350371 SOCS1 6 6.65 × 10^−3 1.28 × 10^−10 Chr17 7210796 7211307 EIF5A 6 6.65 × 10^−3 8.61 × 10^−12 Chr19 55850629 55851365 KMT5C 12 3.31 × 10^−3 2.55 × 10^−13 Chr20 57426538 57427973 GNAS 38 1.59 × 10^−2 2.88 × 10^−21 Chr22 38851318 38852154 KCNJ4 9 1.06 × 10^−2 1.26 × 10^−16 [100]Open in a new tab ^a The positions were obtained according to hg19, GRCh37 (Genome Reference Consortium Human Reference 37). ^b These three different methylation regions were selected by logistic regression into models. ^c These nine different methylation regions were selected by elastic net regression into models. Principal-component analysis was conducted to explore the difference in 24 DMRs among ethnicity-specific and sample type-specific subgroups ([101]Figure S2). Function enrichment of DMRs Gene ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis were performed for 25 DMR-associated genes. GO analysis indicated that DMR-associated genes contributed to many immune-related biological functions, such as immune cell activation and regulation, cellular response to IFN-γ, and cytotoxicity ([102]Figure S3A). KEGG analysis suggested that these 25 genes mainly participated in antigen processing and presentation and pathogen infection-related pathways (cytomegalovirus, papillomavirus, etc.) ([103]Figure S3B). A protein-protein interaction (PPI) network was built to visualize the interactions among the proteins encoded by these 25 genes. HLA-F, ZBTB22, SIN3A, and GABBR1 were considered hub genes in the constructed network ([104]Figure S3C). Construction of diagnostic classifiers based on the validated DMRs Variables were selected by logistic regression and elastic net regression. A total of six machine learning methods were applied to construct classifiers based on the selected variables. Through binary univariate and multivariate logistic regression, three DMRs (chr11: 65315205–65315625, chr3: 195635643–195636243, and chr6: 29691631–29692475) were finally selected as classifiers ([105]Figures 3A and 3B). For these three DMRs, the logistic regression classifier yielded a sensitivity of 79.1%, specificity of 84.4%, and area under the curve (AUC) of 0.888 (95% confidence interval [CI]: 0.831–0.945) ([106]Figures 3C and 3D). This classifier also showed a net benefit in performance regardless of the risk threshold selected ([107]Figure 3E). The highest AUC was 0.999 (95% CI: 0.997–1.000) for the random tree classifier, followed by that for the extreme gradient boosting (XGBoost) (AUC = 0.972; 95% CI: 0.948–0.995) and k-nearest neighbor (KNN) (AUC = 0.945; 95% CI: 0.908–0.982) classifiers. The lowest AUC was observed for the support vector machine (SVM) classifier (AUC = 0.833; 95% CI: 0.758–0.907). Figure 3. [108]Figure 3 [109]Open in a new tab The included variables, nomogram, and performance of the 3-DMR logistic regression classifier (A) Customizing visualizations of included variables in 3-DMR logistic regression classifiers. The upper two plots are the genomic coordinates of the targeted differentially methylated regions, followed by a line chart that shows the mean beta value of all probes in the corresponding region among tuberculosis patients (red line) and healthy controls (green line). Next, the genomic annotations, including CpG island locations and DNAseI hypersensitive sites, were plotted. The data of CpG island locations and DNAseI hypersensitive sites were obtained from Wu et al.[110]^22 and UCSC Genome Browser. Finally, RefSeq tracks were added. (B) Nomogram of the 3-DMR logistic regression classifier. For simplicity, region-related genes were employed to represent the corresponding differentially methylated regions. LTBP3, TNK2-AS1, and HLA-F represent the regions of chr11: 65315205–65315625, chr3: 195635643–195636243, and chr6: 29691631–29692475, respectively. (C) Calibration curve of the 3-DMR logistic regression classifier. (D) Receiver operating characteristic of the 3-DMR logistic regression classifier. (E) Decision curve analysis curve of the 3-DMR logistic regression classifier. The bold red curve shows the benefit net of this classifier at different risk thresholds, while the curves on both sides represent its 95% confidence interval. To avoid overfitting, the largest λ at which the mean squared error was within 1 standard error of the minimum (λ = 0.068) was used in the process of variable selection by elastic net regression ([111]Figure 4A). In addition to the above three DMRs, six DMRs (chr15: 75743753–75744225, chr6: 30458519–30458601, chr6: 33244976–33246390, chr6: 31627090–31627313, chr6: 33283789–33284168, and chr6: 31937968–31938372) ([112]Figure 4B) were selected by elastic net regression. For these nine DMRs, the elastic net regression classifier reached a sensitivity of 82.1%, specificity of 86.7%, and AUC of 0.918 (95% CI: 0.871–0.966) ([113]Figure 4C). The highest AUC was found for the random tree classifier (AUC = 1.000, 95% CI: 1.000–1.000), followed by the XGBoost (AUC = 0.997; 95% CI: 0.990–1.000) and KNN (AUC = 0.919, 95% CI: 0.869–0.969) classifiers. Figure 4. [114]Figure 4 [115]Open in a new tab Optimal hyperparameter selection, included variables and receiver operating characteristics of the 9-DMR elastic net regression classifier (A) The cross-validation plot for the optimal λ selection. The red dots represent the target mean squared error corresponding to each λ. The left dotted line shows the value of λ when MSE was at its minimum, while the right dotted line shows the maximum value of λ when MSE was within 1 standard error of the minimum. (B) Customizing visualizations of included variables in 9-DMR elastic net regression classifier. The upper two plots are the genomic coordinates of the targeted differentially methylated regions, followed by the line chart which shows the mean beta value of all probes in the corresponding region among tuberculosis patients (red line) and healthy controls (green line). Next, the genomic annotations, including CpG island locations and DNAseI hypersensitive sites, were plotted. The data of CpG island locations and DNAseI hypersensitive sites were obtained by the publication of Wu et al.[116]^22 and UCSC Genome Browser. Finally, RefSeq tracks were added. (C) The receiver operating characteristics of the 9-DMR elastic net regression classifier. The optimal hyperparameters used in each classifier are provided in [117]Table S2. Validation of DMRs by region-specific multiple sequencing The methylation levels of the above nine DMRs were further tested. Altogether, 62 samples from 31 TB patients and 31 HCs were collected in a prospective clinical cohort. The regions of chr3: 195635643–195636243 and chr6: 29691631–29692475 were found to be differentially methylated (p = 4.19 × 10^−5 and 0.024, respectively) ([118]Figure 5A). However, no meaningful findings were observed in the other seven regions (original data are provided in [119]Data S1). Figure 5. [120]Figure 5 [121]Open in a new tab The methylation levels of three differentially methylated regions and the performance of the 3-DMR logistic regression classifier in an independent cohort (A) A violin plot of the methylation levels of three differentially methylated regions in an independent cohort. TB, tuberculosis; HC, healthy control. (B) The receiver operating characteristics of the 3-DMR logistic regression classifier in an independent cohort. Given that the data are generated in different ways (array and sequencing), the cutoff value of the classifier based on microarray datasets might not be suitable for sequencing data; therefore, new thresholds generated by sequencing data were used. Among all the classifiers constructed by different DMRs and modeling methods, the 3-DMR logistic regression classifier exhibited the highest AUC (0.838; 95% CI: 0.737–0.938). The specificity of this classifier increased to 90.3%, while the sensitivity declined slightly but still reached 64.5% ([122]Table 3; [123]Figure 5B). Table 3. The performance of built classifiers 3-DMR classifier __________________________________________________________________ 9-DMR classifier __________________________________________________________________ Variable selection method __________________________________________________________________ Logistic regression __________________________________________________________________ Elastic net regression __________________________________________________________________ Modeling method Logistic regression Support vector machine K-nearest neighbor Random tree XGBoost Elastic net regression Support vector machine K-nearest neighbor Random tree XGBoost Discovery set AUC (95% CI) 0.888 (0.831–0.945) 0.833 (0.758–0.907) 0.945 (0.908–0.982) 0.999 (0.997–1.000) 0.972 (0.948–0.995) 0.918 (0.871–0.966) 0.849 (0.780–0.918) 0.919 (0.869–0.969) 1.000 (1.000–1.000) 0.997 (0.990–1.000) Sensitivity 79.1% 76.1% 65.7% 100% 92.5% 82.1% 91.0% 86.6% 100% 97.0% Specificity 84.4% 77.8% 100% 93.3% 88.9% 86.7% 66.7% 86.7% 100% 100% Validation set AUC (95% CI) 0.838 (0.737–0.938) 0.778 (0.663–0.894) 0.500 (0.355–0.645) 0.750 (0.625–0.875) 0.772 (0.652–0.891) 0.536 (0.389–0.638) 0.506 (0.357–0.654) 0.516 (0.371–0.662) 0.512 (0.367–0.657) 0.532 (0.389–0.675) Sensitivity 64.5% 61.3% – 93.5% 80.6% 61.3% 77.4% 71% 58.1% 87.1% Specificity 90.3% 87.1% – 54.8% 64.5% 58.1% 45.2% 38.7% 45.2% 22.6% [124]Open in a new tab DMR, different methylation region; XGBoost, extreme gradient boosting; AUC, area under curve; CI, confidence interval. To facilitate the application of these findings to clinical practice, an online classifier based on the 3-DMRs logistic regression was designed at [125]https://mengyuan.shinyapps.io/TB_DNAmethylation/. For simplicity, in this online tool, region-related genes were employed to represent the corresponding DMRs. LTBP3, TNK2-AS1, and HLA-F represent the regions of chr11: 65315205–65315625, chr3: 195635643–195636243, and chr6: 29691631–29692475, respectively. Further evaluation of the 3-DMR logistic regression classifier in different situations Considering the serious consequences of TB spread, 35 TB patients and 32 participants injected with Bacillus Calmette-Guerin (BCG) were combined into the same group to decrease the likelihood of missed diagnosis. However, differential diagnosis of TB patients and BCG recipients is needed. Although the 3-DMR logistic regression classifier exhibited moderate AUC (0.689; 95% CI: 0.563–0.816), it could distinguish TB patients from BCG participants with a specificity of 82.9%. The differential diagnosis performance of the 3-DMR logistic regression classifier was also evaluated. The details of disease controls (DCs) are shown in [126]Table S3. The 3-DMR logistic regression classifier showed a strong ability to distinguish TB from malaria (sensitivity, 61.2%; specificity, 87.5%; AUC = 0.778; 95% CI: 0.598–0.959) and systemic inflammatory response syndrome (sensitivity, 100%; specificity, 94.0%; AUC = 0.955; 95% CI: 0.907–1.000). The sensitivity, specificity, and AUC were 100%, 92.5%, and 0.965 (95% CI: 0.929–1.000) when using this classifier to differentiate sepsis patients (GEO: [127]GSE138074) from TB patients, while they were 100%, 100%, and 1.000 (95% CI: 1.000–1.000) when distinguishing sepsis patients (GEO: [128]GSE58651 or GEO: [129]GSE155952) from TB patients. However, this classifier failed to efficiently differentiate TB patients from patients with subclinical parasitemia (sensitivity, 100%; specificity, 41.8%; AUC = 0.631; 95% CI: 0.417–0.844). The differential diagnosis ability of the 3-DMR logistic regression classifier is shown in [130]Figure 6. Figure 6. [131]Figure 6 [132]Open in a new tab Receiver operating characteristic curves of the 3-DMR logistic regression classifier when distinguishing tuberculosis patients from other disease controls (A) The receiver operating characteristic curve of the 3-DMR logistic regression classifier when distinguishing tuberculosis patients from malaria cases. (B) The receiver operating characteristic curve of the 3-DMR logistic regression classifier when distinguishing cases with systemic inflammatory response syndrome from tuberculosis patients. (C) The receiver operating characteristic curve of the 3-DMR logistic regression classifier when distinguishing sepsis patients (GEO: [133]GSE138074) from tuberculosis patients. (D) The receiver operating characteristic curve of the 3-DMR logistic regression classifier when distinguishing sepsis patients (GEO: [134]GSE58651) from tuberculosis patients. (E) The receiver operating characteristic curve of the 3-DMR logistic regression classifier when distinguishing sepsis patients (GEO: [135]GSE155952) from tuberculosis patients. (F) The receiver operating characteristic curve of the 3-DMR logistic regression classifier when distinguishing tuberculosis patients from patients with subclinical parasitemia. Discussion This work represents a comprehensive analysis of TB-related DNA methylation biomarkers. By integrating the available datasets, we identified TB-related targets (89 DMPs and 24 DMRs). With respect to the likelihood and degree of influencing gene expression, 24 DMRs were treated as subsequent candidates. Logistic regression and elastic net regression were used to select potential biomarkers for further validation in an independent cohort. Based on the selected candidates, six different methods were applied to construct classifiers, and the 3-DMR logistic regression classifier outperformed the others. The good performance of this classifier in both discovery and validation cohorts emphasize the tremendous potential of DNA methylation as a TB diagnostic biomarker. Over the past few decades, clinicians have expressed a preference for obtaining the most direct evidence possible to diagnose diseases. Therefore, tissues or body fluid from lesion sites are considered ideal sample types; and, certainly, analyses based on these samples are regarded as reference methods. However, a high level of invasiveness, challenging techniques, and limited applicable populations currently constrain the use of such samples for disease diagnosis. In this context, peripheral blood has emerged as a focus for researchers. The peripheral blood is the environment in which various cytokines and immune cells interact in infectious diseases. Therefore, the circulatory system can be treated as a source of markers for the host immune response[136]^23 and offers information about when and how disease progresses. The exploration of diagnostic biomarkers in the peripheral blood has encompassed genomics, transcriptomics, and epigenetics studies. However, genomics and transcriptomics do not provide an appropriate balance between molecular stability and flexible monitoring of disease progression. DNA methylation perfectly offsets the above two limitations. Thanks to technological developments, many choices for detection methods are available, which also contributes greatly to the clinical application of DNA methylation analysis. These advantages have prompted the deep exploration of DNA methylation in infectious diseases. To clarify how peripheral blood biomarkers are regulated and determine their functions in targeted disease, it is essential to demonstrate their roles as biomarkers of disease progression. The association between DNA methylation and TB was reported as early as 1980.[137]^24 Although the question of whether the change in host genomic methylation is caused by MTB itself or secondary to inflammation remains unanswered, some progress has been made to decode the functions of these changes in DNA methylation. By summarizing and comparing the characteristics of different methylation sites before and after host infection with MTB, scholars found that significantly changed candidates were mainly located in promoter regions. Methylation in the promoter region can lead to the failure of gene transcription initiation and gene silencing the by hampering transcription factor (TF) binding to specific motifs.[138]^25^,[139]^26 Changes in gene activity and expression will further affect a series of signaling pathways and biological functions, including immune cell regulation, cytokine regulation, IFN-γ signaling pathways, and other TB-related immune activities,[140]^21^,[141]^27 consistent with the results of this study. As mentioned above, methylation is able to interfere with TF binding, while TFs are also capable of regulating methylation status. TF occupancy theory suggests that TFs can modulate gene methylation levels by competing with DNA methyltransferase binding at promoter sequences.[142]^26 Pacis et al.[143]^28 cultured MTB-infected human dendritic cells and collected time-dependent data on DNA methylation, gene expression, and chromatin accessibility patterns. The resulting data verified that immune-related TFs could regulate methylation levels by binding to cis-acting elements. Notably, not all TFs are negatively regulated by DNA methylation, and positive correlations between them have also been reported.[144]^29 Collectively, the existing data show that DNA methylation has crucial roles in the pathogenesis and development of TB. Mapping TB-specific methylation and further selecting promising sites may open a new horizon for TB elimination to some extent. The understanding of how DNA methylation functions increases its potential as a promising biomarker. Over time, a number of TB-specific DNA methylation biomarkers, from the global DNA methylation level to detailed methylated sites or regions, have been reported. Maruthai et al.[145]^30 took the global DNA methylation level as a TB-specific biomarker and reported an AUC of 0.81. Their group also focused on the VDR promoter methylation level and the median DNA methylation level of Alu repetitive elements and found that these two indicators exhibited good performance for diagnosing pediatric TB (AUC = 0.977 and 0.969, respectively).[146]^31^,[147]^32 However, these results were obtained from relatively small cohorts (76 and 68 children). The team of Das and Chen reported some differentially methylated sites in TLR2, SNX26, and other genes but did not analyze the diagnostic values of these sites.[148]33, [149]34, [150]35 Wang et al.[151]^36 demonstrated meaningfully methylated sites in key genes in the vitamin D metabolic pathway and further assessed the diagnostic capacity of these genes by calculating the cumulative methylation level of CpG sites on these genes via four different methods. The AUCs of candidate genes varied from 0.578 to 0.794. Esterhuyse et al.[152]^9 incorporated different CpG sites by machine learning to distinguish active TB from latent TB infection and obtained a model with an AUC of 0.74. However, these methylation data were generated from only monocytes and granulocytes, which might restrict the applicability of the model. The high AUCs of these candidate loci underscores the potential of DNA methylation as a TB diagnostic biomarker. However, these biomarkers were identified in relatively small cohorts or specific cell types. Moreover, current studies tend to use a single CpG site or CpG sites in a single gene as a biomarker, which implies limited diagnostic power and generalizability. Comprehensive and systematic work is particularly necessary to avoid the above problems and further develop the diagnostic potential of DNA methylation. Therefore, we first combined all available datasets to properly expand the sample size, cover more ethnic populations, and target more sample types (whole blood, peripheral blood mononuclear cells, etc.). Through rigorous data analysis, the high population coverage and many different sample types ensure the wide applicability of biomarkers. Compared with CpG sites, methylated regions are more likely to regulate gene expression to a greater extent and thus were chosen as biomarkers in this work. In addition, each single biomarker has its own limitations, which will be particularly magnified in complex and changeable clinical settings. Constructing models to incorporate different biomarkers can resolve these questions effectively and maximize the underrated power of TB-specific DNA methylation biomarkers. We tested various combinations of variable selection methods and modeling methods to build different classifiers, which offered us a better opportunity to select a high-performing classifier. Such efforts were finally realized in the 3-DMR logistic regression classifier, which exhibited superior performance in both the discovery and validation cohorts. Fewer included variables, excellent capacity, strong generalizability, and availability as a convenient online tool enable this classifier to be used in real clinical applications. Understanding why this 3-DMR logistic regression classifier possesses such capacity and differential diagnostic ability is crucial to its optimal utilization. The regions of chr3: 195635643–195636243 and chr6: 29691631–29692475 occupied a dominant position in this classifier. The region of chr3: 195635643–195636243 is at the position of TNK2-AS1. TNK2-AS1 is closely related to cell proliferation, invasion and apoptosis in cancer.[153]^37^,[154]^38 TNK2-AS1 acts as a microRNA sponge and, to date, miR-4319 and miR-150-5p have been reported as targets of TNK2-AS1.[155]^38^,[156]^39 Our previous study suggested that miR-150-5p was a promising biomarker for TB diagnosis,[157]^40 and mounting evidence demonstrates that miR-150-5p contributes greatly to immune cell differentiation and the capacity of effector CD8+ T cells to kill infected cells.[158]^41 Herein, we hypothesized that the interaction between MTB and the host influences TNK2-AS1 expression by regulating its methylation level, while altered TNK2-AS1 expression triggers a series of downstream pathways through a ceRNA mechanism. The region of chr6: 29691631–29692475 is located within the promoter region of HLA-F. In the late 1980s, Geraghty et al.[159]^42 first reported HLA-F as a nonclassical class I antigen in the human leukocyte antigen (HLA) family. Due to their prominent role in antigen presentation and immune regulation, many members of the HLA family have received ongoing attention from researchers,[160]^43 but HLA-F has been less well studied. However, the current evidence suggests that HLA-F is an underappreciated player in immune regulation.[161]^44 Changes in HLA-F at the genome, transcription, or epigenetic level can fuel a series of immune alterations and are thus involved in cancer, infection, autoimmunity, and other disorders.[162]^45 In our work, TB patients exhibited decreased methylation levels in the promoter region of HLA-F, also indicating elevated HLA-F mRNA expression. Increased HLA-F mRNA expression had positive impacts on its interaction with immune receptors (KIR3DL1, KIR3DS1, etc.), cytokine production (CCL4, IFN-γ, etc.), and antigen presentation, which have already been verified in other infectious diseases. Lunemann et al.[163]^46 documented that through binding to KIR3DS1, increased HLA-F promoted natural killer cell cytolysis and cytokine production and thus suppressed hepatitis C virus replication. This was also observed in HIV-1 infection.[164]^47 Based on the above, we speculated that the host responded to MTB infection by downregulating the methylation level of the promoter region in HLA-F, elevating HLA-F mRNA expression and then promoting immune cell activation to kill the bacteria. The CpG sites in the HLA-F promoter region may serve as promising targets to enhance host immunity and achieve precise MTB clearance. Aside from its diagnostic ability, generalizability, detection time, cost, and feasibility in economically poor areas must also be considered in the transition of this classifier to clinical practice. In terms of generalizability, data generated from different sample types and ethnicities were included to increase the generalizability of the classifiers. The limited sample size in subgroups of different ethnicities restricted the reliability of related findings, whereas relevant results were presented in this paper to provide potential clues. For sample type subgroups, principal-component analysis indicated that 24 regions in 5 subgroups exhibited similar methylation patterns. This suggests that the 3-DMR logistic regression classifier has the potential to be used in these five sample types. The excellent performance of the 3-DMR logistic regression classifier in whole blood samples from an independent cohort supports the above speculation. Although different blood cells have their own epigenetic patterns, they are all found in the circulation and thus may be stimulated by the same factors and communicate frequently with each other, leading to similarities in their epigenetic patterns.[165]^48^,[166]^49 In terms of cost and detection time, the use of whole blood as samples allows a shorter detection time and lowers costs because it does not require complex cell sorting, gradient density centrifugation, or other sample processes. However, sample processing accounts for a limited part of the total detection time and costs, while detection assays greatly influence detection time and cost. Assays based on bisulfite conversion, restriction enzymes, and affinity enrichment are currently applied to analyze specific methylated regions and, among these, bisulfite conversion is the most widely used.[167]^50 Polymerase chain reaction (PCR), microarrays, and sequencing are utilized to read methylation information after bisulfite conversion.[168]^51 PCR is suitable for primary health facilities due to its low cost, short detection time, and simple operation; however, limited sensitivity and poor ability for methylated region detection should also be taken seriously. Microarrays and sequencing overcome the above shortcomings of PCR but at the expense of high costs and rigorous laboratory conditions. When choosing different detection methods, it should be considered that these methods had different sensitivities and thus different captured efficiencies of CpG sites in targeted regions, which was also observed in this study. Unfortunately, we have not proposed a standardized process to address this problem. In addition, the differential diagnostic ability of the targeted classifier was evaluated in relatively small populations, and more reliable evidence is needed. Herein, much work remains to be done before applying a novel diagnostic pattern in clinical practice. Collectively, the available evidence suggests that DNA methylation might be a kind of biomarker for TB diagnosis. Among all classifiers tested here, the 3-DMR logistic regression classifier presented excellent performance in both the discovery and validation datasets. This classifier might provide insights into how DNA methylation biomarkers could fit into future TB diagnosis and allow TB patients to be shielded from disease progress by timely diagnosis. Further decoding the profile of DNA methylation could offer crucial hints related to TB development, a research direction that requires an increased emphasis on TB. Materials and methods Data preparation DNA methylation array datasets were searched in the NCBI Gene Expression Omnibus database (GEO) and European Bioinformatics Institute ArrayExpress from their inception to October 3, 2020. The search terms used were (“tuberculosis” OR “TB”) AND (“methylation” OR “methylate”), with restrictions on species (Homo sapiens) and sample type (peripheral blood or its components). Considering the influence of complex body interactions on methylation patterns,[169]^52 datasets generated based on in vitro infection cell models, such as GEO: [170]GSE83379, were excluded. Datasets without raw IDAT or CEL files, such as GEO: [171]GSE50835, were also excluded. The search strategy for DCs is described in [172]Data S1. Two authors (Lyu and Zhou) were responsible for reviewing the eligible datasets and extracting the design information of each included dataset. The extracted content included the demographic characteristics of the subjects (age, sex, race, etc.), number and type of samples, detection platform used, etc. Any discrepancy was resolved by discussion. Raw data processing To decrease the impact of using different probe filtering conditions and data processing methods, the raw files for the included datasets were downloaded and processed in the same way. The Minfi v.1.30.0 package[173]^53 was applied to handle the data. The p value of the probe was calculated to assess the reliability of the probe signal, and a probe was filtered out if its p value was more than 0.01 in any sample.[174]^53 Samples with poor quality, such as GEO: [175]GSM1860483, were excluded according to the mean p values of all probes for each sample. In addition, probes with single-nucleotide polymorphisms at CpG sites or on sex chromosomes were excluded. Preprocessing, background noise reduction, and normalization were also performed. DMP and DMR analysis The batch effects among different datasets were adjusted by the sva v.3.32.1 package.[176]^54 Both beta and M values were calculated to assess the methylation level of each CpG site in each sample. Based on the M value matrix,[177]^55 the Limma package v.3.40.6[178]^56 was used to assess DMPs with the standard of |FC| > 1.5 and p < 0.05. For DMRs, the DMRcate package v.1.20.0[179]^57 was employed to identify methylated regions and perform differential methylation analysis. Methylated regions were identified based on a Gaussian kernel, and DMRs were found by tunable kernel smoothing of the differential methylation signal.[180]^57 The region methylation level was assessed on the basis of the mean beta value of all CpG sites in the corresponding region. Based on 363,416 probes, methylated regions with a false discovery rate < 0.05 were regarded as meaningful regions. The differential analysis was also performed in different ethnicities and sample types. DMR annotation was carried out by an online tool, wANNOVAR: [181]http://wannovar.wglab.org. Functional analysis GO enrichment analysis and KEGG pathway enrichment analysis were performed for DMR-related genes. A PPI network was constructed to identify the hub genes among DMR-associated genes by an online tool, STRING: [182]https://string-db.org/. The node score suggested the importance of this node in the whole PPI network. The interactions of the combined score were set at 0.4. Diagnostic classifier development Incorporating DMRs into classifiers might open the door to new methods for clinical diagnosis of TB. Logistic regression and elastic net regression were used to select classifier variables and construct classifiers. All DMRs underwent binary univariate and multivariable logistic regression. DMRs with p < 0.05 in the final multivariable logistic regression were incorporated. For elastic net regression, the key parameter λ was selected by K-fold cross-validation. In addition to the above two methods, SVM, KNN, random tree, and XGBoost were also used to construct classifiers, and optimal hyperparameters of these modeling methods were chosen by grid search or cross-validation. Grid search was realized by GridSearchCV, which evaluates and compares all scores by adjusting a series of parameter values and outputs the optimal values of the parameters that generated the highest score. Cross-validation divides all data into K parts, calculates the average verification accuracy, and selects the optimal parameter when the verification accuracy reached the best value. The Youden index was calculated to determine suitable cutoff values. Sensitivity, specificity, and AUC were used to assess the diagnostic capacity of the classifiers. Receiver operating characteristic curves were plotted to visually present classifier performance. Validation cohort recruitment and sample preparation A total of 62 Chinese participants admitted to West China Hospital of Sichuan University between January 2019 and October 2020 were enrolled. TB patients met the following criteria: (1) diagnosed with TB according to the Diagnostic Criteria for Tuberculosis (WS 288-2008);[183]^58 (2) age older than 18 years; and (3) free of other lung diseases (lung cancer, chronic obstructive pulmonary disease, etc.), liver diseases (hepatitis, hepatocellular carcinoma, etc.), metabolic diseases (diabetes mellitus, hyperlipidemia, etc.), and autoimmune diseases. Pregnant women were excluded unless specifically indicated. HCs with negative results on TB-related examinations and no history of TB were recruited. The TB and HC groups were age and sex matched. All participants signed written informed consents. EDTA-treated whole blood (3.0 mL) was collected from each subject, and genomic DNA was extracted (Kuang Yuan Diagnostics Q1001, China) according to the manufacturer’s protocol. DNA purity was measured by spectrometry at 260/280 nm. The extracted DNA was stored at −80°C. The protocol of this study was approved by the Clinical Trials and Biomedical Ethics Committee of West China Hospital, Sichuan University (registration number in the Chinese Clinical Trial Registry: ChiCTR1900028670). Region-specific multiple sequencing and data analysis To verify DMRs found in array analysis and test classifier performance and generalization, region-specific multiple sequencing was performed. Ultrasound was used to fragment the genomic double-stranded DNA into pieces of 300 bp. After purification with 2× magnetic beads, DNA was bisulfite-treated with an EZ DNA Methylation-Gold Kit (Zymo Research D5005, USA). Bisulfite-converted DNA samples further underwent phosphorylation modification by T4 polynucleotide kinase (Thermo Scientific EK0031, USA) and the addition of 5′ adapters. Bisulfite-specific PCR was conducted to amplify regions of interest (primers used in this step are listed in [184]Table S1) using EpiTaq HS (TaKaRa R110A, Japan). Then, Taq (TaKaRa R001WZ, Japan) was used for a second PCR to add sample barcodes and sequencing adapters. PCR amplicons were visualized by gel electrophoresis, purified by 1.2× magnetic beads and sequenced on an Illumina Sequencer. Raw data were subjected to quality control and read trimming. Bismark was applied to align trimmed reads to the reference genome sequence and calculate the percentage of 5-methylated cytosine (5-mC). According to DMRfinder, the methylation degree of each region was assessed by the following formula: methylation fraction of a region = sum of methylated counts at CpG sites within a region/sum of methylated and unmethylated counts at CpG sites within a region. Differences between TB patients and HCs were evaluated by Limma package v.3.40.6. Acknowledgments