Abstract With human guidance, computers now use machine learning (ML) in artificial intelligence (AI) to learn from data, detect trends, and make predictions. Software can adapt and improve with new information. Imaging scans leverage pattern recognition to predict outcomes, diagnose disorders, and suggest treatments. Tuberculosis (TB) remains the most common bacterial disease affecting humans. The World Health Organisation reported that in 2022, 1.3 million people died from tuberculosis, with the death rate potentially reaching 66% if proper treatment isn’t provided. We trained ML-supervised algorithms like XG Boost, Logistic Regression, Random Forest Classifier, Ad- aBoost, and Support Vector Machine to help classify TB patients from large RNA-sequence count data. Such algorithms provided prediction accuracies of 0.963, 0.739, 0.773, 0.866, and 0.866 sequentially. This article highlights feature importance techniques using the ML model, XGBoost, with the highest prediction accuracy of 0.963, identifying significant genes in TB RNA sequence count data. Using key machine learning features, we here identified 20 pathways, 24 gene ontologies, 20 hub genes, and 22 drugs. Next, we applied advanced computational techniques, including pathway analysis, GO, hub-protein and protein–protein interactions (PPI), transcriptomic and miRNA interactions, and drug-protein interactions, to help analyze 100 highly expressed genes. Keywords: ML, Bioinformatics, TB, DEGs, Gene ontology, PPIs, Hub gene, Potential drug Subject terms: Computational biology and bioinformatics, Microbiology Introduction Machine Learning (ML) is being more widely utilized across various sectors including military, cybersecurity, healthcare, among many more. Additionally, ML algorithms are employed to address intricate and emerging challenges in biomedical research, encompassing tasks such as text mining, drug discovery, single-cell RNA sequencing, as well as early diagnosis and prognosis of diseases^[44]1. ML techniques are frequently used for analyzing and interpreting a variety of data types, including multi-omics, imaging, clinical records, medication details, and disease progression. Biomedical data science encompasses a variety of data formats, including genome sequences, omics data, medical imaging, and clinical records^[45]2. As time progresses, machine learning methods are increasingly gaining favor within the medical sector for their efficacy in aiding decision-making processes^[46]3. The study of RNA-sequence data reveals the effectiveness of machine learning algorithms in discovering splice variants within RNA-sequence data by utilizing a variety of ML methods for different detection objectives and to clarify sequence correlations^[47]4. For thousands of years, tuberculosis (TB) has been a common infectious illness that has caused high rates of death in several vulnerable populations. Despite advances in medical science, this tendency continues into the modern period. According to the World Health Organization’s most current TB estimated 10.6 million new cases of TB and 1.3 million fatalities from the disease in 2022 (World Health - organization, 2023). Up to the advent of SARS-CoV-2, tuberculosis (TB) was the leading infectious cause of mortality in humans. Approximately 1.6 million people died from TB in 2021, mostly in low- and middle-income countries^[48]5. Early identification, drug resistance screening, and thorough treatment with short-course regimens can effectively cure tuberculosis^[49]6. Automated techniques for tuberculosis diagnosis and classification have improved disease identification precision, enabling healthcare professionals to make more informed decisions. Additionally, innovators in advanced biotechnology have created a number of bioinformatics tools that have facilitated the investigation of illness studies. Many researchers have employed machine learning (ML) algorithms to predict chronic and viral diseases^[50]7,[51]8 including the prevalence of tuberculosis based on associated risk factors^[52]9,[53]10. Within these fields of artificial intelligence (AI), machine learning (ML) builds mathematical models using training data to make precise diagnoses and choices without the need for explicit manual programming for specific tasks. We therefore used an XGBoost-based feature importance approach to help identify the precise highly expressed genes from RNA sequencing count data in this TB patient research. This approach was used instead of only relying on the values of Adjusted P-values and Log Fold Changes to identify important genes. We then developed a classification pipeline for tuberculosis diagnosis using RNA-Seq microarray data, enabling rapid and reliable analysis of large transcriptome datasets for meaningful conclusions. Before proceeding with further analysis, the initial step involves assessing the quality of the unprocessed raw sequence data. Machine learning algorithms were used to identify significant genes, followed by bioinformatics analyses such as pathway enrichment, gene ontology, and drug prediction. The biological processes and roles associated with the discovered genes, as well as possible therapeutic possibilities for TB treatment, are clarified by these investigations. Our work’s visual depiction is seen in Fig. [54]1. Future efforts will use machine learning algorithms to quickly identify key genes and provide an instant prognosis estimate for TB patients based on gene sequence data. This will make it easier for medical personnel to handle the problem. In conclusion, these findings will be extremely helpful in controlling or reducing the hazards related to tuberculosis patients, providing invaluable assistance to researchers and medication makers. Fig. 1. [55]Fig. 1 [56]Open in a new tab The suggested approach and the workflow. Methods and materials Data sets and pre-processing GEO, located at [57]https://www.ncbi.nlm.nih.gov/geo/, is a global public archive for genetic and genomic data that enables access to large datasets created by scientific communities, such as data from microarrays, next-generation sequencing, and other analytical methods. GEO makes this essential information more easily shared and accessible to researchers around the world^[58]11. We retrieve our dataset from GEO with accession ID [59]GSE103147^[60]12. Furthermore, we acquired this dataset by analyzing raw count data through “GREIN”, an alternative tool facilitating the exploration of publicly available gene expression data^[61]13. The dataset was assembled utilizing the “Illumina HiSeq 2000 (Homo sapiens)” platforms. 6,363 healthy adolescents aged between 12 and 18 years, were enrolled in the study and followed for 24 months or longer. Six months was the period between the collection of blood. Among them, 41 were diagnosed with active TB, while 104 served as asymptomatic control^[62]12. The period between blood collection and the diagnosis of active TB referred to as “time to diagnosis”, varied from 1 to 894 days. Nevertheless, given that it is count data, we obtained 28,091 genes for each individual and the dataset has dimensions of 28,089 rows and 1,608 columns. The count data are presented in tabular form, illustrating the number of sequence fragments associated with each gene across every sample^[63]14. The overview of the dataset is provided in Table [64]1. Table 1. A nutshell of the datasets and findings from the transcriptome study. Diseases GEO ID Sequencing platform Tissue Cell Gene of every sample Case samples Control samples Tuberculosis [65]GSE103147 Illumina HiSeq 2000 Whole blood T cells and monocytes 28,091 724 884 [66]Open in a new tab After acquiring the data, we conducted pre-processing by utilizing the functionalities provided by the “pandas” library. The “pandas” library offers built-in, user-friendly functions for handling routine data manipulations and conducting analyses on datasets like this one. Its objective is to serve as the fundamental framework for Python’s statistical computing endeavors in the foreseeable future^[67]15. Machine learning algorithms This section discusses the experimental use of numerous ML algorithms for the prediction of tuberculosis. Random forest (RF) classifier Random Forest, a machine learning algorithm developed by Breiman and Cutler, combines many decision trees for both regression and classification problems, offering versatility and user-friendliness^[68]16. It is a classifier that uses multiple decision trees on a dataset, based on the majority votes, to improve predictive accuracy and prevent overfitting, resulting in higher accuracy^[69]16. The Random Forest algorithm is a low-training, high-accuracy prediction method that performs efficiently on large datasets. It uses a two-phase process, combining N decision trees to create a random forest^[70]17. The process involves choosing K data points at random, creating decision trees linked to these points, selecting N decision trees, and performing the first two steps. The predictions from each tree are then allocated to the group with the most votes. AdaBoost classifier AdaBoost is an ensemble boosting classifier, proposed by Freund and Schapire since 1996. It combines multiple weak classifiers to increase accuracy. AdaBoost is an iterative method that sets classifier weights and trains data samples for accurate predictions of unusual observations^[71]18. AdaBoost is a classifier that aims to reduce training errors in each iteration by increasing the weight of incorrectly classified instances using a Decision Tree method. It is recommended to train the classifier interactively with a range of weighted training samples. However, this technique cannot be parallelized as each predictor needs to be trained after the preceding one. The AdaBoost algorithm involves assigning each observation the same weight, creating a model using a subset of data, computing errors by contrasting predicted and actual values, assigning higher weights to mistaken data points, and repeating these steps until the error function remains unchanged or the maximum number of estimators is achieved^[72]19. Logistic regression classifier Logistic regression is a popular machine learning algorithm used to predict a categorical dependent variable using independent variables. It predicts probabilistic values between 0 and 1, instead of providing exact values like 0 and 1^[73]20. Procedure for logistic regression The procedure for implementing Logistic Regression in Python is the same as it was in earlier regression topics. The steps are as follows: 1. Pre-processing of the data 2. Assessing the training set for logistic regression fit 3. Projecting the exam outcome 4. Test the correctness of the outcome (matrix creation) 5. Showing the test set outcome visually. Logistic regression equation From the linear regression equation, one may get the logistic regression equation. The following lists the mathematical procedures to obtain equations for logistic regression. Here the straight-line equation may be expressed as follows: graphic file with name d33e528.gif In Logistic Regression y can be between 0 and 1 only, so for this let’s divide the above equation by (1−y): graphic file with name d33e535.gif But we need a range between − [infinity] to +[infinity], then take the logarithm of the equation it will become: graphic file with name d33e542.gif This classification algorithm is used to classify linear logarithms. However, it can be defined in this way: graphic file with name d33e549.gif XGBoost classifier XGBoost is a gradient-boosting machine learning algorithm, renowned for its computational efficiency, feature importance analysis, and handling of missing values, used in regression, classification, and ranking tasks. Dr. Chen from Washington University introduces XGBoost, an ensemble learning model based on Boosting. It constructs multiple CART trees based on splitting nodes, reduces variance, and optimizes CPU multi-threads for improved accuracy. It has applications in artificial intelligence, data analysis, statistics, and mining^[74]21. Support vector machine Support Vector Machine (SVM) is a popular Supervised Learning algorithm used for classification and regression problems in Machine Learning. It aims to create the best decision boundary, a hyperplane, for separating n-dimensional space into classes. Vapnik and colleagues developed SVMs in the 1990s and presented their findings in 1995^[75]22. Support vector regression Support vector regression (SVR) is a continuous regression method that extends linear SVMs, primarily used for time series prediction, by finding a hyperplane with the maximum margin between data points, unlike linear regression which requires specifying relationships between variables. Differential gene expression DGE analysis is a crucial bioinformatics technique for understanding biological processes by comparing gene expression levels between conditions or treatments. Differential expression analysis involves analyzing normalized read count data to identify quantitative changes in expression levels between experimental groups through statistical analysis. Differential expression testing aims to identify genes expressed at different levels between conditions, providing biological insights into the processes impacted by the conditions of interest^[76]23. Feature importance “Feature importance” is a method that assigns a score to each input feature in a model, with higher scores indicating a greater influence of the characteristic on the model used to forecast a variable^[77]24. Understanding the importance of a feature can improve a model’s clarity and effectiveness by providing insights into the relationships between input and output variables. This can help remove unnecessary or redundant features and identify important qualities that may have gone unnoticed, opening up new research possibilities. Methods like Gradient Boosting and XGBoost feature importance can be used to determine feature importance, offering diverse perspectives on feature significance and identifying the most influential ones in a model’s predictions^[78]25. Molecular pathway enrichment and gene ontology analysis A framework and a collection of ideas for characterizing the roles of gene products from all species are provided by the Gene Ontology (GO)^[79]26. The Gene Ontology (GO) knowledge base is a significant resource for understanding gene functions, enabling computational analysis of large-scale molecular biology and genetics studies in biomedical research, making it both machine- and human-readable. A significant introspective effort to categorize biological viewpoints, such as biological mechanisms or chromosomal regions linked to a range of related disorders, is known as a functional enrichment evaluation. A comprehensive web-based tool for gene set enrichment, EnrichR was used to study pathway enrichment and gene ontology, which includes biological processes, cellular components, and molecular functions. several KEGG^[80]27, Reactome^[81]28, WikiPathways^[82]29, Elsevier, and BioCarta^[83]30 were utilized to characterise the major pathways, with an acceptable adjusted P-value of less than 0.05. Protein–protein interactions (PPIs) analysis Protein–protein interactions (PPIs) are physical connections formed through biochemical or electrostatic processes between proteins, which are crucial for various biological processes like cell-to-cell contacts, cell cycle progression, signal transmission, and metabolic pathways.Online software called the Search Tool for the Retrieval of Interacting Genes (STRING)^[84]31 is widely used to evaluate PPI data. The dataset was utilized to evaluate DEG relationships, and a PPI network was created using Cytoscape, revealing significantly correlated proteins based on topological characteristics^[85]32. Hub gene identification Hub genes are those in the gene network that collaborate with a large number of other genes and are frequently essential for biological functions and gene regulation. Fur- thermore, hub genes were shown to be the most strongly linked to illness^[86]33. A Cytoscape plugin called CytoHubba (Version 0.1)^[87]34 was utilized to determine the network hub genes. Another plugin for Cytoscape that may be used to generate and display functionally grouped networks of biological concepts and pathways is called ClueGO^[88]35. A useful addition to ClueGO, the CluePedia Cytoscape plugin searches for additional markers that may be connected to routes. Assessing transcription factors (TF) and miRNAs network Proteins known as transcription factors control how genes are copied into RNA, which is the first step in the process of creating proteins. Transcription factors, are proteins that bind to DNA, and activate or deactivate genes, with enhancers and silencers binding sites in specific parts of the body, enabling cells to perform logic operations^[89]36. We utilized the Network Analyst tool to explore the JASPAR^[90]37 and ChEA^[91]38 databases and identify topographically plausible transcription factors likely to link to our DEGs. JASPAR is a publicly available database containing carefully selected, non-redundant transcription factor (TF) binding profiles for TFs in many species spanning six taxonomic groupings. These profiles are kept as TF flexible models (TFFMs) and position frequency matrices (PFMs). A gene-set enrichment analysis technique called ChIP-X^[92]39. Enrichment Analysis is designed to determine whether query gene sets are enriched with genes that may be transcription factor targets. In ChEA, sets of probable target genes are selected from published ChIP-chip, ChIP-seq, ChIP-PET, and DamID investigations and labeled with transcription factors using a gene-set library (Fig. [93]2). Fig. 2. [94]Fig. 2 [95]Open in a new tab Supervised learning model to diagnosis tuberculosis. MTIs, or miRNA-target interactions, are stored in the miRTarBase database. By using reporter assays, western blots, or microarray studies with miRNA overexpression or knockdown, the gathered MTIs are experimentally confirmed. Analysis of potential drugs The term “Potential Drug-Drug Interaction” (or "potential DDI”) or Potential Drugs describes the potential for one drug to affect another’s effects when taken concurrently. A medication has to interact with a matching protein or enzyme on the receptor to affect it. The interaction between proteins and bioactive molecules is crucial for many pharmacological or prophylactic approaches. Drug-protein interaction refers to the interaction between a protein molecule and a drug molecule, which significantly impacts the activity of medications. DrugBank Online is a crucial tool in the drug industry, enabling significant advancements in the data-driven medicine sector due to its comprehensive referencing and detailed data descriptions. The program Analyst was used to predict drug-protein correlations. Result Evolutionary metrics and results of statistical models Here, we attempt to interpret the numerous signs that might be employed to determine the effectiveness of the recommended approach. More accuracy is required when assessing models employing RNa-sequence data (Fig. [96]3). Even though the efficacy of a model is frequently evaluated using accuracy, Therefore, in addition to accuracy, the evaluation metrics RocAuc, Precision, Recall, Specificity, F1-Score, TPR, and FPR are utilized to obtain an additional thorough comprehension of a prototype’s effectiveness. Each of these metrics was used in the research we conducted to evaluate the suggested model’s reliability. The matrix of uncertainty compiles the several metrics that are employed to assess how effective a classification model is. The following four components are required for this confusion matrix: The acronyms TP, FP, TN, and FN represent the terms “True Positive,” “True Negative,” and “False Negative,” in that order (FN). Four things may happen. True positives are tallied when an incident is identified as positive and is deemed positive; false negatives are counted when an event is labeled as negative. A true negative is counted if the instance is classed as negative; a false positive is tallied if it is labeled as positive. The most frequent results are appropriate labeling (TP) and identification (TN), as opposed to incorrect labeling (FP and FN). The percentage of accurate forecasts to all predictions is known as accuracy. True positives (TP) and true negatives (TN) make up correct forecasts. The totality of the positive (P) and negative (N) instances make up each forecast. N is made up of false negatives (FN) and TN, whereas P is made up of TP and false positives (FP). graphic file with name d33e710.gif 1 Fig. 3. Fig. 3 [97]Open in a new tab Evaluation among the ML model. When evaluating a classification model’s capacity to forecast tuberculosis, precision, and recall are essential metrics to consider. Precision is the percentage of correctly estimated victims of tuberculosis to all expected patients^[98]40. So the precision metric quantifies the proportion of accurate forecasts generated by the model and the amount of accurate positive predictions, or true positives, divided by the total number of positive predictions (including true and false positives) that the model correctly anticipated is the precision. Conversely, recall, which is often referred to as responsiveness, is calculated by dividing the sum of true positive labels by the sum of all real positive labels^[99]41. It compares the percentage of correctly estimated TB patients with the total number of TB patients. The formulas used for these metrics are: graphic file with name d33e735.gif 2 graphic file with name d33e741.gif 3 The average of the harmonics of a classification model’s recall and precision is known as the F1 score, or F-measure^[100]42. The F1 measure accurately reflects a models depend- ability since both metrics have an equivalent role in the outcome. The equation used for the F1 score is: graphic file with name d33e753.gif 4 The classification assessment statistic for a model, called specificity, measures the pro-portion of true negatives that the framework correctly detects. This suggests that a further percentage of real negative data was misinterpreted as positive; One may call these “false positives.” The model’s elevated specificity means that most of the negative findings are being correctly classified by the model. In contrast, a low specificity indicates that many negative results are being incorrectly labeled as positive. Since the expense of false negatives is substantial like when it comes to medical treatment, high specificity is desired^[101]43. Specificity can be computed using the formula below: graphic file with name d33e765.gif 5 In the context of a matrix of confusion, sensitivity or recall are other names for the True Positive Rate (TPR). True Negative Rate (TNR) is also often utilized, much like specificity. A low FPR is crucial to prevent needless additional testing and possible patient damage, whereas a high TPR is crucial to guarantee that every single cancer case is identified during medical treatment (Fig. [102]4). Maintaining the efficacy and security of tests used for diagnosis and screening for medical conditions requires striking a balance between TPR and FPR. It is computed as follows: graphic file with name d33e776.gif 6 graphic file with name d33e782.gif 7 Fig. 4. [103]Fig. 4 [104]Open in a new tab Supervised learning to diagnosis tuberculosis. We employed five untrained models—XG Boost, Logistic Regression, Random Forest Classifier, AdaBoost, and Support Vector Machine—to predict tuberculosis from RNA-Sequence count data. From Table [105]2 the results of the confusion matrices for Precision 0.95, Recall 0.964, RocAuc 0.985, Specificity 0.962, F1-Score 0.957, TPR 0.964, and FPR 0.038 demonstrate that the XG Boost model worked effectively, with the greatest prediction accuracy at 0.963% and lowest Log Loss at 0.139%. Furthermore, with a prediction accuracy of 0.866%, the AdaBoost plus Support Vector Machine model demonstrated the second-highest accuracy. Their respective Log Losses are 0.666% and 0.661%, making them the highest. AdaBoost Precision 0.845, F1-Score 0.844, TPR 0.840, FPR 0.114, Recall 0.840, Specificity 0.886, and RocAuc 0.060 which is the lowest for this specific dataset are shown in the sections that follow. AdaBoost’s ROC AUC value, which measures a model’s ability to distinguish between positive and negative classes, is the lowest. Moreover, for Support Vector Machine F1-Score 0.838, TPR 0.804, FPR 0.087, Precision 0.874, Recall 0.804, RocAuc 0.115, Specificity 0.913. On the other hand, with an accuracy of 0.739%, Logistic Regression is the least accurate model overall. In addition, the Random Forest Classifier’s success rate in third place was 0.772%. Table 2. Model efficiency assessment: evaluation metric scores. ML models Accuracy Log loss Precision Recall RocAuc Specificity F1-score TPR FPR XG boost 0.963 0.139 0.95 0.964 0.985 0.962 0.957 0.964 0.038 Logistic regression 0.739 0.553 0.737 0.609 0.193 0.837 0.667 0.609 0.163 Random forest 0.773 0.555 0.815 0.609 0.099 0.897 0.697 0.609 0.103 AdaBoost 0.866 0.666 0.847 0.840 0.060 0.886 0.844 0.840 0.114 SupportVector machine 0.866 0.661 0.874 0.804 0.115 0.913 0.838 0.804 0.087 [106]Open in a new tab Comparative transcription sequencing utilizing the significance of features To discover extremely expressed genes, Differential Gene Expressions (DEGs) have been performed in this study by using two feature-importance methodologies utilizing algorithms that use machine learning using count data of RNA-Sequence of TB with patients and non-TB. The maximum effectiveness was attained by training five controlled techniques: XG Boost, Logistic Regression, Random Forest Classifier, AdaBoost, and Support Vector Machine. XGBoosting, on the other hand, worked effectively, showing excellent prediction accuracy levels at 0.963% whereas the rest of the algorithms were below 0.9, which is why we selected it. The top 100 frequently occurring genes were then selected from the XGBoost algorithm to minimize uncertainty. Furthermore, the Extended data contain our expected expressed genes in the supplementary file Table [107]S1. Typically, P-value, Adjusted P-value, and Log-FC are used to identify important genes; however, we focused on picking out features to identify them in a new way, resulting in an effective result. Assessment of gene ontology and pathway enrichment analysis Considering Gene Ontology (GO) provides a comprehensive description of protein functions, it is considered one of the essential components of physiological description. GO refers to a controlled and structured phrase set of words called GO terms^[108]44. The study usually yields an ordered set of GO terms having P-values corresponding to every phrase^[109]45. Pathway analysis is an effective method for identifying genes, proteins, and metabolites that function differentially and are generated by present high volumes screening. It is also useful in studying physiology^[110]46. Pathway analysis is a technique used in genome-wide association research or genomics tools for the preliminary identification and understanding of a diseased or physiological state^[111]47. Ontology and pathways designed to carry out a comprehensive physiological simulation method are essential components of physiological treatments. We used an expression set enhancement strategy to identify networks using the machine learning program EnrichR. Five pathway resources were used to perform tests using DEGs of TB. Figure [112]5 displays the 20 major parameters of the signaling pathways. The following Table [113]3 lists the top 10 functions related to cellular components, biological processes, and the top 4 for molecular processes. Both the GO and the Pathways are filtered by the adj. P-value, which is often less than 0.05. The results are then arranged in ascending order. Fig. 5. [114]Fig. 5 [115]Open in a new tab An overview of the network abundance for tuberculosis DEGs. Y axis signaling pathways and X-axis denoted as negative log[10] P value. Asthma has highest negative log[10] P value. Table 3. Investigation of DEGs through an ontological standpoint. GO category GO ID Term P-value Genes Biological process GO:0,002,842 Positive regulation of T cell-mediated immune response to tumor cell 0.000245087 HLA-DRB3;HLA-DRB1 GO:0,002,839 Positive regulation of immune response to tumor cell 0.000366432 HLA-DRB3;HLA-DRB1 GO:0,002,840 Regulation of T cell-mediated immune response to tumor cell 0.000366432 HLA-DRB3;HLA-DRB1 GO:2,000,514 Regulation Of CD4-positive, Alpha–Beta T Cell Activation 0.000511333 HLA-DRB3;HLA-DRB1 GO:2,000,516 Positive regulation Of CD4-positive, Alpha–Beta T cell activation 0.001581045 HLA-DRB3;HLA-DRB1 GO:0,002,399 MHC class II protein complex assembly 0.002165765 HLA-DRB3;HLA-DRB1 GO:0,002,503 Peptide antigen assembly with MHC class II protein complex 0.002165765 HLA-DRB3;HLA-DRB1 GO:0,002,501 Peptide antigen assembly with MHC protein complex 0.003594237 HLA-DRB3;HLA-DRB1 GO:0,002,381 Immunoglobulin production involved in immunoglobulin mediated immune response 0.004004046 HLA-DRB3;HLA-DRB1 GO:0,046,635 Positive regulation of Alpha–Beta T cell activation 0.004004046 HLA-DRB3;HLA-DRB1 Cellular component GO:0,042,613 MHC class II protein complex 0.002490834 HLA-DRB3;HLA-DRB1 GO:0,030,662 Coated vesicle membrane 0.002635341 TEPSIN;HLA-DRB3; HLA-DRB1 GO:0,042,611 MHC protein complex 0.004885387 HLA-DRB3;HLA-DRB1 GO:0,030,134 COPII-coated ER to golgi transport vesicle 0.006547998 TMED3;HLA-DRB3; HLA-DRB1 GO:0,098,553 Lumenal side of endoplasmic reticulum membrane 0.008008107 HLA-DRB3;HLA-DRB1 GO:0,032,588 Trans-golgi network membrane 0.013821758 TEPSIN;HLA-DRB3; HLA-DRB1 GO:0,012,507 ER to golgi transport vesicle membrane 0.030955668 HLA-DRB3;HLA-DRB1 GO:0,030,658 Transport vesicle membrane 0.036310368 HLA-DRB3;HLA-DRB1 GO:0,030,669 Clathrin-coated endocytic vesicle membrane 0.045552255 HLA-DRB3;HLA-DRB1 GO:0,030,666 Endocytic vesicle membrane 0.045588079 HVCN1;HLA-DRB3; HLA-DRB1 Molecular function GO:0,032,395 MHC class II receptor activity 0.000870867 HLA-DRB3;HLA-DRB1 GO:0,023,026 MHC class II protein complex binding 0.006889074 HLA-DRB3;HLA-DRB1 GO:0,042,609 CD4 receptor binding 0.034484249 HLA-DRB1 GO:0,005,388 P-type calcium transporter activity 0.039313504 ATP2B2 [116]Open in a new tab Protein–protein interactions (PPIs) analysis and hub genes identification The capacity of compounds to operate as drugs and the target protein’s activity are largely determined by protein-protein interactions. Most proteins and genes recognize the activities of the ensuing phenotype as a collection of interconnections. Cell-to-cell contacts, regulation of metabolism, evolutionary supervision, and other functions in biology are all managed by protein-protein interactions or PPIs^[117]48. The PPI network was analyzed using STRING, and compliance networks and recurring connections among DEGs were predicted using a Cytoscape visualization. By using topological measures, such as a degree greater than 15°, PPI analysis was used to designate highly communicative proteins. The PPI network (Fig. [118]6) has 40 nodes and 76 edges connecting them, which are the most notable DEGs. Hub genes exhibit the top 10% interconnectedness and a significant correlation with potential units. Because of these interactions, hub genes usually have a major function in biological systems. We utilized the Cytohubba plugin in Cytoscape to identify the top 20 DEGs or hub genes. Notably, Fig. [119]7 depicts the hub genes identified using the MCC approach: CREBBP, SPR, H2AX, CD84, LILRB2, UTY, TFEB, LILRB1, FOXI2, and HVCN1, while the Bottleneck approach identified LILRB2, CD84, LILRB4, HLA-DPB1, HLA-DQB1, HLA-DRB1, LILRB1, C1QB, CD160, and CREBBP as hub genes. Fig. 6. [120]Fig. 6 [121]Open in a new tab PPI network is made up of DEGs for tuberculosis. Differentially ex- pressed protein genes are represented by the circular nodes in the picture, and the interaction between the nodes is shown by the edges. The PPI is made up of 40 nodes and 76 edges. STRING was used to build the PPI network, and Cytoscape was used to view it. Fig. 7. [122]Fig. 7 [123]Open in a new tab Identification of hub genes within the cluster using cytohubba: application of MCC (Maximal Clique Centrality) and bottleneck algorithms and network comparison. The linkages between the top 10 hub genes from each method and additional genes (yellow) are indicated by dark green high- lights. While the (A) BottleNeck has 30 nodes and 65 edges (B) MCC network has 22 nodes and 55 edges. Discovering the miRNAs and transcription factors (TF) that bind to their neighboring DEGs We employed a system that uses a network approach to analyze the controlling TFs and miRNAs to identify significant expression alterations and discover additional signaling molecules associated with the hub protein. Proteins known as transcription factors are substances that control transcription as well as gene activity in all living organisms^[124]49. MiRNAs, which are minuscule RNA molecules, are involved in the modulation of post-transcriptional expression. Figure [125]8 depicts the interaction between DEGs and TFs, while Fig. [126]9 shows the relationship between DEGs and miRNAs. TFs of genes with differential expression have major regulators that were STAT3, GATA2, KLF4, MYC, FLI1, TP53, REST, HNF4A, FOXP1, POU5F1, SPI1, NANOG, SOX2, PPARG, CREM, GATA2, NFKB1, E2F1, JUN, USF2, PPARG, HOXA5, FOXC1, and YY1 (Table [127]4). has-mir-383-3p, has-mir-520h, has-mir-520g-3p, has-mir-7977, has-mir-218-5p, has-mir-6499- 3p, has-mir-30a-5p, has-let-7b-5p, has-mir-26b-5p, has-mir-27a-3p, has-mir-129-2-3p, has-mir-34a-5p, has-mir-1-3p, has-mir-16-5p, has-mir-124-3p, and has-let-7b-5p were defined to create a succinct summary of the DEGs acting as post-transcriptional regulators. The transcriptional and post-transcriptional regulatory elements of the genes asso ciated with TB that are differently regulated are compiled in Tables [128]4 and [129]5 respectively. Fig. 8. [130]Fig. 8 [131]Open in a new tab The Network Analyst’s framework for integrated regulated collaboration among DEGs and TFs, using (a) ChEA and (b) Jasper database. (a) The network contains 47 nodes and 196 edges, where (b) has 31 and 95, nodes and edges respectively. Transcription factors are represented by square nodes, while genes that are connected to transcription factors are represented by circular nodes. Fig. 9. [132]Fig. 9 [133]Open in a new tab The interconnectedness of regulated relationships between miRNAs and DEGs. Here, the circular gene representations link to the miRNAs, which are represented by the square node. Network (a) contains 22 nodes and 29 edges while (b) has 23 nodes and 54 edges both are constructed using miRTarBase and TarBase databases respectively. Table 4. Overview of transcriptional factor (TF) biomolecules of differentially expressed genes of tuberculosis. TF name Description Function FOXC1 Forkhead box C1 Transcription factor activity that binds DNA and transcription factor binding HOXA5 Homeobox A5 Transcription factor activity that binds DNA and RNA polymerase II PPARG Peroxisome proliferator-activated receptor gamma Transcription factor activity that binds DNA and chromatin linking USF2 Upstream transcription Factor 2, C-Fos interacting Transcription factor activity that binds DNA and sequence-specific DNA binding JUN Jun proto-oncogene, AP-1 transcription factor subunit RNA binding and sequence-specific DNA binding E2F1 E2F transcription factor 1 Transcription factor activity that binds DNA and transcription factor binding NFKB1 Nuclear factor kappa B subunit 1 Transcription factor activity that binds DNA and sequence-specific DNA binding GATA2 GATA binding protein 2 Transcription factor activity that binds DNA and chromatin binding YY1 YY1 transcription factor Transcription factor activity that binds DNA and transcription coactivator activity RUNX1 RUNX family transcription factor 1 Transcription factor activity that binds DNA and protein homodimerization activity FOXP1 Forkhead box P1 Transcription factor activity that binds DNA and sequence-specific DNA binding HNF4A Hepatocyte nuclear factor 4 alpha Transcription factor activity that binds DNA and sequence-specific DNA binding REST RE1 silencing transcription factor Transcription factor activity that binds DNA and transcription factor binding TP53 Tumor protein P53 Transcription factor activity that binds DNA and protein heterodimerization activity FLI1 Fli-1 proto-oncogene, ETS transcription factor Transcription factor activity that binds DNA and chromatin binding MYC MYC proto-oncogene, BHLH transcription factor Transcription factor activity that binds DNA and RNA polymerase II cis-regulatory region sequence-specific DNA binding KLF4 KLF transcription factor 4 Transcription factor activity that binds DNA and transcription factor binding STAT3 Signal transducer and activator of transcription 3 Transcription factor activity that binds DNA and sequence-specific DNA binding CREM CAMP responsive element modulator Transcription factor activity that binds DNA and core promoter sequence-specific DNA binding SOX2 SRY-Box transcription factor 2 Transcription factor activity that binds DNA and protein heterodimerization activity NANOG Nanog homeobox Transcription factor activity that binds DNA and chromatin binding SPI1 Spi-1 proto-oncogene Transcription factor activity that binds DNA and RNA binding POU5F1 POU class 5 homeobox 1 RNA binding and DNA binding particular to certain sequences [134]Open in a new tab Table 5. Overview of miRNA biomolecules of differentially expressed genes of tuberculosis. Name Description Function hsa-mir-383-3p MicroRNA 383 Inhibitor of tumors in several different cancer types, such as thyroid, kidney, and cervical cancer^[135]1 hsa-mir-520 h MicroRNA 520 h Downregulated Death-associated protein kinase 2 (DAPK2) expression^[136]2 , and decrease of side population cells and prevention of cell propagation and invasion in PANC1 cell line^[137]3 hsa-mir-520 g-3p MicroRNA 520 g Associated with reduced procoagulant and signaling potential of cancer cells^[138]4 hsa-mir-7977 MicroRNA 7977 Novel indicator to patients of lung adenocarcinoma and may play as tumor suppressor in lung cancer^[139]5 hsa-mir-218-5p MicroRNA 218–1 Has a vital function in preventing lung cancer cells from proliferating and migrating, most likely by attaching to the EGFR receptor^[140]6 hsa-mir-6499-3p MicroRNA 6499 3p MiR-6499-3p was predicted to target the NFKBIA gene. NFKBIA has a role in the acute stage responsive signaling, which includes T and neutrophil activation^[141]7 hsa-mir-30a-5p MicroRNA 30a Increases paclitaxel (lung cancer cell) sensitivity by suppressing BCL-2, a crucial apoptosis regulator, thus encouraging chemotherapy-induced apoptosis^[142]8 hsa-let-7b-5p MicroRNA Let-7b Prohibits the Transportation and poliferation of growth factor derived from platelets (PDGF) triggered Pulmonary Artery Smooth Muscle Cells (PASMCs) by modulating IGF1^[143]9 hsa-mir-26b-5p MicroRNA 26b Enhanced apoptosis, decreased fibroblasts, and reduced lung fibrosis in mice bronchial tissues, and decreased mortality in mouse bronchial tissues due to upregulation of miR-26a-5p^[144]10 hsa-mir-27a-3p MicroRNA 27a Lung fibroblast development into myofibroblasts was boosted by miR-27a-3p suppression, while overexpression was hindered^[145]11 hsa-mir-129–2-3p MicroRNA 129–2 Members of the MiR-129 family are widely acknowledged to be tumor suppressors, exhibiting reduced expression in several types of cancers^[146]12 hsa-mir-34a-5p MicroRNA 34a In the p53 wild-type tumor in the colon cell HCT116, miR-34a-5p markedly reduced cell proliferation, migration, invasion, and metastasis^[147]13 hsa-mir-1-3p MicroRNA 1–1 Had a restraint effect on the tumorigenic potential of lung adenocarcinoma (LUAD) cells, as demonstrated by the markedly decreased vitality, movement, and incursion of LUAD cells upon miR-1-3p upregulation. Furthermore, miR-1-3p showed conspicuous downregulation in human cell lines and LUAD tissues^[148]14 hsa-mir-16-5p MicroRNA 16–1 Since miR-16-5p is not highly expressed in breast cancer tissues, it can prevent breast cancer by inhibiting the NF-κ B pathway and raising the level of AKT3^[149]15 hsa-mir-124-3p MicroRNA 124–3 miR-124 expression at a low level in breast cancer tissue was linked to poor prognosis of patients^[150]16 hsa-let-7b-5p MicroRNA Let-7b hsa-let-7b-5p may prevent glioma cells from migrating, invading, and going through cell cycle^[151]17 [152]Open in a new tab Potential medication To comprehend the molecular components associated with the transmission of signals, a protein-drug interaction study must be carried out^[153]50. Using NetworkAnalyst approaches based on drug-protein interactions from the DrugBank library, we identified 22 prospective therapy medications for frequently occurring DEGs as promising medicinal options in TB. Figure [154]10 displays 22 widely used medicinal substances, Bevacizumab, Daclizumab, Palivizumab, Natalizumab, Efalizumab, Alefacept, Alemtuzumab, Tositumomab, Ibritumomab tiuxetan, Muromonab, Basiliximab, Rituximab, Trastuzumab, Gem- tuzumab ozogamicin, Abciximab, Adalimumab, Etanercept, Cetuximab, N-Acetyle Sero- tonin, Biopterin, and 2’-Monophosphoadenosine 5’-Diphosphoribose which had been identified in the DEGs of TB Protein Drug Associations. Fig. 10. [155]Fig. 10 [156]Open in a new tab This figure depicted 22 potential medications for tuberculosis treatment identified through the protein-drug interaction approach. Among them, 18 drugs target the C1QB gene, while the others interact with the SPR gene. In the diagram, medications are represented by rectangular nodes, and their corresponding gene targets are depicted as spherical symbols. Discussion Nowadays, machine learning is becoming increasingly important in the bioinformatics industry as artificial intelligence advances quickly, allowing for in-depth data analysis. While most RNA-Sequence data may be processed using machine learning approaches, tuberculosis dataset has been examined in this study since it is a infectious disease with potentially fatal consequences. In the investigation, we introduced a count- oriented classification workflow for RNA sequence information analysis that uses well- established machine learning techniques to effectively detect tuberculosis. Additionally, the methods employed here allow us to deal with substantial transcriptome data and make trustworthy inferences about TB proteins. Notably, we made an effort to create a novel approach by using the machine learning technique’s key characteristic technique to the TB data in order to identify genes with enhanced expression of RNA-Sequence data. Also, we employed a variety of bioinformatics tools in our investigation, which enables us to fully comprehend tuberculosis and pinpoint its related biomarkers. We used multiple trained machine learning methods, such as Random Forest, AdaBoost, Gradient Boosting, Logistic Regression, Decision Tree, and XGBoost, in our detailed examination of tuberculosis (TB). Figure [157]3 showed their accuracy, losses, and performance indicators in graphic style. The Table [158]2 provides more information on each of these indicators. we investigated Pathway enrichment assessments in Fig. [159]5, Gene Ontology analysis in Table [160]3, Protein-Protein Interaction examinations in Fig. [161]6, and investigated Hub-Protein interactions in Fig. [162]7, Transcriptional Factor interactions in Fig. [163]8, miRNA interactions in Fig. [164]9, and drug-protein interactions in Fig. [165]10. There are various reasons why our top model, XGBoost, performed better on our dataset than the others. With integrated L1 and L2 regularization and the capacity to represent intricate non-linear relationships, it excelled in managing high-dimensional data. Its efficiency is further increased by special features including tree pruning, effective column block computation, and internal handling of missing values. The model’s edge was probably influenced by its flexibility in adjusting hyperparameters, resistance to outliers, and capacity to record features engagement. Furthermore, some datasets’ intrinsic characteristics may be more compatible with gradient-boosted trees, indicating that the underneath patterns in our data were especially suitable to XGBoost. Below is the mathematical formula for XGBoost model’s objective and procedure: The model’s prediction for the i^th instance at the t^th iteration of a dataset comprising n samples and m characteristics is represented by the symbol yˆ[i]^t). In every iteration of XGBoost, the following objective function needs to be optimized: graphic file with name d33e1804.gif 8 When the regularization term, Ω, is defined as follows and l is the loss function with differentiable convexity: graphic file with name d33e1812.gif 9 In this case, w[j] is the score given to the j^th leaf, and T shows the number of leaves in the tree. The tree’s ultimate configuration is determined by minimizing: graphic file with name d33e1830.gif 10 where g, and h denote the corresponding sum of the cases that reach the child node on the left, G, and H stand for the total of the loss function’s first- and second-order gradients for the instances in the current node. In bioinformatics, gene ontology (GO) and enrichment analysis are widely utilized statistical methods that help researchers uncover the biological significance of large gene sets. GO suggests that biological processes can involve molecular interactions. The molecular aspect focuses on molecular processes, while the cellular component refers to the location where a gene exerts its function. Meanwhile, pathway analysis offers a distinct approach by exploring the connections between physiologically or molecularly complex diseases. This is the most efficient way to make an organism react to changes occurring inside of it. From Table [166]3 the leading 10 genes of biological process, cellular component, and molecular function are mentioned in turn according to their P values. The most significant genes are HLA-DRB3; and HLA-DRB1 for these above-mentioned three processes. HLA-DRB1 alleles could represent an increased risk of developing active TB^[167]51. Other important DEGs are TEPSIN, TMED3, HVCN1, and ATP2B2 for these three categories. Figure [168]5 showed the pathway enrichment summary of tuberculosis DEGs. According to the figure, the top pathways are asthma, allograft rejection, Graft-versus-host disease, and Type I diabetes mellitus. In our study, the protein-protein interaction (PPI) network was analyzed using Cytoscape and STRING. Our method combined the robust display and analysis capabilities of Cytoscape with the huge PPI database of STRING. With the use of Cytoscape’s sophisticated network analysis tools and STRING’s carefully selected data, this integration enables a thorough investigation of the interaction network, resulting in a more in-depth and understandable knowledge of protein interactions. This, in our opinion, makes our method a useful addition to already-existing computational techniques, especially in situations where computational resources or data availability may be restricted. In contrast to certain machine learning techniques that could need a lot of training data and feature building, our methodology uses Cytoscape’s network measurements and STRING’s preset interaction scores directly to identify key proteins. In conclusion, our suggested workflow presents a well-balanced strategy for combining the interactive features of Cytoscape with reliable data from STRING, making it easy to use and understand for PPI network analysis. Hub proteins of high or low expression levels in TB patients have been found. TB quickly triggers the transcription factor cAMP Response Element Binding Protein (CREBP), which controls a variety of cellular reactions in macrophages^[169]52. Experiments on the M. tuberculosis challenge utilizing CD84-deficient C57BL/6 mice indicate that CD84 expression probably contributes to immunosuppression of T and B cells during M. tuberculosis pathogenesis and also inhibits B cell activation^[170]53. Myeloid-derived suppressor cells (MDSCs) are immunosuppressive cells that are reprogrammed to become pro-inflammatory by LILRB2 antagonists, which kills Mtb^[171]54. Compared to healthy controls, patients with active tuberculosis had reduced CD160 expression in their B cells and monocytes^[172]55. The TFs we found are connected to TB. Table [173]5 shows the overview of TF biomolecules of differentially expressed genes. NFKB1, one of the most influential hub genes was determined as a potential therapeutic target in tuberculosis^[174]56. GATA2 is downregulated by sustained stimulation with M. tuberculosis antigen^[175]57. After M. tuberculosis infection, TLR4 which is YY1-activated may exacerbate mycobacterial damage in human microbes^[176]58. Tuberculosis is also associated with our discovered miRNAs. Serum exosomes may be a useful marker for exosomal miR-7977, which is crucial for lung cancer diagnosis^[177]59. In both MTB-infected and uninfected THP-1 cells, forced overexpression of miR- 30a had a restraining action on TLR/MyD88 activation and cytokine production^[178]60. mir- 26b-5p might function as biomarker for IOTB(Intraocular Tuberculosis)^[179]61. We want to delineate the particular benefits of utilizing machine learning strategies like AdaBoost, XGBoost, and Logistic Regression classifiers in comparison to more conventional approaches like R-based statistical methodologies and other well-established bioinformatics tools. Multiple weak learners are used by machine learning models, especially ensemble techniques like AdaBoost and XGBoost, to create a strong prediction model. When compared to single statistical methods, this approach frequently yields better generalization and higher accuracy. ML classifiers were able to identify complicated, non-linear correlations in the data, which improved the accuracy of identifying genes that were differentially expressed in differential gene analysis. Gene expression profiles, genetic variants, and route information are just a few of the data types that our machine-learning algorithms effectively manage and integrate. Even if they are strong, traditional approaches could have trouble integrating diverse data sources. With their capacity to process and learn from intricate, multi-dimensional datasets, machine learning techniques offer a more comprehensive perspective and reveal interactions that traditional tools would overlook. Because machine learning algorithms are flexible, they can find new patterns and connections in the data that might be missed by conventional tools. They can find new patterns and connections in the data. For instance, when we used XGBoost to identify hub genes, we were able to identify possible important regulators that were previously missed by traditional centrality metrics. Large-scale genomic investigations can benefit from our ML algorithms’ better computing efficiency and quicker processing times as compared to other conventional methods. For TB patients, our identified potential therapeutic compounds may be beneficial. Treatment with bevacizumab reduces hypoxia in TB granulomas, enhances small molecule delivery, and encourages vascular normalization^[180]62. There is some ambiguity on how daclizumab, palivizumab, natalizumab, efalizumab, tositumomab, alefacept, muromonab, and others react with TB. Thus, preclinical and clinical trials together with additional research are needed. Conclusion Our study endeavors to employ a machine learning method to detect important genes using the feature significance tools and to confront the overarching global challenge of tuberculosis and advance the well-being and livelihoods of individuals. Here, we explored both machine learning techniques and conventional bioinformatics methods. We now have a more comprehensive understanding of the molecular pathways behind tuberculosis sickness because of the integration of machine learning algorithms, statistical analysis, and bioinformatics tools. This approach would allow us to more effectively identify expressed genes, moving beyond reliance only on conventional bioinformatics methods with a specific focus on detecting DEGs. In addition, when dealing with a massive amount of features, employing a machine learning approach proves to be effective as well as more reliable in predicting DEGs. We’ve analyzed vast transcriptomic data successfully using multiple ML models. XGBoost model demonstrating strong performance achieved 96.3% accuracy. This shows that ML can enhance TB diagnosis, enabling tailored treatments and revolutionizing clinical practice. In our TB biomarker study, we analyzed various molecular components, such as 20 pathways, 24 gene ontologies, 23 transcription factors, 20 hub genes, 16 miRNAs, and 22 possible drugs. Based on our verification, we validated our study. The three main pathways are hematopoietic cell lineage, Th1 and Th2 cell development, and antigen processing, as well as presentation Fig. [181]5. The reported Hub proteins such as CD84, LILRB2, LILRB4, and CD160 might provide encouraging pathways for advancing disease study and therapy in Table [182]6. Similarly, transcriptional factors like FOXC1, GATA2, YY1, SPI1, MYC, and SOX2 in Fig. [183]8, coupled with miRNAs like has-mir-7977, hsa-let-7b-5p, hsa-mir-26b-5p, hsa-mir-16-5p, hsa-mir-124-3p, and hsa-mir-1-3p in Fig. [184]9, explore the regulatory dynamics at the core of TB. Moreover, our study pinpointed prospective drug candidates such as biopetrin, oxaloacetate ion, bevacizumab, and other compounds, showing potential for therapeutic application in Fig. [185]10. Although more preclinical and clinical research is required to confirm their safety and efficacy profiles, these discoveries present opportunities to transform TB therapy approaches and enhance patient outcomes. This research marks a notable step forward in utilizing machine learning for detecting genes that are expressed from RNA sequence count data. By combining essential bioinformatics techniques and verifying results opposing previous studies on TB, we provide a strong method for comprehending and tackling TB on a molecular level. Table 6. Analysis of protein–protein interactions identifies hub genes that DEGs have accumulated. Gene Symbol Description Feature CREBBP CREB binding protein DNA-binding transcription factor activity and transcription factor binding SPR Sepiapterin reductase Oxidoreductase activity and aldo–keto reductase (NADP) activity H2AX H2A.X variant histone Play a central role in transcription regulation, DNA repair, DNA replication, and chromosomal stability CD84 CD84 molecule Signaling receptor activity LILRB2 Leukocyte immunoglobulin like receptor B2 Signaling receptor activity and protein phosphatase 1 binding UTY Ubiquitously transcribed tetratricopeptide repeat containing, Y-Linked Dioxygenase activity and histone H3K27me2/H3K27me3 demethylase activity TFEB Transcription factor EB DNA-binding transcription factor activity and protein dimerization activity LILRB1 Leukocyte immunoglobulin like receptor B1 Protein homodimerization activity and protein phosphatase 1 binding FOXI2 Forkhead box I2 DNA-binding transcription factor activity and DNA-binding transcription factor activity, RNA polymerase II-specific HVCN1 Hydrogen voltage-gated channel 1 Monoatomic ion channel activity and voltage-gated proton channel activity CD84 CD84 molecule Signaling receptor activity LILRB4 Leukocyte immunoglobulin like receptor B4 Signaling receptor activity and antigen binding HLA-DPB1 Major histocompatibility complex, class II, DP beta 1 Peptide antigen binding HLA-DQB1 Major histocompatibility complex, class II, DQ beta 1 Peptide antigen binding and MHC class II receptor activity HLA-DRB1 Major histocompatibility complex, class II, DR beta 1 Peptide antigen binding and MHC class II receptor activity C1QB Complement C1q B chain Protein homodimerization activity CD160 CD160 molecule Signaling receptor binding and MHC class I receptor activity [186]Open in a new tab Limitations of the studying A limitation of this research is the requirement for improved precision for the models. Although our present methodology has yielded some insightful findings, the incorporation of machine learning models in further research endeavors may considerably augment the precision and accuracy of our results. Furthermore, we still need to use the evidence of research currently available for validating a subset of biomarkers. Thoroughly validating those biomarkers is crucial, as it may provide researchers with a better understanding of expressed genes, eventually allowing more accurate analysis and enhanced TB management tactics Supplementary Information [187]Supplementary Information.^ (15.8KB, docx) Acknowledgements