Abstract

   Classification of adenocarcinoma (AC) and squamous cell carcinoma (SCC)
   poses significant challenges for cytopathologists, often necessitating
   clinical tests and biopsies that delay treatment initiation. To address
   this, we developed a machine learning-based approach utilizing resected
   lung-tissue microbiome of AC and SCC patients for subtype
   classification. Differentially enriched taxa were identified using
   LEfSe, revealing ten potential microbial markers. Linear discriminant
   analysis (LDA) was subsequently applied to enhance inter-class
   separability. Next, benchmarking was performed across six different
   supervised-classification algorithms viz. logistic-regression,
   naïve-bayes, random-forest, extreme-gradient-boost (XGBoost), k-nearest
   neighbor, and deep neural network. Noteworthy, XGBoost, with an
   accuracy of 76.25%, and AUROC
   (area-under-receiver-operating-characteristic) of 0.81 with 69%
   specificity and 76% sensitivity, outperform the other five
   classification algorithms using LDA-transformed features. Validation on
   an independent dataset confirmed its robustness with an AUROC of 0.71,
   with minimal false positives and negatives. This study is the first to
   classify AC and SCC subtypes using lung-tissue microbiome.

   Subject terms: Programming language, Cancer, Biomarkers

Introduction

   Adenocarcinoma (AC) and squamous cell carcinoma (SCC) are two major
   subtypes of non-small-cell lung cancer (NSCLC), accounting for more
   than 80% of lung cancer (LC) cases^[32]1,[33]2. A late diagnosis of
   NSCLC at an advanced stage, a high rate of metastasis, and the
   development of therapy resistance are responsible for approximately 95%
   of mortality^[34]3. Despite advancements in therapeutic strategies,
   NSCLC patient survival has not increased significantly. In the United
   States, the relative survival of 5-year for patients with stage I was
   65%, while the patients with stage IV were less than 10%^[35]4. Even if
   the patient underwent surgical resection, the post-operative recurrence
   rate remains high. Almost 50–90% of postoperative recurrence occurs in
   the first 2 years^[36]5. However, targeted and contemporary
   immune-based therapy have minimally improved NSCLC
   survival^[37]6–[38]8. The probable reason for the poor survival of
   NSCLC might be due to the underlying heterogeneity and variance in the
   cell populations^[39]9.

   AC and SCC have significant heterogeneity, different features,
   prognostic consequences, and therapeutic responses^[40]10,[41]11. Owing
   to high heterogeneity and variance in NSCLC subtypes, their early-stage
   classification poses a significant challenge for medical professionals.
   AC originates from the epithelial cells lining alveoli and airways,
   where gaseous exchange occurs. These malignant cells often form
   gland-like structures and produce mucin, a hallmark of AC.
   Histologically, these cells exhibit columnar or cuboidal shapes, with
   hyperchromatic and prominent nucleoli exhibiting growth patterns such
   as acinar, papillary, micropapillary, solid, and lepidic^[42]12,[43]13.
   Additionally, these cells express specific markers like thyroid
   transcription factor-1 (TTF-1) and napsin A, which aid in
   differentiating AC from other LC types^[44]14. Environmental factors,
   including smoking, air pollution, and exposure to carcinogens such as
   asbestos and radon, are strongly associated with an increased risk of
   AC^[45]15–[46]17. Molecularly, AC is related to mutations in oncogenes
   such as EGFR, KRAS, ALK, and BRAF, driving uncontrolled cellular
   proliferation. Treatment response in AC is primarily determined by
   these genetic mutations with EGFR (e.g., osimertinib) and ALK
   inhibitors (e.g., crizotinib), improving outcomes in targeted
   populations^[47]18,[48]19. Immunotherapies like PD-1/PD-L1 inhibitors
   (e.g., pembrolizumab) have shown significant benefits in patients with
   high PD-L1 expression^[49]20. In contrast, SCC arises from squamous
   epithelial cells in the bronchial airways, typically in the central
   lung regions. It is strongly linked to prolonged exposure to
   carcinogens, particularly cigarette smoke, which induces mutations in
   critical genes like TP53, PIK3CA, FGFR1, and CDKN2A. These mutations
   disrupt cell cycle regulation, apoptosis, and oncogenic signaling,
   promoting tumorigenesis. Specifically, mutations in TP53 and CDKN2A
   impair tumor suppressive functions, disrupting cell cycle control and
   apoptosis. Alterations in PIK3CA and FGFR1 enhance oncogenic signaling
   pathways, promoting uncontrolled cell growth and
   survival^[50]21,[51]22. Other risk factors, including air pollution,
   asbestos, and radon exposure, also contribute to SCC
   development^[52]23. Histologically, SCC is characterized by keratinized
   squamous cells, hyperchromatic nuclei, high nuclear-to-cytoplasmic
   ratios, and keratin pearl formation, reflecting its aggressive
   phenotype. This cellular architecture, coupled with marked
   pleomorphism, reflects the aggressive phenotype of SCC^[53]24. Growth
   patterns include solid and papillary structures, with occasional
   clear-cell or basaloid features^[54]12.

   Although overall survival rates have improved due to targeted therapies
   with driver mutations and immunotherapy, resistance to these therapies
   remains a significant challenge in AC. Furthermore, unlike AC, SCC
   lacks frequent targetable mutations like EGFR or ALK, limiting
   therapeutic options^[55]25. Additionally, only a small portion of NSCLC
   patients achieve sustained and optimal responses to immunotherapy. For
   those who develop resistance to immunotherapy, it presents a critical
   clinical challenge, with no effective treatment options currently
   available and a consequent poor prognosis. This issue is further
   exacerbated in NSCLC patients diagnosed at advanced stages, who face
   poor prognosis and limited treatment modalities, underscoring the need
   for novel therapeutic strategies. Thus, early detection and optimized
   treatment strategies for NSCLC subtypes are urgently needed, which can
   be achieved by a more significant understanding of the commensal lung
   microbiome.

   The lung microbiota entails a wide range of microorganisms that play a
   pivotal role in NSCLC etiology^[56]19–[57]25. Growing research has
   suggested that an imbalanced lung microbiota, termed “dysbiosis,” can
   cause carcinogenesis through genomic instability, affecting metabolism,
   induction of inflammatory pathways, and immune response in the
   host^[58]22,[59]26–[60]28. For example, studies have shown that certain
   bacteriotoxins and other pro-inflammatory factors from microorganisms
   such as Haemophilus influenzae, E. coli, Enterobacter spp., Moraxella,
   and Legionella genera are correlated that drive the inflammation of
   lung tissue and contribute to the formation of tumors, to promote
   carcinogenesis^[61]29. Zheng et al. indicated that Firmicutes and
   Enterobacter are associated with NSCLC through chronic
   inflammation-associated carcinogenesis. This suggests the causative
   relationship between specific microbes and cancer development^[62]30.
   Su et al. analyzed lung microbiome in NSCLC patients from four
   different sites, including lung tumor tissues, matched para-tumor
   tissues, matched bronchial tissues, and distal normal lung tissues.
   They found the predominance of Firmicutes, Desulfobacterota,
   Proteobacteria, and Bacteroidota and suggested their role in
   carcinogenesis. Moreover, they also found an inverse relationship
   between host physiological conditions like BMI and alpha diversity
   within the lung microbiome. This implies that individuals with higher
   BMI may exhibit lower microbial diversity within their lung microbiome.
   The interplay between host physiological conditions and microbiome
   diversity provides novel clues for mechanistic insights into NSCLC
   development and progression^[63]31.

   Recent research has highlighted the significant difference in the
   microbial communities of AC and SCC, which needs to be explored for
   their potential as predictive and diagnostic markers for early
   screening of NSCLC complemented with ML methods. For instance,
   Druzhinin et al. investigated variations in the sputum microbiome among
   patients with AC, SCC, and healthy controls. Using the LEfSe method,
   they identified an increased abundance of genera Streptococcus,
   Bacillus, and Rothia in SCC patients compared to those with AC.
   However, no significant differences in bacterial composition were
   observed between AC patients and healthy controls. Sputum samples from
   SCC patients also exhibited higher levels of genera such as
   Streptococcus, Bacillus, and Peptostreptococcus compared to controls.
   These findings indicate distinct microbiome profiles associated with
   NSCLC histological subtypes^[64]32. Similarly, Gomes et al.
   characterized microbial communities from bronchoalveolar lavage fluid
   (BALF) samples of AC and SCC patients. They identified a predominance
   of phylum Proteobacteria across LC cases, with a distinct enrichment of
   family Enterobacteriaceae in SCC. In AC, the microbiota was
   characterized by an increased abundance of genera Acinetobacter and
   Propionibacterium. These findings suggest distinct microbial profiles
   between AC and SCC, highlighting their potential as biomarkers for
   patient stratification, prognosis, and therapeutic intervention^[65]33.
   Another recent study by Jang et al. evaluated the BALF of AC and SCC
   patients. They suggest most patients exhibit overlapping lung
   microbiome signatures in NSCLC subtypes with varying compositions. They
   also found that the species Actinomyces graevenitzii was predominant in
   SCC, while Haemophilus parainfluenzae, Neisseria subflava,
   Porphyromonas endodontalis, and Fusobacterium nucleatum were enriched
   in AC. This study highlighted the role of microbiome diversity in lung
   homeostasis and its potential association with NSCLC progression and
   prognosis^[66]34. However, while these studies identified significant
   differences between AC and SCC, they did not explore these variations
   for potential microbial markers or NSCLC subtype classification
   complemented with ML-based methods. The reliance on BALF and sputum
   samples also introduces potential contamination from the upper
   respiratory tract and environmental sources, which may confound
   results.

   In this work, we aimed to analyze the significant differences in
   resected lung tumor microbiome composition from AC and SCC patients,
   which are predominant subtypes in NSCLC. Resected lung tissue
   microbiomes more accurately represent the lung’s microbial community as
   they are less influenced by external contaminants such as those from
   the oral cavity or upper airways. Unlike non-invasive methods like
   sputum or bronchoalveolar lavage, which may be contaminated by microbes
   from the upper respiratory tract or environmental particles, resected
   tissue is directly sampled from the lung parenchyma. This minimizes the
   effects of exogenous microbes, providing a more precise and reliable
   profile of the endogenous lung microbiome in NSCLC patients^[67]35.
   Additionally, metadata such as patient age, sex, smoking history, and
   environmental material (malignant or not) are also considered for
   analysis. These variables were not analyzed earlier, but they are
   necessary to predict whether the NSCLC subtype is curable. Furthermore,
   we improved existing methodologies by optimizing the selection of
   differential microbial features and integrating relevant metadata. By
   fine-tuning the hyperparameters of existing algorithms, we tailored
   their ability to achieve enhanced stratification of AC and SCC. We
   highlight novel microbial markers and provide deeper insights into
   subtype-specific microbial dynamics associated with the NSCLC subtype.
   This refined approach underscores the unique contributions of this
   study to the understanding of NSCLC microbiome heterogeneity.

Results

Taxonomic profiling of lung microbiota composition in AC and SCC

   The relative taxon abundance in AC and SCC patients on phylum, class,
   order, family, genus, and species levels were classified and analyzed.
   The per-sample genus and phylum bacterial taxonomic distribution that
   varies significantly between individuals is shown in Fig. [68]1. The
   phylum/genus taxonomic characteristics of the lung tissue microbiota
   between the AC and SCC groups are presented in Fig. [69]1A, B, and the
   extended microbiota taxonomic profiles based on the class, order, and
   family level are given in Supplementary Fig. [70]S1. At the phylum
   level, Proteobacteria (65.99%), followed by variation in Deinococcota
   (9.92%), Firmicutes (9.68%), and Bacteroidota (8.53%) were the most
   common in both the AC and SCC groups, shown in Fig. [71]1A. At the
   genus level, Acinetobacter (14.64%), Pseudomonas (14.21%), and Thermus
   (9.85%) were the core genera present in the AC and SCC samples as shown
   in Fig. [72]1B. Additionally, at the family level, the relative
   abundance of Pseudomonaceae (14.21%), Comamonadaceae (13.16%),
   Moraxellaceae (15%), Thermaceae (9.85%), and Weeksellaceae (5.44%) was
   increased more significantly in both groups, as shown in Supplementary
   Fig. [73]S1.

Fig. 1. Taxonomic and diversity analysis.

   [74]Fig. 1
   [75]Open in a new tab

   Taxonomic characterization of the lung microbiota in NSCLC patients
   with AC and SCC at A Phylum level; B Genus level. Comparison of
   relative taxa abundance of the lung microbiota in NSCLC patients with
   AC and SCC at C Phylum level; D Genus level. (Only the relative
   abundance within each group >0.05% is shown). E Richness was used to
   evaluate the α-diversity in AC & SCC, respectively. The p-values were
   computed using the Wilcoxon rank sum test. F The difference in lung
   microbiota composition between the two groups was displayed by
   principal coordinate analysis (PCoA) based on Bray–Curtis metrics
   (p-value calculated based on the PERMANOVA test). The magenta and
   orange dots represent AC and SCC, respectively.

   Figure [76]1C, D compares the relative taxa abundance of the lung
   tissue microbiota in AC and SCC patients at the phylum and genus
   levels. The microbiome composition, which was present in at least 0.05%
   of samples, is shown here. AC was found to be dominated by the phyla
   Proteobacteria (29.66%), followed by Deinococcota (7.88%) and
   Firmicutes (4.45%). In contrast, SCC was dominated by Proteobacteria
   (36.36%), followed by Firmicutes (5.24%), Bacteroidota (4.20%), and
   Deinococcota (2.05%) as represented in Fig. [77]1C. The genus Thermus
   (7.81%) dominated the microbiome, followed by Acinetobacter (7.07%) and
   Pseudomonas (4.23%) in AC, whereas, in SCC, Pseudomonas (9.98%)
   dominated the microbiome, followed by Acinetobacter (7.57%) and
   Streptococcus (2.64%) as shown in Fig. [78]1D. Additionally, at the
   family level, Thermaceae (7.81%) dominated the microbiome, followed by
   Moraxellaceae (7.31%), Comamonadaceae (6.83%), and Pseudomonadaceae
   (4.23%) in AC, whereas, in SCC samples, Pseudomonadaceae (9.98%)
   dominated, followed by Moraxellaceae (7.68%), Comamonadaceae (6.33%)
   and Sphingomonadaceae (3.20%), which are displayed in Supplementary
   Fig. [79]S2. The shift in microbiome composition across different tumor
   stages I to IIIA is also compared and presented in [80]Supplementary
   material (Supplementary Table [81]S1).

Bacterial diversity and community structure analysis in AC and SCC

   The richness and Shannon diversity index were calculated to measure
   differences in taxonomic diversity among AC and SCC groups.
   Comparatively, the taxonomic diversity of samples from the AC and SCC
   groups was significantly distinct. The richness and Shannon index
   revealed that α-diversity was higher in AC than in the SCC group, with
   a statistically significant p-value of 0.002, respectively, as shown in
   Fig. [82]1E. To evaluate the substantial variation in microbiome
   structure in AC and SCC, ecological distances were calculated based on
   the Bray–Curtis and visualized by principal coordinates analysis (PCoA
   plot) in Fig. [83]1F. The PERMANOVA with 10,000 permutations yielded a
   statistically significant p-value of 4e−04, implying no clear
   distinction between AC and SCC samples. There was relatively weak
   clustering between the AC and SCC groups, consistent with the findings
   reported by Gomes et al.^[84]33.

Differential microbiota compositions in AC and SCC

   The LEfSe algorithm explores potential bacterial biomarkers by
   detecting taxa with differential abundance among the two groups (AC vs.
   SCC). This identified the specific lung tissue bacterial taxa
   associated with two groups. An LDA score above 2.0 indicated the most
   significant difference in taxa from the phylum to the species level, as
   shown in Fig. [85]2A. We found 74 differentially abundant taxa at
   different levels, including order Thermales, family Thermaceae, phylum
   Deinococcota, class Deinococci, genus Thermus, etc., differentially
   abundant in AC. On the other hand, SCC was enriched with the phylum
   Proteobacteria, class Alphaproteobacteria, family Oxalobacteraceae,
   genus Herbaspirillum, etc., as shown in Fig. [86]2A.

Fig. 2. Taxonomic feature selection and correlation analysis for ML
classifiers.

   [87]Fig. 2
   [88]Open in a new tab

   A Linear discriminant analysis effect size (LEfSe) of the dominant
   bacterial communities in AC and SCC groups. The pink and yellow bars
   indicate the taxa identified in greater relative abundance in patients
   with AC and SCC, respectively, with p < 0.05 and LDA ≥ 2.0. B The
   heatmap shows the Pearson's correlation between taxonomic features
   selected from the AC and SCC groups. Features with a correlation
   coefficient > 0.95, likely due to taxonomic similarities (e.g., genus
   and order from the same classification level), were filtered to retain
   only one representative feature per correlated pair. This selection
   process reduced the feature set to 44 taxonomic features for further
   consideration.

   After LEfSe, Pearson's correlation analysis was performed to remove the
   highly correlated features; we achieved 44 features. For instance, the
   genus Thermus and order Thermales belonged to the same classification
   and had a correlation coefficient greater than 0.95. Therefore, only
   one feature from these highly correlated features was selected. The
   correlation heatmap of these taxonomic features is shown in Fig.
   [89]2B. Next, we selected five taxa of each AC and SCC class for our
   classifier. Thus, a total of 10 features were used for model training.
   Of these, ten bacteria features belong to the phylum Patescibacetria,
   Proteobacteria, class Alphaproteobacteria, order Thermus,
   Micrococcales, Enterobacterales, family Corynebacteiaceae, genus
   Herbaspirillum, Corynebacterium, and species Kocuria_palustris. The
   metadata features included environment material (malignant or
   non-malignant), host sex, host age, and smoking history.

Dimensionality reduction and data transformation using LDA

   AC and SCC subtypes have similar lung tissue microbiomes but with
   varying compositions. Applying LDA transformation to our dataset
   reduced its dimensionality while maximizing the separation between AC
   and SCC. Supplementary Fig. [90]S3 shows the dataset after LDA
   transformation. LDA identifies new axes in such a way that the variance
   of the original dataset is conserved while maximizing the inter-class
   distance and minimizing the intra-class distances at the same time. In
   our case, as the number of classes is two, LDA reduced our number of
   features to one. The maximum number of features after LDA
   transformation is always one less than the number of classes in the
   dataset. Then, we used our LDA-transformed data to classify two classes
   using the previously mentioned features.

Model evaluation

Performance assessment on the original test dataset

   Given the differences in lung microbiome in AC and SCC, we hypothesized
   that the selected bacterial features might classify these subtypes and
   metadata information in patients with LC. To this end, we constructed
   several machine learning classifiers and deep neural networks to
   identify AC from SCC using the differential features. The top 14
   features after LEfSe and LDA selection were chosen for the
   classification task. We employed six different supervised
   classification algorithms viz. LR, NB, RF, XGBoost, KNN, and DNN to
   evaluate their performance. As detailed in Table [91]1, the performance
   matrix of these classifiers was evaluated using metrics such as
   accuracy, precision, recall, specificity, sensitivity, and AUROC
   (detailed description of model parameters are provided in Supplementary
   material Note [92]1). A stacked bar graph representing the performance
   metrics of all models has been included, as shown in Supplementary Fig.
   [93]S4. Among these, XGBoost demonstrated the most balanced and
   reliable performance, with an accuracy of 76.25% and an AUROC of 0.81,
   making it the most robust model for distinguishing between the two
   subtypes. Its superior performance can be attributed to its
   ensemble-based architecture, which leverages gradient boosting to
   minimize prediction errors and enhance model robustness iteratively.

Table 1.

   Comparison of performance matrix of different classification models on
   the original test dataset
   Classifier TP FP FN TN Accuracy Precision Specificity
   Sensitivity/Recall F1 score AUROC
   Parameters
   LR 40 21 01 18 0.72 0.80 0.46 0.72 0.70 0.79
   NB 10 01 31 38 0.60 0.73 0.97 0.61 0.54 0.86
   RF 36 16 05 23 0.74 0.76 0.59 0.73 0.73 0.80
   XGBoost 34 12 07 27 0.76 0.77 0.69 0.76 0.76 0.81
   KNN 22 13 19 26 0.60 0.60 0.67 0.60 0.60 0.60
   DNN 39 18 02 21 0.75 0.80 0.54 0.74 0.74 0.82
   [94]Open in a new tab

   TP True positive, FP False positive, FN False negative, TN True
   negative, AUROC Area under receiver operating characteristics curve, LR
   Logistic regression, RF Random forest, XGBoost Extreme gradient boost,
   KNN K nearest neighbor, DNN Deep neural network.

   The variations observed across the classifiers’ performance metrics can
   be attributed to their distinct assumptions and underlying
   architectures. For instance, NB, despite achieving the highest
   specificity (97%), showed lower sensitivity (61%) and accuracy (60%),
   limiting its reliability in detecting AC patients. This discrepancy can
   be attributed to NB’s assumption of feature independence, which hinders
   its ability to capture complex feature interactions, leading to poor
   sensitivity. Conversely, RF exhibited moderate accuracy (74%), and an
   AUROC of 0.80, but its overfitting compromised its performance, mainly
   when applied to smaller datasets. This overfitting reduces the model’s
   generalizability, limiting its efficacy on unseen data. KNN, which
   recorded the lowest accuracy (60%) and AUROC (0.60), struggled due to
   its sensitivity to feature scaling and the curse of dimensionality,
   impeding its ability to generalize in high-dimensional spaces. DNN
   achieved a high AUROC of 0.82 but exhibited poor specificity (54%),
   likely due to its need for larger datasets to prevent overfitting.
   Finally, LR, constrained by its linear assumptions, achieved moderate
   accuracy (72%) and AUROC (0.79) but struggled to model the complex,
   non-linear relationships inherent in the data, resulting in suboptimal
   performance.

   A comprehensive evaluation of classifier performance requires the
   consideration of key metrics such as accuracy, precision, recall, F1
   score, and AUROC, all of which are critical for assessing a model’s
   clinical utility. In this regard, XGBoost excelled across all these
   metrics, demonstrating the highest accuracy (76.25%), precision (77%),
   and recall (76%), making it particularly well-suited for clinical
   applications where both true positive identification and avoiding false
   positives are essential. The F1 score of 0.76 further underscores its
   balanced performance, especially in the context of imbalanced datasets.
   In contrast, while NB exhibited high specificity (97%), its lower
   sensitivity (61%) limited its ability to accurately detect AC cases,
   making it more suitable for contexts where SCC detection is prioritized
   but at the risk of underdiagnosing AC. RF and DNN, despite performing
   well in certain metrics, faced challenges as RF suffered from
   overfitting on small datasets. At the same time, DNN, despite a high
   AUROC (0.82), exhibited poor specificity, limiting its clinical
   applicability.

   The clinical utility of these classifiers depends on the balance
   between sensitivity and specificity, as these directly influence the
   reliability of subtype classification for AC and SCC. XGBoost
   demonstrated the best balance, with the highest sensitivity (76%) and
   specificity (69%), indicating its robustness in identifying true
   positive cases while maintaining reasonable accuracy in identifying
   negative cases. This makes XGBoost a practical choice for clinical
   applications, where minimizing misclassification of both subtypes is
   critical. The high AUROC of 0.81 further supports XGBoost’s reliability
   in distinguishing between AC and SCC, suggesting its suitability for
   the diagnostic or prognosis of NSCLC patients. However, in clinical
   scenarios where SCC detection is paramount, classifiers like NB, which
   achieved the highest specificity (97%), maybe more appropriate, despite
   its lower sensitivity (61%), which could lead to underdiagnosis of AC.
   Similarly, DNN, with its high AUROC (0.82), faced challenges with poor
   specificity (54%), limiting its clinical utility in settings where
   balanced performance across subtypes is required. While XGBoost emerged
   as the most reliable classifier overall, we further validated its
   performance using an external dataset to strengthen its robustness and
   ensure its reliability across different datasets of similar ancestry.

Performance assessment on the independent dataset for validation

   We also evaluated the performance of our classifiers in discriminating
   the AC and SCC subtypes using the identified bacterial biomarkers on an
   independent dataset (PRJNA327258). It was found that XGBoost showed the
   highest accuracy of 64.4% as compared to other classifiers, as
   demonstrated by the confusion matrix in Supplementary Fig. [95]S6. In
   addition, XGBoost has achieved the highest AUROC of 0.71 with 58%
   specificity and 65% sensitivity, as given in Supplementary Fig. [96]S7.
   The performance metrics, including sensitivity, specificity, and F1
   score, reflect the classifier’s balanced capability to discriminate
   between AC and SCC subtypes, reinforcing the robustness of
   microbiome-based biomarkers for subtype classification.

   The successful validation of the XGBoost model on the independent
   dataset PRJNA327258, which shares similar patient demographics and
   metadata with the training dataset, highlights the model’s robustness
   and generalizability. This validation addresses concerns about
   overfitting and confirms that the model’s performance extends beyond
   the original dataset. Both datasets, derived from the Lombardy region
   of Italy and representing European ancestry, minimize genetic and
   environmental variability, ensuring consistency between the datasets.
   Furthermore, using resected tumor samples in both datasets enhances the
   reliability of the comparison.

   This validation demonstrates the model’s capability to accurately
   classify NSCLC subtypes in independent datasets, supporting its
   reliability for analyzing similar populations and sample types.
   Moreover, using independent yet comparable data highlights the
   potential for adapting this approach to broader populations and diverse
   sample types, supporting its generalization across different settings.

Functional enrichment analysis of key bacterial biomarkers

Analysis of metabolic pathway

   The analysis of bacterial communities using PICRUSt2 has elucidated
   differential functional profiles that may influence the progression of
   NSCLC, particularly in AC and SCC. The predicted KEGG-KO analysis,
   visualized through the ggpicrust2 package, reveals significant
   differences between these two subtypes, as depicted in Fig. [97]3G.
   Notably, bacterial biomarkers associated with various pathway classes,
   including genetic information processing and metabolism, have been
   identified, with specific pathways such as carbohydrate, energy, and
   amino acid metabolism exhibiting more significant enrichment in
   adenocarcinoma. In AC, arginine and proline metabolism pathways may
   enhance nitric oxide synthesis, fostering a tumor-supportive
   microenvironment conducive to angiogenesis and immune evasion^[98]36.
   Furthermore, pyruvate metabolism and glycolysis are implicated in the
   Warburg effect, facilitating rapid ATP production and supplying
   biosynthetic precursors essential for tumor progression^[99]37,[100]38.
   Additionally, carbon fixation pathways contribute by providing vital
   biomolecules under nutrient-limited conditions. Conversely, in SCC,
   secondary metabolite biosynthesis and xenobiotic degradation pathways
   appear to play a pivotal role in tumor adaptation. These pathways
   enhance resistance to oxidative stress while facilitating the
   degradation of therapeutic agents, thus conferring drug resistance.
   Recent research underscores the significance of secondary metabolite
   biosynthesis in NSCLC progression, as it enables tumor cells to adapt
   to adverse conditions and diminishes the efficacy of anticancer
   treatments^[101]39,[102]40. Moreover, genetic information processing
   pathways—including transcription, protein folding, and degradation, are
   critical for tumor viability. These pathways help meet the biosynthetic
   demands of rapidly proliferating cancer cells while managing cellular
   stress. The interplay of these pathways suggests that specific
   metabolic functions within the lung microbiome can significantly
   influence NSCLC progression and treatment responses^[103]41. This
   understanding highlights potential therapeutic targets aimed at
   disrupting these tumor-supporting mechanisms.

Fig. 3. Model evaluation and functional enrichment of taxonomic features.

   [104]Fig. 3
   [105]Open in a new tab

   Evaluation of taxonomic features classification performance using
   confusion matrix in A Logistic regression (LR); B Naïve Bayes
   classifier (NB); C Random forest (RF); D Extreme boost (XGBoost); E K
   nearest neighbor (KNN); F Deep neural network (DNN); G Downstream
   pathway enrichment analysis of the identified taxonomic features using
   PICRUSt2. The columns in red and blue represent the AC and SCC groups,
   respectively.

Comparison with previous findings

   The involvement of the identified bacterial taxa with NSCLC progression
   and development has been supported by several studies. For instance,
   members of the phylum Proteobacteria, including
   Alphaproteobacteria^[106]42, Herbaspirillum^[107]43, and
   Enterobacterales^[108]44, have been reported to influence NSCLC
   progression by modulating immune responses, promoting inflammation, and
   altering the tumor microenvironment. These bacteria may contribute to
   the inflammatory processes crucial in tumor development and
   progression. The genus Thermus from the order Thermales might
   contribute to cancer development in AC through mechanisms such as amino
   acid metabolism and reduced signal transduction, both are vital for
   tumor growth^[109]45. Additionally, bacteria from the Corynebacterium
   genus, belonging to the Corynebacteriaceae family, are implicated in
   chronic inflammation and immune evasion, further supporting NSCLC
   progression^[110]44. Kocuria palustris from the order Micrococcales
   might also play a role in local immune modulation and chronic
   inflammation, processes that are critical in fostering tumor
   growth^[111]44. Finally, the presence of Patescibacteria could indicate
   disruptions in metabolic pathways relevant to cancer biology,
   potentially aiding in cancer cell survival and proliferation^[112]46.
   These bacteria may significantly impact NSCLC pathogenesis through
   diverse mechanisms, contributing to tumor development and progression.

Discussion

   The intricate relationship between humans and microbes spans millennia,
   shaping the very fabric of our existence. As we delve into the depths
   of this symbiotic relationship, we uncover a world teeming with
   microbial life—a vast ecosystem known as the human microbiome^[113]47.
   Comprising a diverse array of microorganisms, including bacteria,
   yeasts, archaea, fungi, protozoa, and viruses, the microbiome
   orchestrates essential functions within our bodies, from immune system
   regulation to digestion and disease prevention^[114]48. At the heart of
   this microbial symphony lies the respiratory tract, a complex landscape
   inhabited by its distinct microbiota. From the upper reaches of the
   nasal cavity and nasopharynx to the depths of the lungs, each region
   harbors a unique community of microorganisms intricately woven into the
   fabric of respiratory health^[115]42,[116]49. In the upper respiratory
   tract, bacteria such as Moraxella, Staphylococcus, Corynebacterium,
   Haemophilus, and Streptococcus reign supreme, forming a protective
   barrier against invading pathogens^[117]50. Meanwhile, the lower
   respiratory tract maintains a lower microbial biomass, essential for
   the delicate balance of gas exchange and mucosal immunology^[118]51.

   Recent advancements unveiled the pivotal role of the lung microbiome in
   the early detection and screening of non-small cell lung cancer
   (NSCLC)^[119]38–[120]40. Armed with state-of-the-art machine learning
   algorithms and 16S rRNA gene sequencing technology, researchers
   embarked on a groundbreaking journey to decipher the intricate
   relationship between the lung microbiome and NSCLC
   subtypes^[121]52,[122]53. By analyzing resected lung tissue samples—a
   less susceptible source of contamination compared to other sampling
   methods—scientists sought to unravel the unique microbial signatures
   associated with adenocarcinoma (AC) and squamous cell carcinoma (SCC)
   subtypes^[123]54. In this work, we attempted to distinguish two major
   NSCLC subtypes using ML-based classification algorithms and 16S rRNA
   gene sequencing data obtained from resected lung tissue samples of 263
   patients of European ancestry. To our knowledge, this is the first
   study to classify NSCLC subtypes using lung microbiome data from
   resected lung tissue samples.

   Our findings showed significant differences in microbial composition
   between AC and SCC subtypes, with distinct taxa prevailing in each.
   Interestingly, there was a substantial difference in the relative
   abundance of Firmicutes, Bacteroidota, and Deinococcota in AC and SCC.
   At the genus level, Thermus was found to be dominant in AC, while
   Pseudomonas was dominant in SCC patients. The trend of differentiation
   was observed in the overall structure of the microbial communities
   between the two subtypes based on α diversity. This suggested a
   significant difference in the richness and diversity of microbiota in
   the lung tissue samples of the AC and SCC patients. This observed
   heterogeneity in AC and SCC might be due to differences in the smoking
   status, age, sex, and environment (malignant or not) of
   patients^[124]52,[125]55. In line with a previous study^[126]33, we
   also observed no difference in the β diversity between AC and SCC.

   Further, through the meticulous application of advanced analytical
   techniques such as Linear Discriminant Analysis Effect Size (LEfSe) and
   Pearson correlation analysis, we identified taxonomic features that
   could serve as potential biomarkers for subtype classification. We
   found that the AC group was associated with a significantly higher
   amount of order Thermales, followed by the family Corynebacteriaceae
   and order Micrococcales. Conversely, the species Kocuria palustris,
   followed by phylum Patescibacteria in SCC, were differentially
   abundant. Pearson correlation was applied to remove the highly
   correlated features, and after selecting the top five features of each
   AC and SCC subtype, we found ten taxonomic features, including order
   Thermales, family Corynebacteriaceae, order Micrococcales, species
   Kocuria palustris, phylum Patescibacteria and five others which are
   used for classifying the AC, and SCC.

   We then applied LDA, a supervised machine learning algorithm, to reduce
   the dimensionality of the dataset while also maximizing the inter-class
   distances and minimizing the intra-class distances between AC and SCC.
   Leveraging these findings, a classification approach using five
   supervised ML algorithms was built for subtype classification. We used
   metadata including age, sex, malignant status, and smoking history as
   features and ten other LDA-transformed differential bacterial genera to
   train our classification model with optimized hyperparameters using
   Bayesian optimization. Our benchmarking analysis showed the XGBoost
   algorithm to be the best performer with an accuracy of 76.25%, likely
   due to its ensemble-based approach, which is particularly effective at
   capturing the unique characteristics of AC and SCC. Our work revealed
   ten bacterial genera and four metadata features potentially able to
   classify AC and SCC.

   Moreover, our analysis revealed differential metabolic pathways
   associated with AC and SCC, which shed light on the role of the lung
   microbiome in cancer progression and development. Noteworthy, pathways
   such as pyruvate metabolism, glycolysis, and carbon fixation pathways
   were predominant in the AC group. In contrast, the pathway related to
   amino acid and carbohydrate metabolism, like the TCA cycle, was
   predominant in the SCC group. Emerging evidence demonstrates that
   specific cancer cells, particularly those with deregulated oncogene and
   tumor suppressor expression, rely heavily on the TCA cycle for energy
   production and macromolecule synthesis^[127]56. Increased nucleotide
   metabolism, a critical pathway for DNA replication, RNA synthesis, and
   cellular bioenergetics, suggests uncontrolled growth of tumors and is a
   hallmark of NSCLC^[128]57. Further studies are required, focusing on
   the mechanism of these associated pathways for a better understanding
   of host-microbiome interactions in AC and SCC.

   While these findings represent a significant understanding of the lung
   microbiome’s role in NSCLC subtypes, further exploration of
   microbiome-based biomarkers is needed to translate these findings into
   clinical practice. The identified lung microbiome, as a potential
   classifier for AC and SCC, holds promise for diagnostic applications.
   Since these microbiomes were derived from resected tumors, more
   representative of the actual lung microbial communities, their presence
   can typically only be detected in patients who have undergone biopsy.
   To address this limitation, we compared identified bacterial
   microbiomes from resected tumor samples across various minimally
   invasive or non-invasive sample sites, including sputum, saliva, and
   BALF. Our findings show that Proteobacteria was also detected in these
   sample sites, underscoring its potential as a diagnostic biomarker for
   AC^[129]35. Future longitudinal studies with larger sample sizes,
   conducted across diverse geographical regions, should aim to compare
   microbiome profiles from the upper respiratory tract (e.g., saliva and
   sputum) and lower respiratory tract (e.g., BALF, bronchial washing
   fluid, protected specimen brushing, lung tissue, and resected lung
   tumor tissue) within the same individuals. Additionally, these studies
   could implement our ML-based classification method to detect NSCLC
   subtypes by identifying common microbiome markers from minimally
   invasive or non-invasive samples. Such efforts would enhance the
   robustness and generalizability of our proposed approach. Integrating
   microbiome profiling in clinical diagnostics could revolutionize cancer
   detection by providing a non-invasive method for detecting early-stage
   cancers and assessing treatment responses. One promising avenue for
   integration is the development of microbiome-based diagnostic panels,
   which could be used alongside conventional diagnostic approaches such
   as bronchoscopy or biopsy. These panels would analyze microbial
   signatures in bodily fluids, such as sputum, saliva, or BALF, enabling
   clinicians to detect the presence of disease or track tumor progression
   non-invasively before morphological changes are visible on imaging.

   In addition, future studies must address confounding factors that can
   significantly influence microbiome composition, including antibiotic
   use, medical conditions (e.g., COPD), and environmental or lifestyle
   factors such as smoking, diet, and occupational exposures. Recent
   findings suggest that antibiotic treatment before immunotherapy is
   associated with poorer outcomes in NSCLC patients, underscoring the
   significance of microbial influences on treatment efficacy^[130]58.
   This suggests that antibiotic use can alter microbiome diversity and
   complicate the interpretation of cancer-specific microbial changes.
   Comorbidities, such as COPD, environmental and lifestyle factors,
   smoking history, dietary habits, and occupational exposures, also
   significantly modulate microbiome composition, necessitating their
   inclusion as covariates in analyses^[131]26,[132]59. Integrating
   confounder analysis in future studies will enhance the specificity and
   reliability of microbiome data, improving the clinical utility of
   microbial signatures for NSCLC diagnostics and prognostics.

   ML methods offer significant potential for classifying NSCLC subtypes,
   and future studies should focus on advancing key areas to enhance their
   clinical applicability. One critical area is reducing bias in model
   training. This can be achieved by expanding sample sizes across diverse
   geographical regions and incorporating various sample types from the
   upper and lower respiratory tracts using non-invasive and minimally
   invasive methods. Additionally, ensuring adequate representation of
   patients across all age groups, geographical locations, and ethnicities
   will further enrich the training data, enhancing model robustness and
   generalizability. Such diversity will make the model more reliable and
   applicable to a wide range of clinical settings. Another key focus
   should be optimizing model generalization to address the challenge of
   overfitting, where models perform well on training data but struggle
   with unseen, real-world data. Future research should prioritize
   hyperparameter optimization to balance bias and variance. In our study,
   we employed a Bayesian algorithm, improving model robustness and
   performance across various datasets. By addressing these factors,
   future ML models for NSCLC subtype classification will be more reliable
   and applicable in clinical practice.

   In conclusion, our study addresses key limitations of previous
   research, such as the reliance on non-invasive samples prone to
   contamination and the lack of ML integration for subtype
   classification. By analyzing resected lung tumor tissue and
   incorporating clinical metadata, we provide a more accurate
   representation of the lung microbiome. Our approach, leveraging
   optimized feature selection and fine-tuned ML models, identified novel
   microbial markers and achieved enhanced stratification of AC and SCC.
   These findings advance our understanding of subtype-specific microbial
   dynamics and highlight the potential avenues for further research into
   microbiome-based biomarkers for NSCLC.

Methods

   The data analysis workflow of our study is illustrated in Fig. [133]4
   and described in detail below. Figure [134]4A describes the overview of
   the study design, starting from data collection and its preprocessing,
   followed by feature selection and model building. Finally, evaluation
   of the model on certain parameters. All statistical analyses have been
   performed using R software (version 4.3.1) and Python (version 3.6.7),
   as applicable. All the data analysis codes are publicly available on
   the GitHub repository (URL:
   [135]https://github.com/kashpk/Lung-microbiome-biomarker-analysis).

Fig. 4. Study design and ML strategy.

   [136]Fig. 4
   [137]Open in a new tab

   A Overview of the proposed methodology for ML-based lung microbiome
   data analysis, including steps like data collection, its preprocessing,
   feature selection, model building & cross-validation, and its
   evaluation; B Summary of hyperparameter tuning using the Bayesian
   optimization algorithm. (“Created with BioRender.com”).

Data collection

   The raw 16S rRNA sequencing data on NSCLC patients was collected from
   the NCBI database using the sequence read archive (SRA) toolkit
   (version 2.9.1). The complete dataset is available at NCBI under the
   identifier [138]PRJNA303190. This data was collected from patients of
   European ancestry (Italy) using Illumina high-throughput sequencing to
   target the V3-V4 region of the 16S rRNA gene. We used lung microbiome
   data collected from their resected lung tissue samples. This dataset
   includes 294 samples, with 159 AC and 135 SCC. Moreover, patient
   metadata, such as age, sex, smoking history, and environmental material
   (malignant or not), were considered for analysis. To clarify,
   non-malignant and tumor samples were segregated and treated as two
   distinct classes for stratification in our analysis. The patient’s
   NSCLC stage information was obtained directly from the original study’s
   authors via email^[139]45. Most patients were in the early stages
   (i.e., stages I to IIIA). We also used another independent dataset with
   the identifier [140]PRJNA327258 to validate our findings. This dataset
   comprises 97 samples, with 43 AC and 50 SCC. This dataset was selected
   due to its similarity in characteristics to the training data,
   including patient demographics, geographic region (Lombardy, Italy),
   European ancestry, and metadata relevant to NSCLC subtypes
   classification. Both datasets include samples derived from resected
   tumors. PRJNA327258 study was unpublished and was separately available
   in SRA.

Data pre-processing and taxonomy profiling

   The 16S rRNA gene sequencing data was processed using a standardized
   pipeline implemented in the QIIME2 tool^[141]60. Briefly, paired-end
   demultiplexed reads were joined using fastq-pair^[142]61 followed by
   primary and secondary level filtering as described by Bokulich et al.
   year^[143]62. At first, the reads with a Phred score < 30 were removed.
   Then, the remaining reads were checked for ambiguous base calls and
   adapters. In addition, reads were trimmed to a uniform length of
   300 bp, with any reads shorter than 300 bp excluded from the analysis.
   26 samples out of 294 were removed following the parameters mentioned
   above. Next, the Deblur approach was used to obtain error-free
   biological sequences referred to as amplicon sequence variants
   (ASV)^[144]63. The denoised ASV data from 268 samples (136 AC and 132
   SCC) was obtained after quality filtering and denoising. The ASV data
   was further assessed to find their taxonomy lineage using a sklearn
   classifier pre-trained on the Silva database (version 138-99)^[145]64.
   The sequence reads originating from chloroplast or human mitochondrial
   DNA was excluded from further analysis. A total of 5 samples that did
   not have metadata information on smoking status were removed, resulting
   in 263 (136 AC and 127 SCC) samples for further analysis. Then,
   Richness diversity indices as a measure of α-diversity were computed to
   reflect the microbial diversity within a single sample. Similarly,
   β-diversity was assessed based on Bray-Curtis dissimilarity and
   visualized using principal coordinate analysis (PCoA) in R.

Feature selection

   The LefSe^[146]65 method eliminated redundant features and selected
   features relevant to AC and SCC. This method uses the non-parametric
   Wilcoxon rank-sum and Kruskal–Wallis tests to find statistically
   significant differences between the AC and SCC groups. The selected
   features with an LDA score greater than 2.0 and p-value < 0.05 were
   substantial and of greater effect size. We selected features by
   eliminating features with a correlation coefficient >0.95. Only one
   feature from each correlated pair was retained for further
   consideration. This correlation occurred due to features selected from
   the LEfSe method showing similarities between belonging to the same
   classification level.

Data transformation using LDA

   The data transformation was done to reduce the dimensionality of the
   above dataset resulting from feature selection to maximize inter-class
   distances and minimize intra-class distances. For this purpose, we
   divided this dataset comprising 263 samples (136 AC and 127 SCC) into
   training and test datasets with a 7:3 ratio. The dataset split is done
   to maintain the class ratio in both datasets. The train and test
   datasets comprised 183 and 80 samples, respectively. Next, we applied
   LDA to the training dataset and used the transformed train dataset
   matrix as an input to transform the test dataset.

Classification using machine and deep learning methods

   We further implemented six ML/DL algorithms to determine the predictive
   potential of the features selected post-data transformation step for
   classifying NSCLC subtypes. Table [147]2 provides an overview of the
   six ML/DL algorithms used in our analysis. These classification
   algorithms were used from publicly available Python libraries and are
   briefly described below.

Table 2.

   Hyperparameter selection using the Bayesian optimization algorithm for
   different classification models
   Classifier Hyperparameters Range Selected values
   LR L2 regularization (λ) 0–01 0.5
   RF Max_depth 01–10 01
   Min_sample_split 02–50 06
   Min_sample_leaf 01–50 11
   Max_leaf_nodes 05–100 20
   N_estimators 10–100 88
   XGBoost Max_depth 01–10 01
   KNN N_neighbors 01–10 02
   Weights Uniform, distance Distance
   Distance metric (p) Manhattan (1), Euclidean (2) Euclidean (2)
   DNN Neurons 01–500 228, 240, 80, 63, 224
   L2 regularization term (α) 10^−5–10^−2 0.0010
   Learning rate (lr) 10^−5–10^−1 0.0016
   Hidden layers 01–05 05
   [148]Open in a new tab

   LR Logistic regression, RF Random forest, XGBoost Extreme gradient
   boost, KNN K nearest neighbor, DNN Deep neural network.

ML and DL algorithms and hyperparameters tuning

   We used a Bayesian optimization-based algorithm for hyperparameter
   tuning^[149]66. It is an optimization algorithm that utilizes a
   Gaussian process and tries to minimize the loss concerning the
   hyperparameters. It is computationally more efficient and gives global
   optima, while other optimization algorithms like random search^[150]67
   and grid search^[151]68 tend to get stuck at local minima. All the ML
   and DL algorithms were optimized using Bayesian optimization to balance
   model complexity and generalization, thereby reducing the risk of
   overfitting. For each hyperparameter configuration proposed by the
   Bayesian optimization process, we performed 5-fold cross-validation on
   the training dataset, and the model was evaluated on a test subset to
   compute the F1 score. The average F1 score across all folds was
   calculated and assigned to the respective hyperparameter configuration
   as its performance metric. This iterative process was repeated for a
   predefined maximum number of evaluations, allowing Bayesian
   optimization to efficiently explore the hyperparameter space and
   identify configurations that maximized the average F1 score. Finally,
   the hyperparameter configuration with the highest average F1 score was
   selected as the optimal setup for model training. This is illustrated
   in Fig. [152]4B. Subsequently, the ML and DL algorithms were trained
   with the optimized hyperparameters using training data and evaluated
   using test data.
     * (i)
       Logistic regression (LR): LR is a standard classification
       algorithm^[153]69 that models the relationship between features and
       labels using a sigmoid function in such a way that the loss is
       minimized. For classifying AC and SCC, LR was trained using the top
       selected features, with L2 regularization (λ) applied to prevent
       overfitting and improve model generalization. The algorithm outputs
       probabilities for each subtype, which are converted into discrete
       labels based on a default threshold.
     * (ii)
       Naïve Bayes classifier (NB): NB algorithm is a probabilistic and
       generative classification technique based on Bayes theorem. It
       assumes that the features are conditionally independent given the
       class label (AC and SCC) and computes class-conditional
       probabilities by estimating the likelihood of each feature. In our
       study, the possibility is approximated using Gaussian
       distributions, where each feature’s mean and standard deviation
       within each class were utilized to estimate the probabilities. NB
       does not require adjustable hyperparameters, making it a simple yet
       effective model for classification tasks with an assumption of
       feature independence^[154]70,[155]71.
     * (iii)
       Random forest classifier (RF): For binary classification of our
       subtypes, we applied an RF classifier, an ensemble learning
       algorithm^[156]72. Here, the RF classifier assessed the input
       features corresponding to the AC and SCC subtypes, constructing
       decision boundaries that best separate the two classes based on the
       features. Each tree independently evaluated the features, and their
       predictions were aggregated to determine the most likely class for
       each test sample. The model’s final decision reflected the majority
       vote from the ensemble of decision trees, effectively combining the
       predictive power of multiple models to improve accuracy and reduce
       overfitting. In this study, we optimized key hyperparameters of the
       RF model using Optuna for hyperparameter tuning. These parameters
       included the number of trees (N_estimators), the maximum depth of
       each tree (Max_depth), the maximum number of leaf nodes
       (Max_leaf_nodes), the minimum number of samples required to split a
       node (Min_samples_split), and the minimum number of samples needed
       to be at a leaf node (Min_sample_leaf). The hyperparameters were
       tuned using Stratified K-Fold cross-validation to ensure robust
       evaluation.
     * (iv)
       Extreme gradient boost classifier (XGBoost): XGBoost is an example
       of ensemble learning based on a gradient boosting framework.
       However, unlike the RF algorithm, where multiple decision trees are
       built independently, XGBoost builds trees sequentially. Each new
       decision tree is trained on a modified dataset, which is the
       original training dataset but altered in such a way that the
       samples misclassified by the previous decision tree are given more
       weight. This strategy allows the model to focus on hard-to-classify
       instances, improving the accuracy of the ensemble through iterative
       refinement^[157]73. In our study, to perform AC and SCC
       classification, we optimized key hyperparameters, specifically, the
       maximum depth of each tree (Max_depth), which controls the
       complexity of individual trees within the ensemble. During
       training, the XGB model sequentially adjusted its decision
       boundaries to minimize the misclassification error, with each
       subsequent tree refining the decision boundaries of the previous
       ones. The final classification decision for each test sample was
       derived from the aggregated predictions of all the trees in the
       ensemble.
     * (v)
       K-nearest neighbor (KNN): This is a non-parametric distance-based
       ML algorithm^[158]74, where a sample is assigned to the class that
       is most common among its nearest neighbors. During training, the
       KNN algorithm assigned each test sample to the class of its closest
       neighbors based on the Euclidean distance, effectively
       distinguishing between the NSCLC subtypes. The model was trained by
       tuning key hyperparameters, including the number of neighbors
       (N_neighbors), the weighting scheme for neighbors (weights), and
       the distance metric parameter (p). The weights parameter allowed
       for different voting strategies, either uniform or distance-based,
       where closer neighbors were given more influence in the
       classification decision.
     * (vi)
       Deep neural network (DNN): DNNs are a neural network with more than
       one hidden layer. It has wide applications in various fields like
       computer vision, Natural language processing, etc. However, there
       is very little research on the application of DNN in biomarker
       analysis. In our study, we employed DNNs, which consist of multiple
       hidden layers due to their ability to capture complex, non-linear
       relationships in the data. These are essential for accurately
       predicting NSCLC classification. The performance of the DNN model
       was influenced by its architecture, including the number of hidden
       layers and the number of neurons within each layer. The
       hyperparameters were optimized, including the L2 regularization
       term (α), to prevent overfitting the learning rate (lr), batch
       size, and dropout probability (p[drop])^[159]75. During training,
       the DNN adjusted its weights through backpropagation to minimize
       the loss function while incorporating early stopping to avoid
       overfitting. The model’s final classification decision was based on
       the outputs of its learned layers, which effectively differentiated
       between the AC and SCC. The stratified K-fold cross-validation was
       used to evaluate the model’s performance, ensuring a balanced
       representation of both subtypes in each fold.

   To address the “black box” nature of the DNN model, we employed SHAP
   (Shapley Additive Explanations) to gain interpretability and
   transparency in the decision-making process. SHAP provides a robust
   method for explaining ML models by attributing each prediction to the
   contribution of individual features. This approach helped us to
   deconstruct the model’s output and understand which features were most
   influential in classifying the NSCLC subtypes. Using the
   KernelExplainer to compute SHAP values, we examined the impact of the
   top 10 selected microbial features on the model’s predictions. The SHAP
   summary plot was generated to visualize the distribution of these
   contributions, thereby addressing the complexity and opacity inherent
   in DNN. The results, highlighting the relative importance of these
   features, are provided in the [160]Supplementary Material
   (Supplementary Fig. [161]S3).

Model evaluation and comparison

   The model’s prediction performance in classifying two groups (AC vs.
   SCC) based on our selected features was computed through evaluation
   metrics. The evaluation metrics such as recall, precision, F[1] score,
   accuracy, and finally, the area under the receiver operating
   characteristics (AUROC) are used and explained below.

   True Positive: TP represents that while finding the type of NSCLC if
   the actual value is AC/SCC and the predicted value is also AC/SCC.
   Then, TP is equal to 1.

   False Positive: If the actual value is AC/SCC, but the predicted value
   contradicts. Then, FP is equal to 0.

   True Negative: If the actual is not AC/SCC, but the predicted value is
   also not AC/SCC. Then, TN is equal to 1.

   False Negative: If the actual value is not AC/SCC, but the predicted
   value is AC/SCC. Then, FN is equal to 0.

   The evaluation metrics are calculated using the above values:

   Accuracy: It is the most common segmentation and classification metric,
   defined by the ratio of correct predictions to the total number of
   predictions, as shown in Eq. ([162]1),
   [MATH: <mrow><mi
   mathvariant="normal">Accuracy</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="italic">TP</mi><mo>+</mo><mi
   mathvariant="italic">TN</mi></mrow><mrow><mi
   mathvariant="italic">TP</mi><mo>+</mo><mi
   mathvariant="italic">TN</mi><mo>+</mo><mi
   mathvariant="italic">FP</mi><mo>+</mo><mi
   mathvariant="italic">FN</mi></mrow></mfrac></mrow> :MATH]
   1

   Precision: It is the ratio of correctly categorized positive samples to
   total positive samples (either correctly or incorrectly). It is shown
   in Eq. ([163]2),
   [MATH: <mrow><mi
   mathvariant="normal">Precision</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="italic">TP</mi></mrow><mrow><mi
   mathvariant="italic">FP</mi><mo>+</mo><mi
   mathvariant="italic">TP</mi></mrow></mfrac></mrow> :MATH]
   2

   Sensitivity/Recall: It is the ratio of correctly classified positive
   samples as positive to the number of positive samples. It measures the
   ability of the model to detect positive samples. It is called recall,
   shown in Eq. ([164]3),
   [MATH: <mrow><mi mathvariant="normal">Sensitivity</mi><mo>/</mo><mi
   mathvariant="normal">Recall</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="italic">TP</mi></mrow><mrow><mi
   mathvariant="italic">FN</mi><mo>+</mo><mi
   mathvariant="italic">TP</mi></mrow></mfrac></mrow> :MATH]
   3

   F[1] score: It is the weighted harmonic mean of precision and recall.
   The best score is 1, and its value is a single number between 0 and 1,
   derived as shown in Eq. ([165]4),
   [MATH: <mrow><mi mathvariant="normal">F</mi><mn>1</mn><mi
   mathvariant="normal">
   score</mi><mo>=</mo><mfrac><mrow><mn>2</mn><mo>*</mo><mi
   mathvariant="italic">TP</mi></mrow><mrow><mn>2</mn><mo>*</mo><mi
   mathvariant="italic">TP</mi><mo>+</mo><mi
   mathvariant="italic">FP</mi><mo>+</mo><mi
   mathvariant="italic">TN</mi></mrow></mfrac></mrow> :MATH]
   4

   Specificity: It is the ratio of correctly classified negative samples
   as negative to the total number of actual negative samples. It measures
   the ability of the model to correctly identify non-positive (negative)
   samples, minimizing false positives. It is shown in Eq. ([166]5),
   [MATH: <mrow><mi
   mathvariant="normal">Specificity</mi><mo>=</mo><mfrac><mrow><mi
   mathvariant="italic">TN</mi></mrow><mrow><mi
   mathvariant="italic">FP</mi><mo>+</mo><mi
   mathvariant="italic">TN</mi></mrow></mfrac></mrow> :MATH]
   5

   The AUROC (area under the receiver operating characteristic) curve was
   used to graphically illustrate the relationship between a model’s
   ability to correctly identify positive cases (sensitivity) and its
   ability to correctly identify negative cases (specificity).

Pathway analysis

   To investigate the functional composition and potential association of
   specific taxa with NSCLC subtypes, we used a popular Phylogenetic
   Investigation of Communities by Reconstruction of Unobserved States2
   (PICRUSt2)^[167]76 tool. The ASV abundance table was used as input to
   give the metagenome functionality. Then, the predicted metagenome
   functions were mapped to the Kyoto Encyclopedia of Genes and Genomes
   (KEGG) Ortholog (KO)^[168]77 to find the pathways. The differentially
   enriched pathways were analyzed descriptively using ggpicrust2 in R
   software. This was done to map the predicted functional profiles to
   KEGG pathways and identify enriched or depleted pathways in samples
   associated with AC or SCC.

Supplementary information

   [169]Supplementary material^ (583.6KB, pdf)

Acknowledgements