Graphical abstract [41]graphic file with name ga1.jpg [42]Open in a new tab Keywords: Papillary thyroid carcinoma, Microbiome, Gender, Subtype Abstract While the intratumor microbiome has become increasingly implicated in cancer development, the microbial landscape of papillary thyroid carcinoma (PTC) is essentially uninvestigated. PTC is characterized by varied prognosis between gender and cancer subtype, but the cause for gender and subtype-based dissimilarities is unclear. Women are more frequently diagnosed with PTC, while men suffer more advanced-staged PTC. In addition, tall cell variants are more aggressive than classical and follicular variants of PTC. We hypothesized that intratumor microbiome composition distinctly alters the immune landscape and predicts clinical outcome between PTC subtypes and between patient genders. Raw whole-transcriptome RNA-sequencing, Level 3 normalized mRNA expression read counts, and DNA methylation 450 k sequencing data for untreated, nonirradiated tumor, and adjacent normal tissue were downloaded from the Genomic Data Commons (GDC) legacy archive for 563 thyroid carcinoma patients. Microbe counts were extracted using Pathoscope 2.0 software. We correlated microbe abundance to clinical variables and immune-associated gene expression. Gene-set enrichment, mutation, and methylation analyses were conducted to correlate microbe abundance to characterize microbes’ roles. Overall, PTC tumor tissue significantly lacked microbes that are populated in adjacent normal tissue, which suggests presence of microbes may be critical in controlling immune cell expression and regulating immune and cancer pathways to mitigate cancer growth. In contrast, we also found that microbes distinctly abundant in tall cell and male patient cohorts were also correlated with higher mutation expression and methylation of tumor suppressors. Microbe dysbiosis in specific PTC types may explain observable differences in PTC progression and pathogenesis. These microbes provide a basis for developing specialized prebiotic and probiotic treatments for varied PTC tumors. 1. Introduction Thyroid cancer is the fifth most common cancer in women, and it is estimated that over 12,000 men and 40,000 women will be diagnosed with thyroid cancer in 2021 [43][1]. Advanced-stage thyroid cancer presents itself more frequently in men than women, but the cause of this disparity is unclear [44][2], [45][3]. Despite advancements in diagnostic technology for thyroid cancer, precision medicine and nonsurgical treatment options for papillary thyroid carcinoma (PTC) and its most common subtypes, classical (CPTC), follicular variant (FVPTC), and the more aggressive tall cell papillary thyroid carcinoma (TCPTC) are severely lacking [46][4], [47][5], [48][6], [49][7], [50][8], [51][9]. A recent study identified the presence of microbes in blood and thyroid cancer tissue, suggesting that there exists a predictable and significant abundance of microbes within PTC issues [52][10]. Furthermore, there is mounting evidence that the intratumor microbiome is implicated in cancer progression and mitigation through their metabolites. For example, the absence of folate-producing Bifidobacterium and Lactobacillus was correlated with hypomethylation at regions of the p53 gene in colorectal cancer [53][11]. This research suggests that tissue-specific microbes may play a tumor suppressive role, specifically by causing or preventing the repair of DNA damage. Past microbiome studies have used The Cancer Genome Atlas (TCGA) data and machine learning to identify diagnostic microbial signatures in various cancers but have not elucidated how microbe composition varies in PTC and how microbes interact with pathways to influence cancer progression [54][10]. We hypothesize that the intratumor microbiome in thyroid carcinoma plays a critical role in determining the varying immune landscape across different types of PTC. In this study, we characterized the abundance of microbes in PTC using RNA sequencing data obtained from TCGA. We identified microbes differentially abundant between males and females and between different PTC subtypes. To determine the mechanisms by which specific microbes may influence the immune landscape, we correlated microbe abundance to clinical variables, immune-associated (IA) gene expression, immune-cell expression, and cancer and immunological pathways. Finally, we correlated microbe abundance with known copy number variations (CNV) and mutations and methylation sites ([55]Fig. 1). With this study, we aim to characterize the thyroid carcinoma intratumor microbiome and investigate microbiome’s unique clinical significance in thyroid carcinoma prognosis. Fig. 1. [56]Fig. 1 [57]Open in a new tab Schematic of project workflow and analyses. After obtaining sequencing data from The Cancer Genome Atlas (TCGA), we used the Pathoscope software to extract microbial read counts. We normalized the microbial counts using Aitchison’s log transformation. 1) We performed differential abundance to identify which microbes were significant differentially abundant in tumor vs. normal samples. The overlap in differentially abundant microbial species between various cohorts is demonstrated through venn diagrams: male versus female thyroid carcinoma and classical papillary thyroid carcinoma (CPTC) versus follicular variant papillary thyroid carcinoma (FVPTC) versus tall cell papillary thyroid carcinoma (TCPTC). 2) Three methods of contamination correction were used to identify potential contaminants among the significantly differentially abundant microbes we identified in Step 1. We performed correction by sequencing plate, sequencingdata, and abundance in one individual sample vs. all samples. We also compiled a list of common contaminants from previously published research and eliminated these microbes from further study. 3) MACIS (distant Metastasis, patient Age, Completeness of resection, local Invasion, and tumor Size) score and clinical variable analysis of differentially abundant microbes were performed using the Kruskal-Wallis statistical test. Clinical variable data (pathologic stage, TNM stage) for each sample were obtained from the Broad Institute GDAC Firehose database. 4) Correlation of microbial abundance to immune-associated gene expression and immune cell infiltration: We performed gene differential expression analysis on Level 3 normalized mRNA expression read count data obtained from Genomic Data Commons legacy archive/TCGA. We then correlated microbe abundance with dysregulated immune-associated gene expression using Kruskal-Wallis and visualized and categorized the data using Cytoscape Reactome FIViz software. We used CIBERSORTx software to obtain immune cell expression correlated with microbe abundance. 5) We used Gene Set Enrichment Analysis to identify immunological and oncogenic signatures correlated with microbial abundance. 6) We used the Repeated Evaluation of Variables conditionAL Entropy and Redundancy (REVEALER) to identify chromosomal amplifications, deletions, and mutations correlated with microbial abundance. 7) We used DNA methylation 450k sequencing data and a modified workflow of methylationArrayAnalysis from Bioconductor to identify differentially methylated sites in thyroid carcinoma. We correlated differentially methylated sites with microbial abundance using the Kruskal-Wallis statistical test. 2. Materials and methods 2.1. Data acquisition Raw whole-transcriptome RNA-sequencing, Level 3 normalized mRNA expression read counts, and DNA methylation 450 k sequencing data for untreated, nonirradiated tumor and adjacent normal tissue were downloaded from the Genomic Data Commons (GDC) legacy archive for 563 thyroid carcinoma (THCA) patients (354 CPTC, 101 FVPTC, 35 TCPTC, 135 male, 366 female tumor samples) [58][12], [59][13]. The GDC provides web-based access to data and metadata from cancer genomics studies aggregated in TCGA, and The Broad Institute developed a suite of tools and pipelines to assist in aggregating and analyzing various types of large-scale genomic and proteomic data. Clinical information for all patients was downloaded from Broad GDAC Firehose [60][14]. Genomic alteration information for each patient was obtained from the last analysis report (2016) of the Broad Institute TCGA Genome Data Analysis Center [61][15]. 2.2. Extraction of microbial reads RNA-sequencing data were filtered for bacterial reads via direct alignment to a reference library of bacterial sequences using the Pathoscope 2.0 software and the NCBI nucleotide database [62][16]. To account for the inherent compositionality of microbe data, we performed a ratio transformation of the microbe abundance data. Ratio transformations capture the relationships between samples in the data, regardless of whether the data are counts or proportions. Log-ratios make the dataset linearly related, so that log-ratio abundances of features can be compared to other samples in the data [63][17], [64][18]. We used the R “compositions” package to perform the variance-stabilizing Aitchison’s log transformation to eliminate negative correlation bias and account for large amounts of variability between samples. 2.3. Evaluation of contamination Contamination of RNA-sequencing data may originate from appliance contamination or sequencing errors [65][19]. In order to account for contamination, we employed multiple contamination correction methods based on TCGA sampling and sequencing methodology [66][20]. First, scatterplots of microbial abundance versus date of sequencing were produced; potential contaminants were determined based on unrepresentative overabundance of bacteria, or overabundance solely across three (or less) consecutive dates. Second, boxplots of microbe abundance versus sequencing plates were produced to identify unrepresentative overabundance of bacteria, or overabundance of bacteria solely in samples sequenced on the same plate. Third, total abundance across all microbes was compared against the abundance of each individual microbe, per patient. As the total number of read counts for each sample increases, it is reasonable to assume that the abundance of any specific microbe will also increase if the microbe is not a contaminant. We identified potential contaminants in scatter plots of individual microbe abundance versus total abundance with a slope of zero (margin around 0 of ±0.1) 2.4. Differential abundance between patient cohorts and correlation to MACIS (distant metastasis, patient age, completeness of resection, local Invasion, and tumor Size) score and clinical variables The Kruskal-Wallis test was used to determine differential abundance (p < 0.05) and correlations between microbe abundance and lymph node metastasis, MACIS score, neoplasm cancer status, pathologic stage, pathologic tumor-node-metastasis stage, and residual tumor (p < 0.05). 2.5. Correlation of microbial abundance to immune-associated elements, copy number variants and mutations, and methylation sites Dysregulated immune-associated (IA) genes are defined as genes with altered expression between tumor and normal tissue. The genes were determined using edgeR analysis of mRNA expression data obtained from TCGA (FDR < 0.05 and |log fold change| > 1) [67][13]. EdgeR offers highly accurate statistical methods for multigroup experiments [68][21], [69][22]. Dysregulated IA genes were clustered using the Cytoscape Reactome FIViz software. The Kruskal-Wallis test was used to determine the presence of correlation between differentially abundant microbes and dysregulated IA genes (p < 0.01). The p-value cut-off was reduced to 0.01 to focus on the most statistically significant correlations. The threshold was determined by p-value in order to evaluate effect size and assess the probability of the results supporting the hypothesis. Using the software tool CIBERSORTx and mRNA expression data, we used gene expression profiles to estimate immune infiltration [70][23]. The input was a matrix of gene expression read counts by THCA samples, compiled from mRNA sequencing data [71][13]. CIBERSORTx software returned expression values for 22 immune cell types. We conducted Gene Set Enrichment Analysis (GSEA) to correlate microbe abundance to known cancer pathways and immunological signatures. The GSEA software and signatures from the C2, C6, and C7 collections were downloaded from the Molecular Signatures Database [72][24]. Statistically significantly (p < 0.05) enriched signatures were plotted. The surface-level trends of mutation presence were analyzed by calculating the percentage of patients with each mutation, indicated by a binary value per mutation. The GDAC files were compiled into input files for the REVEALER (repeated evaluation of variables conditional entropy and redundancy) algorithm, which identifies sets of specific CNVs, including chromosomal deletions, amplifications, and alterations, and mutations that are most likely implicated in changes to the target expression profile. The target profile was identified as the expression of a single immune-associated gene. The REVEALER algorithm runs in multiple iterations in order to identify the most prominent genomic alterations. For our study, we set the maximum number of iterations to three. The algorithm also allows for the use of a seed, or a particular mutation of CNV gain or loss event that may account for target activity. Because we did not know the individual genetic alterations that were responsible for each IA genes’ dysregulation, the seed was set to null. Significant correlations were indicated by p < 0.05 and CIC > 0.15 [73][25]. Using a modified workflow of methylationArrayAnalysis from Bioconductor, we converted B-values to M-values and then performed probe-wise differential methylation analysis on the matrix of M-values in limma to obtain t-statistics and p-values for each CpG site [74][26], [75][27]. The Kruskal-Wallis test was used to correlate microbe abundance to the extent of methylation. 2.6. Validation with external dataset 77 CPTC and 48 FVPTC samples were obtained from the European Nucleotide Archive (Project ID: PRJEB11591) [76][28]. Pathoscope 2.0 was used to align and extract microbial reads. Microbe abundance was plotted to compare relative abundance between tumor samples from TCGA and ENA. 3. Results 3.1. Microbial landscapes of tumor tissue between genders are distinct Principal component analysis (PCA) was conducted to measure heterogeneity in microbe species between tumor and normal samples, tumor samples of varied subtype, and tumor samples from different genders ([77]Fig. 2). Overall, the principal component plots show major clustering and indicate that the microbe populations between the tumor samples are relatively homogenous. However, the third PCA plot contains a clustering of female patient tumor samples in the top right corner which makes clear that the microbial landscapes of tumor tissues from the different genders may be distinct. We found the microbes that contributed most to the top-right cluster to be cyanobacteria species. Fig. 2. [78]Fig. 2 [79]Open in a new tab Principal component analysis plots to visualize microbial landscape in the following comparisons: Tumor versus normal tissue, classical papillary thyroid carcinoma versus follicular variant papillary thyroid carcinoma versus tall cell papillary thyroid carcinoma, and males versus females. 3.2. Contamination correction To verify that our measurement of intratumor microbial abundance was consistent across samples and not due to contamination, we conducted three methods of contamination correction. We identified 17 and 21 potential contaminants using correction by sequencing date and plate respectively ([80]Fig. 3A, B). We compared individual microbial abundance to the abundance of all microbes to identify aberrant abundance potential contaminants ([81]Fig. 3C). Finally, we compiled a list of 175 commonly known contaminants from studies of bacterial contamination from reagents and of the hospital microbiome [82][29], [83][30]. In all, we identified 202 potential contaminants to be eliminated from further investigation (Supplementary Table 1). Fig. 3. [84]Fig. 3 [85]Open in a new tab Contamination correction. (A) Potential microbial contaminants were identified by producing scatterplots of individual microbe abundance versus date of sequencing of sample. Independent clusters of high abundance across a few dates indicates that the microbe is not consistently differentially abundant across all samples. (B) Potential microbial contaminants were identified by producing boxplots of individual microbe abundance versus sequencing plate. Singular boxplots of high abundance in one sequencing plate indicates that the microbe is not consistently differentially abundant across all samples. Instead, the microbe appears to be differentially abundant due to contamination in one sequencing plate. (C) Read counts of total abundance across all microbes was compared against the abundance of each individual microbe, per patient. Potential contaminants were identified in scatter plots with a slope of zero. This method elicited zero potential contaminants, as all plots had a slope with magnitude significantly greater than zero. 3.3. Differential abundance of intratumor microbes We found 45 microbes in CPTC, 34 in FVPTC, and 33 in TCPTC to be differentially abundant between tumor and normal tissue. We found 33 microbes in male samples and 49 microbes in female samples to be differentially abundant between tumor and normal tissue. Of all the statistically significant correlations, we focused on investigating the strongest correlations (p < 0.05) of microbes that have been previously discovered and studied in the context of cancer and inflammatory disease. Micrococcus luteus, Frankia sp., Anabaena sp. K119, and uncultured Gammaproteobacteria bacterium were all similarly overabundant in normal tissue in CPTC, FVPTC, and TCPTC ([86]Fig. 4A). Trueperella pyogenes and Stenotrophomonas maltophilia K279a were similarly dysregulated in CPTC and FVPTC ([87]Fig. 4B). We also identified the strongest correlations of microbes uniquely differentially abundant in each subtype cohort ([88]Fig. 4C). Rhodococcus fascians D188 was overabundant in normal samples in CPTC. Acinetobacter baumannii AB0057 was overabundant in normal samples in FVPTC. Bradyrhizobium sp. BTAi1 was overabundant in normal samples in TCPTC. Fig. 4. [89]Fig. 4 [90]Open in a new tab Differential abundance of microbes. (A) Microbes identified to be similarly differentially abundant across all three papillary thyroid carcinoma (PTC) subtypes. Yellow corresponds to classical PTC (CPTC) data, green corresponds to follicular variant PTC (FVPTC) data, and red corresponds to tall cell PTC (TCPTC) data. (B) Microbes identified to be similarly differentially abundant in CPTC and FVPTC. (C) Microbes identified to be uniquely differentially abundant in CPTC, FVPTC, and TCPTC. (D) Synechococcus sp. CC9311 was significantly differentially abundant in both male and female thyroid tumor samples. However, the microbe was absent in male tumor tissue versus normal tissue but overabundant in female tumor tissue versus normal tissue. (E) Venn diagram schematics summarizing microbes found to be differentially abundant in patient cohorts. The number indicates the number of most significant correlations, and the down arrow indicates that the microbe was sparse in tumor samples compared to normal samples. (For interpretation of the references to colour in this figure legend, the