Graphical abstract

   [41]graphic file with name ga1.jpg
   [42]Open in a new tab

   Keywords: Papillary thyroid carcinoma, Microbiome, Gender, Subtype

Abstract

   While the intratumor microbiome has become increasingly implicated in
   cancer development, the microbial landscape of papillary thyroid
   carcinoma (PTC) is essentially uninvestigated. PTC is characterized by
   varied prognosis between gender and cancer subtype, but the cause for
   gender and subtype-based dissimilarities is unclear. Women are more
   frequently diagnosed with PTC, while men suffer more advanced-staged
   PTC. In addition, tall cell variants are more aggressive than classical
   and follicular variants of PTC. We hypothesized that intratumor
   microbiome composition distinctly alters the immune landscape and
   predicts clinical outcome between PTC subtypes and between patient
   genders. Raw whole-transcriptome RNA-sequencing, Level 3 normalized
   mRNA expression read counts, and DNA methylation 450 k sequencing data
   for untreated, nonirradiated tumor, and adjacent normal tissue were
   downloaded from the Genomic Data Commons (GDC) legacy archive for 563
   thyroid carcinoma patients. Microbe counts were extracted using
   Pathoscope 2.0 software. We correlated microbe abundance to clinical
   variables and immune-associated gene expression. Gene-set enrichment,
   mutation, and methylation analyses were conducted to correlate microbe
   abundance to characterize microbes’ roles. Overall, PTC tumor tissue
   significantly lacked microbes that are populated in adjacent normal
   tissue, which suggests presence of microbes may be critical in
   controlling immune cell expression and regulating immune and cancer
   pathways to mitigate cancer growth. In contrast, we also found that
   microbes distinctly abundant in tall cell and male patient cohorts were
   also correlated with higher mutation expression and methylation of
   tumor suppressors. Microbe dysbiosis in specific PTC types may explain
   observable differences in PTC progression and pathogenesis. These
   microbes provide a basis for developing specialized prebiotic and
   probiotic treatments for varied PTC tumors.

1. Introduction

   Thyroid cancer is the fifth most common cancer in women, and it is
   estimated that over 12,000 men and 40,000 women will be diagnosed with
   thyroid cancer in 2021 [43][1]. Advanced-stage thyroid cancer presents
   itself more frequently in men than women, but the cause of this
   disparity is unclear [44][2], [45][3]. Despite advancements in
   diagnostic technology for thyroid cancer, precision medicine and
   nonsurgical treatment options for papillary thyroid carcinoma (PTC) and
   its most common subtypes, classical (CPTC), follicular variant (FVPTC),
   and the more aggressive tall cell papillary thyroid carcinoma (TCPTC)
   are severely lacking [46][4], [47][5], [48][6], [49][7], [50][8],
   [51][9].

   A recent study identified the presence of microbes in blood and thyroid
   cancer tissue, suggesting that there exists a predictable and
   significant abundance of microbes within PTC issues [52][10].
   Furthermore, there is mounting evidence that the intratumor microbiome
   is implicated in cancer progression and mitigation through their
   metabolites. For example, the absence of folate-producing
   Bifidobacterium and Lactobacillus was correlated with hypomethylation
   at regions of the p53 gene in colorectal cancer [53][11]. This research
   suggests that tissue-specific microbes may play a tumor suppressive
   role, specifically by causing or preventing the repair of DNA damage.
   Past microbiome studies have used The Cancer Genome Atlas (TCGA) data
   and machine learning to identify diagnostic microbial signatures in
   various cancers but have not elucidated how microbe composition varies
   in PTC and how microbes interact with pathways to influence cancer
   progression [54][10]. We hypothesize that the intratumor microbiome in
   thyroid carcinoma plays a critical role in determining the varying
   immune landscape across different types of PTC.

   In this study, we characterized the abundance of microbes in PTC using
   RNA sequencing data obtained from TCGA. We identified microbes
   differentially abundant between males and females and between different
   PTC subtypes. To determine the mechanisms by which specific microbes
   may influence the immune landscape, we correlated microbe abundance to
   clinical variables, immune-associated (IA) gene expression, immune-cell
   expression, and cancer and immunological pathways. Finally, we
   correlated microbe abundance with known copy number variations (CNV)
   and mutations and methylation sites ([55]Fig. 1). With this study, we
   aim to characterize the thyroid carcinoma intratumor microbiome and
   investigate microbiome’s unique clinical significance in thyroid
   carcinoma prognosis.

Fig. 1.

   [56]Fig. 1
   [57]Open in a new tab

   Schematic of project workflow and analyses. After obtaining sequencing
   data from The Cancer Genome Atlas (TCGA), we used the Pathoscope
   software to extract microbial read counts. We normalized the microbial
   counts using Aitchison’s log transformation. 1) We performed
   differential abundance to identify which microbes were significant
   differentially abundant in tumor vs. normal samples. The overlap in
   differentially abundant microbial species between various cohorts is
   demonstrated through venn diagrams: male versus female thyroid
   carcinoma and classical papillary thyroid carcinoma (CPTC) versus
   follicular variant papillary thyroid carcinoma (FVPTC) versus tall cell
   papillary thyroid carcinoma (TCPTC). 2) Three methods of contamination
   correction were used to identify potential contaminants among the
   significantly differentially abundant microbes we identified in Step 1.
   We performed correction by sequencing plate, sequencingdata, and
   abundance in one individual sample vs. all samples. We also compiled a
   list of common contaminants from previously published research and
   eliminated these microbes from further study. 3) MACIS (distant
   Metastasis, patient Age, Completeness of resection, local Invasion, and
   tumor Size) score and clinical variable analysis of differentially
   abundant microbes were performed using the Kruskal-Wallis statistical
   test. Clinical variable data (pathologic stage, TNM stage) for each
   sample were obtained from the Broad Institute GDAC Firehose database.
   4) Correlation of microbial abundance to immune-associated gene
   expression and immune cell infiltration: We performed gene differential
   expression analysis on Level 3 normalized mRNA expression read count
   data obtained from Genomic Data Commons legacy archive/TCGA. We then
   correlated microbe abundance with dysregulated immune-associated gene
   expression using Kruskal-Wallis and visualized and categorized the data
   using Cytoscape Reactome FIViz software. We used CIBERSORTx software to
   obtain immune cell expression correlated with microbe abundance. 5) We
   used Gene Set Enrichment Analysis to identify immunological and
   oncogenic signatures correlated with microbial abundance. 6) We used
   the Repeated Evaluation of Variables conditionAL Entropy and Redundancy
   (REVEALER) to identify chromosomal amplifications, deletions, and
   mutations correlated with microbial abundance. 7) We used DNA
   methylation 450k sequencing data and a modified workflow of
   methylationArrayAnalysis from Bioconductor to identify differentially
   methylated sites in thyroid carcinoma. We correlated differentially
   methylated sites with microbial abundance using the Kruskal-Wallis
   statistical test.

2. Materials and methods

2.1. Data acquisition

   Raw whole-transcriptome RNA-sequencing, Level 3 normalized mRNA
   expression read counts, and DNA methylation 450 k sequencing data for
   untreated, nonirradiated tumor and adjacent normal tissue were
   downloaded from the Genomic Data Commons (GDC) legacy archive for 563
   thyroid carcinoma (THCA) patients (354 CPTC, 101 FVPTC, 35 TCPTC, 135
   male, 366 female tumor samples) [58][12], [59][13]. The GDC provides
   web-based access to data and metadata from cancer genomics studies
   aggregated in TCGA, and The Broad Institute developed a suite of tools
   and pipelines to assist in aggregating and analyzing various types of
   large-scale genomic and proteomic data. Clinical information for all
   patients was downloaded from Broad GDAC Firehose [60][14]. Genomic
   alteration information for each patient was obtained from the last
   analysis report (2016) of the Broad Institute TCGA Genome Data Analysis
   Center [61][15].

2.2. Extraction of microbial reads

   RNA-sequencing data were filtered for bacterial reads via direct
   alignment to a reference library of bacterial sequences using the
   Pathoscope 2.0 software and the NCBI nucleotide database [62][16]. To
   account for the inherent compositionality of microbe data, we performed
   a ratio transformation of the microbe abundance data. Ratio
   transformations capture the relationships between samples in the data,
   regardless of whether the data are counts or proportions. Log-ratios
   make the dataset linearly related, so that log-ratio abundances of
   features can be compared to other samples in the data [63][17],
   [64][18]. We used the R “compositions” package to perform the
   variance-stabilizing Aitchison’s log transformation to eliminate
   negative correlation bias and account for large amounts of variability
   between samples.

2.3. Evaluation of contamination

   Contamination of RNA-sequencing data may originate from appliance
   contamination or sequencing errors [65][19]. In order to account for
   contamination, we employed multiple contamination correction methods
   based on TCGA sampling and sequencing methodology [66][20]. First,
   scatterplots of microbial abundance versus date of sequencing were
   produced; potential contaminants were determined based on
   unrepresentative overabundance of bacteria, or overabundance solely
   across three (or less) consecutive dates. Second, boxplots of microbe
   abundance versus sequencing plates were produced to identify
   unrepresentative overabundance of bacteria, or overabundance of
   bacteria solely in samples sequenced on the same plate. Third, total
   abundance across all microbes was compared against the abundance of
   each individual microbe, per patient. As the total number of read
   counts for each sample increases, it is reasonable to assume that the
   abundance of any specific microbe will also increase if the microbe is
   not a contaminant. We identified potential contaminants in scatter
   plots of individual microbe abundance versus total abundance with a
   slope of zero (margin around 0 of ±0.1)

2.4. Differential abundance between patient cohorts and correlation to MACIS
(distant metastasis, patient age, completeness of resection, local Invasion,
and tumor Size) score and clinical variables

   The Kruskal-Wallis test was used to determine differential abundance
   (p < 0.05) and correlations between microbe abundance and lymph node
   metastasis, MACIS score, neoplasm cancer status, pathologic stage,
   pathologic tumor-node-metastasis stage, and residual tumor (p < 0.05).

2.5. Correlation of microbial abundance to immune-associated elements, copy
number variants and mutations, and methylation sites

   Dysregulated immune-associated (IA) genes are defined as genes with
   altered expression between tumor and normal tissue. The genes were
   determined using edgeR analysis of mRNA expression data obtained from
   TCGA (FDR < 0.05 and |log fold change| > 1) [67][13]. EdgeR offers
   highly accurate statistical methods for multigroup experiments
   [68][21], [69][22]. Dysregulated IA genes were clustered using the
   Cytoscape Reactome FIViz software. The Kruskal-Wallis test was used to
   determine the presence of correlation between differentially abundant
   microbes and dysregulated IA genes (p < 0.01). The p-value cut-off was
   reduced to 0.01 to focus on the most statistically significant
   correlations. The threshold was determined by p-value in order to
   evaluate effect size and assess the probability of the results
   supporting the hypothesis.

   Using the software tool CIBERSORTx and mRNA expression data, we used
   gene expression profiles to estimate immune infiltration [70][23]. The
   input was a matrix of gene expression read counts by THCA samples,
   compiled from mRNA sequencing data [71][13]. CIBERSORTx software
   returned expression values for 22 immune cell types.

   We conducted Gene Set Enrichment Analysis (GSEA) to correlate microbe
   abundance to known cancer pathways and immunological signatures. The
   GSEA software and signatures from the C2, C6, and C7 collections were
   downloaded from the Molecular Signatures Database [72][24].
   Statistically significantly (p < 0.05) enriched signatures were
   plotted.

   The surface-level trends of mutation presence were analyzed by
   calculating the percentage of patients with each mutation, indicated by
   a binary value per mutation. The GDAC files were compiled into input
   files for the REVEALER (repeated evaluation of variables conditional
   entropy and redundancy) algorithm, which identifies sets of specific
   CNVs, including chromosomal deletions, amplifications, and alterations,
   and mutations that are most likely implicated in changes to the target
   expression profile. The target profile was identified as the expression
   of a single immune-associated gene. The REVEALER algorithm runs in
   multiple iterations in order to identify the most prominent genomic
   alterations. For our study, we set the maximum number of iterations to
   three. The algorithm also allows for the use of a seed, or a particular
   mutation of CNV gain or loss event that may account for target
   activity. Because we did not know the individual genetic alterations
   that were responsible for each IA genes’ dysregulation, the seed was
   set to null. Significant correlations were indicated by p < 0.05 and
   CIC > 0.15 [73][25].

   Using a modified workflow of methylationArrayAnalysis from
   Bioconductor, we converted B-values to M-values and then performed
   probe-wise differential methylation analysis on the matrix of M-values
   in limma to obtain t-statistics and p-values for each CpG site
   [74][26], [75][27]. The Kruskal-Wallis test was used to correlate
   microbe abundance to the extent of methylation.

2.6. Validation with external dataset

   77 CPTC and 48 FVPTC samples were obtained from the European Nucleotide
   Archive (Project ID: PRJEB11591) [76][28]. Pathoscope 2.0 was used to
   align and extract microbial reads. Microbe abundance was plotted to
   compare relative abundance between tumor samples from TCGA and ENA.

3. Results

3.1. Microbial landscapes of tumor tissue between genders are distinct

   Principal component analysis (PCA) was conducted to measure
   heterogeneity in microbe species between tumor and normal samples,
   tumor samples of varied subtype, and tumor samples from different
   genders ([77]Fig. 2). Overall, the principal component plots show major
   clustering and indicate that the microbe populations between the tumor
   samples are relatively homogenous. However, the third PCA plot contains
   a clustering of female patient tumor samples in the top right corner
   which makes clear that the microbial landscapes of tumor tissues from
   the different genders may be distinct. We found the microbes that
   contributed most to the top-right cluster to be cyanobacteria species.

Fig. 2.

   [78]Fig. 2
   [79]Open in a new tab

   Principal component analysis plots to visualize microbial landscape in
   the following comparisons: Tumor versus normal tissue, classical
   papillary thyroid carcinoma versus follicular variant papillary thyroid
   carcinoma versus tall cell papillary thyroid carcinoma, and males
   versus females.

3.2. Contamination correction

   To verify that our measurement of intratumor microbial abundance was
   consistent across samples and not due to contamination, we conducted
   three methods of contamination correction. We identified 17 and 21
   potential contaminants using correction by sequencing date and plate
   respectively ([80]Fig. 3A, B). We compared individual microbial
   abundance to the abundance of all microbes to identify aberrant
   abundance potential contaminants ([81]Fig. 3C). Finally, we compiled a
   list of 175 commonly known contaminants from studies of bacterial
   contamination from reagents and of the hospital microbiome [82][29],
   [83][30]. In all, we identified 202 potential contaminants to be
   eliminated from further investigation (Supplementary Table 1).

Fig. 3.

   [84]Fig. 3
   [85]Open in a new tab

   Contamination correction. (A) Potential microbial contaminants were
   identified by producing scatterplots of individual microbe abundance
   versus date of sequencing of sample. Independent clusters of high
   abundance across a few dates indicates that the microbe is not
   consistently differentially abundant across all samples. (B) Potential
   microbial contaminants were identified by producing boxplots of
   individual microbe abundance versus sequencing plate. Singular boxplots
   of high abundance in one sequencing plate indicates that the microbe is
   not consistently differentially abundant across all samples. Instead,
   the microbe appears to be differentially abundant due to contamination
   in one sequencing plate. (C) Read counts of total abundance across all
   microbes was compared against the abundance of each individual microbe,
   per patient. Potential contaminants were identified in scatter plots
   with a slope of zero. This method elicited zero potential contaminants,
   as all plots had a slope with magnitude significantly greater than
   zero.

3.3. Differential abundance of intratumor microbes

   We found 45 microbes in CPTC, 34 in FVPTC, and 33 in TCPTC to be
   differentially abundant between tumor and normal tissue. We found 33
   microbes in male samples and 49 microbes in female samples to be
   differentially abundant between tumor and normal tissue. Of all the
   statistically significant correlations, we focused on investigating the
   strongest correlations (p < 0.05) of microbes that have been previously
   discovered and studied in the context of cancer and inflammatory
   disease. Micrococcus luteus, Frankia sp., Anabaena sp. K119, and
   uncultured Gammaproteobacteria bacterium were all similarly
   overabundant in normal tissue in CPTC, FVPTC, and TCPTC ([86]Fig. 4A).
   Trueperella pyogenes and Stenotrophomonas maltophilia K279a were
   similarly dysregulated in CPTC and FVPTC ([87]Fig. 4B). We also
   identified the strongest correlations of microbes uniquely
   differentially abundant in each subtype cohort ([88]Fig. 4C).
   Rhodococcus fascians D188 was overabundant in normal samples in CPTC.
   Acinetobacter baumannii AB0057 was overabundant in normal samples in
   FVPTC. Bradyrhizobium sp. BTAi1 was overabundant in normal samples in
   TCPTC.

Fig. 4.

   [89]Fig. 4
   [90]Open in a new tab

   Differential abundance of microbes. (A) Microbes identified to be
   similarly differentially abundant across all three papillary thyroid
   carcinoma (PTC) subtypes. Yellow corresponds to classical PTC (CPTC)
   data, green corresponds to follicular variant PTC (FVPTC) data, and red
   corresponds to tall cell PTC (TCPTC) data. (B) Microbes identified to
   be similarly differentially abundant in CPTC and FVPTC. (C) Microbes
   identified to be uniquely differentially abundant in CPTC, FVPTC, and
   TCPTC. (D) Synechococcus sp. CC9311 was significantly differentially
   abundant in both male and female thyroid tumor samples. However, the
   microbe was absent in male tumor tissue versus normal tissue but
   overabundant in female tumor tissue versus normal tissue. (E) Venn
   diagram schematics summarizing microbes found to be differentially
   abundant in patient cohorts. The number indicates the number of most
   significant correlations, and the down arrow indicates that the microbe
   was sparse in tumor samples compared to normal samples. (For
   interpretation of the references to colour in this figure legend, the