Abstract Background The detection rate of thyroid cancer (TC) has been continuously improved due to the development of detection technology. Epithelial–mesenchymal transition (EMT) is thought to be closely related to the malignant progression of tumors. However, the relationship between EMT-related genes (ERGs) characteristics and the diagnosis and prognosis of TC patients has not been studied. Methods Four datasets from Gene Expression Omnibus (GEO) were used to perform transcriptomic profile analysis. The overlapping differentially expressed ERGs (DEERGs) were analyzed using the R package “limma”. Then, the hub genes, which had a higher degree, were identified by the protein–protein interaction (PPI) network. Gene expression analysis between the TC and normal data, the disease-free survival (DFS) analysis of TC patients from The Cancer Genome Atlas Thyroid Cancer (TCGA-THCA) cohort, function analysis, and immunohistochemistry (IHC) were performed to verify the importance of the hub genes. Finally, a prognostic risk scoring was constructed to predict DFS in patients with the selected genes. Results A total of 43 DEERGs were identified and 10 DEERGs were considered hub ERGs, which had a high degree of connectivity in the PPI network. Then, the differential expressions of FN1, ITGA2, and KIT between TC and normal tissues were verified in the TCGA-THCA cohort and their protein expressions were also verified by IHC. DFS analysis indicated upregulations of FN1 expression (P<0.01) and ITGA2 expression (P<0.01) and downregulation of KIT expression (P=0.01) increased risks of decreased DFS for TCGA-THCA patients. Besides, by building a prognostic risk scoring model, we found that the DFS of TCGA-THCA patients was significantly worse in high-risk groups. Conclusion In summary, these hub ERGs were potential biomarkers for diagnosis and prognosis of TC, which can provide a basis for further exploring the efficacy of EMT in patients with TC. Keywords: thyroid cancer, EMT, signature, bioinformatic analysis, immunohistochemistry Introduction Thyroid cancer (TC) is the most common endocrine malignancy. In the past 40 years, the incidence of TC has risen steadily in many countries around the world.[34]^1 With the improvement of detection technology, the incidence of recent TC in China in 2014 was 12.40/100,000, making it one of the fastest-growing malignant tumors.[35]^2 Although the incidence of TC is increasing year by year, the mortality rate of TC is stable or decreased slightly due to early diagnosis.[36]^3^,[37]^4 Ultrasonography is the main method for stratifying the risk of early cancerous thyroid nodules. Patients with suspected TC could take a fine-needle aspiration biopsy (FNAB) or surgical resection, and followed by a pathological examination to assess the nature of cancer.[38]^5 However, in some cases, due to the overlapping of cytological features of malignant and benign thyroid nodules, approximately 15%-30% of biopsies are uncertain.[39]^6 And the uncertain diagnosis may bring inappropriate treatment leading to poor quality of life, such as the surgical removal of the thyroid will seriously affect the quality of life of patients. Ultrasound and FNAB can only be diagnosed and cannot respond to the question of whether to perform surgery. Therefore, supplementing and improving traditional diagnostic methods of TC patients is still an urgent problem to be addressed. Epithelial–mesenchymal transition (EMT) is one of the markers of carcinogenesis, which includes the re-differentiation of epithelial cells into mesenchymal cells, making the cell phenotype malignant.[40]^7–9 At present, some studies have found that EMT-related molecular markers are significantly correlated with TC occurrence and adverse outcomes.[41]^10−13 However, these studies only focused on a single genetic biomarker, which has been less effective in predicting outcomes. Studies have shown that gene signatures composed of multiple genes may be the better predictors of patients. Some researchers have studied the relationship between EMT-related signatures and the prognosis of patients with gastric adenocarcinoma,[42]^14 endometrial cancer,[43]^15 and glioma[44]^16 by analyzing public databases. However, there has been no bioinformatics research in the field of thyroid cancer. In this study, through the analysis of existing genetic data, a transcriptomic profile of thyroid cancer of EMT-related genes (ERGs) was constructed, and further analysis was performed, as well as exploring the potential signatures for diagnosis and prognostic guidance. Materials and Methods Data Collection Data screening was completed independently by two researchers, and a consistent comparison was conducted after completion. We searched in Gene Expression Omnibus (GEO) database ([45]http://www.ncbi.nlm.nih.gov/geo) for all published data related to gene expression in thyroid cancer until August 1, 2020, by using the following keywords: (thyroid cancer OR cancerous goiter OR carcinoma of thyroid OR thyroid carcinoma). Finally, a total of 4 thyroid cancer data sets from the GEO database ([46]GSE29265, [47]GSE33630, [48]GSE60542, and [49]GSE65144) were included in this study.[50]^17 Datasets inclusion criteria: (1) they contained human TC samples and normal tissue samples. (2) their sample sizes at least 20. The information of four GEO datasets was shown in [51]Supplementary Table S1. Differentially Expressed ERGs Analysis A list of ERGs containing 1184 ERGs was downloaded from the EMT gene database ([52]Supplementary Table S2).[53]^18 All data were processed using the R software (version 3.6.1). The limma package was used to identify the differentially expressed ERGs (DEERGs) between the TC samples and normal samples.[54]^19 The adjusted P-value < 0.05 and the absolute fold change (FC) ≥ 2 were considered statistically significant. Overlapping DEERGs between GEO datasets were selected using a Venn diagram ([55]https://bioinfogp.cnb.csic.es/tools/venny/index.html). Functional Enrichment Analysis Gene Ontology (GO) can annotate genes, gene products and sequences with potential biological phenomena in three aspects: biological process (BP), cellular component (CC), and molecular function (MF);[56]^20 Kyoto Encyclopedia of Genes and Genomes (KEGG) is a comprehensive database for understanding advanced functions and utilities of biological systems.[57]^21 In this study, GO and KEGG analysis of the DEERGs were performed using the R package clusterProfiler.[58]^22 Hub ERGs Selection and Analyses The online tool Search Tool for the Retrieval of Interacting Genes database (STRING) ([59]https://string- db.org/) was used to analyze the interactive relationships of the DEERGs.[60]^23 The combined score of ≥0.4 was the cut-off value. Cytoscape software (version 3.6.0) was then used to make PPI networks.[61]^24 And top 10 DEERGs with higher node degree scores were identified hub ERGs using the Cytoscape plug-in cytoHubba. Verification of Differential Expression of Hub ERGs Raw read counts for hub ERGs of 497 TC samples and 56 normal ones from the TCGA database were extracted using R package TCGAbiolinks.[62]^25 They were loaded into R package DESeq2 and normalized by variance-stabilizing transformation (VST).[63]^26^,[64]^27 VST counts were used for Student’s t‐test analysis by R package ggstatsplot.[65]^28 The statistical significance cut‑off level was P<0.01. Moreover, the protein levels of hub ERGs in 5 PTC tissues and paired normal tissues from the Shanghai Tong Ren Hospital were verified using immunohistochemistry (IHC) analysis. These tissues were enrolled from October 2018 to October 2019 and all diagnoses were based on pathological and/or cytological evidence. Ethical approval was obtained from the Ethics Committee of the Shanghai Tong Ren Hospital and informed consent was given by all patients before sample collection. Five human thyroid normal and tumor tissues were fixed in 4% formaldehyde and embedded in paraffin. Paraffin-embedded samples were cut into 6 μm thick sections, and then they were placed in xylene and ethanol solution for deparaffinating and rehydration in PBS. 3% hydrogen peroxide solution and 5% donkey serum were used to block endogenous peroxidase activity and nonspecific sites after samples were placed in the citric acid solution for antigen retrieval. Sections were stained with rabbit antibodies against FIBRONECTIN (Abcam, ab268021, 1:600), ITGA2 (Abcam, ab133557, 1:200), and KIT (Abcam, ab32363, 1:200) for incubating at 4°C overnight, and then incubation in the second antibody (Abcam, ab97051, 1:1000) was carried out at room temperature for 2h. Antigen-antibody complexes were detected using ChemMateTM EnVisonTM/HRP complex (DAB) as a peroxidase substrate (GK500705, Gene Tech). Results were visualized under an optical microscope (OLYMPUS IX73, JAPAN). All pictures were taken under the same microscope camera (OLYMPUS DP80, JAPAN) with uniform parameters (OLYMPUS cellSens Standard 1.18) after sections were mounted with xylene sealant. Brown staining areas were selected with an eyedropper tool after dropping colors represented by fewer than 10 pixels, and the mean option density was calculated in the manner of integrated option density (IOD)/area using Image Pro Plus 6.0 software. Survival Analysis of Hub ERGs, Construction of the Risk Score Model, and Correlation Analysis Between Risk Score and Clinical Characteristics Clinical characteristics including age, gender, stage, disease-free interval (DFI) time, and DFI status for TCGA-THCA patients were downloaded from UCSC Xena datasets.[66]^29 After deleting patients with incomplete clinical characteristics and DFI time less than 0, a total of 350 patients were included in the disease-free survival (DFS) analysis. The optimal threshold was determined by the R package Survminer[67]^30 to divide patients into high- or low-expression groups and DFS was calculated using the R package survival.[68]^31 ERGs associated with DFS (P<0.05) were retained for subsequent multivariable Cox analysis to construct the ERG-related prognostic signatures. Then, a prognostic risk score model was established as follows: risk score = Σ (βn × expression of gene n). Patients were divided into low-risk group and high-risk group according to the median risk score, and the prognostic difference between the two groups was performed by the Kaplan–Meier curve. Harrell’s concordance index and calibration curves comparing predicted DFS and observed survival were used to evaluate the performance of the prognostic nomogram. Time-dependent receiver operating characteristic (ROC) curves and multivariable Cox analysis including patients’ clinical characteristics were calculated to evaluate the effectiveness of the prediction model. The relationship between risk score and age was analyzed through Pearson correlation coefficient. And the relationships between risk score and other clinical characteristics (gender, stage, and race), which are classified variables, were analyzed through t-test and ANOVA. P < 0.05 was considered statistically significant. Results Identification of DEERGs in TC and Data Integration There were 116 DEERGs in [69]GSE29265, 147 DEERGs in [70]GSE33630, 114 DEERGs in [71]GSE60542, 298 DEERGs in [72]GSE65144. Among the DEERGs, 60, 86, 67, and 155 ERGs were upregulated while 56, 61, 47, and 143 ERGs were downregulated in [73]GSE29265, [74]GSE33630, [75]GSE60542, and [76]GSE65144, respectively ([77]Figure 1A-[78]D). After integrated analyses of the four GEO datasets, 43 ERGs (21 upregulated and 22 downregulated) were identified as the DEERGs ([79]Figure 1E and [80]F), and the cluster heatmap of these DEERGs is shown in [81]Figure 1G. Figure 1. [82]Figure 1 [83]Open in a new tab Identification of the DEERGs in TC and data integration. (A) Volcano plot of [84]GSE29265, (B) volcano plot of [85]GSE33630, (C) volcano plot of [86]GSE60542, (D) volcano plot of [87]GSE65144, (E) Venn plots of overlapping upregulated DEERGs, (F) Venn plots of overlapping downregulated DEERGs, and (G) the cluster heatmap of the overlapping DEERGs. Red, white, and blue represent higher expression levels, no expression differences, and lower expression levels, respectively. The rows and columns of the heatmap represent DEERGs and datasets, respectively, and the numbers in it represent the logFC between TC tissue and normal ones from each GEO dataset. Significant Functions and Pathway Enrichment Analysis For 43 DEERGs, a pathway and process enrichment analysis were carried out using R package clusterProfiler. Terms with an adjusted P-value <0.05 were collected. GO analysis demonstrated that DEERGs were mainly related to the process of ossification, extracellular matrix (ECM), and cell growth ([88]Figure 2A). KEGG pathway analysis showed that DEERGs were mainly related to the PI3K−Akt signaling pathway, ECM−receptor interaction, and focal adhesion ([89]Figure 2B). Figure 2. [90]Figure 2 [91]Open in a new tab GO and KEGG pathway enrichment analysis. (A) GO circle plot of overlapping DEERGs, (B) KEGG circle plot of overlapping DEERGs. Hub DEERGs Identification in the PPI Network We used the STRING database and Cytoscape to construct a PPI network to further analyze the interaction between the DEERGs ([92]Figure 3A). There were 43 nodes and 78 edges in the network. As a result, fibronectin 1 (FN1), the tissue inhibitor of metalloproteinase 1 (TIMP1), integrin alpha 2 (ITGA2), receptor tyrosine kinase (KIT), secreted phosphoprotein 1 (SPP1), cadherin 2 (CDH2), runt-related transcription factor 2 (RUNX2), galectin 3 (LGALS3), bone morphogenetic protein 2 (BMP2), and versican (VCAN), which have a higher degree of connectivity according to the “Degree” algorithm, were identified as hub DEERGs ([93]Figure 3B). Figure 3. [94]Figure 3 [95]Open in a new tab The PPI network of the DEERGs. The PPI network of (A) the overlapping DEERGs, (B) top 10 hub DEERGs. Red nodes and blue nodes represent upregulated DEERGs and downregulated DEERGs, respectively in. (A). The size of the interaction score between genes was positively correlated with the depth of nodes color in (B). Verification of Differential Expression of Hub DEERGs Using mRNA expression data of 497 TC samples and 56 normal ones from the TCGA-THCA dataset, upregulations of FN1 expression (P<0.01) and ITGA2 expression (P<0.01) and downregulation of KIT expression (P<0.01) was identified in TC samples compared with normal samples ([96]Figure 4A-[97]C). To confirm the reliability of the hub DEERGs, the IHC was used to verify the protein expression levels of the 3 hub DEERGs. Consistent with mRNA expression, the IHC results showed that FN1 and ITGA2 proteins were upregulated, while KIT protein was downregulated in the TC tissues ([98]Figure 5A-[99]D) Figure 4. [100]Figure 4 [101]Open in a new tab Validation of hub DEERG expressions in the TCGA-THCA dataset. (A) FN1, (B)ITGA2, and (C) KIT gene expression differences between TC tissues and normal ones. Figure 5. [102]Figure 5 [103]Open in a new tab Immunohistochemical staining results of the protein expressions of hub DEERGs from the clinical specimens. Brown staining represents the expression of 3 proteins and blue staining represents nucleus. Bar: 50um. A zoomed-in 40 magnification of each IHC picture is shown in better resolution at the lower left corner. (A) IHC of FN1, (B) IHC of ITGA2, (C) IHC of KIT, and (D) IHC of Blank. Prognostic Value of Hub DEERGs, Construction of the Risk Score Model, and Correlation between Risk Score and Clinical Characteristics A total of 350 TCGA-THCA patients were enrolled in our survival analysis. The univariate Cox regression results showed that the high expression of FN1 (P<0.01) and ITGA2 (P<0.01) and low expression of KIT (P=0.01) were significantly associated with the worst DFS in TC patients ([104]Figure 6A-[105]C). Then, the prognostic signature including the above-mentioned DEERGs was developed using the multivariate Cox regression model and risk scores for every patient were determined using the coefficients (−1.32 for FN1, −0.32 for ITGA2, and 0.69 for KIT). Then, according to the median risk score, all patients were divided into high-risk (n=175) or low-risk groups (n=175) ([106]Figure 7A). And high-risk group had worse DFS compared with low-risk group using the Kaplan–Meier curve (P<0.01) ([107]Figure 7B). The C-index of the 3-ERG signature was 0.67, and the calibration curve further revealed that the signature had a good performance in predicting the DFS of TC patients ([108]Figure 7C). ROC curves show that the risk score accurately predicts TC patient’s DFS ([109]Figure 7D). In addition, the risk score calculated from the 3-ERG signature was an independent prognostic factor using multivariate Cox regression analysis ([110]Figure 7E). The correlation analysis showed that the risk score was negatively correlated with age (P<0.01, r=−0.17) and not with other clinical characteristics (gender, stage, and race) ([111]Supplementary Figure 1S). Figure 6. [112]Figure 6 [113]Open in a new tab Survival analysis of hub DEERGs using TCGA-THCA dataset. Differences in DFS in patients with high and expression of (A) FN1, (B) ITGA2, and (C) KIT. Figure 7. [114]Figure 7 [115]Open in a new tab Survival analysis of the 3 ERGs signature using TCGA-THCA dataset. (A) Risk score based on 3 ERGs signature, survival status, and heatmap for the expressions of 3 ERGs for each patient, (B) DFS difference between high-risk and low-risk groups using the Kaplan–Meier curve, (C) calibration curves for 1-, 3-, and 5-year from the 3 ERGs signature, (D) ROC curves for 1-, 3-, and 5-year DFS predictions for the 3 ERGs signature using the ROC curves, and (E) forest plot of the risk score using multivariate Cox regression analysis. Discussion Thyroid cancer (TC) is the most common endocrine malignant tumor, and its incidence increases year by year.[116]^32 Despite good prognosis, patients with recurrence or metastasis still have a high risk of death.[117]^33^,[118]^34 Some studies have demonstrated that the aggressiveness of thyroid cancer is closely related to processes such as epithelial–mesenchymal transformation (EMT).[119]^34–37 At present, many studies have constructed ERG signatures to predict the survival of cancer patients. Zhong et al developed a reliable EMT-related lncRNA risk signature, which can independently predict the OS and DFS of ccRCC.[120]^14 In hepatocellular carcinoma, a total of 5 prognostic correlated ERGs were included to develop a novel prognostic risk model.[121]^38 Cao et al developed a novel EMT-related gene signature as an independent prognostic factor for bladder cancer.[122]^39 However, researches on the relationship between ERGs and the diagnosis and prognosis of TC patients are still very limited. In this study, for the first time, we used the ERG signature with higher prediction accuracy to predict the survival of TC patients and achieved good prediction results. In this study, we comprehensively analyzed four microarray datasets from GEO, which were all published data related to gene expression in thyroid cancer and normal ones in the GEO database until August 1, 2020. A total of 43 DEERGs (21 upregulated and 22 downregulated) were identified between TC tissues and normal ones. Functional enrichment analyses demonstrated that DEERGs were mainly related to the process of ossification, extracellular matrix (ECM), and PI3K−Akt signaling pathway. In patients with thyroid cancer, there are changes in thyroid-stimulating hormone, which is closely associated with osteogenesis.[123]^40–43 The ECM plays an important role in the process of tumor shedding, adhesion, movement, degradation, and hyperplasias, such as prostate cancer,[124]^44 gastric cancer,[125]^45 breast cancer,[126]^46 and anaplastic thyroid cancer.[127]^47 The phosphoinositide 3-kinase–protein kinase B/AKT (PI3KPKB/AKT) pathway is one of the most prominent molecular signaling pathways implicated in cellular growth, proliferation, apoptosis, and metabolism.[128]^48 Activation of the PI3K-AKT signaling pathway is critical to the occurrence and development of thyroid cancer.[129]^49–51 We finally screened out that FN1, ITGA2, and KIT were hub DEERGs from the PPI network, TCGA differential expression verification, and IHC verification. Otherwise, the DFS analyses on the TCGA-THCA dataset showed that upregulations of FN1 expression (P<0.01) and ITGA2 expression (P<0.01) and downregulation of KIT expression (P=0.01) increased risks of decreased DFS for patients. Fibronectin 1 (FN1) participates in cell adhesion and migration processes and is expressed in multiple cell types.[130]^52 FN1 has also turned out to be associated with the ECM changes.[131]^53 In thyroid cancer, Sponziello M et al demonstrated that the overexpression of FN1 mediated thyroid tumor cell migration and invasion by 86 patients.[132]^54 Shaohua Zhan et al identified FN1 as a novel prognostic biomarker associated with sporadic medullary thyroid cancer pathophysiological changes.[133]^55 Furthermore, some studies found FN1 was highly expressed in TC from a series of data sets, but they did not offer a survival analysis of FN1.[134]^56–59 Integrin alpha 2 (ITGA2) is an alpha subunit, which often combines with the beta subunit to form a heterodimer α2β1, and then participates in the adhesion of platelets and other cells to the extracellular matrix.[135]^60–62 More studies suggested that ITGA2 might be closely related to tumor cell migration, invasion, and metastasis.[136]^63^,[137]^64 In thyroid cancer, one study identified that ITGA2 expression was upregulated in PTC tissues but not in BRAF-positive samples,[138]^65 and another study showed that PTC patients with high ITGA2 expression had poorer relapse-free survival than PTC patients with low ITGA2 expression.[139]^66 Receptor tyrosine kinase (KIT), a form of myeloid receptor that binds the stem cell factor, plays a major role in cancer occurrence.[140]^67 Papers are showing that KIT is highly expressed in small cell lung cancer,[141]^68 leukemia cells,[142]^69 colon cancer,[143]^70 and neuroblastoma.[144]^71 But KIT expression is lower in breast cancer[145]^72 and melanoma.[146]^73 Numerous studies have investigated the expression of KIT in TC,[147]^74–76 suggesting its role in thyroid epithelial cell differentiation and growth control. These findings indicate that different ERGs jointly promote the occurrence of EMT, and these ERGs may have different functions in different tumors. Several previous studies have constructed prognostic signatures of TC patients. Most research has focused on immune-related genes, like Gan et al[148]^77 Li et al[149]^78 Peng et al,[150]^79 and so on. They established different immune-related signatures to predict the prognosis of patients with TC. The remaining thyroid prediction models mainly focused on the study of characteristic gene sets are mainly m6A-related signature[151]^80^,[152]^81 and autophagy-related signature.[153]^82 These results indicate that the characteristic signature not only has a good prediction effect on the prognosis of thyroid cancer patients but is more conducive to the comprehensive interpretation of the occurrence and development of TC and provides a theoretical basis for the use of drugs that act on this characteristic. However, no studies have been conducted to analyze the EMT characteristics of TC, which is closely related to the occurrence and development of TC, and establish relevant prediction signatures. In this study, a 3 EMT-related genes were used to construct a prognostic risk signature for TC patients. The univariate and multivariate Cox regression results showed that high-risk group had worse DFS in TC patients. The study has several advantages. First, to our knowledge, this is the first study to analyze the EMT gene expression profile of thyroid cancer and its function. In addition, our research has constructed an excellent ERG signature as a potential predictor of TC patients, which helps to develop new strategies for diagnosing and predicting the prognosis of TC patients. Finally, calibration curves were used for internal validation of the signature, and the results showed that the signature’s prediction effect was good. However, our observations still have some limitations: 1) as the cumulation of TC samples, subtype analysis was required to perform; 2) because patients with TC have a longer survival period and fewer patients have observed death outcomes, this study only performed DFS analysis; 3) and the constructed predictive model has not been validated in a prospective cohort. Conclusions In conclusion, this study used existing data to construct a more comprehensive transcriptomic profile of TC and found potential biomarkers for TC diagnosis and prognosis. However, it is necessary to further study the specific role of these genes in TC. Funding Statement This work was supported by Startup Fund for Youngman Research at SJTU (to TF, 17×100040015); Shanghai Jiao Tong University Medical and Industrial Cross Project (to TF, YG2017QN70); Young Talent Program of China National Nuclear Corportion (to SJ, CNNC201948); and The Scientific Project of Shanghai Municipal Health Commission (to BY, Grant No.2018ZHYL0202). Abbreviations TC, thyroid cancer; EMT, epithelial–mesenchymal transition; ERGs, EMT-related genes; GEO, Gene Expression Omnibus; DEERGs, differentially expressed ERGs; PPI, protein–protein interaction; DFS, disease-free survival; TCGA-THCA, The Cancer Genome Atlas Thyroid Cancer; IHC, immunohistochemistry; FNAB, fine-needle aspiration biopsy; PTC, papillary thyroid cancer; ATC, anaplastic thyroid carcinoma; FC, fold change; GO, Gene Ontology; BP, biological process; CC, cellular component; MF, molecular function; KEGG, Kyoto Encyclopedia of Genes and Genomes; STRING, Search Tool for the Retrieval of Interacting Genes; VST, variance-stabilizing transformation; DFI, disease free interval; ROC, receiver operating characteristic; ECM, extracellular matrix; FN1, fibronectin 1; TIMP1, the tissue inhibitor of metalloproteinase 1; ITGA2, integrin alpha 2; KIT, receptor tyrosine kinase; SPP1, secreted phosphoprotein 1; CDH2, cadherin 2; RUNX2, runt-related transcription factor 2; LGALS3, galectin 3; BMP2, bone morphogenetic protein 2; VCAN, versican; PI3KPKB/AKT, phosphoinositide 3-kinase–protein kinase B/AKT. Author Contributions All authors made a significant contribution to the work reported, whether that is in the conception, study design, execution, acquisition of data, analysis and interpretation, or in all these areas; took part in drafting, revising or critically reviewing the article; gave final approval of the version to be published; have agreed on the journal to which the article has been submitted; and agree to be accountable for all aspects of the work. Disclosure The authors report no conflicts of interest in this work. References