Abstract Objective: Papillary thyroid carcinoma (PTC) accounts for 80% of thyroid malignancy, and the occurrence of PTC is increasing rapidly. The present study was conducted with the purpose of identifying novel and important gene panels and developing an early diagnostic model for PTC by combining artificial neural network (ANN) and random forest (RF). Methods and results: Samples were searched from the Gene Expression Omnibus (GEO) database, and gene expression datasets ([34]GSE27155, [35]GSE60542, and [36]GSE33630) were collected and processed. [37]GSE27155 and [38]GSE60542 were merged into the training set, and [39]GSE33630 was defined as the validation set. Differentially expressed genes (DEGs) in the training set were obtained by “limma” of R software. Then, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis as well as immune cell infiltration analysis were conducted based on DEGs. Important genes were identified from the DEGs by random forest. Finally, an artificial neural network was used to develop a diagnostic model. Also, the diagnostic model was validated by the validation set, and the area under the receiver operating characteristic curve (AUC) value was satisfactory. Conclusion: A diagnostic model was established by a joint of random forest and artificial neural network based on a novel gene panel. The AUC showed that the diagnostic model had significantly excellent performance. Keywords: papillary thyroid carcinoma, random forest, artificial neural network, early diagnostic, biomarker Introduction Thyroid cancer (TC), accounting for 3.4% of all cancers diagnosed annually, is the most common endocrine malignancy ([40]Seib and Sosa, 2019). TC was divided into three types: differentiated thyroid cancer, representing over 90% of thyroid cancer, consists of papillary thyroid carcinoma (PTC) and follicular thyroid carcinoma (FTC). In addition, anaplastic thyroid cancer (ATC, 1%) and poorly differentiated thyroid cancer (PDTC, 5%) are rare tumors ([41]Prete et al., 2020). With the popularization of physical examination, the incidence of PTC has increased rapidly ([42]Wang and Sosa, 2018). PTC is usually an inert and curable tumor with a 10-year survival rate>90% ([43]Alvarado et al., 2009). However, more than 25% of advanced-stage PTC patients are characterized by their invasiveness and metastasis; these traits usually result in poor prognosis ([44]Huang et al., 2021). Among PTC patients, cervical lymph node metastasis (LNM) occurs in 50%–80% of patients, which is a tested risk factor for recurrence and a reduced survival rate ([45]Ling et al., 2021). Distant metastasis occurs in 2% of advanced-stage PTC, which is the main reason for death. Lung is the most common metastatic site (53.4%), and 26.3% presented with the multiple-organ metastasis ([46]Toraih et al., 2021). So, we intended to discover novel and important gene panels and develop a diagnostic model for PTC of early stage. The availability of microarray technology and more precise RNA-sequencing technology improves the research of disease pathogenesis ([47]Xie et al., 2020). Discovering the most meaningful variables for classification is the primary question about developing a classification model using gene expression data. To resolve this, a variety of machine learning algorithm, including random forest (RF) ([48]Kursa, 2014; [49]Cai et al., 2015) and artificial neural network (ANN) ([50]Chen et al., 2014), were used. Differing from common statistical methods, machine learning involves learning from cases ([51]Van Calster et al., 2019). Therefore, RF and ANN were joined to develop a novel diagnostic model of PTC by learning from the training set and then testing the model in the validation set. The results of this study reveal a novel gene panel for early clinical diagnosis of PTC. Materials and methods Research design The study flowchart is shown in [52]Figure 1. Three gene expression datasets ([53]GSE27155, [54]GSE60542 and [55]GSE33630) were collected from the GEO database. [56]GSE27155 and [57]GSE60542 were integrated into the training set, and [58]GSE33630 was selected as the validation set. Differentially expressed genes (DEGs) were defined by the R package “limma.” Gene Ontology (GO) function enrichment analyses and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were conducted through the “clusterProfiler” package of R based on the DEGs in the training set. Also, immune cell infiltration analysis was performed by R. We determined 10 important genes through random forest from the DEGs by the “randomForest” package. Further, an artificial neural network diagnostic model was established through the “neuralnet” package using the 10 important genes and was assessed by AUC. Finally, the validity of the ANN diagnostic model was validated with the performance of a validation set. FIGURE 1. FIGURE 1 [59]Open in a new tab Flowchart. Data download and process The [60]GSE27155, [61]GSE60542, and [62]GSE33630 datasets were collected via the Gene Expression Omnibus database (GEO; [63]https://www.ncbi.nlm.nih.gov/geo/). Then, gene names were obtained by transforming probe names by R software. If several probes could be matched to identical gene, the expression data with gene was replaced by its mean expression value. Finally, 55 samples (4 normal samples and 51 PTC samples) in [64]GSE27155, 63 samples (30 normal samples and 33 PTC samples) in [65]GSE60542 and 94 samples (45 normal samples and 49 PTC samples) in [66]GSE33630 were utilized in this research ([67]Table 1). [68]GSE27155 and [69]GSE60542 were combined into a training set, and [70]GSE33630 as a validation set. TABLE 1. Source of datasets. Datasets Platform Normal PTC [71]GSE27155 [72]GPL96 4 51 [73]GSE60542 [74]GPL570 30 33 [75]GSE33630 [76]GPL570 45 49 [77]Open in a new tab Screening for differentially expressed genes Differentially expressed genes (DEGs) were identified in 34 normal samples and 84 PTC samples of the training set through the R package “limma” using the classic Bayesian data analysis. |log2FC| >1.5 and adjusted p Value <0.05 were set as threshold. Then, we obtained 94 DEGs involving 53 up-regulated genes and 41 down-regulated genes. R package “pheatmap” and “ggplot2” were utilized to conduct the heatmap and volcano plot. Gene Ontology function and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis To explore the biological significance of these DEGs, Gene Ontology (GO) enrichment analysis (adjusted p value < 0.05) categorizing genes into biological process (BP), cellular component (CC), and molecular function (MF) was performed. Meantime, Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis was used to describe metabolic pathways (p value < 0.05). Enrichment analysis of DEGs was conducted through the R package “clusterProfiler.” Immune cell infiltration analysis As a crucial component of the tumor environment, immune cells impacted the development and prognosis of tumors, their composition as well as function in various types of tumors were different. On the one hand, some immune cells have the roles to be favorable targets for immunotherapy. On the other hand, some may react negatively, even resulting in drug resistance. Given these causes, figuring out the ingredient and possible influence of immune cells in PTC was beneficial to identify valuable therapeutic targets. We downloaded a reference including human gene expression in immune cells. Then, we exploited the reference and a file including the gene expression of each sample to complete immune cell infiltration analysis and obtain the result involving the expression of immune cells of each sample by R software. Based on the result, we performed correlation analysis of immune cell by the “corrplot” package and difference analysis of immune cells by the “vioplot” package. Random forest analysis to determine important genes The important genes were identified by a random forest classifier via “randomForest” package of R. Firstly, the parameter “ntree” was set as 500 to find the best number of trees. By calculating the error of cross validation, 15 was selected as the best number of trees represented the minimum error of cross validation. Then, parameter 15 was used to reconduct random forest and the importance score of genes was obtained. Genes with an importance score greater than 2 were seen as PTC significantly related genes. The R package “heatmap” was utilized to draw a heatmap based on important genes. Development and validation of an artificial neural network model Based on 10 important genes identified by RF, an artificial neural network model was developed via the R package “neuralnet.” At first, the expression of 10 important genes was converted into “gene tag” based on their expression levels. In the case of a certain sample, the expression level of a specific gene was compared to the median of all sample expression level. Among the up-regulated genes, if the expression level is higher than the median, it will be valued as 1, otherwise 0. Among the down-regulated gene, if expression level is higher than median, it will be valued as 0, otherwise 1. Then, we finished a “gene tag” sheet. Next, the hidden layers of ANN were set as 5 to obtain a gene weight calculated by “gene tag”. Finally, the ANN diagnostic model was established. We assessed the model in the training set. Also, we validated the model in the validation set, and its diagnostic performance was evaluated by AUC. Results Differential expression analysis In total, 53 significantly up-regulated genes and 41 significantly down-regulated genes were determined between normal samples and PTC samples based on |log2FC| >1.5 and adjusted p value <0.05 as threshold in training sets [78]GSE27155 and [79]GSE60542. Heatmaps ([80]Figure 2A) and volcano plots ([81]Figure 2B) of DEGs showed favorable discrimination of gene expression. FIGURE 2. [82]FIGURE 2 [83]Open in a new tab Heatmap and volcano plot of DEGs. (A) The heatmap of differential expression analysis result. The colors of the heatmap, from red to blue, indicate high to low expression of genes in normal and PTC samples. On the upper part of the heatmap, the blue band represents normal samples and the red band stands for PTC samples. (B) The volcano plot of DEGs. The X-axis is logFC, and the Y-axis is –log10 (adj.p value). The high-expression genes’ adj. p value <0.05 and logFC >1.5 are located on the top-right. The down-regulated genes’ adj. p value <0.05 and logFC < -1.5 are located on the top-left. In addition, the black dots indicate the remaining stable genes. Gene Ontology/Kyoto Encyclopedia of Genes and Genomes enrichment analysis To reveal the biological importance of DEGs in the mechanism of PTC, we conducted GO and KEGG enrichment analysis using the package “clusterProfiler” of R software. Amid GO enrichment analysis results (adjusted p value cutoff = 0.05), concerning BP included wound healing, regulation of body fluid levels, blood coagulation and hemostasis; CC involved collagen-containing extracellular matrix; MF contained extracellular matrix structural constituents and other important functions ([84]Figure 3A). KEGG pathway enrichment analysis (p value cutoff = 0.05) indicated that the DEGs were significantly related to tyrosine metabolism, complement and coagulation cascades, ECM–receptor interaction, and cell adhesion molecules ([85]Figure 3B). FIGURE 3. [86]FIGURE 3 [87]Open in a new tab The results of GO and KEGG enrichment analysis. (A) The GO function enrichment analysis of DEGs. The X-axis stands for count of genes. The Y-axis represents BP, CC, and MF. (B) The KEGG pathways enrichment analysis of DEGs. The abscissa shows the count of genes, and the ordinate exhibits pathways. Immune cell infiltration analysis We utilized a reference including human gene expression in immune cells and a file involving gene expression of each sample to perform immune cell infiltration analysis. Then, we conducted difference analysis of immune cells by the “vioplot” package and correlation analysis of immune cells by the “corrplot” package. Compared with normal samples, T cells gamma delta, macrophages M2, dendritic cells resting, dendritic cells activated and mast cells resting were higher in PTC samples (p < 0.05). Conversely, macrophages M1, mast cells activated, and eosinophils were lower in PTC(p < 0.05) ([88]Figure 4A). And macrophages M1 were correlated with dendritic cells activated and T cells gamma delta (correlation, 0.46 and 0.47) ([89]Figure 4B). FIGURE 4. [90]FIGURE 4 [91]Open in a new tab The results of immune cell infiltration analysis. (A) The distribution of immune cells between normal samples and PTC samples. The abscissa shows various immune cells, and the ordinate shows fraction. p < 0.05 was regarded as statistically significant. (B) Correlation analysis of immune cells. The abscissa and ordinate are on behalf of immune cells, and the number stands for correlation. Red color represents a positive correlation, and blue color represents a negative correlation. The larger absolute value or deeper color explains the higher correlation. Diagnosis-related important genes with random forest We performed a random forest based on 94 DEGs to determine important genes. Considering the relationship plot between error of cross validation and the number of decision trees, we chose 15 trees as the parameter of the final model, which revealed a minimum error of cross validation in the model ([92]Figure 5A). Eventually, 10 important genes were determined on the condition that their importance score was greater than 2 for the following analysis ([93]Figure 5B). Amid the 10 important genes, GALNT7 was the most important. With heatmap, 10 important genes could be differed from normal samples and PTC samples. Among them, CITED1, AMIGO2, PSD3, GALNT7, and PROS1 were down-regulated in normal samples and up-regulated in PTC samples. TFF3, SLC4A4, AOX1, IPCEF1, and TFCP2L1 were up-regulated in normal samples and down-regulated in PTC samples ([94]Figure 5C). FIGURE 5. [95]FIGURE 5 [96]Open in a new tab (A) Correlation between the number of decision trees and the error of cross validation. The number of decision trees is shown on the abscissa; the error of cross validation is exhibited on the ordinate. The best number of decision trees was 15 for which the error of cross validation was minimum. (B) The X-axis stands for the importance score of genes calculated by the Gini coefficient method. The Y-axis represents the names of genes. (C) The heatmap of 10 important genes determined by random forest. Red color represents up-regulated genes in both samples, and blue color is on behalf of down-regulated genes in both samples. Above the picture, PTC samples are showed as a red band, and blue band indicates normal samples. Development and validation of an artificial neural network model Expression of important genes determined by RF was transformed into “gene tag” marked as 0/1. The weight of all genes was calculated for optimal discrimination between normal samples and PTC samples. Then, an ANN diagnostic model based on gene weight was established ([97]Figure 6). Performance of the model had an AUC of 0.988 ([98]Figure 7A) in the training set and 0.968 ([99]Figure 7B) in the validation set, indicating that the model was very satisfactory in diagnosing PTC. The results demonstrated that we had developed a precise diagnostic model between PTC and normal samples. FIGURE 6. FIGURE 6 [100]Open in a new tab The visualization of the ANN diagnostic model. The neural network topology with 10 input layers consists of important genes; 5 hidden layers; and 2 output layers, including normal and PTC. FIGURE 7. [101]FIGURE 7 [102]Open in a new tab Assessment and testification of the ANN diagnostic model by the ROC curve. (A) The assessment result of the training set; (B) the testification result of the validation set. Discussion To date, employment of machine learning algorithms and wide application of gene expression data in public databases offer methods to find biomarkers for cancer diagnosis in a list of fields ([103]Wang et al., 2018; [104]Tabl et al., 2019). The integration of RF and ANN can be employed to establish a stable diagnostic model for some diseases, like ulcerative colitis and abdominal aortic aneurysm ([105]Li et al., 2020; [106]Duan et al., 2022). Papillary thyroid carcinoma is characterized by slow progression and good prognosis. Early-stage PTC patients have a high postoperative survival rate, but advanced-stage PTC patients still have the risk of lymph node metastasis and distant metastasis, which seriously affect treatment and prognosis. Though ultrasound is viewed as primary screening approach for PTC, the diagnostic accuracy relying on node size is changeable, especially for nodules ≤10 mm ([107]Sutherland et al., 2021). Other factors such as equipment, scan gain, dynamic range, frequency, and doctors also significantly impact the accuracy of ultrasound diagnosis ([108]Gong et al., 2022). Owing to the absence of a perfect diagnostic approach and a lack of potential biomarkers that can be utilized in clinical practice, it is critical to establish a model for early diagnosis and screening of PTC. Among 10 important genes identified by RF, CITED1, AMIGO2, PSD3, GALNT7, and PROS1 were down-up-regulated in normal samples and up-regulated in PTC samples, TFF3, SLC4A4, AOX1, IPCEF1, and TFCP2L1 were up-regulated in normal samples and down-regulated in PTC samples. Then, an ANN diagnostic model was established, and the performance was assessed by AUC (0.988). Also, we testified the diagnostic ability in the validation set and the AUC was 0.968, which had great efficiency. Together, the developed diagnostic model could offer a novel perspective on our research of the mechanism of PTC. In a previous study, 9 genes have been verified among 10 important genes related to PTC. The incidence and progression of a series of malignancies are relevant to the high expression of Polypeptide N-Acetylgalactosaminyltransferase7 (GALNT7) and its family members ([109]Nakagawa et al., 2017; [110]Detarya et al., 2020). Wang argued that GALNT7 by activating EGFR/PI3K/AKT kinase pathway to promote cell proliferation and invasion of papillary thyroid cancer ([111]Wang et al., 2021a). In recent years, various human Pleckstrin and sec7 domain-containing 3 (PSD3) have been believed to be related to some tumors, like acute myeloid leukemia ([112]Walker et al., 2021), breast cancer metastasis ([113]Thomassen et al., 2009), astrocytoma progression ([114]van den Boom et al., 2006). PSD3 inhibits apoptosis in papillary thyroid cancer by promoting proliferation, migration, invasion and G1/S transition ([115]Jin et al., 2021). Abnormal expression of Protein S (PROS1) affects human papillary thyroid cancer progression, especially associated with lymph node metastasis ([116]Wang et al., 2021b). Low expression of SLC4A4 affects invasion, metastasis, and the MAPK signaling pathway in PTC. Huang argued the down-regulation of SLC4A4 might be on account of the excessive iodine intakes of patients ([117]Huang et al., 2021). Glu/Asp rich carboxy-terminal domain 1 (CITED1) via Wnt/β-catenin signaling pathway results in the development of PTC ([118]Wang et al., 2019). Li showed that CITED1 increased phosphorylation of pRb as well as E2F1 transcriptional activity when p21 and p27 were expressed at low levels, and verified CITED1 was involved in PTC cell proliferation and tumorigenesis ([119]Li et al., 2018). Yu reported that aldehyde oxidase 1 (AOX1) protein level in blood plasma was lower in patients with PTC, which indicated that AOX1 level in blood plasma had the potential to differ in PTC from healthy humans. Furthermore, low levels of AOX1 were highly related to poor survival of PTC ([120]Yu et al., 2021). IPCEF1 was viewed as a significant biomarker for PTC. Moreover, a study showed that the hsa_circ_IPCEF1/hsa-miR-3619–5p axis was associated with the mechanism of PTC, which offers a new idea for further diagnosis and treatment of PTC ([121]Guo et al., 2021). Several researchers revealed noticeable differences in Trefoil factor 3 (TFF3) between benign thyroid nodules and thyroid malignancy ([122]Krause et al., 2008; [123]Karger et al., 2012). Low expression of TFCP2L1 can promote the progression of PTC and CircHACE1 curbs PTC development by upregulating TFCP2L1 through adsorbing miR-346 ([124]Li et al., 2021). Interestingly, we identified another important gene, Adhesion Molecule With Ig Like Domain 2 (AMIGO2), which has never been reported to be associated with PTC. In addition, the research also has several limitations. Firstly, though we have validated 10 significantly PTC-related genes, the sample size is relatively small. Secondly, the ANN diagnostic model was conducted using datasets from the GEO database, so it should be tested in laboratory experiments and clinical practice. Conclusion In this study, 10 genetic biomarkers related to PTC were determined and the ANN model established by the 10-gene panel displayed satisfactory performance when diagnosing PTC. Moreover, the present research offers a useful basis for early screening of PTC and promotes further study for development of PTC as well as provides potential genes as targets for clinical treatment. In conclusion, our finding has a certain clinical value that can be valuable for early diagnosis of PTC. Data availability statement The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found in the article/[125]Supplementary Material. Author contributions SW and MG conceived of and designed this study. WL, ZY, and XX made literature searches. SW performed figures and tables. WL and ZY conducted data analysis. SW wrote the manuscript, and MG critically reviewed it. MG supervised the research. All authors have read and approved the final manuscript. Conflict of interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Publisher’s note All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher. Supplementary material The Supplementary Material for this article can be found online at: [126]https://www.frontiersin.org/articles/10.3389/fgene.2022.957718/ful l#supplementary-material [127]Click here for additional data file.^ (10.8KB, XLS) [128]Click here for additional data file.^ (2KB, XLS) [129]Click here for additional data file.^ (7.4KB, XLS) [130]Click here for additional data file.^ (4.8KB, XLS) [131]Click here for additional data file.^ (22.8KB, XLS) [132]Click here for additional data file.^ (3.7KB, XLS) References