Abstract Background Acute myeloid leukemia (AML) is a clonal malignant disease with poor prognosis and a low overall survival rate. Although many studies on the treatment and detection of AML have been conducted, the molecular mechanism of AML development and progression has not been fully elucidated. The present study was designed to pursuit the molecular mechanism of AML using a comprehensive bioinformatics analysis, and build an applicable model to predict the survival probability of AML patients in clinical use. Methods To simplify the complicated regulatory networks, we performed the gene co-expression and PPI network based on WGCNA and STRING database using modularization design. Two machine learning methods, A least absolute shrinkage and selector operation (LASSO) algorithm and support vector machine-recursive feature elimination (SVM-RFE), were used to filter the common hub genes by five-fold cross-validation. The candidate hub genes were used to build the predictive model of AML by the cox-proportional hazards analysis, and validated in The Cancer Genome Atlas (TCGA) cohort and ohsu cohort, which were reliable in the experimental verification by qRT-PCR and western blotting in mRNA and protein levels. Results Three hub genes, FLT3, CD177 and TTPAL were used to build a clinically applicable model to predict the survival probability of AML patients and divided them into high and low groups. To compare the survival ability of the model with the classical clinical features, we generated the nomogram. The model displayed the most risk points contrast to other clinical characteristics, which was compatible with the data of cox multivariate regression. Conclusion This study reveal the novel molecular mechanism of AML, and construct a clinical model significantly related to AML patient prognosis. We showed the integrated roles of critical pathways, hub genes associated, which provide potential targets and new research ideas for the treatment and early detection of AML. Keywords: AML, modularization, machine learning, prognostic model, FLT3 Introduction AML is a devastating hematological malignancy. Differentiation arrest and unscheduled proliferation of immature cells of myeloid lineage are characteristic of this disease ([37]Hansrivijit et al., 2019). A variety of chemotherapy regimens, biological agents, and stem cell transplantation are the main treatment options for AML ([38]Liu et al., 2019a,[39]b; [40]Zhou et al., 2019). However, chemotherapy drug toxicity may lead to acute and life-threatening complications. Compared with standard chemotherapy, allogeneic stem cell transplantation is a suitable method to reduce the risk of recurrence of AML, but also increase the risk of serious complications. Although continuously improved, the traditional method of treatment does not lead to a complete cure or an ideal duration of survival for AML in clinical practice ([41]Manola et al., 2013). Genomics, proteomics and bioinformatics analysis methods have been used to develop new personalized treatment strategies, study of the functions of related biomolecules, and collection of information on emerging trends in genome matching of clinical data are effective methods to improve the prognosis of patients ([42]Wang et al., 2015; [43]Bret et al., 2016; [44]Cai and Levine, 2019). Although many studies have analyzed genome variation in AML, the association between genome variation and molecular mechanism of AML is still unclear. Therefore, a comprehensive study of AML was urgent. In this study, we aimed to explore the molecular mechanism of AML using a comprehensive bioinformatics analysis, and construct a clinical model to identify survival associated hub genes of AML patients. We initially performed the function annotation and modularization of gene differential expression; we then filtered candidate hub genes in GEO training cohort using machine learning algorithm with five-across validation and validated those hub genes in TCGA and ohsu cohort. We also hope that the results of this study can help us identify key pathways and genes related to AML, and provide possible targets and new research ideas for the treatment and early detection of AML. Materials and Methods Sample Collection The bone marrow of AML samples and non-leukemia samples were collected from 24 patients at The Second Affiliated Hospital of Qiqihar Medical University. Two independent pathologists made the diagnosis of AML and assessed the samples. Patients characteristics were summarized in [45]Supplementary Table 1. Microarray Data Source and Pre-processing The gene expression profiles of AML were obtained from three data sets, [46]GSE6891, [47]GSE10358 and [48]GSE15061 of the NCBI GEO database, which are based on the Affymetrix HT HG-U133A and HG-U133A 2.0 Array. A total of 223 biochips from AML patients were analyzed, including 154 AML tumor samples and 69 non-leukemia samples. The raw data of the three datasets were downloaded from GEO, and the R package, Simpleaffy, was used for Affymetrix quality control and data analysis ([49]Yu et al., 2010). Annotations were made using gene symbols from each respective platform annotations. Then, expression data from all 223 samples were included into a united gene expression matrix. The mean value of gene expression was used in multiple probes sets with a single gene symbol. Batch correction were performed before the next analysis was conducted using combat method in sva R package ([50]Johnson et al., 2007; [51]Leek et al., 2012). Functional Analysis The limma package was used to identify differentially expressed genes (DEGs) ([52]Ritchie et al., 2015), then GO and KEGG analysis were performed using clusterProfiler ([53]Yu et al., 2012). The log2(fold change) > 1 and BH-adjusted p value < 0.05 were filtered as the statistically significant. GSEA was utilized to deeply analyze the variation in biological functional and pathways between AML and non-leukemia samples ([54]Subramanian et al., 2005). Module Analysis Using WGCNA Based on PPI Network To modularize the biological variation in AML, the WGCNA package was used for the co-expression analysis of the DEGs ([55]Langfelder and Horvath, 2008). Then, the data was superimposed onto the PPI database of STRING ([56]Szklarczyk et al., 2015). The co-expression analysis clusters were delineated using the dynamic tree cut package with the minimum height for each module set at 0.2 ([57]Langfelder et al., 2008). The trend of each module was based upon eigengene, and the members of the module were collected through Pearson correlation from among DEGs and their interactors. Moreover, a topological overlapping matrix was also utilized to filter the PPI network ([58]Ravasz et al., 2002). Finally, individual modules were annotated using cluster Profiler ([59]Yu et al., 2012) and were visualized in Cytoscape ([60]Shannon et al., 2003). Construction the Predictive Model of AML Firstly, A least absolute shrinkage and selector operation (LASSO) algorithm and support vector machine-recursive feature elimination (SVM-RFE) were used to filter the hub genes by five-fold cross-validation, respectively ([61]Tibshirani, 1996; [62]Huang et al., 2014). Then, we combined the result from LASSO and SVM-RFE algorithms to filter the common hub genes. Finally, the cox-proportional hazards analysis was performed using glmnet R package, and picked up the risk associated hub genes that were more than 900 times in 1000 repetition ([63]Friedman et al., 2010; [64]Xu et al., 2017). X-tile 3.6.1 software (Yale University, New Haven, CT, United States) was employed to decide the best cutoff for AML patients categorized as low risk and high risk. The prophetic capacity of the prognostic model was assessed by the log-rank test and Kaplan-Meier survival analysis in GEO training cohort, TCGA testing cohort and Ohsu testing cohort. Assessment of Nomogram Performance To predict the survival ability of 1, 3, and 5 year of AML patients, we performed the nomogram analysis depend on the results of multivariate analysis including age, gender, race, chemotherapy status, radiation therapy status, gene fusion and risk type. Moreover, the calibration plot was used to assess the proportion of the predicted probabilities against the observed ones. Quantitative Real-Time PCR (qRT-PCR) qRT-PCR was used to verify the results. The total RNA of AML samples and non-leukemia samples were extracted using TRIzol reagent. The genes of interest were then quantified through qRT-PCR using a One-Step qPCR Kit (Invitrogen, United States), which was executed on a CFX ConnectTM Real-Time System (BIO-RAD, United States), according to the manufacturer’s instructions. The results were analyzed using 2^–ΔΔCT method, with GAPDH as a reference gene ([65]Livak and Schmittgen, 2001). The primer sequences of the target genes are shown in [66]Supplementary Table 2. Western Blotting Analysis The tissues were lysed, and total protein was quantified using the Pierce^TM Detergent Compatible Bradford Assay Kit (Thermo Scientific). 20 μg of protein from each sample was used for SDS-PAGE. After transferring the sample onto a PVDF membrane, the blot was incubated with indicate antibodies. All antibodies were purchased from CST: CD177 (ab203025, Abcam), TTPAL (ab103740, Abcam), FLT3 (#3462, CST), and GAPDH (#5174, CST). Statistical Analysis All experiments were performed in triplicate, at the least. For analyses between two groups, the student’s t test was leveraged for the comparison of tumor tissue with adjacent tissue. Data are presented as mean SDs, except when indicated otherwise. A p value < 0.05 was considered to be statistically significant. Results Identification of Differential Expression Genes (DEGs) and Functional Variation We used the limma package to screen out DEGs from 154 AML samples and 69 non-leukemia samples. The inclusion criteria of the DEGs was an absolute log2FC > 1 and the BH-adjusted p value < 0.01 was used as the statistical filter conditions. 1,084 DEGs, including 202 significantly upregulated genes and 882 significantly downregulated genes, were obtained. The volcano plot of DEGs is shown in [67]Figure 1A. FIGURE 1. [68]FIGURE 1 [69]Open in a new tab Functional analysis. (A) The volcano plot of the 1,084 DEGs between AML tumor samples and non-leukemia samples. Genes know to be upregulated and downregulated are displayed with different colors. Three hub genes chosen for model construction are indicated; (B) The KEGG pathway enrichment analysis showed that transcriptional misregulation in cancer, hematopoietic cell lineage, cell cycle, and TH1, 2, 17 cell differentiation were the most significantly affected phases in AML; (C) The most enriched GO targets were involved in neutrophil activation, neutrophil degranulation, neutrophil activation involved in immune response, neutrophil mediated immunity and leukocyte migration; (D) The GSEA results of AML patients and non-leukemia tissues performed on all genes at transcription level. Three hub genes chosen for model construction are indicated. In order to further analyze DEGs, we explored the functional variation between the two groups using the cluster Profiler package. 107 GO terms were identified with the BH-adjusted p value < 0.01. The GO SemSim package was used to remove duplicate terms, keeping only one representative term, which resulted in 49 unique GO terms ([70]Yu et al., 2010). The results of the GO analysis showed that the most enriched GO targets were involved in neutrophil activation, neutrophil degranulation, neutrophil mediated immune response and leukocyte migration ([71]Figure 1B). The KEGG pathway enrichment analysis showed that transcriptional misregulation in cancer, hematopoietic cell lineage, cell cycle, and TH1, 2, 17 cell differentiation were the most significantly affected phases in AML ([72]Figure 1C). These results complemented the results of the GO enrichment analysis. In order to further verify the relationship between the phenotype and functionally differentiated genes, we performed GSEA analysis on all genes at transcription level. The transcripts of AML were found to be remarkably associated with downregulated genes related to three pathways ([73]Figure 1D). Integrative Network Analysis Reveals New Functional Modules An integrative analysis method was used to model the dynamics of proteome changes upon cancer progression, as previously described ([74]Tan et al., 2017). We applied WGCNA to all DEGs to cluster the correlative proteins that had similar molecular functions or biological processes ([75]Jansen et al., 2002). Later, these proteins were superimposed onto the PPI network to identify the functional modules. As a result, we identified 143 modules with the number of proteins in each ranging from 2 to 25 ([76]Figure 2A), and 122 of these modules were highly interconnected by their members ([77]Figure 2B). Each module was annotated using known functional terms or signaling pathways. We found that many modules, including module 3, 10, 15, 22, and 24 ([78]Figure 3C), were notably enriched in hematopoietic system related progression. In addition, module 27 was found to be involved in RNA splicing, module 38 was involved in autophagy, module 54 was involved in the regulation of transcription, while module 83 was involved in translational initiation ([79]Figure 3D). In summary, the progression of AML involves the balanced regulation and extensive reprogramming of mutually connected functional modules. FIGURE 2. [80]FIGURE 2 [81]Open in a new tab Expression profiling of proteome reveals co-expression clusters and functional modules in AML. (A) Distribution of 120 out of 143 modules. Each node represents the individual module and their interactions by the module size. Edges connect modules that share PPIs. Boxed modules are further enlarged in C and D; (B) 143 modules with the number of proteins ranging from 2 to 25 were identified; (C) Module 3, 10, 15, 22, and 24 were notably enriched in hematopoietic system related progression showing the protein names and representative functional terms; (D) Module 27 was found to be involved in RNA splicing, module 38 was found to be involved in autophagy, module 54 was found to be involved in the regulation of transcription, while module 83 was found to be involved in translational initiation. FIGURE 3. [82]FIGURE 3 [83]Open in a new tab Two algorithms were performed for hub genes selection. (A) LASSO; (B) SVM-RFE; (C) Common genes selected by two algorithms. Construction, Validation and Assessment of the Predictive Model of AML For considering the variation between AML patients and healthy people, we aimed to estimate the predictive potential of DEGs. After differential expression analysis, we get 1084 DEGs in AML patients. Next, we performed two distinct machine learning algorithms, the LASSO and SVM-RFE, to screen the most significant DEGs for building the prognostic model. By the LASSO algorithm, we validated a set of 16 hub genes. And we also chose a set of 17 hub genes using the SVM-RFE algorithm. After integrating the hub genes from the LASSO and SVM-RFE algorithms, we obtained 40 hub genes with 7 hub genes identified simultaneously by the two machine learning algorithms with five-fold cross-validation. In detail, the training set was randomly divided into five equal portions; then, during each of the five iterations, we first performed the LASSO and SVM-RFE as the feature selection method on 4/5 of the training data and trained the classifiers with the selected features. Next, we applied the trained classifiers to the remaining 1/5 of the training data for prediction. Finally, the predictions from all five iterations were then combined and compared with the truth. The 7 significant hub genes are VPREB3, CYP4F3, TTPAL, CTSE, RBP7, CD177, and FLT3 ([84]Figures 3A–C). Then, the cox-proportional hazards analysis was used to stratify the AML patients in to high and low risk subgroups. We established the predictive model by calculating the risk score to predict the ability of survival in GEO training cohort ([85]Figure 4A) (risk score = normalized expression level of FLT3 ^∗ 0.261 + normalized expression level of CD177 ^∗ 0.327 - normalized expression level of TTPAL ^∗ 0.555). The cutoff point of high and low patients was obtained using X-tile software. [86]Figures 4B,C showed the predictive ability of the prognostic model in TCGA and Ohsu testing cohort, respectively. The results of Kaplan-Meier survival analysis were shown in [87]Figures 4D,E. Moreover, to identify if FLT3, CD177 and TTPAL genes influence the AML prognosis independently, we performed survival analysis and found that these three hub genes were involved in the prognosis of AML independently or in the established model. Finally, we created a nomogram to predict the 1, 3, and 5 years overall survival for AML patients. The model displayed the most risk points contrast to other clinical characteristics, which was compatible with the data of cox multivariate regression ([88]Figure 5A). Finally, the calibration plot was used to assess the consistency between the prediction and the observation. As expected, the results found to be near to the ideal curve ([89]Figure 5B). FIGURE 4. [90]FIGURE 4 [91]Open in a new tab Prognostic analysis of the predictive model. (A–C). Association between the risk score (upper) and the expression of three prognostic hub genes (bottom) is displayed in GEO training cohort, TCGA testing cohort and Ohsu testing cohort; (D–F). Kaplan-Meier survival showed OS was significantly higher in the low-risk score subgroup than in the high-risk score subgroup in GEO training cohort, TCGA testing cohort, and Ohsu testing cohort. FIGURE 5. [92]FIGURE 5 [93]Open in a new tab AML survival nomogram. (A) Nomogram for predicting the probability of 1, 3, and 5 years OS for AML patients; (B–D). Calibration plot of the nomogram for predicting the probability of OS at 1, 3, and 5 years. Experimental Verification of Candidate Genes in mRNA and Protein Levels In order to confirm DEGs, the total RNA of 24 paired AML samples were isolated for qRT-PCR validation. 40 target DEGs were selected, as shown in [94]Figure 6. The DEGs were successfully validated and showed a good correspondence with the results of the transcriptome analysis, indicating precise and reliable microarray results. FIGURE 6. [95]FIGURE 6 [96]Open in a new tab (A,B) Validation of DEGs by qRT-PCR. Boxplots indicate the medians and dispersions of 40 AML and normal samples. P-values are counted by student’ t test, *p < 0.05, **p < 0.01, ***p < 0.001. At the same time, we confirmed 3 hub genes at protein level, FLT3 protein expression levels were all found to be upregulated in AML, and CD177 and TTPAL were downregulated in AML, which is consistent with the results of the qRT-PCR ([97]Figure 7). One of the most widely studied gene in the hematopoiesis of AML is FLT3. FLT3 is class III receptor tyrosine kinases that play a crucial role in hematopoiesis ([98]Reilly, 2002). The pathogenesis of several malignant tumors are associated with the overexpression of FLT3 ([99]Fassunke et al., 2010). In particular, the FLT3 genes have been intensively studied in childhood AML ([100]Liang et al., 2002; [101]Boissel et al., 2006). CD177 is mostly expressed in neutrophils, and is upregulated in tumor tissues of patients with colitis associated cancer (CAC). CD177 has been proven to predict the benign prognosis of colorectal cancer ([102]Bai et al., 2017; [103]Zhou et al., 2018). FIGURE 7. [104]FIGURE 7 [105]Open in a new tab Detection in protein level. Western blotting detection of indicated protein. Lysates from three pairs of AML and normal samples were subjected to western blotting with antibody to, FLT3, CD177, TTPAL, and GAPDH. GAPDH is a reference gene. Discussion AML is a clonal malignant disease with a poor prognosis and low overall rate of survival. It originates from hematopoietic bone marrow primordial cells. Immature leukocytes grow rapidly and interfere with the production of normal blood cells. The median survival time of AML patients is only 5–10 months ([106]Hansrivijit et al., 2019). The overall survival rate (OS) of traditional treatment (chemotherapy and stem cell transplantation) for AML is low, and chemotherapy is easily accompanied by complications, while stem cell transplantations are high in cost, with a risk of being rejected ([107]Vaughn et al., 2019). The molecular mechanism of AML development and progression is not fully understood, and it is particularly important to find new targets and strategies for individualized therapy. In this study, we combined three datasets from Gene Expression Omnibus (GEO) as GEO cohort, including 377 AML samples and 69 non-leukemia samples. The combat algorithm in sva R package was used to remove the batch effect. By GO and KEGG analyses, we found that dysfunctions of AML patients were primarily enriched in cytokine-cytokine receptor interaction, transcriptional mis-regulation in cancer, chemokine signaling pathway, and neutrophil related functions, such as neutrophil activation, neutrophil degranulation, neutrophil mediated immunity and so on. Moreover, traditional strategies for gene expression analysis have focused on identifying individual genes that exhibit differences between two or more states of interest. Some specific pathways might be significantly affected while changes in expression of individual genes are relatively subtle. To address this puzzler, we performed GSEA using MSigDB (c5.bp.v6.2.symbols.gm) as reference gene set. The results of GSEA proved the previous conjecture and in good agreement with GO and KEGG results. In addition to the function annotation of gene differential expression, we also explored the gene co-expression and PPI network based on WGCNA and STRING database using modularization design. Modules were generated from hierarchical cluster tree algorithm and topological overlapping matrix, and then functionally annotated ([108]Ravasz et al., 2002; [109]Langfelder et al., 2008). In this approach, the intricate regulatory networks were facilitated into simple and easy modules, which were conducive to ascertain the connections of hub genes in the biological processes. From modularization analysis, we found that the progression of AML was related with balanced regulation and extensive reprogramming of mutually connected functional modules, such as leukocyte migration, T cell receptor signaling (TCR) pathway, autophagy and RNA splicing. The function of the autophagy in cancer, as a driver of oncogenic transformation or inhibitor of tumor progression, remains a controversial topic. Watson et al. found that hematopoietic stem and progenitor cells possess elevated autophagic flux than mature hematopoietic cells, but the flux of AML cells tends to decrease. This combined with the fact that genes related autophagy were subject to copy number variation (CNV) loss in AML, may imply the connection between decreased autophagy and the progression of AML ([110]Watson et al., 2015). Recently, the dysfunction of gene splicing in AML development and drug resistance have received attention. Several recent studies have emphasized that splicing factor mutations are important drivers of hematological malignancies ([111]de Necochea-Campion et al., 2016; [112]Zhou and Chng, 2017; [113]Tyner et al., 2018). For example, Oncogene Wilms’ tumor gene 1 (WT1) is a target for immunotherapy and biomarker in AML, and a large number of isoforms of WT1 were validated. Among them, +5/ + KTS are the notable variant at prognosis, although the ratio swings ([114]Siehl et al., 2004; [115]Lopotová et al., 2012). Benefit from the method of machine learning, we established a clinically applicable model to predict the survival probability of AML patients. This model was built in GEO cohort, and validated in TCGA and ohsu cohort. Our results showed that AML patients could be stratified into two subgroups with high or low risks of OS. Kaplan-Meier survival analysis was used to value the prophetic capacity. The clinical features and accuracy of model were assessed in the nomogram and calibration plot. Furthermore, the three hub genes identified by machine learning algorithms were reliable in the experimental verification by qRT-PCR and western blotting in mRNA and protein levels. All these suggested that the conformity strategy was feasible. In addition, the three final hub genes discovered are all novelly associated with cancer, especially FLT3. FLT3 is considered to be a target of treatment for AML, and at present, the development of clinical targets related with FLT3 is very active. FLT3 is characterized by the presence of five immunoglobulin-like motifs within their extracellular section. These motifs are exclusively expressed in hematopoietic cells ([116]Blume-Jensen and Hunter, 2001). FLT3 mutations occur as secondary events during AML clonal evolution ([117]Shlush et al., 2014). FLT3-ITD mutation has a negative impact on the prognosis of AML, only a minority of patients with FLT3-ITD mutation in leukemic blasts are cured through chemotherapy. Overall, we explored the molecular mechanism that influence the occurrence and development of AML at the genome level using an integrated method, and built a model to predict the survival probability of AML patients in clinical use. We also hope that the results of this study may help to identify critical pathways and genes associated with AML and provide potential targets and new research ideas for the treatment and early detection of AML. Data Availability Statement Publicly available datasets were analyzed in this study. The TCGA database: [118]https://portal.gdc.cancer.gov. The Gene Expression Omnibus: [119]https://www.ncbi.nlm.nih.gov/geo/. Ethics Statement The study protocol was approved by the Ethics Board of the Second Affiliated Hospital of Qiqihar Medical University and complied with the Declaration of Helsinki. Author Contributions HZ designed the study and wrote the manuscript. YiQ analyzed the data. SZ, YaQ, and HG collected the data. YiQ, SW, XW, and TH performed the experiment. All authors read and approved the manuscript. Conflict of Interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Acknowledgments