Abstract In 2019 it is estimated that more than 21,000 new acute myeloid leukemia (AML) patients will be diagnosed in the United States, and nearly 11,000 are expected to die from the disease. AML is primarily diagnosed among the elderly (median 68 years old at diagnosis). Prognoses have significantly improved for younger patients, but as much as 70% of patients over 60 years old will die within a year of diagnosis. In this study, we conducted a reanalysis of 2,213 acute myeloid leukemia patients compared to 548 healthy individuals, using curated publicly available microarray gene expression data. We carried out an analysis of normalized batch corrected data, using a linear model that included considerations for disease, age, sex, and tissue. We identified 974 differentially expressed probe sets and 4 significant pathways associated with AML. Additionally, we identified 375 age- and 70 sex-related probe set expression signatures relevant to AML. Finally, we trained a k nearest neighbors model to classify AML and healthy subjects with 90.9% accuracy. Our findings provide a new reanalysis of public datasets, that enabled the identification of new gene sets relevant to AML that can potentially be used in future experiments and possible stratified disease diagnostics. Subject terms: Acute myeloid leukaemia, Gene expression, Transcriptomics Introduction Acute myeloid leukemia (AML) is a heterogeneous malignant disease of the hematopoietic system myeloid cell lineage^[26]1–[27]5. AML is best characterized by terminal differentiation in normal blood cells and excessive production and release of cells at various stages of incomplete maturation (leukemia cells). As a result of this faster than normal, and uncontrolled growth of leukemia cells, healthy myeloid precursors involved in hematopoiesis are suppressed, and ultimately can soar to death within months from diagnosis if untreated^[28]1,[29]6. AML accounts for 70% of myeloid leukemia and nearly 80% of acute leukemia cases, making it the most common form of both myeloid and acute leukemia^[30]1,[31]7. The number of new AML cases is increasing each year – in 2019 alone, an estimated 21,450 new AML patients will be diagnosed, and nearly 10,920 are expected to die from the disease^[32]8. According to the 2016 World Health Organization (WHO) newly revised myeloid neoplasms and acute leukemia classification system^[33]9, AML prognosis criteria for classification are highly dependent on the presence of chromosomal abnormalities, including chromosomal deletions, duplications, translocations, inversions, and gene fusions. AML is diagnosed predominantly through microscopic, cytogenetic, and molecular genetic analyses of patients’ blood, and/or bone marrow samples. Microscopic examination may be used to detect distinctive features (e.g. Auer rods) in cell morphology, cytogenetic analysis to identify chromosomal structural aberrations (e.g., t(8;21), inv(16), t(16;16), or t(9;11)), and molecular genetic analysis to identify gene fusion (e.g., RUNX1-RUNX1T1 and CBFB-MYH11), and mutations in genes frequently mutated in AML (e.g., NPM1, CEBPA, RUNX1, FLT3)^[34]1,[35]3,[36]5,[37]10–[38]12. Such cytogenetic and molecular genetic analyses are used to identify prognosis markers for classifying AML patients into three risk categories: favorable, intermediate, and unfavorable, currently based primarily on the European LeukemiaNet (ELN) 2017 classification^[39]3,[40]10 (see Estey^[41]3 for a recent review, including ELN assessments). A large group of AML patients present normal karyotypes and lack chromosomal abnormalities^[42]3,[43]5,[44]10,[45]11,[46]13. These patients are classified as intermediate risk, and often have heterogeneous clinical outcome with standard therapy with risk of AML relapse^[47]3,[48]5,[49]14. Additionally, AML prognosis worsens with age, and older patients respond less to current treatments, with poorer clinical outcomes compared to younger patients^[50]15,[51]16. AML can occur in people of all ages but is primarily diagnosed among the elderly (>60 years old), with a median age of 68 years at diagnosis^[52]8. Recent advances in AML biology have expanded our understanding of its complex genetic landscape, and led to significant improvement in prognoses and therapeutic strategy for younger patients^[53]2,[54]16. For elderly patients, prognoses remain grim and the main therapeutic strategy, remission induction therapy followed by an intensive consolidation phase (post-remission), had remained nearly unchanged over the past three decades^[55]1,[56]2,[57]4,[58]5,[59]10,[60]16,[61]17. More recently, however, new therapeutic agents have been approved for older AML patients^[62]4,[63]5, and these include venetoclax (combined with decitabine or azacitidine)^[64]18, midostaurin (combined with standard chemotherapy)^[65]19, and gilteritinib^[66]20. It is expected that the new therapeutic agents will improve prognosis for older AML patients (where in the past up to 70% of AML patients aged 65 or older were reported to die within a year following diagnosis^[67]21). While it is apparent that the nature of AML changes with age, still little is known about the extent of these associations and how they vary with patient age^[68]2,[69]22,[70]23, and current indications from ELN and the National Comprehensive Cancer Network (NCCN) essentially consider age as a surrogate variable that is used only in conjunction with other treatment‐related mortality factors^[71]3,[72]5,[73]10. Taking into consideration age in the identification of changes in AML global gene expression may lead to improved early diagnosis and improvement in treatment approaches for elderly patients. To further complicate matters, AML has multiple driver mutations and competing clones that evolve over time, making it a very dynamic disease^[74]13,[75]24. Multiple gene expression analyses of AML have been carried out, 25 of which have been systematically compared by Miller and Stamatoyannopoulos^[76]25, who analyzed information on 4,918 genes, and identified 25 genes reported across multiple studies, with potential prognostic features. In this study, we performed a comprehensive gene expression analysis of 2,213 AML patients and 548 healthy subjects, by re-analyzing publicly available gene expression microarray data from 37 curated studies (a reanalysis following strict inclusion criteria) and identified disease-, age- and sex-related gene expression changes associated with AML. The differentially expressed gene sets were associated to signaling pathways relevant in AML, and also used to train and test a predictive model of AML or healthy status. We believe that our results may lead to improved AML early detection, and diagnostic testing with target genes, which collectively can potentially serve as age- and sex-dependent biomarkers for AML prognosis, as well as new treatment targets with mechanisms of action different from those used in conventional chemotherapy. Results Data curation and gene expression pre-processing We searched the Gene Expression Omnibus (GEO) public repository, based on our systematic workflow and inclusion criteria, Fig. [77]1a,b. Overall, 2,132 datasets were screened, and 643 selected (577 were excluded as non-Affymetrix, various platform arrays). From the 66 remaining corresponding studies, 34 were excluded due to: lack of metadata, using non-peripheral blood or non-bone marrow tissues, or being cell line or cell-type specific, or analyzing treated subjects. After this curation we obtained 34 age-annotated gene expression datasets from 32 different studies covering 2,213 AML patients and 548 healthy individuals. These 34 datasets were reanalyzed, starting from raw microarray data, to perform a gene expression analysis of variance and functional pathway enrichment analysis (see online Methods). Table [78]1 provides a description of each dataset with a sub-table summary of all curated data used in this study. After pre-processing each individual dataset separately, Fig. [79]1b, we performed the statistical analysis on 44,754 probe sets which were common across all samples (Affymetrix expression microarray data). Figure 1. [80]Figure 1 [81]Open in a new tab General approach, data curation, and analysis workflow summary. The flowchart shows in (a) the five main steps that summarize our method of approach for our study, and in (b) the curation and screening criteria for raw gene expression and annotation data files curation, data pre-processing, supervised machine learning for missing metadata prediction, and batch effects correction. (c) The analysis included a linear model analysis of variance (ANOVA) coupled with Tukey’s Honestly Significant Difference (HSD) post-hoc tests, and KEGG pathway and GO enrichment. Finally, we performed a machine learning classification of AML based on our findings. Table 1. Summary table gene expression datasets used in this study. Author, Year GEO accession Disease Status* Affymetrix platform id: Number of samples used & Sample source* Refs* (A) Curated datasets used in linear model analysis (34 datasets from 32 studies) Zatkova et al., 2009 [82]GSE10258 AML [83]GPL570: 8 BM ^[84]68 Tomasson et al., 2008 [85]GSE10358 AML [86]GPL570: 300 BM ^[87]69 Metzeler et al., 2008 [88]GSE12417 AML [89]GPL570: 73 BM & 5 PB [90]GPL96/97: 160 BM & 2PB ^[91]55 Wouters et al., 2009, Taskesen et al., 2011 [92]GSE14468 AML [93]GPL570: 482 BM & 43 PB ^[94]70, [95]71 Figueroa et al., 2009 [96]GSE14479 AML [97]GPL570: 16 BM ^[98]72 Klein et al., 2009 [99]GSE15434 AML [100]GPL570: 231 BM & 20 PB ^[101]73 Lück et al., 2011 [102]GSE29883 AML [103]GPL570: 10 BM & 2 PB ^[104]74 Li et al., 2013, Herold et al., 2014, Janke et al., 2014, Jiang et al., 2016 [105]GSE37642 AML [106]GPL570: 140 BM [107]GPL96/97: 422 BM ^[108]56– [109]59 Bullinger et al., 2014 [110]GSE39363 AML [111]GPL570: 11 BM & 2 PB NYP Opel et al., 2015 [112]GSE46819 AML [113]GPL570: 8 BM & 4 PB ^[114]75 TCGA et al., 2015 [115]GSE68833 AML [116]GPL570: 183 BM NYP Cao et al., 2016 [117]GSE69565 AML [118]GPL570: 12 PB ^[119]76 Bohl et al., 2016 [120]GSE84334 AML [121]GPL570: 25 BM & 20 PB NYP Li et al., 2011 [122]GSE23025 AML [123]GPL570: 21 BM & 13 PB ^[124]77 Warren et al., 2009 [125]GSE11375 Healthy [126]GPL570: 26 PB ^[127]78 Green et al., 2009 [128]GSE14845 Healthy [129]GPL570: 1 PB NYP Wu et al., 2012 [130]GSE15932 Healthy [131]GPL570: 8 PB NYP Karlovich et al., 2009 [132]GSE16028 Healthy [133]GPL570: 22 PB ^[134]79 Krug et al., 2011 [135]GSE17114 Healthy [136]GPL570: 14 PB NYP Kong et al., 2012 [137]GSE18123 Healthy [138]GPL570: 17 PB ^[139]80 Sharma et al., 2009 [140]GSE18781 Healthy [141]GPL570: 25 PB ^[142]81 Rosell et al., 2011 [143]GSE25414 Healthy [144]GPL570: 12 PB ^[145]82 Schmidt et al., 2006 [146]GSE2842 Healthy [147]GPL570: 2 PB ^[148]83 Meng et al., 2015 [149]GSE71226 Healthy [150]GPL570: 3 PB NYP Tasaki et al., 2017 [151]GSE84844 Healthy [152]GPL570: 30 PB ^[153]84 Leday et al., 2018 [154]GSE98793 Healthy [155]GPL570: 64 PB ^[156]85 Shamir et al., 2017 [157]GSE99039 Healthy [158]GPL570: 121 PB ^[159]86 Tasaki et al., 2018 [160]GSE93272 Healthy [161]GPL570: 35 PB ^[162]87 Clelland et al., 2013 [163]GSE46449 Healthy [164]GPL570: 24 PB ^[165]88 Lauwerys et al., 2013 Ducreux et al., 2016 [166]GSE39088 Healthy [167]GPL570: 46 PB ^[168]89, [169]90 Xiao et al., 2011 [170]GSE36809 Healthy [171]GPL570: 35 PB ^[172]91 Zhou et al., 2010 [173]GSE19743 Healthy [174]GPL570: 63 PB ^[175]92 (B) Covariate datasets (used for batch correction and for testing predictive models) Jiang et al., 2018^# [176]GSE107968^* 2 AML; 1 Healthy [177]GPL570: 3 BM NYP Greiner et al., 2015^# [178]GSE68172^* 20 AML; 5 Healthy [179]GPL570: 25 PB ^[180]64 Majeti et al., 2009^# [181]GSE17054^* 9 AML; 4 Healthy [182]GPL570: 13 BM ^[183]65 Bacher et al., 2012^# [184]GSE33223^* 20 AML; 10 Healthy [185]GPL570: 30 PB ^[186]66 Mills et al., 2009^# [187]GSE15061^* 404 AML; 138 Healthy [188]GPL570: 542 BM ^[189]67 (C) Analysis datasets summary statistics Disease state Sample source Affymetrix platform id Unique probe sets AML Healthy BM PB [190]GPL570 [191]GPL96/97 [192]GPL570 [193]GPL96/97 2,213 548 2,090 671 2,177 584 54,675 44,760 [194]Open in a new tab Summary of datasets used in our analysis and disease classification. *GEO, Gene Expression Omnibus; AML, acute myeloid leukemia; Refs., references; NYP, not yet published; [195]GPL570, Affymetrix Human