Abstract
Machine learning (ML) is a useful tool for advancing our understanding
of the patterns and significance of biomedical data. Given the growing
trend on the application of ML techniques in precision medicine, here
we present an ML technique which predicts the likelihood of complete
remission (CR) in patients diagnosed with acute myeloid leukemia (AML).
In this study, we explored the question of whether ML algorithms
designed to analyze gene-expression patterns obtained through RNA
sequencing (RNA-seq) can be used to accurately predict the likelihood
of CR in pediatric AML patients who have received induction therapy. We
employed tests of statistical significance to determine which genes
were differentially expressed in the samples derived from patients who
achieved CR after 2 courses of treatment and the samples taken from
patients who did not benefit. We tuned classifier hyperparameters to
optimize performance and used multiple methods to guide our feature
selection as well as our assessment of algorithm performance. To
identify the model which performed best within the context of this
study, we plotted receiver operating characteristic (ROC) curves. Using
the top 75 genes from the k-nearest neighbors algorithm (K-NN) model
(K = 27) yielded the best area-under-the-curve (AUC) score that we
obtained: 0.84. When we finally tested the previously unseen test data
set, the top 50 genes yielded the best AUC = 0.81. Pathway enrichment
analysis for these 50 genes showed that the guanosine diphosphate
fucose (GDP-fucose) biosynthesis pathway is the most significant with
an adjusted P value = .0092, which may suggest the vital role of
N-glycosylation in AML.
Keywords: acute Myeloid Leukemia (AML), machine Learning (ML), gene
expression profiling, remission induction
Introduction
RNA sequencing (RNA-seq) and other high-throughput next-generation
sequencing platforms have emerged as powerful approaches for
discovering pathogenic pathways and potential targets for clinical
intervention in patients with acute myeloid leukemia (AML).^[31]1 Using
whole-transcriptome sequencing, our previous work compared the profiles
of core-binding factor acute myeloid leukemia (CBF-AML) cases with
those characterized by normal karyotype (NK), illuminating similarities
and differences with respect to gene-expression signatures and splicing
events as well as RNA fusions that typify the inv(16) vs the t(8;21)
AML subtypes.^[32]2
In concert with the rise of large-scale omics-oriented sequencing,
machine-learning (ML) algorithms have increasingly been applied to
gene-expression analysis aimed at classifying tumors, predicting
survival, identifying therapeutic targets, and classifying genes
according to function.^[33]3–[34]7 Significant results have been shown
for predicting outcomes of large B-cell lymphoma,^[35]8 hepatitis B
virus–positive metastatic hepatocellular carcinomas^[36]9 as well as
documenting diverse pathologic responses to chemotherapy in patients
with breast cancer.^[37]10 Using gene-expression profiling of data
generated by microarrays in conjunction with both supervised and
unsupervised learning, Bullinger et al^[38]11 identified prognostic
subclasses in adult AML; the research group also constructed an optimal
133-gene predictor of overall survival. Yeoh et al^[39]12 performed
classification, subtype discovery, and outcome prediction in patients
with pediatric acute lymphoblastic leukemia (ALL). However, no previous
study has specifically addressed expression differences among large
cohorts of pediatric and young-adult AML patients with regard to
complete remission (CR). In this study, we compare pre-treatment
gene-expression profiles using 3 supervised learning algorithms to
discover predictors of CR.
Materials and Methods
We obtained 473 bone marrow specimens from 473 patients, both children
and young adults with ages ranging between 8 days and 28 years who had
been diagnosed with de novo AML. For comparison, we acquired an
additional 20 bone marrow specimens from normal, healthy individuals.
All samples were obtained by written consent from the parents/guardians
of minors from the Children’s Oncology Group clinical trial AAML1031.
The Institutional Review Board at Fred Hutchinson Cancer Research
Center has reviewed and approved this study. It is filed under
Institutional Review File #9950 (Biology of the Alterations of the
Signal Transduction Pathway in Pediatric Cancer). The number of samples
with clinical information regarding CR used in this study was 414. RNA
sequencing was performed on all 493 samples using the Illumina platform
HiSeq2000 ([40]https://www.illumina.com). Reads were mapped to Ensembl
Gene IDs ([41]http://useast.ensembl.org/), which belong to 31 biotypes,
including protein-coding sequences, non-coding sequences, and
pseudogenes, among others. RPKM (reads per kilobase per million mapped
reads) values were calculated for each gene. Genes that had a count of
at least 1 per million (CPM) in at least 3 samples were retained.
Quantile normalization was applied among all samples. Python library
scikit-learn ([42]http://scikit-learn.org/stable/) modules of commonly
used statistical models and algorithms were directly implemented in the
scripts. Gene set enrichment analysis (GSEA) was performed using the
online tool Enrichr ([43]http://amp.pharm.mssm.edu/Enrichr/), as well
as our in-house OmicPath (v 0.1) R package.
Violin plots showing gene-expression distribution patterns were
generated using the in-house OmicPlot (v 0.1) R package.
Feature selection
Principal components analysis (PCA) was performed to examine the
general pattern of the data, remove outliers, and select algorithms
appropriate for our data.
RNA sequencing expression data of m samples by n genes were used as
inputs and learn the mapping using
[MATH:
F:x→{CR
,NotinCR} :MATH]
[MATH:
X∈Rm×n,y∈{0,1}m×1 :MATH]
Samples were divided into a training set (N = 331) and a test set
(N = 83). Three classifiers—k-nearest neighbors algorithm (K-NN),
support vector machine (SVM), and random forest (RF)—were applied to
select features for the training set via 5-fold cross-validation. With
the features selected, the classifier was tested on the same training
set (N = 331). The classifier with the best performance was then tested
on the remaining test set (N = 83).
K-NN classifier
We performed 100 iterations of a 5-fold cross-validation. In each fold,
we first carried out a t test for initial feature selection to identify
the 100 most statistically significant genes, ie, those that were the
most differentially expressed between the CR (positive class) and
non-complete remission (NCR) (negative class). We found that using more
than 100 genes did not improve performance. For further feature
selection out of the genes identified by t testing, we compared the
performance of 2 algorithms: Hill Climbing^[44]13 (sequential feature
addition) and Randomized Lasso^[45]14 (using the model’s feature
weights as ranks and selecting the highest ranking feature). At each
fold, an area under the curve (AUC) was computed using a selected
subset of genes and the fold’s validation set. Following the 100
iterations, the features (genes) were ranked by the average of AUCs
computed using those genes across different folds. Essentially, the
genes that on average helped yield the best AUCs were ranked highest.
SVM classifier
To overcome the issue of class imbalance, downsampling was
applied,^[46]15–[47]17 ie, a smaller subset of 114
samples—N[(CR)] = 57, N[(NCR)] = 57; 91 for the training set and 23 for
the test set—was used as input for the SVM classifier. Processes
similar to those described above for K-NN were applied to SVM
classifiers with 1 exception: we used a third method Recursive Feature
Elimination for the second feature selection in addition to Hill
Climbing and Randomized Lasso.
RF classifier
We trained RF classifiers using scikit-learn’s
ensemble.RandomForestClassifier module. To select parameters for the RF
classifier, we performed a grid search for the following parameters:
number of trees (estimators), maximum number of features, and maximum
tree depth. The remaining parameters were set to their default values.
Then, the optimal parameters in terms of AUC were selected together
with the best performing feature selection approach. For feature
selection, we performed a comparison between 2 approaches: (1) nested
5-fold with built-in RF feature selection—we trained the classifier on
4/5 of the training set using the built-in “feature importance”
attribute to rank the features (genes). Those genes were then used a
second time on the same 4/5 of the training set. We then tested the
classifier on the remaining 1/5 of the training set (ie, validation
set) to assess performance. (2) We carried out 100 iterations of a
5-fold cross-validation while aggregating the feature importance values
computed at each fold. We then computed a Spearman correlation between
each gene’s importance values and the AUC computed in each fold. We
used the genes with the highest correlation scores to train on the same
4/5 training set and then tested the method on the remaining validation
set.
Following feature selection
At the end of each feature selection, the same cross-validation
procedure was employed to generate the AUC results when testing the
validation set. The final AUC result of the (chosen) K-NN classifier
was a simple, 1-episode period of training on the training set with the
selected genes followed by testing on the unseen test set with the same
selected genes.
Results
According to PCA based on all genes, these 414 AML samples with
clinical information regarding CR did not cluster by CR or NCR status,
nor by age/year of diagnosis. There is no obvious outliers, so all of
them were included in this study.
Area-under-the curve results from different K values were used to
estimate optimal K for the K-NN classifier. [48]Figure 1 shows that
statistically significant genes identified from the t test can help
improve the AUC results and that K = 27 yielded the best average AUC.
With the optimal K = 27, receiver operating characteristic (ROC) curves
were produced using 2 feature selection methods: Hill Climbing and
Randomized Lasso ([49]Figure 2). Overall, the Hill Climbing resulted in
better results with the best AUC = 0.84.
Figure 1.
Figure 1.
[50]Open in a new tab
Area under the curves from different Ks used to estimate an optimal K
value for K-NN classifier. AUC indicates area under the curve; K-NN,
k-nearest neighbors algorithm.
Figure 2.
Figure 2.
[51]Open in a new tab
Receiver operating characteristic curves of K-NN (with the optimal
K = 27) using 2 feature selection methods: (A) Hill Climbing and (B)
Randomized Lasso. K-NN indicates k-nearest neighbors algorithm; ROC,
receiver operating characteristic; FS, feature selection; HC, Hill
Climbing; R.LASSO, Randomized Lasso.
To compare the performance of K-NN and SVM classifier, the balanced
data set with N[(CR)] = 57 and N[(NCR)] = 57 was split into training
set (N = 91) and the test set (N = 23). Using a 5-fold cross-validation
performed on the training set, ROC curves of K-NN and SVM algorithms
were calculated using 3 feature-election methods: Hill Climbing,
Recursive Feature Elimination, and Randomized Lasso. The K-NN
outperformed SVM, and Hill Climbing still resulted in better AUC
results for K-NN ([52]Supplemental Figure S1).
Hyperparameter tuning for RF suggested using 100 trees to have the best
performance (AUC = 0.74). The simple method resulted in better results
with the best (training set) AUC = 0.73 compared with the more complex
approach ([53]Supplemental Figure S2).
Based on the above observations, K-NN with Hill Climbing performed the
best on the training data (N = 331), yielding an AUC score of 0.84.
When we tested this model on the remaining 1/5 of the data (N = 83),
using the top 50 genes with the best AUC scores from the training set
yielded an AUC score of 0.81 ([54]Figure 3).
Figure 3.
Figure 3.
[55]Open in a new tab
Final K-NN model performance on test data (N = 83). ROC indicates
receiver operating characteristic.
Based on using these top 50 genes, our GSEA analysis using OmicPath
showed that BATF (basic leucine zipper transcriptional factor ATF-like)
and RAC2 (Ras-related C3 botulinum toxin substrate 2) are related to a
decreased IgM (Immunoglobulin M) level with FDR (false discovery
rate) =0.0073. TSTA3 (GDP-l-fucose synthase) and RAC2 are related to an
increased neutrophil cell number (FDR = 0.0073). Pathway enrichment
analysis using Enrichr showed that TSTA3 and FPGT (fucose-1-phosphate
guanylyltransferase) were mapped to the GDP-fucose biosynthesis pathway
(Reactome 2016; [56]https://reactome.org) with an adjusted P value of
.0092. These 2 genes were also mapped to the pathway’s parent terms
“Synthesis of substrates in N-glycan biosynthesis” and “Biosynthesis of
the N-glycan precursor (dolichol lipid-linked oligosaccharide, LLO) and
transfer to a nascent protein.” This indicates the vital role of
N-glycosylation in AML pathology and patient prognosis. The expression
of these top 50 genes compared with normal bone marrow (NBM) samples
are shown in violin plot ([57]Figure 4).
Figure 4.
[58]Figure 4.
[59]Open in a new tab
Expressions of top 50 genes with the best AUC scores from K-NN with
Hill Climbing. AML indicates acute myeloid leukemia; AUC, area under
the curve; CR, complete remission; K-NN, k-nearest neighbors algorithm;
NBM, normal bone marrow; NCR, non-complete remission; RPKM, reads per
kilobase per million.
Discussion
This study explored and evaluated different ML algorithms for
predicting CR in AML patients based on their pre-treatment
gene-expression signatures. It revealed a significant underlying
genetic difference between patients with contrasting outcomes following
treatment. Gene set enrichment analysis results highlighted specific
biological features that carry prognostic value for further
exploration. For example, low IgM and leukocyte count >50 × 10^9/liter
have been demonstrated as 2 of the adverse predictors for the duration
of complete continuous remission in childhood ALL.^[60]18
Fucose-containing glycans play important roles in selectin-mediated
leukocyte-endothelial adhesion as well as various immunity and
signaling processes. Alterations in expression or structure of
fucosylated oligosaccharides have also been observed in cancer
pathology. Conditional impairment in fucosylated glycan expression in
mice exhibited altered myeloid development including aberrant
proliferation of myeloid progenitors and an increased production of
granulocytes which leads to neutrophilia. The loss of AB blood group
antigen expression along with the increases in H and Lewisy expression
are associated with poor prognosis. Increased expression of Lewisx/a
structures, Tn/sialyl-Tn/T antigens, and β1,6 GlcNAc branching of
N-linked core structures were observed in advanced cancers and related
with poor prognosis.^[61]19–[62]22 This information may help physicians
select more suitable courses of treatment, whether the treatment be
more aggressive chemotherapy or an altogether novel alternative
therapy.
Supplemental Material
Predict_CR_AML_Supplemental_xyz13921555f86f3 – Supplemental material
for Predicting Complete Remission of Acute Myeloid Leukemia: Machine
Learning Applied to Gene Expression
[63]Click here for additional data file.^ (1.2MB, pdf)
Supplemental material, Predict_CR_AML_Supplemental_xyz13921555f86f3 for
Predicting Complete Remission of Acute Myeloid Leukemia: Machine
Learning Applied to Gene Expression by Ophir Gal, Noam Auslander, Yu
Fan and Daoud Meerzaman in Cancer Informatics
Acknowledgments