Graphical abstract
graphic file with name ga1.jpg
[27]Open in a new tab
Keywords: Pan-cancer classification, Genetic mutation map, Image-based
deep learning, Guided Grad-CAM visualization, Tumor-type-specific
genes, Pathway analysis
Abstract
Accurate cancer type classification based on genetic mutation can
significantly facilitate cancer-related diagnosis. However, existing
methods usually use feature selection combined with simple classifiers
to quantify key mutated genes, resulting in poor classification
performance. To circumvent this problem, a novel image-based deep
learning strategy is employed to distinguish different types of cancer.
Unlike conventional methods, we first convert gene mutation data
containing single nucleotide polymorphisms, insertions and deletions
into a genetic mutation map, and then apply the deep learning networks
to classify different cancer types based on the mutation map. We
outline these methods and present results obtained in training VGG-16,
Inception-v3, ResNet-50 and Inception-ResNet-v2 neural networks to
classify 36 types of cancer from 9047 patient samples. Our approach
achieves overall higher accuracy (over 95%) compared with other widely
adopted classification methods. Furthermore, we demonstrate the
application of a Guided Grad-CAM visualization to generate heatmaps and
identify the top-ranked tumor-type-specific genes and pathways.
Experimental results on prostate and breast cancer demonstrate our
method can be applied to various types of cancer. Powered by the deep
learning, this approach can potentially provide a new solution for
pan-cancer classification and cancer driver gene discovery. The source
code and datasets supporting the study is available at
[28]https://github.com/yetaoyu/Genomic-pan-cancer-classification.
1. Introduction
Cancer is considered as the deadly genetic diseases, characterized by
abnormal cell growths [29][1], [30][2]. Globally, more than 18 million
new cancer cases are diagnosed resulting to 9.6 million deaths in 2018
[31][3]. Genetic mutations have been shown to be associated with
different types of cancer [32][4], [33][5], [34][6]. Cancer
classification based on genetic mutations can be readily achieved
through increased usage of high-throughput sequencing techniques. A
large amount of mutation data has been generated and publicly released.
Among them, The Cancer Genome Atlas (TCGA) is a cohort cataloguing
genetic mutations data for more than 30 types of cancers from more than
10,000 patients [35][7]. TCGA contains various genetic mutations data,
including single-nucleotide polymorphism (SNP), small insertions or
deletions (INDEL), copy number variations (CNV), etc. By handling the
massive amount of data, researchers now are able to design new
analytical methods for accurate cancer classification and detection
based on gene alteration. However, accurate and reliable cancer
classification is particularly challenging as a result of the
complexity and scale of the data. Considering the sequencing covers
more than thousands of genes, but most of genes did not contain
informative mutations thus making classification difficult by analyzing
all those genes [36][8], [37][9]. In order to avoid the mutation data
being too sparse (even all zero), most analytical methods screen genes
before classification [38][10], [39][11]. These methods are simple and
effective in some cases, but important features (genes) may be removed
during the screening process.
Recent advances in deep learning underpin a collection of algorithms
with an impressive ability to analyze molecular data without prior
feature selection or human-directed training. Prior deep learning
approaches usually work well for a specific type of cancer, such as
brain cancer [40][12], gliomas [41][13], acute myeloid leukemia
[42][14], breast cancer [43][15], [44][16], soft tissue sarcomas
[45][17] and lung cancer [46][18]. Given the complexity of pan-cancer
data, directly using those mentioned approaches might not be
appropriate for multiple types of cancer. Recently, some works are
starting to consider the importance of genetic mutations in multiple
types of cancer classification. By analyzing more than 8000 samples’
genetic mutations profiles from 12 cancer types obtained from the TCGA,
Sun et al. [47][19] reported a novel method, Genome Deep Learning
(GDL), for cancer subtyping. However, more than 12 specific models were
constructed. Limited by the number of models, this approach will be
insufficient and unconfident in analysis of more types, and larger
cancer mutation data. Yuan et al. [48][20] described DeepGene, an
advanced Deep Neural Network (DNN) based cancer type classifier.
Experimental results on 12 selected types of cancer from TCGA
demonstrated improved classification performance compared with
classifiers of Support Vector Machine (SVM), k-Nearest Neighbors (KNN)
and Naïve Bayes (NB). However, the DNN classifier only has the optimal
accuracy of 65.5%, which will prevent its development as an accurate
cancer classifier.
In addition, most of these studies usually used only one type of
genetic mutation data as input for cancer classification, which limits
the performance of the classifier. For instance, Yuan et al. proposed
DeepGene on somatic point mutation data for cancer classification
[49][20]. AlShibli et al. [50][21] proposed three deep learning
techniques to classify six cancer types based on CNV data. Although
these methods are effective, the characteristic information is still
not comprehensive enough. As far as we known, there is no existing work
specifically designed to combined multiple types of mutation data.
As a result, a general algorithm for easy and reliable cancer
classification based on multiple types of genetic mutation data is
still missing. Previous works tend to use a variety of modeling
methods, sometimes combine them together. In such a context merely
adopting deep learning approaches developed within other setting might
not be appropriate in pan-cancer classification based on different gene
mutation data. Given these challenges, a new and simple approach is
necessary.
Motivated by works of deep learning in image analysis, we describe a
novel image-based deep learning strategy for cancer classification and
mutated gene discovery. The proposed strategy is consisting of three
main steps: construction of genetic mutation map, classification using
deep Convolutional Neural Networks (CNN) and identify cancer driver
genes by Guided Grad-CAM (a combination of Guided backpropagation and
Gradient-weighted Class Activation Mapping) visualization [51][22].
This novel strategy makes the following research contributions:
* (1)
A genetic mutation map was constructed for each cancer patient,
documenting the gene alternations condition including
single-nucleotide polymorphism (SNP), insertion (INS) and deletion
(DEL) with chromosome position information. Prior knowledge on the
mutated genes selected is not necessary, avoiding bias caused by
hand-picking. The correlation between mutated genes and cancer
types can be built without gene prescreening.
* (2)
Genetic mutation map and popular deep neural networks, which used
in combination, produce a high accuracy in pan-cancer
classification. Compared with other widely used classification
methods (such as SVM and KNN), our test classifiers, including
VGG-16 [52][23], Inception-v3 [53][24], ResNet-50 [54][25] and
Inception-ResNet-v2 [55][26], can effectively extract deep features
from complex genetic mutation data, and significantly improve the
classification accuracy.
* (3)
The application of Guided Grad-CAM visualization to generate
heatmaps were utilized to identify tumor type-specific genes and
pathways.
* (4)
The systematical examination of gene mutations in 36 types of
cancer from 9,047 patient samples demonstrates the advancement of
our method, allowing a deeper understanding of the mutation
landscape of cancer. The constructed genetic mutation map dataset
was publicly released at
[56]https://github.com/yetaoyu/Genomic-pan-cancer-classification/tr
ee/master/DNN-models/dataset.
2. Materials and methods
2.1. Cancer types and samples statistics
The genetic mutation data from various types of cancer in TCGA are
collected from the Firebrowse portal ([57]http://firebrowse.org/). The
dataset is assembled by selecting the genes across all samples for 36
cancer types that contain mutations. As shown in [58]Supplementary Fig.
S1, the upper line chart represents the number of mutation genes from
each type of cancer, and the lower bar chart represents the total
number of mutation conditions from those genes, including SNP, INS and
DEL. As shown in the horizontal axis, the sample number of each tumor
type ranges from Cholangiocarcinoma (CHOL, n = 35) to Breast invasive
carcinoma (BRCA, n = 982). From 9,047 TCGA samples with 23,231 mutation
genes, we demonstrate the general applicability of our image-based deep
learning method on 36 types of cancers.
2.2. Mutation map construction
To construct the mutation landscape of cancer, we create the genetic
mutation map. Assuming that the size of the mutation map is
[MATH: N×N :MATH]
, all the mutation genes from 36 types of cancers are collected,
grouped and located to the matrix map according to their positions on
the chromosomes.
Firstly, mutated genes from each type of cancer are sorted according to
their positions on chromosome (chromosomes 1–22, X and Y). For cancer
[MATH: j :MATH]
, the list of mutated genes on chromosome
[MATH: i :MATH]
is
[MATH: rij :MATH]
, where
[MATH: 0⩽i⩽23
:MATH]
,
[MATH: 0⩽j⩽35
:MATH]
in this paper. Then the mutated genes in the same chromosome from
different types of cancer are grouped according to their positions. For
chromosome
[MATH: i :MATH]
, the length of mutated gene set
[MATH:
Ri·=ri0
∪ri1∪⋯∪ri
35 :MATH]
collected from different cancers is
[MATH: Li :MATH]
. Therefore, the number of columns occupied by the genes on chromosome
[MATH: i :MATH]
in the mutation map is
[MATH:
ki×3
:MATH]
, where
[MATH:
ki=<
/mo>Li/N+1
,ifLi%N≠0
Li/N,ifLi%N=0<
/mfenced> :MATH]
. To be specifically, each gene occupies three pixels in the same row
of the mutation map, where each pixel point represents the mutation
condition of the gene. These three pixels are colored with blue, green
or red to represent SNP, INS or DEL respectively. Genes on all
chromosomes occupy
[MATH: K :MATH]
columns, where
[MATH:
K=∑i
mi>=0233
mn>ki=∑i=0
mrow>233×Li/N+1
,ifLi%N≠0
Li/N,ifLi%N=0<
/mfenced>andK⩽N :MATH]
According to the above description, we can choose an appropriate
[MATH: N :MATH]
value. In our experimental data, the value of
[MATH: N :MATH]
is 310. Next, a collection of mutation genes are arranged and aligned
vertically in the matrix map, according to their positions and orders
on the chromosomes 1 to 22, X, and Y, thereby forming a
[MATH: N×N :MATH]
genetic mutation map for each tumor sample ([59]Fig. 1A). Each
chromosome occupies
[MATH:
ki×3
:MATH]
columns in the genetic map, containing
[MATH: pp⩽ki×N :MATH]
genes and the extra pixels in the image are set to zeros. Finally, we
output the genetic mutation maps for all of 9,047 patient samples from
36 types of cancer. All the mutation maps are normalized by the maximum
value over RGB channels.
Fig. 1.
[60]Fig. 1
[61]Open in a new tab
Schematic representation of image-based deep learning for genomic
pan-cancer classification. (A) The protocol of genetic mutation map
construction. The gene mutation conditions including single-nucleotide
polymorphism (SNP), insertion (INS) and deletion (DEL) with chromosome
position information are transformed into the genetic mutation map.
Each chromosome occupies
[MATH:
ki×3
:MATH]
columns in the genetic map, containing
[MATH: pp⩽ki×N :MATH]
genes. Each gene occupies three pixels in the same row of the mutation
map, where each pixel represents the mutation condition of the gene,
colored blue, green, or red according to their labels to SNP, INS and
DEL. Those pixel points are arranged and aligned vertically in the
mutation map, according to their positions on the chromosomes, there
forming a
[MATH: N×N :MATH]
matrix map for each patient, referred as the genetic mutation map. (B)
Workflow of establishing the image-based deep learning models. All
patient samples are transformed into the mutation maps and then divided
into training, validation, and testing sets, respectively. The images
of mutation maps are fed into different deep learning architectures for
training and testing on pan-cancer classification. (C) Guided Grad-CAM
are employed to generated heatmaps for the identification of top
distinct candidate genes that help the pan-cancer classification. (For
interpretation of the references to color in this figure legend, the