Graphical abstract

   graphic file with name ga1.jpg
   [27]Open in a new tab

   Keywords: Pan-cancer classification, Genetic mutation map, Image-based
   deep learning, Guided Grad-CAM visualization, Tumor-type-specific
   genes, Pathway analysis

Abstract

   Accurate cancer type classification based on genetic mutation can
   significantly facilitate cancer-related diagnosis. However, existing
   methods usually use feature selection combined with simple classifiers
   to quantify key mutated genes, resulting in poor classification
   performance. To circumvent this problem, a novel image-based deep
   learning strategy is employed to distinguish different types of cancer.
   Unlike conventional methods, we first convert gene mutation data
   containing single nucleotide polymorphisms, insertions and deletions
   into a genetic mutation map, and then apply the deep learning networks
   to classify different cancer types based on the mutation map. We
   outline these methods and present results obtained in training VGG-16,
   Inception-v3, ResNet-50 and Inception-ResNet-v2 neural networks to
   classify 36 types of cancer from 9047 patient samples. Our approach
   achieves overall higher accuracy (over 95%) compared with other widely
   adopted classification methods. Furthermore, we demonstrate the
   application of a Guided Grad-CAM visualization to generate heatmaps and
   identify the top-ranked tumor-type-specific genes and pathways.
   Experimental results on prostate and breast cancer demonstrate our
   method can be applied to various types of cancer. Powered by the deep
   learning, this approach can potentially provide a new solution for
   pan-cancer classification and cancer driver gene discovery. The source
   code and datasets supporting the study is available at
   [28]https://github.com/yetaoyu/Genomic-pan-cancer-classification.

1. Introduction

   Cancer is considered as the deadly genetic diseases, characterized by
   abnormal cell growths [29][1], [30][2]. Globally, more than 18 million
   new cancer cases are diagnosed resulting to 9.6 million deaths in 2018
   [31][3]. Genetic mutations have been shown to be associated with
   different types of cancer [32][4], [33][5], [34][6]. Cancer
   classification based on genetic mutations can be readily achieved
   through increased usage of high-throughput sequencing techniques. A
   large amount of mutation data has been generated and publicly released.
   Among them, The Cancer Genome Atlas (TCGA) is a cohort cataloguing
   genetic mutations data for more than 30 types of cancers from more than
   10,000 patients [35][7]. TCGA contains various genetic mutations data,
   including single-nucleotide polymorphism (SNP), small insertions or
   deletions (INDEL), copy number variations (CNV), etc. By handling the
   massive amount of data, researchers now are able to design new
   analytical methods for accurate cancer classification and detection
   based on gene alteration. However, accurate and reliable cancer
   classification is particularly challenging as a result of the
   complexity and scale of the data. Considering the sequencing covers
   more than thousands of genes, but most of genes did not contain
   informative mutations thus making classification difficult by analyzing
   all those genes [36][8], [37][9]. In order to avoid the mutation data
   being too sparse (even all zero), most analytical methods screen genes
   before classification [38][10], [39][11]. These methods are simple and
   effective in some cases, but important features (genes) may be removed
   during the screening process.

   Recent advances in deep learning underpin a collection of algorithms
   with an impressive ability to analyze molecular data without prior
   feature selection or human-directed training. Prior deep learning
   approaches usually work well for a specific type of cancer, such as
   brain cancer [40][12], gliomas [41][13], acute myeloid leukemia
   [42][14], breast cancer [43][15], [44][16], soft tissue sarcomas
   [45][17] and lung cancer [46][18]. Given the complexity of pan-cancer
   data, directly using those mentioned approaches might not be
   appropriate for multiple types of cancer. Recently, some works are
   starting to consider the importance of genetic mutations in multiple
   types of cancer classification. By analyzing more than 8000 samples’
   genetic mutations profiles from 12 cancer types obtained from the TCGA,
   Sun et al. [47][19] reported a novel method, Genome Deep Learning
   (GDL), for cancer subtyping. However, more than 12 specific models were
   constructed. Limited by the number of models, this approach will be
   insufficient and unconfident in analysis of more types, and larger
   cancer mutation data. Yuan et al. [48][20] described DeepGene, an
   advanced Deep Neural Network (DNN) based cancer type classifier.
   Experimental results on 12 selected types of cancer from TCGA
   demonstrated improved classification performance compared with
   classifiers of Support Vector Machine (SVM), k-Nearest Neighbors (KNN)
   and Naïve Bayes (NB). However, the DNN classifier only has the optimal
   accuracy of 65.5%, which will prevent its development as an accurate
   cancer classifier.

   In addition, most of these studies usually used only one type of
   genetic mutation data as input for cancer classification, which limits
   the performance of the classifier. For instance, Yuan et al. proposed
   DeepGene on somatic point mutation data for cancer classification
   [49][20]. AlShibli et al. [50][21] proposed three deep learning
   techniques to classify six cancer types based on CNV data. Although
   these methods are effective, the characteristic information is still
   not comprehensive enough. As far as we known, there is no existing work
   specifically designed to combined multiple types of mutation data.

   As a result, a general algorithm for easy and reliable cancer
   classification based on multiple types of genetic mutation data is
   still missing. Previous works tend to use a variety of modeling
   methods, sometimes combine them together. In such a context merely
   adopting deep learning approaches developed within other setting might
   not be appropriate in pan-cancer classification based on different gene
   mutation data. Given these challenges, a new and simple approach is
   necessary.

   Motivated by works of deep learning in image analysis, we describe a
   novel image-based deep learning strategy for cancer classification and
   mutated gene discovery. The proposed strategy is consisting of three
   main steps: construction of genetic mutation map, classification using
   deep Convolutional Neural Networks (CNN) and identify cancer driver
   genes by Guided Grad-CAM (a combination of Guided backpropagation and
   Gradient-weighted Class Activation Mapping) visualization [51][22].
   This novel strategy makes the following research contributions:
     * (1)
       A genetic mutation map was constructed for each cancer patient,
       documenting the gene alternations condition including
       single-nucleotide polymorphism (SNP), insertion (INS) and deletion
       (DEL) with chromosome position information. Prior knowledge on the
       mutated genes selected is not necessary, avoiding bias caused by
       hand-picking. The correlation between mutated genes and cancer
       types can be built without gene prescreening.
     * (2)
       Genetic mutation map and popular deep neural networks, which used
       in combination, produce a high accuracy in pan-cancer
       classification. Compared with other widely used classification
       methods (such as SVM and KNN), our test classifiers, including
       VGG-16 [52][23], Inception-v3 [53][24], ResNet-50 [54][25] and
       Inception-ResNet-v2 [55][26], can effectively extract deep features
       from complex genetic mutation data, and significantly improve the
       classification accuracy.
     * (3)
       The application of Guided Grad-CAM visualization to generate
       heatmaps were utilized to identify tumor type-specific genes and
       pathways.
     * (4)
       The systematical examination of gene mutations in 36 types of
       cancer from 9,047 patient samples demonstrates the advancement of
       our method, allowing a deeper understanding of the mutation
       landscape of cancer. The constructed genetic mutation map dataset
       was publicly released at
       [56]https://github.com/yetaoyu/Genomic-pan-cancer-classification/tr
       ee/master/DNN-models/dataset.

2. Materials and methods

2.1. Cancer types and samples statistics

   The genetic mutation data from various types of cancer in TCGA are
   collected from the Firebrowse portal ([57]http://firebrowse.org/). The
   dataset is assembled by selecting the genes across all samples for 36
   cancer types that contain mutations. As shown in [58]Supplementary Fig.
   S1, the upper line chart represents the number of mutation genes from
   each type of cancer, and the lower bar chart represents the total
   number of mutation conditions from those genes, including SNP, INS and
   DEL. As shown in the horizontal axis, the sample number of each tumor
   type ranges from Cholangiocarcinoma (CHOL, n = 35) to Breast invasive
   carcinoma (BRCA, n = 982). From 9,047 TCGA samples with 23,231 mutation
   genes, we demonstrate the general applicability of our image-based deep
   learning method on 36 types of cancers.

2.2. Mutation map construction

   To construct the mutation landscape of cancer, we create the genetic
   mutation map. Assuming that the size of the mutation map is
   [MATH: <mrow><mi>N</mi><mo>×</mo><mi>N</mi></mrow> :MATH]
   , all the mutation genes from 36 types of cancers are collected,
   grouped and located to the matrix map according to their positions on
   the chromosomes.

   Firstly, mutated genes from each type of cancer are sorted according to
   their positions on chromosome (chromosomes 1–22, X and Y). For cancer
   [MATH: <mrow><mi>j</mi></mrow> :MATH]
   , the list of mutated genes on chromosome
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   is
   [MATH: <mrow><msub><mrow><mi
   mathvariant="bold-italic">r</mi></mrow><mrow><mi
   mathvariant="bold-italic">ij</mi></mrow></msub></mrow> :MATH]
   , where
   [MATH: <mrow><mn>0</mn><mo>⩽</mo><mi>i</mi><mo>⩽</mo><mn>23</mn></mrow>
   :MATH]
   ,
   [MATH: <mrow><mn>0</mn><mo>⩽</mo><mi>j</mi><mo>⩽</mo><mn>35</mn></mrow>
   :MATH]
   in this paper. Then the mutated genes in the same chromosome from
   different types of cancer are grouped according to their positions. For
   chromosome
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   , the length of mutated gene set
   [MATH:
   <mrow><msub><mi>R</mi><mrow><mi>i</mi><mo>·</mo></mrow></msub><mo
   linebreak="goodbreak"
   linebreakstyle="after">=</mo><msub><mi>r</mi><mrow><mi>i</mi><mn>0</mn>
   </mrow></msub><mo>∪</mo><msub><mi>r</mi><mrow><mi>i</mi><mn>1</mn></mro
   w></msub><mo>∪</mo><mo>⋯</mo><mo>∪</mo><msub><mi>r</mi><mrow><mi>i</mi>
   <mn>35</mn></mrow></msub></mrow> :MATH]
   collected from different cancers is
   [MATH: <mrow><msub><mi>L</mi><mi>i</mi></msub></mrow> :MATH]
   . Therefore, the number of columns occupied by the genes on chromosome
   [MATH: <mrow><mi>i</mi></mrow> :MATH]
   in the mutation map is
   [MATH:
   <mrow><msub><mi>k</mi><mi>i</mi></msub><mo>×</mo><mn>3</mn></mrow>
   :MATH]
   , where
   [MATH:
   <mrow><msub><mrow><mi>k</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=<
   /mo><mrow><mfenced open="{"><mtable><mtr><mtd><mrow><mfenced close="⌋"
   open="⌊"><mrow><msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>/</mo><mi>N</mi></mrow></mfenced></mrow><mo>+</mo><mn>1</mn><mo>
   ,</mo><mspace width="1em"></mspace><mi>i</mi><mi>f</mi><mspace
   width="1em"></mspace><msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></mro
   w></msub><mo>%</mo><mi>N</mi><mo>≠</mo><mn>0</mn></mtd></mtr><mtr><mtd>
   <msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>/</mo><m
   i>N</mi><mo>,</mo><mspace width="1em"></mspace><mspace
   width=".25em"></mspace><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><mi>i</mi><mi>f</mi><mspace
   width="1em"></mspace><msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></mro
   w></msub><mo>%</mo><mi>N</mi><mo>=</mo><mn>0</mn></mtd></mtr></mtable><
   /mfenced></mrow></mrow> :MATH]
   . To be specifically, each gene occupies three pixels in the same row
   of the mutation map, where each pixel point represents the mutation
   condition of the gene. These three pixels are colored with blue, green
   or red to represent SNP, INS or DEL respectively. Genes on all
   chromosomes occupy
   [MATH: <mrow><mi>K</mi></mrow> :MATH]
   columns, where

   [MATH:
   <mrow><mi>K</mi><mo>=</mo><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>i</
   mi><mo>=</mo><mn>0</mn></mrow><mrow><mn>23</mn></mrow></msubsup><mn>3</
   mn><msub><mrow><mi>k</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo
   ><msubsup><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>0</mn></
   mrow><mrow><mn>23</mn></mrow></msubsup><mn>3</mn><mo>×</mo><mrow><mfenc
   ed open="{"><mtable><mtr><mtd><mrow><mfenced close="⌋"
   open="⌊"><mrow><msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>/</mo><mi>N</mi></mrow></mfenced></mrow><mo>+</mo><mn>1</mn><mo>
   ,</mo><mspace width="1em"></mspace><mi>i</mi><mi>f</mi><mspace
   width="1em"></mspace><msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></mro
   w></msub><mo>%</mo><mi>N</mi><mo>≠</mo><mn>0</mn></mtd></mtr><mtr><mtd>
   <msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>/</mo><m
   i>N</mi><mo>,</mo><mspace width="1em"></mspace><mspace
   width=".25em"></mspace><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><mspace width=".25em"></mspace><mspace
   width=".25em"></mspace><mi>i</mi><mi>f</mi><mspace
   width="1em"></mspace><msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></mro
   w></msub><mo>%</mo><mi>N</mi><mo>=</mo><mn>0</mn></mtd></mtr></mtable><
   /mfenced></mrow><mspace
   width="1em"></mspace><mi>a</mi><mi>n</mi><mi>d</mi><mspace
   width="1em"></mspace><mi>K</mi><mo>⩽</mo><mi>N</mi></mrow> :MATH]

   According to the above description, we can choose an appropriate
   [MATH: <mrow><mi>N</mi></mrow> :MATH]
   value. In our experimental data, the value of
   [MATH: <mrow><mi>N</mi></mrow> :MATH]
   is 310. Next, a collection of mutation genes are arranged and aligned
   vertically in the matrix map, according to their positions and orders
   on the chromosomes 1 to 22, X, and Y, thereby forming a
   [MATH: <mrow><mi>N</mi><mo>×</mo><mi>N</mi></mrow> :MATH]
   genetic mutation map for each tumor sample ([59]Fig. 1A). Each
   chromosome occupies
   [MATH:
   <mrow><msub><mi>k</mi><mi>i</mi></msub><mo>×</mo><mn>3</mn></mrow>
   :MATH]
   columns in the genetic map, containing
   [MATH: <mrow><mi>p</mi><mrow><mfenced close=")"
   open="("><mrow><mrow><mi>p</mi><mo>⩽</mo><msub><mi>k</mi><mi>i</mi></ms
   ub><mo>×</mo><mi>N</mi></mrow></mrow></mfenced></mrow></mrow> :MATH]
   genes and the extra pixels in the image are set to zeros. Finally, we
   output the genetic mutation maps for all of 9,047 patient samples from
   36 types of cancer. All the mutation maps are normalized by the maximum
   value over RGB channels.

Fig. 1.

   [60]Fig. 1
   [61]Open in a new tab

   Schematic representation of image-based deep learning for genomic
   pan-cancer classification. (A) The protocol of genetic mutation map
   construction. The gene mutation conditions including single-nucleotide
   polymorphism (SNP), insertion (INS) and deletion (DEL) with chromosome
   position information are transformed into the genetic mutation map.
   Each chromosome occupies
   [MATH:
   <mrow><msub><mi>k</mi><mi>i</mi></msub><mo>×</mo><mn>3</mn></mrow>
   :MATH]
   columns in the genetic map, containing
   [MATH: <mrow><mi>p</mi><mrow><mfenced close=")"
   open="("><mrow><mrow><mi>p</mi><mo>⩽</mo><msub><mi>k</mi><mi>i</mi></ms
   ub><mo>×</mo><mi>N</mi></mrow></mrow></mfenced></mrow></mrow> :MATH]
   genes. Each gene occupies three pixels in the same row of the mutation
   map, where each pixel represents the mutation condition of the gene,
   colored blue, green, or red according to their labels to SNP, INS and
   DEL. Those pixel points are arranged and aligned vertically in the
   mutation map, according to their positions on the chromosomes, there
   forming a
   [MATH: <mrow><mi>N</mi><mo>×</mo><mi>N</mi></mrow> :MATH]
   matrix map for each patient, referred as the genetic mutation map. (B)
   Workflow of establishing the image-based deep learning models. All
   patient samples are transformed into the mutation maps and then divided
   into training, validation, and testing sets, respectively. The images
   of mutation maps are fed into different deep learning architectures for
   training and testing on pan-cancer classification. (C) Guided Grad-CAM
   are employed to generated heatmaps for the identification of top
   distinct candidate genes that help the pan-cancer classification. (For
   interpretation of the references to color in this figure legend, the