Graphical abstract graphic file with name ga1.jpg [27]Open in a new tab Keywords: Pan-cancer classification, Genetic mutation map, Image-based deep learning, Guided Grad-CAM visualization, Tumor-type-specific genes, Pathway analysis Abstract Accurate cancer type classification based on genetic mutation can significantly facilitate cancer-related diagnosis. However, existing methods usually use feature selection combined with simple classifiers to quantify key mutated genes, resulting in poor classification performance. To circumvent this problem, a novel image-based deep learning strategy is employed to distinguish different types of cancer. Unlike conventional methods, we first convert gene mutation data containing single nucleotide polymorphisms, insertions and deletions into a genetic mutation map, and then apply the deep learning networks to classify different cancer types based on the mutation map. We outline these methods and present results obtained in training VGG-16, Inception-v3, ResNet-50 and Inception-ResNet-v2 neural networks to classify 36 types of cancer from 9047 patient samples. Our approach achieves overall higher accuracy (over 95%) compared with other widely adopted classification methods. Furthermore, we demonstrate the application of a Guided Grad-CAM visualization to generate heatmaps and identify the top-ranked tumor-type-specific genes and pathways. Experimental results on prostate and breast cancer demonstrate our method can be applied to various types of cancer. Powered by the deep learning, this approach can potentially provide a new solution for pan-cancer classification and cancer driver gene discovery. The source code and datasets supporting the study is available at [28]https://github.com/yetaoyu/Genomic-pan-cancer-classification. 1. Introduction Cancer is considered as the deadly genetic diseases, characterized by abnormal cell growths [29][1], [30][2]. Globally, more than 18 million new cancer cases are diagnosed resulting to 9.6 million deaths in 2018 [31][3]. Genetic mutations have been shown to be associated with different types of cancer [32][4], [33][5], [34][6]. Cancer classification based on genetic mutations can be readily achieved through increased usage of high-throughput sequencing techniques. A large amount of mutation data has been generated and publicly released. Among them, The Cancer Genome Atlas (TCGA) is a cohort cataloguing genetic mutations data for more than 30 types of cancers from more than 10,000 patients [35][7]. TCGA contains various genetic mutations data, including single-nucleotide polymorphism (SNP), small insertions or deletions (INDEL), copy number variations (CNV), etc. By handling the massive amount of data, researchers now are able to design new analytical methods for accurate cancer classification and detection based on gene alteration. However, accurate and reliable cancer classification is particularly challenging as a result of the complexity and scale of the data. Considering the sequencing covers more than thousands of genes, but most of genes did not contain informative mutations thus making classification difficult by analyzing all those genes [36][8], [37][9]. In order to avoid the mutation data being too sparse (even all zero), most analytical methods screen genes before classification [38][10], [39][11]. These methods are simple and effective in some cases, but important features (genes) may be removed during the screening process. Recent advances in deep learning underpin a collection of algorithms with an impressive ability to analyze molecular data without prior feature selection or human-directed training. Prior deep learning approaches usually work well for a specific type of cancer, such as brain cancer [40][12], gliomas [41][13], acute myeloid leukemia [42][14], breast cancer [43][15], [44][16], soft tissue sarcomas [45][17] and lung cancer [46][18]. Given the complexity of pan-cancer data, directly using those mentioned approaches might not be appropriate for multiple types of cancer. Recently, some works are starting to consider the importance of genetic mutations in multiple types of cancer classification. By analyzing more than 8000 samples’ genetic mutations profiles from 12 cancer types obtained from the TCGA, Sun et al. [47][19] reported a novel method, Genome Deep Learning (GDL), for cancer subtyping. However, more than 12 specific models were constructed. Limited by the number of models, this approach will be insufficient and unconfident in analysis of more types, and larger cancer mutation data. Yuan et al. [48][20] described DeepGene, an advanced Deep Neural Network (DNN) based cancer type classifier. Experimental results on 12 selected types of cancer from TCGA demonstrated improved classification performance compared with classifiers of Support Vector Machine (SVM), k-Nearest Neighbors (KNN) and Naïve Bayes (NB). However, the DNN classifier only has the optimal accuracy of 65.5%, which will prevent its development as an accurate cancer classifier. In addition, most of these studies usually used only one type of genetic mutation data as input for cancer classification, which limits the performance of the classifier. For instance, Yuan et al. proposed DeepGene on somatic point mutation data for cancer classification [49][20]. AlShibli et al. [50][21] proposed three deep learning techniques to classify six cancer types based on CNV data. Although these methods are effective, the characteristic information is still not comprehensive enough. As far as we known, there is no existing work specifically designed to combined multiple types of mutation data. As a result, a general algorithm for easy and reliable cancer classification based on multiple types of genetic mutation data is still missing. Previous works tend to use a variety of modeling methods, sometimes combine them together. In such a context merely adopting deep learning approaches developed within other setting might not be appropriate in pan-cancer classification based on different gene mutation data. Given these challenges, a new and simple approach is necessary. Motivated by works of deep learning in image analysis, we describe a novel image-based deep learning strategy for cancer classification and mutated gene discovery. The proposed strategy is consisting of three main steps: construction of genetic mutation map, classification using deep Convolutional Neural Networks (CNN) and identify cancer driver genes by Guided Grad-CAM (a combination of Guided backpropagation and Gradient-weighted Class Activation Mapping) visualization [51][22]. This novel strategy makes the following research contributions: * (1) A genetic mutation map was constructed for each cancer patient, documenting the gene alternations condition including single-nucleotide polymorphism (SNP), insertion (INS) and deletion (DEL) with chromosome position information. Prior knowledge on the mutated genes selected is not necessary, avoiding bias caused by hand-picking. The correlation between mutated genes and cancer types can be built without gene prescreening. * (2) Genetic mutation map and popular deep neural networks, which used in combination, produce a high accuracy in pan-cancer classification. Compared with other widely used classification methods (such as SVM and KNN), our test classifiers, including VGG-16 [52][23], Inception-v3 [53][24], ResNet-50 [54][25] and Inception-ResNet-v2 [55][26], can effectively extract deep features from complex genetic mutation data, and significantly improve the classification accuracy. * (3) The application of Guided Grad-CAM visualization to generate heatmaps were utilized to identify tumor type-specific genes and pathways. * (4) The systematical examination of gene mutations in 36 types of cancer from 9,047 patient samples demonstrates the advancement of our method, allowing a deeper understanding of the mutation landscape of cancer. The constructed genetic mutation map dataset was publicly released at [56]https://github.com/yetaoyu/Genomic-pan-cancer-classification/tr ee/master/DNN-models/dataset. 2. Materials and methods 2.1. Cancer types and samples statistics The genetic mutation data from various types of cancer in TCGA are collected from the Firebrowse portal ([57]http://firebrowse.org/). The dataset is assembled by selecting the genes across all samples for 36 cancer types that contain mutations. As shown in [58]Supplementary Fig. S1, the upper line chart represents the number of mutation genes from each type of cancer, and the lower bar chart represents the total number of mutation conditions from those genes, including SNP, INS and DEL. As shown in the horizontal axis, the sample number of each tumor type ranges from Cholangiocarcinoma (CHOL, n = 35) to Breast invasive carcinoma (BRCA, n = 982). From 9,047 TCGA samples with 23,231 mutation genes, we demonstrate the general applicability of our image-based deep learning method on 36 types of cancers. 2.2. Mutation map construction To construct the mutation landscape of cancer, we create the genetic mutation map. Assuming that the size of the mutation map is [MATH: N×N :MATH] , all the mutation genes from 36 types of cancers are collected, grouped and located to the matrix map according to their positions on the chromosomes. Firstly, mutated genes from each type of cancer are sorted according to their positions on chromosome (chromosomes 1–22, X and Y). For cancer [MATH: j :MATH] , the list of mutated genes on chromosome [MATH: i :MATH] is [MATH: rij :MATH] , where [MATH: 0i23 :MATH] , [MATH: 0j35 :MATH] in this paper. Then the mutated genes in the same chromosome from different types of cancer are grouped according to their positions. For chromosome [MATH: i :MATH] , the length of mutated gene set [MATH: Ri·=ri0 ri1ri 35 :MATH] collected from different cancers is [MATH: Li :MATH] . Therefore, the number of columns occupied by the genes on chromosome [MATH: i :MATH] in the mutation map is [MATH: ki×3 :MATH] , where [MATH: ki=< /mo>Li/N+1 ,ifLi%N0 Li/N,ifLi%N=0< /mfenced> :MATH] . To be specifically, each gene occupies three pixels in the same row of the mutation map, where each pixel point represents the mutation condition of the gene. These three pixels are colored with blue, green or red to represent SNP, INS or DEL respectively. Genes on all chromosomes occupy [MATH: K :MATH] columns, where [MATH: K=i=0233ki=i=0233×Li/N+1 ,ifLi%N0 Li/N,ifLi%N=0< /mfenced>andKN :MATH] According to the above description, we can choose an appropriate [MATH: N :MATH] value. In our experimental data, the value of [MATH: N :MATH] is 310. Next, a collection of mutation genes are arranged and aligned vertically in the matrix map, according to their positions and orders on the chromosomes 1 to 22, X, and Y, thereby forming a [MATH: N×N :MATH] genetic mutation map for each tumor sample ([59]Fig. 1A). Each chromosome occupies [MATH: ki×3 :MATH] columns in the genetic map, containing [MATH: ppki×N :MATH] genes and the extra pixels in the image are set to zeros. Finally, we output the genetic mutation maps for all of 9,047 patient samples from 36 types of cancer. All the mutation maps are normalized by the maximum value over RGB channels. Fig. 1. [60]Fig. 1 [61]Open in a new tab Schematic representation of image-based deep learning for genomic pan-cancer classification. (A) The protocol of genetic mutation map construction. The gene mutation conditions including single-nucleotide polymorphism (SNP), insertion (INS) and deletion (DEL) with chromosome position information are transformed into the genetic mutation map. Each chromosome occupies [MATH: ki×3 :MATH] columns in the genetic map, containing [MATH: ppki×N :MATH] genes. Each gene occupies three pixels in the same row of the mutation map, where each pixel represents the mutation condition of the gene, colored blue, green, or red according to their labels to SNP, INS and DEL. Those pixel points are arranged and aligned vertically in the mutation map, according to their positions on the chromosomes, there forming a [MATH: N×N :MATH] matrix map for each patient, referred as the genetic mutation map. (B) Workflow of establishing the image-based deep learning models. All patient samples are transformed into the mutation maps and then divided into training, validation, and testing sets, respectively. The images of mutation maps are fed into different deep learning architectures for training and testing on pan-cancer classification. (C) Guided Grad-CAM are employed to generated heatmaps for the identification of top distinct candidate genes that help the pan-cancer classification. (For interpretation of the references to color in this figure legend, the