Abstract Cancer evolves through the accumulation of somatic mutations over time. Although several methods have been developed to characterize mutational processes in cancers, these have not been specifically designed to identify mutational patterns that predict patient prognosis. Here we present CLICnet, a method that utilizes mutational data to cluster patients by survival rate. CLICnet employs Restricted Boltzmann Machines, a type of generative neural network, which allows for the capture of complex mutational patterns associated with patient survival in different cancer types. For some cancer types, clustering produced by CLICnet also predicts benefit from anti-PD1 immune checkpoint blockade therapy, whereas for other cancer types, the mutational processes associated with survival are different from those associated with the improved anti-PD1 survival benefit. Thus, CLICnet has the ability to systematically identify and catalogue combinations of mutations that predict cancer survival, unveiling intricate associations between mutations, survival, and immunotherapy benefit. INTRODUCTION Cancer progression is a stochastic evolutionary process in which cells acquire somatic mutations that allow them to evade growth suppression, resist cell death signals, and enhance replication and immune suppression ([28]1,[29]2). Most cancers are caused by multiple somatic mutations that together lead to the cancer phenotype ([30]1,[31]3). The somatic mutations that cause cancer are often called driver mutations, or simply, drivers. The drivers can cause impairments in a variety of functional pathways, including DNA replication, DNA repair, cell cycle control, and programmed cell death ([32]4,[33]5). In addition, cancers are extremely heterogeneous, such that the driver mutations and affected genes vary greatly between patients, even within the same cancer type ([34]6,[35]7). Although some somatic mutations are indeed drivers that directly contribute to the cancer phenotype, the substantial majority are passengers, that is, mutations that are simply coincidentally present in tumors and have no discernible effect on the cancer phenotype ([36]8,[37]9,[38]10), some of these mutations could result from DNA repair impairments in cancer. Thus, it is in general difficult to pinpoint mutations that are critical for tumor initiation and progression, and to identify clinically relevant combinations of mutations that could facilitate stratifying patients by survival rate and/or treatment response ([39]11,[40]12). The rapid accumulation of cancer genomic data in recent years has enabled the creation of a comprehensive collection of somatic mutations in cancer and evaluation of their impact on tumor progression ([41]13,[42]14,[43]15). In contrast to germline mutations, that is, predisposition variants detected in germline cells, somatic mutations that are the most common cause of cancer occur in diploid cells and are tissue-specific. To systematically characterize the mutational processes that promote cancer, mathematical methods have previously been used to decipher mutational signatures from somatic mutation catalogues ([44]16). These approaches largely involve modelling specific mutation types in trinucleotides using Nonnegative Matrix Factorization (NMF) ([45]16,[46]17,[47]18,[48]19). Although these mutation signatures successfully characterize key mutational processes for numerous cancers ([49]17,[50]18,[51]19), they are not optimized for the prediction of patient survival or treatment efficacy. The recent development of immune checkpoint blockade therapies, and particularly, anti-PD1 (programmed death-1) treatment has demonstrated durable responses in multiple cancer types, especially, melanoma, lung cancer and mismatch repair deficient gastrointestinal and endometrial cancers ([52]20,[53]21,[54]22). However, not all patients respond to this treatment, which can incur severe side effects and costs ([55]23,[56]24), thus adding urgency to the need for the use of mutational data to predict treatment efficacy. Indeed, the first FDA-approved marker for anti-PD1 efficacy is based on high microsatellite instability (MSI-H) ([57]25,[58]26), which results from mismatch repair deficiency and is therefore linked to increased mutagenesis ([59]27,[60]28). More recently, high tumor mutational burden (TMB-H), has also been approved by the FDA as a marker for anti-PD1 efficacy based on similar research ([61]29,[62]30,[63]31,[64]32). However, the MSI-H marker is limited to gastrointestinal and endometrial cancers, where mismatch repair deficiency is observed almost exclusively ([65]33). In addition, the predictive signal of TMB-H status can be confounded by disease subtype ([66]34,[67]35). When considered individually, some cancer types do not show association between TMB-H and survival with anti-PD1 immune checkpoint blockade treatment ([68]31,[69]34,[70]35) although such association is strongly evident in non small cell lung cancers ([71]30,[72]34). Several techniques have been developed to study the associations between cancer mutations and survival ([73]36,[74]37,[75]38), and many studies have reported mutations in distinct genes that are associated with survival in particular cancer types. For example, mutations of TP53, KRAS and PIK3CA are associated with survival in colorectal cancers ([76]39,[77]40,[78]41), mutations of BRCA1 and BRCA2 in breast and ovarian cancers ([79]42,[80]43,[81]44), and mutations of BRCA2 in prostate cancers ([82]45,[83]46). ALK rearrangements are exploited for treatment and prognosis of non-small cell lung cancer ([84]47), B2M mutations for multiple myeloma and leukemia prognoses ([85]48), c-KIT mutations for gastrointestinal stromal tumors ([86]49), BRAF V600E mutations are predictive of papillary thyroid cancer recurrence ([87]50), and MYCN alterations are serves as a marker of spontaneous regression in neuroblastoma ([88]51). In addition, some RNA based signatures are employed in the clinic, including the 21-gene signature to predict breast cancer recurrence ([89]52), and a 17-gene signature to predict prostate cancer risk and recurrence ([90]53,[91]54). However, mutation-based studies usually focus on a single gene or cancer type, and do not include comprehensive analyses of potential gene interactions that could predict survival. Several methods have been developed to identify complex combinations of mutations that correspond to interaction networks ([92]55,[93]56,[94]57) and/or can be used for cancer subtype clustering ([95]9,[96]58,[97]59), but to our knowledge, these precise methods have not been directly harnessed towards survival prediction, despite the observations that subtype clustering often defines survival differences ([98]60,[99]61). We are unaware of efforts to systematically identify and catalogue combinations of mutations that would predict survival across different cancer types. There is therefore a pressing need for an approach that could uncover combinations of mutations that enable clustering cancer patients based on survival rates derived from mutational patterns, and could be used to systematically identify mutational patterns that predict survival in different cancer types. Here, we present CLICnet ([100]http://clicnet.pythonanywhere.com/), a computational method for Clinical Clustering of Cancer patients using neural NETworks, which includes a collection of independent predictors trained for different cancer types. To our knowledge, CLICnet is the first method to systematically identify and catalogue patterns of somatic mutations that are significantly predictive of survival in different cancer types, based on subsets of genes from the MSK-IMPACT panel. CLICnet relies on Restricted Boltzmann Machine (RBM) ([101]62,[102]63) neural networks to cluster cancer patients into high and low risk clusters, based on mutations in cancer type-specific sets of genes. We analyzed 10,141 tumors samples that represent 15 cancer types, from the Cancer Genome Atlas (TCGA) ([103]64,[104]65) and Memorial Sloan Kettering Cancer Center (MSKCC) cohorts ([105]66). CLICnet was trained and validated on the TCGA and MSKCC mutation and survival data, respectively, for each cancer type, to cluster patients into two clusters with significantly different survival rates, based on mutation patterns. We catalogued the top 5 combinations of mutations for each cancer that are predictive of patient survival. In some cancer types, the CLICnet clusters were also predictive of the anti-PD1 immune checkpoint blockade therapy benefit. Thus, CLICnet allows the identification of combinations of mutations that predict survival, provides a catalogue of such combinations across different cancer types, and pinpoints mutation combinations that predict survival under anti-PD1 treatment in three cancer types. MATERIALS AND METHODS Training and validation sample collection and preprocessing The TCGA ([106]65) mutation data was downloaded from the Xena browser ([107]67,[108]68) and the corresponding clinical data was obtained ([109]69); the two data sets were merged using the patient barcode. Survival was set to the maximum value between the ‘last_contact_days_to’ and ‘death_days_to’ columns. The MSKCC mutation and clinical data ([110]66) were downloaded from the cBioPortal ([111]70,[112]71) ([113]https://www.cbioportal.org) and merged using the patient ID. Survival was set to the ‘OS_MONTHS’ column. A large proportion of the samples in the MSK-IMPACT cohort was derived from distal metastases in different tissues (55%), which can bias the analysis, especially because the TCGA training data considered did not include samples from such sites. therefore, only the primary site tumor samples in the MSK-IMPACT cohort were included, by filtering for samples for which the ‘SAMPLE_TYPE’ column was set to ‘Primary’. The analyzed cancer types, which are included in both datasets are detailed in Figure [114]1 and [115]Supplementary Table S4. Given the differences in the assignment of cancer types and the different cancer types that are included in each dataset, the samples were aggregated into a total of 15 types of cancer. We filtered the samples to retain those with one of the 15 cancer types included in both datasets. Colorectal adenocarcinomas (COAD, READ) were aggregated because they have similar mutations and clinical characteristics. Non-small cell lung cancers (LUSC, LUAD), and renal cell carcinomas (KIRC,KIRP) were aggregated by the tissue of origin to increase the sample size, as it has been done in previous studies ([116]72,[117]73,[118]74,[119]75,[120]76). We verified that the identified CLICnet gene sets were also associated with the outcome independently in each of these cancer types ([121]Supplementary Figure S6). In addition, to evaluate the predictive ability of CLICnet gene sets that were selected in non-small cell lung cancer and renal cell carcinoma specific subtypes, we individually trained CLICnet on LUAD, LUSC and KIRC (KIRP was excluded because the MSK data contains only 15 KIPR samples). We could not identify any gene set specifically for LUSC, likely, due either to the smaller sample size or the subtype discrepancy between the TCGA and MSK cohorts. We found five LUAD gene sets after five training iterations, all of which were significantly predictive on the MSK data ([122]Supplementary Figure S7, [123]Supplementary Table S5). In addition, we found five KIRC gene sets that were significantly predictive on the MSK data after seven iterations ([124]Supplementary Figure S7, [125]Supplementary Table S5). The LUAD specific gene sets show a slightly better predictive ability compared with the aggregated non-small cell lung cancer gene sets with fewer training iterations, and the KIRC specific gene sets show a comparable predictive ability to the aggregated renal cell carcinoma gene sets with a comparable number of iterations. Figure 1. [126]Figure 1. [127]Open in a new tab The CLICnet dataset, and the training and validation pipeline. (A) Illustration of the training of CLICnet. The GA is used to select genes that are given as input to the RBM. The RBM produces two clusters of patients, given the mutations provided by the GA, and the survival difference between patients in the two clusters is evaluated (via log-rank P-value), and given back to the GA as the fitness function. (B) The datasets used for training and validation of CLICnet. The numbers in the table refer to the number of samples in each dataset. Mutation values per gene were set to 1 if a non-synonymous mutation was present and to 0 otherwise. Gene level mutations were used rather than nucleotide level mutations to restrict the overall number of features, avoiding having many more features than samples, which would hinder application of machine learning methods. Two additional datasets were obtained for melanoma patients treated with anti-PD1 ([128]77,[129]78), for which the mutational and clinical data were obtained from the cBioPortal ([130]70,[131]71) ([132]https://www.cbioportal.org). RBM structure, training and assessment RBMs are neural networks that are typically utilized for unsupervised learning tasks which involve automatically discovering and learning regularities and patterns in the input data such that the model learns to generate new examples ([133]62,[134]63). The choice of RBMs as the machine learning technique for this study was motivated by the following: (i) First, RBMs are specifically designed for applications to binary and sparse datasets ([135]79). Given the sparse and binary nature of the mutation data, which is a severe bottleneck for the application of most machine learning techniques ([136]80,[137]81), we reasoned that the method of choice should mitigate these issues, by being well fitted to process and utilize this type of datasets. (ii) Second, RBMs are simple, shallow and unsupervised, thus allowing interrogation of the features and weights learned (and therefore, potential interpretability applications), and can learn a distribution over the data without explicitly optimizing a supervised classification task, which might lead to overfitting. (iii) Third, RBMs are generative models. Therefore, they can extract highly informative features from the input data to learn the hidden states, which then can be used for clustering. Because we were interested in clustering patients by mutations in sets of genes, which make a sparse and binary input data, and aimed for a simple method with potentially interpretable features, RBMs were considered to be the optimal choice for this study. Each CLICnet RBM is trained for a specific cancer type, on a specific set of genes. The RBMs used in this work were constructed with Inline graphic visible units and one hidden unit, where Inline graphic is the number of genes in the gene set. The number of epochs for training each RBM was set to 1000, with a 0.1 learning rate. When a trained RBM is applied to mutation data from a new sample, the hidden unit activation can be either zero or one, defining the cluster assignment of the samples. To assess how each RBM clustering predicts patient survival rates, Cox's proportional hazard model was applied to the assigned clusters and the corresponding patient survival data. Patients were additionally stratified by sex, age and stage, to ensure that these were not confounders of the analysis. The subsets of patients with treatment information were additionally stratified by treatment, to ensure that these were not confounders of the analysis. The P-value was extracted, with a lower P-value indicating a stronger association between the defined clusters and overall survival. Although RBMs are inherently stochastic in both training and application, the trained RBMs created for CLICnet include a deterministic procedure to define the clustering (or hidden states), and thus make the subsequent applications deterministic and reproducible. This was achieved by directly using the hidden probabilities to define the hidden state (where the median is set as cutoff), rather than randomly sampling a new distribution over these probabilities, to ensure that CLICnet always returns the same results for the same input once trained. In addition, for training of CLICnet, we set a constant random seed, to ensure that retraining CLICnet with the same input yields the same trained RBM. As a result, for any set of genes, there is one specific clustering of patients that is inferred by the CLICnet RBM. Selecting sets of genes for CLICnet The RBMs were incorporated with the GA feature selection to identify gene sets that yield RBM-inferred clusters with significantly different survival rates. The genes that were initially considered for training were those that are included in the MSK-IMPACT panel ([138]82) and that, across the patients within each cancer, are mutated in the top 0.7 percentile frequency among all MSK-IMPACT panel genes. From the set of genes that is used as initial input to CLICnet, three iterations of GA are applied to select the subset of genes that, when given as input to the RBM, optimizes the difference in survival rates between the two RBM clusters. Hence, the GA step depends on the RBM clustering to evaluate the fitness function, which is the survival difference between the two RBM-inferred clusters. The RBM step receives different solutions (sets of genes) from the GA, and evaluates each solution through the survival difference between the two inferred clusters (Figure [139]1). The following steps of the GA were defined for each cancer type: (a) Initialization of a population of size 50, where 15% of the considered genes for the given cancer type was randomly selected for each instance in the initial population. (b) Evaluation of each instance in the population, where mutations in each gene set in the population were used to train an RBM, define two clusters of patients, and yield a Cox P-value which shows how well the clusters correspond to survival. This P-value was used to evaluate each of the gene sets. (c) The top half ([140]25) instances in the population, that is, those with the lowest Cox P-values, were selected for reproduction, with randomly selected pairing. (d) Crossover was applied to the randomly selected pairs, until a population size of 50 was reached. Three iterations of steps (b)–(d) were repeated, and the best solution was retained, corresponding to the sets of genes that yielded the lowest P-values with the RBM clustering. These parameters (the population size and percentage of considered genes) were optimized for the TCGA training set via a grid search, with a 3-fold cross validation. For each cancer type, the genetic algorithm was applied until five different sets of genes were found, such that each of them yielded an RBM clustering with a Cox's proportional hazard P-value ≤0.05 in the training (TCGA) and validation set (MSKCC). We used 100 iterations of this process as the upper bound, to reduce the risk of overfitting, where for all 15 cancer types, at least five sets were found in fewer than 100 iterations (the precise number of iterations required for each cancer type is shown in Figure [141]2). The number 5 was selected because it is the largest number of gene sets that were found for every cancer type under 100 iterations. Therefore, the top 5 gene sets in each cancer type are reported. The entire training of CLICnet was completed in less than 6 hours on a high performance computing cluster. Figure 2. [142]Figure 2. [143]Open in a new tab Evaluation of the performance and stability of CLICnet. (A) Bar plots show the number of iterations needed to obtain five gene sets that were significant on the TCGA training data, and subsequently significant on the MSKCC validation data, for each cancer type. The numbers within the bars show the percentage of validated gene sets, among those that were significant on the training data. Statistical significance (permutation P-value) is indicated with asterisks (* P < 0.01, ** P < 0.001). (B) Boxplots show the number of genes in the CLICnet gene set for each cancer type. (c) Boxplot with overlaid dot-plots showing the CLICnet log-rank P-values, when applied to randomly sampled 2/3 of the MSKCC validation set. Predicting survival with anti-PD1 using the TMB-H status The survival of MSKCC patients treated with anti-PD1 was predicted using the TMB-H status. To that end, we used different cut-offs of the TMB (ranging from 5 to 24), in order to define the TMB-H status and predict the survival of anti-PD1 treated patients. The prediction was performed separately for all anti-PD1 treated MSKCC samples, for primary samples (which were used for CLICnet), and for metastatic samples estimating the Cox proportional hazard ratio and P-values. Mutational signatures analysis To evaluate the mutational processes underlying the different CLICnet clusters, the mutational signatures reported by Alexandrov et al. were quantified for each TCGA sample ([144]83). These signatures were compared between each of the high and low risk clusters defined by CLICnet, in every cancer type, through a two-sided rank sum P-value, and significant (P-value < 0.05) associations were identified. Whenever a significant association between a cancer type and a mutational signature was detected, at least three CLICnet clusters from the given cancer type were associated with that signature. Statistical analyses Survival analyses: Kaplan-Meier survival curves were plotted, where the two CLICnet clusters define the curves. Cox proportional hazard analyses were applied to estimate how the CLICnet clusters predict the survival time, through a hazard ratio (HR) and P-value. Boxplots and comparisons: For all boxplots, center lines indicate medians, box edges represent the interquartile range, whiskers extend to the most extreme data points not considered outliers, and the outliers are plotted individually. Points are defined as outliers if they are greater than q[3] + w  ×  (q[3]–q[1]) or