Abstract

   Cancer is a serious disease that causes many deaths every year. We
   urgently need to design effective treatments to cure this disease.
   Tumor suppressor genes (TSGs) are a type of gene that can protect cells
   from becoming cancerous. In view of this, correct identification of
   TSGs is an alternative method for identifying effective cancer
   therapies. In this study, we performed gene ontology (GO) and pathway
   enrichment analysis of the TSGs and non-TSGs. Some popular feature
   selection methods, including minimum redundancy maximum relevance
   (mRMR) and incremental feature selection (IFS), were employed to
   analyze the enrichment features. Accordingly, some GO terms and KEGG
   pathways, such as biological adhesion, cell cycle control, genomic
   stability maintenance and cell death regulation, were extracted, which
   are important factors for identifying TSGs. We hope these findings can
   help in building effective prediction methods for identifying TSGs and
   thereby, promoting the discovery of effective cancer treatments.

Introduction

   Currently, cancer is the second most common cause of death, following
   cardiovascular disease. Cancer that originates from the epithelial
   cells or mesenchymal cells is characterized by uncontrolled cell
   proliferation. In malignancy, cancer cells invade adjacent normal
   tissues and metastasize through blood circulation, lymphokinesis or
   body cavity transfer. In this process, proteins that are coded by tumor
   suppressor genes (TSGs) play vital roles in the mechanisms associated
   with cellular growth, DNA damage, apoptosis and metabolic regulation
   [31][1].

   It has been reported that tumor suppressor inactivation and
   haploinsufficiency occur at several different levels in tumor patients.
   In the past decades, many classic TSGs have been widely identified,
   which are silenced by recurrent LOH (loss of heterozygosity) and
   physical deletion in the tumor genome. Increasing evidence has shown
   the abnormal DNA methylation or histone modifications, and non-coding
   RNA affect the expression of TSGs at the epigenetic level and
   post-transcriptional level, respectively [32][2], [33][3].

   The first identified TSG was retinoblastoma protein (Rb), which was
   identified by studies of familial retinoblastoma in early childhood.
   Based on this, the “two-hit” hypothesis was introduced by Knudson in
   1971 [34][4], [35][5]. As a guardian to the normal cell cycle, the Rb
   protein is responsible for the G1/S checkpoint and maintains regular
   cell growth. In addition to loss of heterozygosity, the high frequent
   mutations or partial deletions are mainly located in exon13∼exon17 of
   Rb and have been found in various cancer types, especially in lung
   cancer, breast cancer, osteosarcoma and bladder cancer, with a
   frequency ranging from 15% to 50% [36][6]–[37][10]. Like Rb, the p53
   protein family as a key element of the tumor suppression network,
   exerts much of its growth arrest in the cell cycle and induces
   apoptosis. Changes to p53 are involved in various cancers. Genetic
   variation mainly missense mutations, in p53 are often regarded as the
   driver mutations that confer apoptosis evasion and abnormal cell growth
   of tumor cells, especially those that originate from the epithelial
   tissue. More than 86% of point mutations occur in the evolutionary
   conservative regions, especially four mutation hotspots [38][11],
   [39][12]. In addition, p53 is silenced via LOH in the genome and
   hypermethylation at the epigenetic level in cancer patients [40][13],
   [41][14].

   Like Rb and p53, some tumor suppressor proteins control cell behaviors
   directly by arresting cell proliferation, disturbing the cell cycle and
   inducing apoptosis, and these are called the gate-keepers. The destiny
   of a cell is also affected indirectly by some tumor suppressor proteins
   that are associated with mutation accumulation and genome stability
   maintenance such as BRCA1 and BRCA2, which are also referred to as
   caretakers [42][15], [43][16].

   Additionally inherited mutations of BRCA1 and BRCA2 (breast cancer 1/2)
   are associated with patients who have hereditary breast cancer,
   accounting for 5–10% of all breast cancer patients [44][17]. Loss
   function of their products causes abnormal homologous recombination and
   genome instability, which increases the susceptibility to breast and
   ovarian cancer [45][18].

   Unlike the activated oncogene, suppression of TSGs occurs more
   frequently, providing evidence for understanding the initiation and
   progress of various cancers. The identification and subsequent
   activation of TSGs can facilitate controlling cell proliferation,
   restraining the biological activity of cancer. In this study, we
   attempted to investigate the characteristics of TSGs. The TSGs
   retrieved from the web-based database, TSGene (tumor suppressor gene
   database), facilitated our investigation of TSGs. These genes were
   called ‘positive genes’ and all of the remaining genes in the STRING
   were selected as ‘negative genes’. Gene Ontology (GO) is an
   acknowledged bioinformatics tool for representing gene product
   properties across all species by defined GO terms, the function of the
   genes and their products were represented by the GO terms and predicted
   by the GO annotation effectively [46][19], [47][20]. In contrast, the
   Kyoto Encyclopedia of Genes and Genomes (KEGG) is a comprehensive
   database based on known molecular interaction networks and is usually
   used to understand biological pathways and systems [48][21]. In view of
   this, the enrichment scores of the GO terms and KEGG pathways were used
   to encode all genes investigated in this study. Minimum redundancy
   maximum relevance (mRMR) and incremental feature selection (IFS)
   [49][22] combined with a prediction engine were employed to analyze
   these features. The analysis of the extracted GO terms and KEGG
   pathways suggests that they are related to TSGs. In addition, the
   extracted GO terms and KEGG pathways were used to predict the novel
   TSGs, indicating that they may help build effective computational
   methods for identifying TSGs.

Materials and Methods

Dataset

   We compiled 716 human TSGs in the TSGene database
   ([50]http://bioinfo.mc.vanderbilt.edu/TSGene/download.cgi), which were
   collected from two resources: public databases and literature reports.
   In detail, 187 (human) and 170 (human) known TSGs were retrieved from
   UniProtKB (28 January, 2012) and the TAG database
   ([51]http://www.binfo.ncku.edu.tw/TAG/GeneDoc.php) (29 March, 2012),
   respectively, with only 41 overlapped genes by mapping to the Entrez
   gene symbols. By combining two exhaustive searches, PubMed and Gene
   Reference Into Function (GeneRIF) [52][23], [53][24], and after
   overlapping and synonymous genes with same the Entrez gene ID were
   filtered, 637 protein-coding TSGs and 79 non-coding TSGs were
   identified [54][25]. Because the encoding method described in Section
   “Encoding method” employed the neighbors of each investigated TSG in
   the STRING, we obtained 615 genes with their ensembl protein IDs in the
   STRING. These genes were termed ‘positive genes’ and are given in
   [55]Table S1. The remaining 17,985 ensembl protein IDs in the STRING
   were considered ‘negative genes’.

   The number of negative genes was much larger than that of the positive
   genes. This is an imbalanced dataset. Inspired by some studies dealing
   with this type of data [56][26], [57][27], we divided the 17,985
   negative genes into six datasets, Inline graphic , where Inline graphic
   contained 3,075 negative genes and, Inline graphic contained 2,610
   negative genes. The 615 positive genes were put into each of these
   datasets, comprising six new datasets, Inline graphic , i.e., S[i]
   (i = 1,2,3,4,5,6) consisting of genes in A[i] (i = 1,2,3,4,5,6) and 615
   positive genes.

Encoding method

   To analyze the characteristics of the TSGs, it is very important to
   encode each gene with its essential properties. GO is an acknowledged
   bioinformatics tool for representing gene product properties across all
   species by defined GO terms, while KEGG is a comprehensive database
   based on known molecular interaction networks and usually includes the
   biological pathway and system information [58][21]. Therefore, we
   selected GO terms and KEGG pathways to code each gene. TSGs have a
   strong relationship with some GO terms and KEGG pathways. On the other
   hand, the enrichment method of GO can reflect the relationship between
   the genes and GO terms [59][28]. It is reasonable to use this method to
   encode genes and analyze the relationship of the TSGs and GO terms.
   Furthermore, this method can also be extended to KEGG pathways [60][29]
   to find the relationship between the genes and KEGG pathways.

GO enrichment

   For one gene g and one GO term GO[j], the GO enrichment score is
   defined as the −log[10] of the hypergeometric test P value
   [61][28]−[62][30] of a gene set G containing g’s direct neighbors in
   the protein-protein interaction network of STRING and GO term GO[j],
   which can be calculated by:
   graphic file with name pone.0107202.e005.jpg (1)

   where N is the number of overall proteins in human, M is the number of
   proteins annotated to the GO term GO[j], n is the number of proteins in
   G, and m is the number of proteins in G, which are annotated to the GO
   term GO[j]. The high score for one gene and one GO term implies that
   the gene and GO term have a special relationship. The 12,877 GO terms
   induced 12,877 GO enrichment scores.

KEGG enrichment

   For one gene g and one KEGG pathway P[j], the KEGG enrichment score is
   defined as the –log[10] of the hypergeometric test P value [63][29] of
   a gene set G containing g’s direct neighbors in the protein-protein
   interaction network of STRING and KEGG pathway P[j], which can be
   computed as follows:
   graphic file with name pone.0107202.e006.jpg (2)

   where N is the number of overall proteins in human, M is the number of
   proteins in the KEGG pathway P[j], n is the number of proteins in G, m
   is the number of proteins in both G and P[j]. Additionally, the higher
   the KEGG enrichment score for g and P[j], the stronger the relationship
   between them. The 239 KEGG pathways induced 239 features of KEGG
   enrichment scores.

   Each of the 12,877 GO enrichment scores or each of the 239 KEGG
   enrichment scores can be considered a dimension. Accordingly, each gene
   g can be represented by a vector in 12,877+239 = 13,116-D space, which
   is formulated as:
   graphic file with name pone.0107202.e007.jpg (3)

Prediction method

   Dagging is a well-known meta classifier. The main idea of this
   classifier is to integrate multiple classifiers derived from a single
   learning algorithm that is trained by disjoint samples of the original
   dataset [64][31]. The brief description of this method is as follows.
   For a training dataset Inline graphic with samples Inline graphic ,
   construct k disjoint subsets by randomly taking Inline graphic samples
   in Inline graphic [,] without replacement, such that Inline graphic .
   These subsets were used to train a basic classifier (e.g., support
   vector machine) and derive k classification models, Inline graphic .
   For a query sample, each of these models M[i] (1≤i≤ k) provides a
   predicted result. The predicted result of dagging integrated these
   results by majority voting.

   In Weka 3.6.4 [65][32], the classifier “Dagging” implements the dagging
   classifier mentioned above. Here, it was adopted as the prediction
   engine. For convenience, it was run with its default parameters. In
   detail, the SMO (Sequential Minimal Optimization), which implements
   John Platt’s sequential minimal optimization algorithm for solving the
   optimization problem during the training of a support vector classifier
   using polynomial or Gaussian kernels [66][33], [67][34], is set as the
   basic classifier, and k is set to 10.

Evaluation method

   Ten-fold cross-validation is a widely used cross-validation method for
   evaluating the performance of different classification models
   [68][35]−[69][38]. Compared to the Jackknife test [70][39], [71][40],
   the 10-fold cross-validation test requires less computing time and
   provides similar results for a given dataset. Therefore, the current
   study adopted this cross-validation method to evaluate the performance
   of the prediction method.

   To represent the predicted results of a two-class classification
   problem, a confusion matrix was often employed, which contained the
   following four entries: true positives (TP), true negative (TN), false
   positives (FP), and false negative (FN) [72][41], [73][42]. Based on
   these values, the prediction accuracy (ACC), specificity (SP),
   sensitivity (SN) [74][42] and Matthews’s correlation coefficient (MCC)
   [75][43] were often used to evaluate the predicted results, which can
   be computed by
   graphic file with name pone.0107202.e014.jpg (4)

   As mentioned in Section “Dataset”, five datasets were constructed in
   this study to reduce the size difference of the ‘positive genes’ and
   ‘negative genes’. However, each dataset still had very different class
   sizes. In detail, the number of ‘negative genes’ was at least 4 times
   as many as that of ‘positive genes’. Thus, the ACC is not appropriate
   for evaluating the predicted results on the whole. MCC, as a balanced
   measure even if the classes are of very different sizes, was employed
   as the key measurement.

Feature selection method

   As mentioned in Section “Encoding method”, each gene was represented by
   13,116 features of the enrichment scores, which indicated the
   relationship between the genes and GO terms or KEGG pathways. TSGs are
   related to some GO terms and KEGG pathways. To identify key GO terms
   and KEGG pathways, some feature selection methods were employed in this
   study. The procedure of the feature selection method included two
   stages: (I) Cramer’s coefficient [76][44], [77][45], which used to
   discard non-essential features and (II) minimum redundancy maximum
   relevance (mRMR), incremental feature selection (IFS) [78][22] and
   Dagging [79][31] for further selection.

   The Cramer’s coefficient [80][44], [81][45], derived from the Pearson
   Chi-square test [82][46], is a statistical measure of two variables.
   Its value is between 0 and 1. According to the fact that a high
   Cramer’s coefficient of two variables indicates a strong association of
   two variables, features with low Cramer’s coefficients to samples’
   class labels were deemed non-essential features. Here, we used 0.1 as
   the threshold and features with Cramer’s coefficients lower than 0.1
   were excluded.

   The second stage of the feature selection involved the mRMR, IFS and
   Dagging. In detail, the mRMR method sorted the remaining features in
   two lists, while the IFS and Dagging were used to extract key features
   based on the feature lists obtained by the mRMR method. The mRMR
   method, proposed by Peng et al. [83][22], has two criteria:
   Max-Relevance and Min-Redundancy, producing the following two feature
   lists: (I) MaxRel feature list and (II) mRMR feature list. The MaxRel
   feature list sort features only based on the Max-Relevance criterion,
   while the mRMR feature list sort features based on both the
   Max-Relevance and Min-Redundancy. In this study, these two lists were
   formulated as follows:
   graphic file with name pone.0107202.e015.jpg (5)

   where N is the total number of features. The mRMR method has been
   widely used in recent years to analyze complicated biological problems
   [84][36], . Since the mRMR feature list was built with both the
   Max-Relevance and Min-Redundancy criteria in mind, it was used to
   extract important features by combining the IFS and Dagging. This
   procedure was as follows:

   (I) Construct N feature set from the mRMR features list   F [mRMR], say
   Inline graphic , such that    Inline graphic , i.e. Inline graphic
   consisted   of the first i features in F [mRMR].

   (II) For each Inline graphic , Dagging was conducted on samples
     represented by features in Inline graphic , evaluated by 10-fold
     cross-validation, thereby obtaining ACC, SP, SN and MCC   (cf. Eq.
   4).

   (III) The feature set that can produce the maximum MCC is the   optimal
   feature set. Additionally, an IFS-curve was plotted   with the MCC
   value as its Y-axis and the superscript i of    Inline graphic (the
   number of features that participate in the   classification) as its
   X-axis.

Results and Discussion

Results of the feature selection

   As mentioned in Section “Dataset”, 6 datasets, Inline graphic , were
   constructed. For each, we calculated the Cramer’s coefficients of the
   features and the samples’ class labels. Then, the features with
   Cramer’s coefficients lower than 0.1 were excluded. The remaining
   features were kept for the further selection. The number of remaining
   features for each dataset is shown in [85]Table 1 .

Table 1. The number of remaining features after using Cramer’s coefficient to
exclude non-essential features.

   Dataset Number of remaining features
   S [1]   3,347
   S [2]   3,837
   S [3]   4,632
   S [4]   4,270
   S [5]   4,956
   S [6]   6,661
   [86]Open in a new tab

   The mRMR method, IFS method and Dagging were used to analyze the
   remaining features for each dataset S[i]. The mRMR program, downloaded
   from [87]http://research.janelia.org/peng/proj/mRMR/, was executed on
   each dataset S[i], in which each sample was represented by the
   remaining features. For convenience, the mRMR method was conducted with
   its default parameters. As mentioned in Section “Feature selection
   method”, the MaxRel features list and mRMR features list were obtained
   for each dataset S[i]. However, to reduce the computation time, we only
   obtained the first 500 features in each of the two feature lists, which
   are summarized in [88]Table S2.

   The IFS method and classifier Dagging were executed according to the
   mRMR features list for each dataset S[i], which was evaluated by
   10-fold cross-validation. The SNs, SPs, ACCs and MCCs obtained for each
   dataset S[i] are given in [89]Table S3. For easy observation, we
   plotted an IFS-curve for each dataset S[i]. The six IFS-curves are
   shown in [90]Figure 1 ; the maximum MCCs for datasets Inline graphic
   were 0.3938, 0.4092, 0.4417, 0.4351, 0.4744, and 0.5511, respectively.
   These values are listed in [91]Table 2 , in which the numbers of the
   features used to obtain these maximum MCCs are also listed. In detail,
   by using the first 366, 440, 181, 318, 302, and 261 features in the
   mRMR features lists of the six datasets (see [92]Table S3),
   respectively, the MCCs calculated by Eq. 4 were 0.3938, 0.4092, 0.4417,
   0.4351, 0.4744, and 0.5511, respectively. Accordingly, six optimal
   feature sets, OS [1], OS [2], …, OS [6] can be obtained by selecting
   the first 366, 440, 181, 318, 302, and 261 features in six mRMR feature
   lists of six datasets, respectively.

Figure 1. Six IFS-curves for six datasets.

   [93]Figure 1
   [94]Open in a new tab

   In detail, (A) shows the IFS-curve for the dataset S [1]; (B) shows the
   IFS-curve for the dataset S [2]; (C) shows the IFS-curve for the
   dataset S [3]; (D) shows the IFS-curve for the dataset S[4]; (E) shows
   the IFS-curve for the dataset S [5]; (F) shows the IFS-curve for the
   dataset S [6]. The Y-axis represents the Matthews’s correlation
   coefficient (MCC) and the X-axis represents the number of features
   participating in the classification model.

Table 2. The number of features in the optimal feature set for each dataset
and the MCC values obtained by using these features.

   Dataset Number of features inthe optimal feature set Maximum MCC value
   S [1]   366                                          0.3938
   S [2]   440                                          0.4092
   S [3]   181                                          0.4417
   S [4]   318                                          0.4351
   S [5]   302                                          0.4744
   S [6]   261                                          0.5511
   Mean                                                 0.4509
   [95]Open in a new tab

Analysis of the GO terms in the total optimal feature set

   As mentioned in Section “Results of the feature selection”, six optimal
   feature sets were obtained. We took the union operation of these sets
   and obtained a new dataset denoted by OS ( Inline graphic ) and termed
   the total optimal feature set, consisting of 708 enrichment features of
   the GO terms and 9 enrichment features of the KEGG pathways, which are
   available in [96]Table S4. The analysis of 708 GO terms is described
   below.

   Seven hundred and eight GO terms can be divided into the following
   three parts: (1) Biological Process (BF); (2) Cellular Component (CC);
   and (3) Molecular Function (MF). We mapped the 708 GO terms to the
   children terms of three GO domains. As we can see in [97]Figures 2
   −[98] 4 , the GO terms in the OS were significantly enriched in some
   specific children terms with a high frequency and high ratio, which is
   defined as “the number of each GO term”/“the scale of the number of its
   children terms”.

Figure 2. Frequency and ratio of GO terms of biological process in OS.

   [99]Figure 2
   [100]Open in a new tab

   (A) Frequency of GO terms of biological process in OS. (B) Ratio of GO
   terms of biological process in OS.

Figure 4. Frequency and ratio of GO terms of molecular function in OS.

   [101]Figure 4
   [102]Open in a new tab

   (A) Frequency of GO terms of molecular function in OS. (B) Ratio of GO
   terms of molecular function in OS.

   Biological process GO terms. The top five biological process GO terms
   of the frequency shown in [103]Figure 2(A) are GO: 0065007: biological
   regulation (221), GO: 0044699: single-organism process (117), GO:
   0032502: developmental process (75), GO: 0050896: response to stimulus
   (60) and GO: 0009987: cellular process (35). The top five biological
   process terms with large base numbers that perform fundamental
   functions in organisms and tumor suppressor proteins may be functional
   disturbance in health maintenance of cancer patients.

   For the ratio of the biological process GO terms shown in [104]Figure
   2(B) , the top five are GO: 0022610: biological adhesion (4.39%,
   5/114), GO: 0040007: growth (2.72%, 4/147), GO: 0032502: developmental
   process (2.28%, 75/3294), GO:0065007: biological regulation (2.09%,
   221/10551) and GO:0050896: response to stimulus (2.0%, 60/3001). The GO
   terms biological adhesion and response to stimulus should be noted and
   relevant TS proteins act in the alarm reaction and have protective
   roles in tumorigenesis and the metastasis process. The GO term
   single-organism process involved in death and cell proliferation is
   highlighted too, although its percentage is not high. The destiny of an
   organism is critically regulated by the cell cycle and apoptosis in
   which TSGs play an important part. TSGs act like brakes on a car and
   are involved in maintenance of the cell cycle checkpoints and apoptosis
   induction [105][53].

   Cells are under constant attack by various agents and oncogenic DNA
   variants form because of endogenous (normal cell metabolite) and
   exogenous agents (chemical species and physical mutagens). To maintain
   genome stability, TSGs participated in multiple mechanisms to repair
   DNA damage and arrest cell proliferation. In DNA double-strand break
   repair (DSBR), several TS genes, including ATM, NBS1, BRCA1 and BRCA2,
   are activated by DNA damage to induce cell cycle checkpoint arrest and
   DSB repair complex formation [106][54]. The highly conserved DNA
   mismatch repair (MMR) proteins composed of MSH2, MLH1, PMS1 and PMS2
   tumor suppressor proteins in people, are required to correct base
   mismatches that are formed in response to exogenous or endogenous
   substances. If the expression of MLH1 or MSH2 is suppressed, cells lose
   the ability to perform mismatch repair and have resistance to
   alkylation mutagens that would normally activate G2/M checkpoint or
   apoptosis [107][55]. In nucleotide excision repair (NER), the DNA
   repair genes are regulated by p53 to remove bulky DNA adducts including
   pyrimidine dimmers induced by UV [108][56]. Normal, unrepaired DNA
   variants promote cells apoptosis.

   Normally, cell proliferation is tightly regulated in different periods
   of the cell cycle. The pRb (retinoblastoma protein), known as the first
   TSG, maintains the G1/S checkpoint through its regulation of the E2F
   family. Inactivation of pRb, which caused by mutations, promoter
   methylation or interaction with oncoproteins, results in loss of
   control of the checkpoint R, allowing for uncontrolled cell
   proliferation [109][57], [110][58]. In addition, cancer cells inhibit
   the expression of many other tumor suppressor proteins to gain
   malignant proliferation ability. For example, with mutations or the low
   expression of TGF-βR II (transforming growth factor βreceptor II) and
   its downstream proteins Smad2/3/4 (SMAD family member 2/3/4), cancer
   cells will be insensitive to the proliferation inhibition of TGF
   [111][59], [112][60]. Similar to pRb, the INK4 (cyclin -dependent
   kinase inhibitor, e.g., p16INK4A) family, which is regulated by TGF-β,
   can block CDK, causing cell growth arrest in a different period of the
   cell cycle. The dysfunction of INK4 or TGF-βR II will allow cells to
   pass through the checkpoint abnormally and accumulate variations
   [113][61], [114][62].

   Apoptosis, known as programmed cell death, can be initiated by two
   distinct signaling pathways, BCL2 induced and death receptor induced,
   which ultimately converge in the caspase cascade. The most famous TSG,
   p53, is mutated in ∼50% of human cancers and related to some tumor
   suppression network [115][14]. p53 is a transcriptional regulator that
   can be activated by DNA damage, certain oncogenes and other cytotoxic
   stress signals, triggering cell cycle arrest (G1/S checkpoint), DNA
   repair and apoptosis. Dysfunction of p53 caused by mutations or
   methylation prevents the damage-induced cell cycle arrest and apoptosis
   [116][63], [117][64]. As a TSG, PTEN (phosphatase with tensin homology)
   negatively regulates the PI3K (the phosphatidylinositol 3-kinase)
   pathway, preventing inappropriate metabolism via effects on TOR and
   promoting cell proliferation via effects on proapoptotic proteins
   [118][65]. CYLD(cylindromatosis), first identified as a TSG in the
   familial cylindromatosis, is a DUB (deubiquitinase) of the USP
   subfamily. Multiple myeloma patients with dysfunction of CYLD have
   abnormal activation of NF-kB and cell cycle and apoptosis dysfunction
   [119][66], [120][67]. The insufficient activation of caspase 8
   (apoptosis-related cysteine peptidase), a key TS gene in the caspase
   cascade, leads to the interruption of signal transduction from death
   receptors, inducing normal apoptosis [121][68], [122][69].

   Many tumor cell types acquire the capacity to invade and metastasize
   though loss of cell-cell adhesion or cell-ECM (extracellular matrix)
   junctions. The silencing or suppression of E-cadherin, which is
   regulated by promoter methylation, histone methylation, transcriptional
   repression or frequent mutations cause EMT (epithelial-mesenchymal
   transition), disruption of cell contacts, tumor cell detachment and
   invasion [123][70], [124][71]. Integrins, a family of heterodimeric
   transmembrane proteins, mediate cell–ECM (extracellular matrix)
   interactions. Aberrant integrin can induce the activation of
   proteolytic enzymes and cause degradation of the extracellular matrix
   and basement membrane, promoting tumor cells metastasis [125][72]. MMPs
   (matrix metalloproteinase) are endopeptidases that are involved in the
   breakdown of the extracellular matrix; they are regulated by
   inhibitors, TIMPs (Tissue Inhibitor of Metalloproteinases). Loss of
   function of TIMPs, which are TSGs, may cause a MMP/TIMP equilibrium
   shift into a malignant status [126][73], [127][74].

   Except the features above, which help us comprehend the relevance
   between tumor suppressors and specific GO terms or pathways, some rare
   investigated terms were highlighted such as metabolic process
   (GO:0008152), reproductive process (GO:0022414), locomotion
   (GO:0040011), localization (GO:0051179)/establishment of localization
   (GO:0051234) and multi-organism process (GO:0051704). These features
   remind us tumor suppressors participate in protein localization
   intracellular, various cells migration and locomotion intercellular,
   complex metabolic process and multi-organism process in the whole
   organism. Particularly, in some tumor types, tumor suppressors play key
   roles in reproductive process, usually related to hormone and hormone
   receptors. These features are not studies deeply as others, but need
   more attention to mine novel tumor suppressors.

   Cellular component GO terms. It can be seen from [128]Figure 3(A) that
   the top five CC GO terms with regard to frequency are GO:0044464: cell
   part (19), GO:0044422: organelle part (10), GO:0032991: macromolecular
   complex (9), GO:0030054: cell junction (3), and GO:0044425: membrane
   part (3), which also have a corresponding high percentage. Their ratios
   (cf. [129]Figure 3(B) ) are GO: 0030054: cell junction (8.57%, 3/35),
   GO: 0044422: organelle part (0.73%, 10/1361), GO: 0044464: cell part
   (0.67%, 19/2823), GO: 0032991: macromolecular complex (0.49%, 9/1824),
   and GO:0044425: membrane part (0.41%, 3/724). Cell junction is a
   cellular component that forms connections between two cells or between
   a cell and the extracellular matrix. As discussed above, TSGs such as
   E-cadherin and integrin play critical roles in tumor cell adhesion and
   metastasis. Additionally, organelles, including the mitochondria,
   ribosomes and UPS (ubiquitin-proteasome system), participate in the
   biological process involved in carcinogenesis. Many macromolecular
   complexes consist of tumor suppressor protein inside cells, such as
   TSgene SMAD2/3(SMAD family member 2/3) in the SMAD protein complex
   [130][75] and SMARCB1(SWI/SNF related, matrix associated, actin
   dependent regulator of chromatin, subfamily b, member 1) in the Swi/Snf
   complex [131][76].

Figure 3. Frequency and ratio of GO terms of cellular component in OS.

   [132]Figure 3
   [133]Open in a new tab

   (A) Frequency of GO terms of cellular component in OS. (B) Ratio of GO
   terms of cellular component in OS.

   Molecular function the GO terms. It can be observed from [134]Figure
   4(A) that the five highest frequency of MF GO terms are GO: 0005488:
   binding (45), GO: 0003824: catalytic activity (19), GO: 0030234: enzyme
   regulator activity (19), GO:0004872: receptor activity (4), and
   GO:0060089: molecular transducer activity (3). On one hand, these high
   frequency MF GO terms consist of a huge number of proteins that perform
   basic biological functions; on the other hand, the catalytic activity
   and enzyme regulator are involved in most vital biological processes,
   including cell proliferation, DNA damage repair and apoptosis. The cell
   junction requires protein binding and enzymes catalyze, which can
   involve biological processes such as phosphorylation, acetylation, the
   cell-extracellular matrix link and cell cycle control. The
   transcription factor Dp (DPDP-polypeptide) forms a complex with E2F1 to
   regulate its binding to DNA and the expression of certain genes (such
   as myc) catalyzed by enzymes [135][77]. Genomic instability is
   essential in almost all tumor factors, and mutations in ATM (ataxia
   telangiectasia mutated) which belongs to the PI3/PI4-kinase family,
   leave DSBs (DNA double-strand breaks) unrepaired [136][78]. The
   receptor proteins transduce extracellular or intracellular messenger to
   the biological effectors, triggering a serial biochemical reaction. The
   typical receptor protein and tumor suppressors in the TGF-β signaling
   pathway are TGF-βR II and BMPR2 (bone morphogenetic protein receptor,
   type II (serine/threonine kinase)) [137][79]. The five most common MF
   GO terms (cf. [138]Figure 4(B) ) are GO: 0030234: enzyme regulator
   (12.33%, 19/154), GO: 0000988: protein binding transcription factor
   activity (3.28%, 2/61), GO: 0005488: binding (2.64%, 45/1703),
   GO:0004872: receptor activity (1.02%, 4/391), and GO:0060089: molecular
   transducer activity (0.74%, 3/405). The corresponding percentages of
   the top five MF terms are similar to the top MF frequency, which are
   associated with the BP percentage and CC percentage and participate in
   tumorigenesis at different level.

   Directed acyclic graph (DAG) analysis of the GO children terms. To
   further understand the function of the selected GO terms, we analyzed
   the directed acyclic graph of the GO children terms. We found that the
   GO children terms clustered in several particular modules under the
   primary GO terms discussed above. In addition to cell adhesion, the
   cellular response to UV-induced DNA damage and subsequent activated
   apoptotic signaling pathway and cell cycle regulation, phosphate
   metabolism, signal transduction and some molecular complex were
   highlighted in the biological modules.

   The phosphorus utilization including phosphorylation and
   dephosphorylation catalyzed by kinases and phosphatases, respectively,
   is a key mechanism in a number of vital cellular pathways such as the
   cell cycle, cell proliferation and apoptosis. Mutations or low
   expression in certain TSGs, such as PTP (protein tyrosine phosphatase),
   should bring the phosphorylation/dephosphorylation ratio out of balance
   [139][80], [140][81].

   Cancer is a disease of aberrant signal transduction. In the functioning
   biological system, tumor suppressors keep the signaling cascades in
   balance, such as for the TGF-βR II and Smad2/3/4 in TGFβ signaling
   pathways [141][59], [142][60] and ptch1 protein (patched 1) in hedgehog
   pathway [143][82].

   In addition, some molecular complex and enzyme activity should be
   noticed. The SWI/SNF complex, which contains a subunit from the BAF
   family, mediated chromatin remodeling in cell differentiation,
   proliferation and DNA repair. Several components of the SWI/SNF
   complex, such as BAF47, function as tumor suppressors, and BRM and BRG1
   act as putative tumor suppressors, which is evidenced by frequently
   loss of heterozygosity [144][83].

Analysis of the KEGG pathways in the total optimal feature set

   Nine KEGG pathway terms in the OS, were hsa04115 (p53 signaling
   pathway), hsa00100 (steroid biosynthesis), hsa05213 (endometrial
   cancer), hsa05216 (thyroid cancer), hsa05218 (melanoma), hsa05219
   (bladder cancer), hsa05220 (chronic myeloid leukemia), hsa05221 (acute
   myeloid leukemia) and hsa05223 (non-small cell lung cancer). As
   discussed above, p53 participates in cell death regulation and cell
   cycle control as a key central element. Aberrant genetic inactivation
   or diminished expression of p53 was found in the most of KEGG cancers
   terms. In addition to Rb in bladder cancer and chronic myeloid leukemia
   [145][7], , abnormal PTEN was also found in thyroid cancer and
   endometrial cancer [146][7], . In melanoma, chronic myeloid leukemia
   and non-small cell lung cancer patients, there is reported silence or
   suppression of ink4a/arf leading to cell cycle disorder and sustained
   cellular proliferation [147][7], . Steroids and steroid metabolism have
   markedly influenced in some cancer types, such as breast cancer and
   prostate cancer, which may mediate the apoptosis network [148][87],
   [149][88].

   Unlike oncogenes, TSGs act as guardians regulating the network of cell
   cycle and apoptosis factors involved in controlling cell fate.
   Furthermore, maintaining genomic stability and balanced cell adhesion
   demand that the TSGs perform normal physiological functions.

Analysis of candidate tumor suppressors predicted based on optimal features

   We try to predict the novel TSGs based on features in the total optimal
   feature set, i.e., the key functions that defines tumor suppressor. For
   each ‘negative gene’, we counted the number of key tumor suppressor
   functions that it was annotated onto. The genes with great number of
   key tumor suppressor functions were considered as candidate tumor
   suppressors, since they shared similar functions with the known tumor
   suppressors. Since oncogene and tumor suppressor are two sides of a
   coin, their functions are difficult to distinguish. To better
   prioritize candidate tumor suppressor, we removed the 330 oncogenes
   from oncogene family of GSEA MSigDB (Molecular Signatures DATAbase,
   [150]http://www.broadinstitute.org/gsea/msigdb/gene_families.jsp) and
   251 oncogenes from HGNC (HUGO Gene Nomenclature Committee,
   [151]http://www.genenames.org/) with the oncogene as the keyword.
   MSigDB is an online database, which collected annotated genes sets for
   GSEA analyze and categorize genes into gene family to provide a
   functional overview. HGNC is a collection of unique symbols and names
   for genes, ncRNA genes and pseudogenes. Subsequently, the overlap genes
   between these genes and the ‘negative genes’ were filtered out, 17,553
   ensembl protein IDs remain in the end, which are available in
   [152]Table S5.

   Our study performs the gene enrichment and pathway enrichment analysis,
   providing a support to identify novel tumor suppressor in these
   features and pathways. In [153]Table 3 , we revealed a list of novel
   tumor suppressor genes, which shared at least 293 key annotations with
   known tumor suppressors. It has been demonstrated part of them play
   suppressive roles in tumorigenesis and more genes need verification by
   functional evidence and a larger clinical pathological characteristics
   data set. There are many tumor suppress genes proved partly, such as
   EP300 [154][89]−[155][91], GATA4 [156][92], ESR1 [157][93] and NFKBIA
   [158][94], [159][95], which still need a large clinic data validation
   and functional research.

Table 3. Top forty putative tumor suppressors based on features in the total
optimal feature set.

   Ensembl ID      Number of key tumorsuppressor functions[160]^a Gene symbol
   ENSP00000297261 353                                            SHH
   ENSP00000324806 353                                            GSK3B
   ENSP00000389184 345                                            MARK2
   ENSP00000264657 338                                            STAT3
   ENSP00000355069 338                                            PAX2
   ENSP00000293549 337                                            WNT1
   ENSP00000353483 331                                            MAPK8
   ENSP00000263253 331                                            EP300
   ENSP00000218894 327                                            SUPT20H
   ENSP00000328181 327                                            NOG
   ENSP00000228872 327                                            CDKN1B
   ENSP00000338548 325                                            FGF1
   ENSP00000250003 322                                            MYOD1
   ENSP00000206249 322                                            ESR1
   ENSP00000245451 321                                            BMP4
   ENSP00000352514 317                                            RUNX2
   ENSP00000348986 316                                            INS-IGF2
   ENSP00000263025 315                                            MAPK3
   ENSP00000354558 313                                            MTOR
   ENSP00000363822 311                                            AR
   ENSP00000361066 310                                            NCOA3
   ENSP00000339004 309                                            FOXG1
   ENSP00000320604 309                                            FAXDC2
   ENSP00000338018 308                                            HIF1A
   ENSP00000278385 308                                            CD44
   ENSP00000216797 306                                            NFKBIA
   ENSP00000222330 304                                            GSK3A
   ENSP00000255465 304                                            CCNA1
   ENSP00000222726 303                                            HOXA5
   ENSP00000334458 303                                            GATA4
   ENSP00000264498 303                                            FGF2
   ENSP00000323588 302                                            SOX2
   ENSP00000392858 299                                            TNF
   ENSP00000302665 299                                            IGF1
   ENSP00000338297 298                                            -
   ENSP00000362649 297                                            HDAC1
   ENSP00000318977 297                                            GEN1
   ENSP00000343745 296                                            DICER1
   ENSP00000265165 294                                            LEF1
   ENSP00000415481 293                                            PROM1
   [161]Open in a new tab
   ^a

   The value in this column is the number of features in the total optimal
   feature set whose values are greater than –log[10](0.05).

   Glycogen synthase kinase 3 beta (GSK3β) belongs to the glycogen
   synthase kinase subfamily. GSK2β regulated Wnt signaling and PI3K/Akt
   pathway negatively, which play key roles in cell cycle, anti-apoptosis
   and invasion [162][96], [163][97]. It has been identified suppression
   of GSK3β in many tumor types including, oral squamous cell carcinoma
   (OSCC), lung cancer, cutaneous squamous cell carcinoma and esophageal
   carcinoma [164][98]−[165][101]. Inhibition of constitutively active
   GSK3β leads to epithelial-mesenchymal transition (EMT) transition
   during tumorigenesis [166][102]. In vitro, GSK3β play a negative
   regulator of myeloid cell leukemia-1(Mcl-1), which has anti-apoptotic
   function and is correlated to the poor prognosis of breast cancer
   patients [167][98], [168][103], [169][104]. Although there are some
   controversial reports, GSK3β is a putative tumor suppressor and need
   more studies [170][105], [171][106].

   Homeobox A5 (HOXA5) is belonging to a DNA-binding transcription factor
   family, homeobox genes cluster A, and regulates organism gene
   expression, adult differentiation and embryonic development in
   organism. It has been observed a frequently increased methylation of
   the HOXA5 promoter region in various tumor tissues
   [172][107]−[173][109] and is related to decreased expression
   [174][107], [175][110]. In addition, HOXA5 up-regulates p53
   transcription through binding to a target element in its promoter
   [176][111]. These evidences document that HOXA5 is a putative tumor
   suppressor for tumorigenesis. But it still warrants further functional
   studies that how HOXA5 suppress tumorigenesis in animal model and in
   clinic.

   Holliday Junction 5′ Flap Endonuclease, previous named gen endonuclease
   homolog 1 (GEN1) is an enzyme, evolved in Holliday junctions (HJs)
   formation during homologous recombination and DNA repair. The activity
   of Yen1, the ortholog of GEN1, is inhibited by phosphorylation events
   in the G1/S transition, keep inactive through S-phase and G2, and
   activated by dephosphorylation at the later stages of mitosis
   [177][112], [178][113]. Similarly, in the early stages of the cell
   cycle, GEN1 is excluded from the nucleus, and access chromatin and HJs
   [179][113]. GEN1 participates in some specific features: cell cycle,
   DNA repair and phosphorylation/dephosphorylation, which involved in
   many tumor suppressors. In Bloom’s syndrome cells, depletion of GEN1
   results in severe chromosome abnormalities [180][114]. It has been
   identified rare recessive at-risk alleles of GEN1 in breast cancer by
   Ekatherina Sh [181][115]−[182][117], and two somatic frameshift
   mutations in breast cancer cell lines and primary tumors through exome
   sequencing [183][114]. Above all, GEN1 is a novel tumor suppressor akin
   to some other DNA repair genes, BRCA1 and BRCA2 in breast cancer,
   although there is rare study prove GEN1 make a high-appreciable
   contribution to breast cancer. In future study, it would be focus on
   the methylation or LOH level and anti-tumorigenesis mechanism to
   explore function of GEN1.

   Besides these genes discussed above, our study reveals more novel
   candidate tumor suppressors including SHH, STAT3, SUPT20H and GSK3A,
   which are highlighted and need more focus and research in future cancer
   research.

Conclusions

   This study summarizes the enrichment analysis of TSGs. The features of
   the GO and KEGG pathway enrichment scores were used to encode the
   investigated genes and some feature selection methods were employed to
   analyze these features. The analysis of the 708 GO terms and 9 KEGG
   pathways implies that they are strongly related to the determination of
   TSGs. We hope that effective methods based on these GO terms and KEGG
   pathways can be built to identify TSGs.

Supporting Information

   Table S1

   List of 615 tumor suppressor genes.

   (PDF)
   [184]Click here for additional data file.^ (9.8KB, pdf)
   Table S2

   List of the MaxRel features lists and mRMR features lists obtained by
   mRMR method for each dataset.

   (PDF)
   [185]Click here for additional data file.^ (270.2KB, pdf)
   Table S3

   List of the SNs, SPs, ACCs and MCCs obtained by IFS and Dagging for
   each dataset S[i].

   (PDF)
   [186]Click here for additional data file.^ (201KB, pdf)
   Table S4

   List of 717 Features in the final optimal feature set.

   (PDF)
   [187]Click here for additional data file.^ (10.7KB, pdf)
   Table S5

   List of the novel tumor suppressors predicted based on features in the
   total optimal feature set.

   (PDF)
   [188]Click here for additional data file.^ (495.1KB, pdf)

Data Availability

   The authors confirm that all data underlying the findings are fully
   available without restriction. All relevant data are within the paper
   and its Supporting Information files.

Funding Statement

   This work was supported by the National Basic Research Program of China
   (2011CB510102, 2011CB510101), the National Natural Science Foundation
   of China (31371335, 61202021, 61373028, 11371008, 81030015), Innovation
   Program of Shanghai Municipal Education Commission (12ZZ087, 12YZ120),
   the Shanghai Educational Development Foundation (12CG55) and the grant
   from “The First-class Discipline of Universities in Shanghai.” The
   funders had no role in study design, data collection and analysis,
   decision to publish, or preparation of the manuscript.

References