Abstract

   In the recent years, bioinformatics methods have been reported with a
   high degree of success for candidate gene identification. In this
   milieu, we have used an integrated bioinformatics approach assimilating
   information from gene ontologies (GO), protein–protein interaction
   (PPI) and network analysis to predict candidate genes related to oral
   squamous cell carcinoma (OSCC). A total of 40973 PPIs were considered
   for 4704 cancer-related genes to construct human cancer gene network
   (HCGN). The importance of each node was measured in HCGN by ten
   different centrality measures. We have shown that the top ranking genes
   are related to a significantly higher number of diseases as compared to
   other genes in HCGN. A total of 39 candidate oral cancer target genes
   were predicted by combining top ranked genes and the genes
   corresponding to significantly enriched oral cancer related GO terms.
   Initial verification using literature and available experimental data
   indicated that 29 genes were related with OSCC. A detailed pathway
   analysis led us to propose a role for the selected candidate genes in
   the invasion and metastasis in OSCC. We further validated our
   predictions using immunohistochemistry (IHC) and found that the gene
   FLNA was upregulated while the genes ARRB1 and HTT were downregulated
   in the OSCC tissue samples.

   Subject terms: Gene ontology, Network topology

Introduction

   Head and neck cancers are among the 10 most common cancers
   globally^[34]1. Oral squamous cell carcinoma (OSCC) in males is the
   most common cancer amongst all head and neck squamous cell carcinoma
   (HNSCC). It includes approximately 90% of oral malignancies and
   accounts for more than 300,000 of newly diagnosed cases every
   year^[35]2. It is estimated that the number of HNSCC cases will
   increase to 1 million by the year 2030
   ([36]http://www.who.int/entity/healthinfo/statistics/bod_dalybywhoregio
   n.xls?ua=1). Despite significant progress in cancer treatment and
   management, the mortality rate associated with OSCC remains unchanged.
   The molecular basis of aggressive OSCC growth and metastasis is still
   unclear. The OSCC often goes undetected until advanced stages which are
   related to high mortality rate, and as such it carries heavy personal
   and societal costs. The overall 5-year survival rate is estimated about
   50%^[37]3. Therefore, new targets for early diagnosis of OSCC and its
   treatment are required to reduce the mortality due to oral cancer.

   The identification of the genes associated with complex diseases has
   become an important task in recent years. Experimental approaches, e.g.
   genetic linkage association studies^[38]4, expression profiling^[39]5,
   and genome-wide association studies^[40]6 have been found to be
   successful in identifying high relative risk genes for diseases like
   cancer^[41]7, asthma^[42]8, diabetes^[43]9 and hypertension^[44]10.
   However, the disease heterogeneity, complexity in finding the gene in a
   given locus and the cost involved led to the development of various
   in-silico approaches for disease gene identification. These approaches
   usually involve data mining^[45]11, analysis of biological sequences,
   phenotype^[46]12 and expression data^[47]5, protein-protein
   interactions (PPI)^[48]11,[49]13–[50]15, gene ontology
   (GO)^[51]13,[52]15, gene regulatory networks and pathways for
   identification of candidate gene(s)^[53]5,[54]11,[55]13. Many of the
   recent publications have reported the successful use of such techniques
   for prioritizing disease-related genes and their drug
   targets^[56]5,[57]13.

   Direct PPIs are one of the strongest indications of a functional
   relation between genes and this information can be utilized for the
   identification of candidate genes^[58]12. The PPI networks can be
   constructed using physical interactions, functional and expression
   data, etc. Physical interactions provide the information about how two
   proteins are connected to each other. These interactions regulate
   various biological activities like transcription, translation, and
   degradation of RNAs and proteins. These interactions can be represented
   as a network graph where a product like protein is a node or vertex,
   and their relation e.g. physical interaction is a link or edge. Such
   networks can be of two types on the basis of edges (i) directed: where
   the relation is directional and (ii) undirected: where the direction of
   the relation is not considered. The PPI networks have been utilized for
   candidate gene identification in many studies^[59]12,[60]14,[61]15. The
   networks can also be classified by their topological properties as
   scale-free and small world networks, where the node degree distribution
   follows the Power law or Poisson distribution respectively. The
   biological networks are generally scale-free networks i.e. they are
   characterized by the presence of many hubs that have hundreds of
   interactions. The PPI data can be used in combination with GO for
   identification of disease-associated genes^[62]13,[63]15,[64]16. The GO
   term is summarized as a relationship between a gene and biological
   function. It provides a common vocabulary to describe aspects of a gene
   product’s biology, allowing genes from different species to be compared
   based on their GO annotations. There are three types of GO terms viz.
   biological process (BP), molecular function (MF) and cellular component
   (CC). The BP is defined as a complete process that contains a set of
   molecular functions like replication or cell cycle. The MF is a
   specific activity at molecular levels such as binding or catalysis,
   whereas CC represents the part of a cell or its extracellular
   environment.

   In the last few years, it has been realized that networks in natural,
   technological and social systems follow a series of basic organizing
   principles in their structure and evolution that distinguish them from
   randomly linked networks^[65]17. The biological network based
   approaches have been used to identify pathways related to diseases and
   the candidate gene(s) related to particular pathways. Gilman et al.
   reported the identification of autism affected genes network using
   network-based analysis of genetic association^[66]18. PPI network in
   combination with GO has been used to predict the genes associated with
   immunodeficiency^[67]14, systemic lupus erythematosus^[68]13, and
   cervical cancer^[69]16. Recently, Jin et al. have reported the
   identification of a twelve gene network module based on differential
   co-expression PPI network in ovarian serous cystadenocarcinoma^[70]19.

   In the current work, we have used an integrated approach that
   integrates information from various sources such as experimental data,
   web resources, and PPIs to identify candidate genes involved in OSCC.
   The identified genes were then validated using literature survey,
   online experimental data and immunohistochemistry (IHC) from patient
   samples (Fig. [71]1).

Figure 1.

   [72]Figure 1
   [73]Open in a new tab

   The flowchart of the methodology. The data sources (HIPPIE, Cancer
   candidate genes and Oral cancer genes) are indicated on the top. The
   HCGN was created from HIPPIE interactions in cancer candidate genes.
   The top ranking genes in HCGN (Set A, 227 genes) were identified using
   a combination of centralities. The genes common between oral cancer
   genes (465) and genes in HCGN (4704) were defined as oral cancer genes
   in HCGN (Set B, 297 genes). The GO enrichment was done using oral
   cancer genes and cancer candidate genes. The genes corresponding to
   significantly enriched GO terms were identified (Set C, 530 genes). The
   candidate oral cancer genes (Set D, 39 genes) were predicted using Set
   A, Set B and Set C. The Set E and F contains oral cancer genes in HCGN
   that are either top ranked or corresponding to enriched GO terms
   respectively. Set G contains the oral cancer genes in HCGN that are
   both top ranked and corresponding to enriched GO terms. A pathway and
   cluster analysis was done on predicted genes (Set D) and selected
   candidates were validated.

Results

   The top ranked 5% genes from consensus ranking list (Set A), oral
   cancer genes in HCGN (Set B) and genes corresponding to disease
   associated enriched GO terms (Set C) were analyzed to predict novel
   oral cancer related genes.

The human cancer gene network

   The cancer candidate genes were obtained from The Cancer Gene Atlas
   (TCGA)-generanker (total 5873 candidate genes)
   ([74]http://cbio.mskcc.org/tcga-generanker/sources.jsp). These were
   mapped to HIPPIE (Human Integrated Protein-Protein Interaction
   rEference) database v1.7^[75]20. This database contains 193486
   interactions among 15519 proteins (v1.7, 2014). There are 15383
   interactions in this database that are coming from literature only (no
   experimental evidence). In the current study, we have used HIPPIE score
   cutoff of 0.63 to select more reliable interactions that resulted in
   the exclusion of almost 1/4th of (44779) the PPI interactions. Among
   the remaining 148707 interactions, there were only four interactions
   that are inferred from literature only.

   The network of cancer genes was created using the remaining PPIs
   (148707). Out of 5873 cancer candidate genes, a total of 4759 genes
   were mapped to 14391 proteins containing 40977 interactions. The
   largest interconnected component of the network was then selected by
   removing parallel interactions, loop and the orphan nodes. This
   resulted in a network containing 4704 nodes and 40973 interactions that
   was defined as human cancer gene network (HCGN) (Fig. [76]1). It is
   also important to note that there is no edge in the HCGN that is coming
   solely from the literature. The topological parameters like average
   degree, graph diameter, average distance, graph density, modularity,
   global efficiency and average clustering coefficient were calculated to
   characterize the HCGN and found to be 17.420, 8.000, 3.174, 0.004,
   0.341, 0.322 and 0.180 respectively. The degree distribution of HCGN
   follows the Power law distribution with exponent value of 1.416
   indicating the network is a scale-free network (Fig. [77]2). The
   scale-free networks are usually natural networks that are dominated by
   hubs. These networks are also vulnerable to specific alterations. The
   HCGN has more than three hundred highly connected nodes with a degree
   of 50 or higher. The highly connected proteins with a central role in
   the network’s architecture are more likely to be essential than other
   proteins^[78]21,[79]22. The average degree of HCGN was found to be
   17.42, indicating that the majority of proteins are highly connected.
   Thus, there are greater chances of having conserved PPIs. It has to be
   noted that essential PPIs are more conserved than nonessential
   PPIs^[80]23. The highly connected proteins can be lethal, if by some
   alterations they become hyper/hypo functional^[81]21.

Figure 2.

   Figure 2
   [82]Open in a new tab

   The graph of degree distribution for HCGN, where K is the degree and P
   (K) is the fraction of nodes in the network with degree K. The exponent
   value of 1.4302 indicates that the network is a scale-free network.

HCGN graph centralities calculation and ranking

   A total of eleven different centralities were calculated for the nodes
   of HCGN. The genes were then ranked by individual centrality scores. A
   total number of unique rankings by each centrality are given in
   Supplementary Table [83]S1. The data indicates that some of the
   centralities e.g. centroid, degree and closeness may not be able to
   give good distinction amongst nodes as the number of unique scores
   given by them are quite less as compared to the number of nodes in the
   network (Supplementary Table [84]S1). The vulnerability gave the
   highest number of unique scores which may result in better distinction
   among nodes.

   The pair-wise rank correlation was also calculated for the ranks given
   by the centralities to the genes (Supplementary Table [85]S2). The rank
   correlation coefficient among the centralities shows that the
   centralities are highly inter-correlated. Some of the centralities have
   100% correlation (Supplementary Table [86]S2), thus only one out of
   them was taken for further study. Therefore, total ten centralities
   were considered for final analysis. It is interesting to note that
   there was a good agreement among centralities for rankings of the genes
   i.e. a gene ranked higher by one centrality was usually ranked similar
   by other centrality (Supplementary Table [87]S3).

   Oral cancer related genes were collected from two databases viz.
   Head-Neck and Oral Cancer Database (HNOCDB)
   ([88]http://gyanxet.com/hno.html) and Oral Cancer Gene Database
   (OrCGDB) v2 ([89]http://www.actrec.gov.in/OCDB/index.htm) containing
   465 oral cancer genes together. It was found that there are 297 genes
   common between 465 oral cancer genes and 4704 HCGN genes. These genes
   were defined as oral cancer genes in HCGN (Set B). The distribution of
   these genes (set B) was checked in top 2%, 5%, 7% and 10% ranked genes
   by each centrality to select a cut-off for prediction of the candidate
   genes (Table [90]1). The unique pooled genes contain the unique genes
   from all centralities combined together at a particular cutoff e.g. 2%.
   Additionally, a consensus ranking was calculated for each gene by
   summing up all ranks given by different centralities. The precision
   score for top 2%, 5%, 7%, and 10% of the consensus ranked genes was
   found to be 30.77, 24.67, 23.85 and 19.87 respectively (Table [91]1).
   This indicates that consensus ranking is clearly better than individual
   centralities or unique pooled genes (Table [92]1). The precision
   decreases as we move from top ranked 2% to top ranked 10% genes. Though
   the top ranked 2% genes comparatively showed greater fraction of oral
   cancer genes in HCGN, the total number of unique oral cancer genes was
   very less (28 out of total 91 genes, Table [93]1). Therefore, to make a
   balance between precision and number of known oral cancer genes we have
   decided for a cutoff of top 5% ranked genes (56 out of 227(Set A)
   genes, Table [94]1) for downstream analyses.

Table 1.

   The distribution of oral cancer genes in HCGN in top ranked lists.
   Centrality Degree Closeness Centroid SPB Eigen vector Page rank CFC CFB
   Stress Vulnerability Unique pooled genes Consensus Ranking
   Top 2% Oral cancer genes^# 2 22 18 24 26 13 27 21 23 21 37 28
   Total genes 4 67 46 76 90 47 90 71 74 84 152 91
   Precision* 50.00 32.84 39.13 31.58 28.89 27.66 30.00 29.58 31.08 25.00
   24.34 30.77
   Top 5% Oral cancer genes^# 5 41 31 43 46 32 51 43 44 45 71 56
   Total genes 9 170 109 193 221 121 221 181 180 211 359 227
   Precision* 55.56 24.12 28.44 22.28 20.81 26.45 23.08 23.76 24.44 21.33
   19.77 24.67
   Top 7% Oral cancer genes^# 8 55 37 59 68 42 65 56 56 58 94 57
   Total genes 13 237 157 269 309 168 308 252 251 295 517 239
   Precision* 61.54 23.21 23.57 21.93 22.01 25.00 21.10 22.22 22.31 19.66
   18.18 23.85
   Top 10% Oral cancer genes^# 8 75 52 77 85 53 83 71 73 70 118 89
   Total genes 18 354 230 383 442 239 440 359 357 423 762 448
   Precision* 44.44 21.19 22.61 20.10 19.23 22.18 18.86 19.78 20.45 16.55
   15.48 19.87
   [95]Open in a new tab

   *Precision is the percentage of oral cancer genes in total genes.
   ^#Common out of 297 oral cancer genes in HCGN.

GO enrichment analysis

   The GO enrichment analysis for oral cancer genes was performed using
   GORILLA (Gene Ontology enRIchment anaLysis and visuaLizAtion tool)
   ([96]http://cbl-gorilla.cs.technion.ac.il/) in two list mode where the
   target was oral cancer genes (465) and the background was cancer genes
   (6023). The GO enrichment analysis resulted in the identification of
   903 enriched GO terms (785 BP, 52 MF and 66 CC). However, this is a
   large number and may result in the prediction of a vast number of genes
   where many of them could be false positives. Therefore, a criterion is
   needed to reduce the total number of GO terms to increase the
   efficiency of prediction.

   In some of the reported studies, the GO terms which were associated
   with less than 3 and more than 50 genes (3 < = B < = 50, B = number of
   genes) were filtered out to remove non-specific terms^[97]13,[98]16. In
   our case, this method gave total 227 enriched GO terms including BP, MF
   and CC. We found 935 genes in HCGN corresponding to these GO terms. Out
   of these, 188 (~20%) were found to be common with oral cancer genes in
   HCGN. However, this criteria was arbitrary and may result in higher
   noise.

   In the current study, two additional criteria were tested (i) top 20%
   enriched GO terms (ii) top 20% enriched GO terms with the removal of
   redundant GO terms by REVIGO ([99]http://revigo.irb.hr/), to select
   significantly enriched GO terms (Table [100]2). The number of top
   ranked 20% enriched GO terms was found to be 172 that was reduced to
   103 GO terms (i.e. a reduction of ~40%) after removal of redundancy.
   The total number of significantly enriched GO terms (BP) were 85. The
   highly enriched terms includes GO:0031340 (positive regulation of
   vesicle fusion) and GO:0043353 (enucleate erythrocyte differentiation).
   There were nine significantly enriched GO terms (MF) which include
   GO:0008379 (thioredoxin peroxidase activity) and GO:0008191
   (metalloendopeptidase inhibitor activity). Similarly, in case of CC
   there were nine significantly enriched GO terms which includes
   GO:0001533 (cornified envelope), and GO:0045095 (keratin filament). The
   details are shown in Supplementary Table [101]S4. Corresponding to
   these 103 GO terms, there were 530 genes (including 152 oral cancer
   related genes) in HCGN. These 530 genes were used for the prediction of
   oral cancer related genes.

Table 2.

   Analysis of different criteria to select specific GO terms.
   S. No. Criteria Number of oral cancer genes in HCGN Total genes
   *Precision
   1. 3 < = B < = 50 188 935 20.10
   2. Top 20% enriched GO terms 170 684 24.85
   3. Top 20% enriched GO terms and redundancy removal 152 530 28.67
   [102]Open in a new tab

   *Percentage of oral cancer genes in total genes.

Prediction of oral cancer candidate genes

The three gene sub-sets were defined as follows

   Set A contains top 5% consensus ranking genes (227 genes), Set B
   contains oral cancer genes in HCGN (297 genes common between all oral
   cancer genes 465 and HCGN genes 4704) and Set C contains genes
   associated with significantly enriched GO terms (530 genes). These set
   of genes were compared to predict the potential candidate genes for
   oral cancer (Fig. [103]3. Briefly, the genes that were present in Set A
   and Set C but not in Set B were defined as candidate genes). Set B was
   not considered for target prediction as these genes were already given
   in known datasets for oral cancer. This method resulted in the
   prediction of 39 potential candidate genes (Table [104]3). These genes
   were taken for downstream analysis.

Figure 3.

   Figure 3
   [105]Open in a new tab

   Prediction of candidate genes using top ranked genes (Set A), oral
   cancer genes in HCGN (Set B) and genes corresponding to significantly
   enriched GO terms (Set C). The genes shared by Set A and Set C but not
   by Set B were predicted as candidate genes.

Table 3.

   The candidate oral cancer genes.
   S. No. Gene ID Gene Symbol HCGN Ranking Degree with known oral cancer
   genes Expression Atlas (mRNA) Oncomine mRNA expression*
   1 3320 HSP90AA1 1 55 Up
   2 8452 CUL3 2 59 NA
   3 2099 ESR1 3 41 Up Down
   4 3312 HSPA8 4 30 Up Up
   5 1499 CTNNB1 5 32 NA
   6 6714 SRC 6 0 NA
   7 4088 SMAD3 7 20 Up NA
   8 5071 PARK2 8 21 Down Down
   9 5970 RELA 9 23 NA
   10 207 AKT1 10 28 NA
   11 408 ARRB1 11 18 NA
   12 7186 TRAF2 12 21 NA
   13 1457 CSNK2A1 13 25 Up
   14 7013 TERF1 14 15 Up
   15 5591 PRKDC 15 25 Up
   16 3064 HTT 15 15 NA
   17 23411 SIRT1 17 20 NA
   18 5518 PPP2R1A 18 14 NA
   19 10014 HDAC5 19 21 NA
   20 4176 MCM7 20 21 Up
   21 3676 ITGA4 21 38 Up
   22 1432 MAPK14 22 15 NA
   23 808 CALM3 23 12 NA
   24 805 CALM2 23 12 NA
   25 801 CALM1 23 12 NA
   26 4851 NOTCH1 26 10 NA
   27 5515 PPP2CA 27 16 Up
   28 2316 FLNA 28 14 Up Up
   29 5894 RAF1 29 19 NA
   30 8841 HDAC3 30 11 NA
   31 8826 IQGAP1 31 12 Down NA
   32 5747 PTK2 32 23 Up
   33 6657 SOX2 33 7 Down NA
   34 6195 RPS6KA1 34 9 NA
   35 5580 PRKCD 35 15 NA
   36 3717 JAK2 36 15 NA
   37 998 CDC42 37 5 Up
   38 6709 SPTAN1 38 12 NA
   39 3688 ITGB1 36 14 Down NA
   [106]Open in a new tab

   *The mRNA expression status (oral cancer vs normal) of a gene as
   reported in the Oncomine database and expression ATLAS. NA = Not
   reported.

Evidence in literature/databases for predicted candidate genes

   It is always pertinent to verify the in-silico predictions. In this
   study, a validation can be done by looking at differential gene
   expression in normal vs. oral cancer samples.

   The predicted genes (total 39) were checked for their differential mRNA
   expression in oral cancer samples in the Oncomine database
   ([107]https://www.oncomine.org) and the Expression Atlas
   ([108]https://www.ebi.ac.uk/gxa/home). The carcinomas of tongue, floor
   of the mouth and oral cavity were considered for the current study. A
   gene was considered differentially expressed, if the fold change was
   greater than 2 and the p-value was less than 0.01. In the Oncomine
   database, a total of thirteen predicted genes were found to have
   differential expression in oral cancer tissues (Table [109]3 and
   Supplementary Table [110]S5). Among them, eleven genes were found to
   have higher expression in oral cancer samples. The genes PRKCD and MCM7
   were found to be highly upregulated as evidenced by fold change of 7.29
   and 5.94 respectively. On the other hand, two genes (PARK2 in tongue
   and ESR1 in oral cavity) were found as downregulated with the fold
   change of 2.98 and 3.59 respectively. The Expression Atlas showed 8 of
   the 39 genes to be differentially expressed in OSCC samples (4
   upregulated and 4 downregulated) (Table [111]3 and Supplementary
   Table [112]S5).

   Further, a detailed literature survey was carried out for the 39
   predicted genes and it was found that 24 genes have literature related
   to oral cancer. The genes CTNNB1 and MCM7 have been well studied for
   oral cancer but were not present in the databases which were taken
   initially for this study. It was interesting to see these genes in the
   predictions. Taken together Oncomine, Expression Atlas and literature,
   29 out of predicted 39 genes have shown association with oral cancer
   which clearly indicates the robustness of the present method. Ten genes
   (CUL3, HTT, ARRB1, TERF2, PPP2R1A, PRKCD, RPS6KA1, HDAC5, HDAC3 and
   SPTAN1) were not found to have any association with oral cancer in
   literature and oncomine data. The detailed literature information of
   all 29 genes is given in supplementary information (Supplementary
   Table [113]S6).

Pathway and functional analysis

   The pathway analysis was performed for the predicted 39 genes using
   PANTHER (Protein ANalysis THrough Evolutionary Relationships)^[114]24.
   A total of eighteen pathways were found to be significantly enriched
   (p-value < 0.01) (Table [115]4). The cholecystokinin receptor (CCKR)
   signaling and gonadotropin releasing hormone receptor pathway were
   found to have the majority of predicted genes (~36% and ~ 31%
   respectively). It is imperative to note that many of the genes are
   common in different pathways i.e. they overlap among pathways. However,
   there are no reports of involvement of these pathways in the pathology
   of OSCC. The B cell activation and heterotrimeric G-protein signaling
   pathway-rod outer segment emerged as two most highly enriched pathways
   (Fig. [116]4A). Both of these pathways have already been reported in
   the context of OSCC^[117]25,[118]26.

Table 4.

   The pathway enrichment analysis.
   Enriched Pathways Genes*
   B cell activation CALM1, CALM2, CALM3, MAPK14, PRKCD, raf1
   Heterotrimeric G-protein signaling pathway-rod outer segment
   phototransduction CALM1, CALM2, CALM3
   VEGF signaling pathway AKT1, MAPK14, PRKCD, PTK2, RAF1
   p53 pathway feedback loops 2 AKT1, CTNNB1, MAPK14, PPP2CA
   CCKR signaling map AKT1, CALM1, CALM2, CALM3, CDC42, CTNNB1, ITGB1,
   JAK2, MAPK14, PRKCD, PTK2, RAF2, RPS6KA1, SRC
   T cell activation AKT1, CALM1, CALM2, CALM3, CDC42, RAF1
   Ras Pathway AKT1, CDC42, MAPK14, RAF1, RPS6KA1
   Angiogenesis AKT1, CTNNB1, ITGA4, ITGB1, MAPK14, NOTCH1, PRKCD, PTK2,
   RAF1, SRC
   FGF signaling pathway AKT1, ITGA4, ITGB1, MAPK14, PPP2CA, PPP2R1A,
   PRKCD, RAF1
   Apoptosis signaling pathway AKT1, HSPA8, PRKCD, RELA, SP101, TRAF2
   p53 pathway AKT1, PP2CA, SIRT1, TRAF2
   Gonadotropin releasing hormone receptor pathway AKT1, CDC42, CTNNB1,
   HTT, ITGB1, MAPK14, PRKCD, PTK2, RAF1, RELA, SMAD3, SRC
   Parkinson disease CSNK2A1, HSPA8, MAPK14, PARK2, SRC
   Endothelin signaling pathway AKT1, ARRB1, PARKCD, RAF1
   EGF receptor signaling pathway AKT1, MAPK14, PPP2CA, PARKCD, RAF1
   Integrin signalling pathway CDC42, FLNA, ITGA4, ITGB1, PTK2, RAF1, SRC
   Inflammation mediated by chemokine and cytokine signaling pathway AKT1,
   ARRB1, ITGA4, ITGB1, JAK2, PARKCD, RAF1, RELA
   [119]Open in a new tab

   *The predicted genes related to the corresponding pathway.

Figure 4.

   [120]Figure 4
   [121]Open in a new tab

   The enriched pathways (A) and functions (B) corresponding to their
   number of genes and fold change.

   The predicted genes (39) were classified into eighteen different
   functional classes with the significant p-value < 0.01 using DAVID
   (Database for Annotation, Visualization and Integrated Discovery)
   (Table [122]5)^[123]27. The function class “chromatin structure and
   dynamics/secondary metabolites biosynthesis, transport and catabolism”,
   which includes genes HDAC5 and HDAC3, was found to be highly enriched.
   The majority of genes were classified in a nucleus (~51%), acetylation
   (~54%) and cytoplasm (49%) (Fig. [124]4B). Many of the genes were
   classified in cytoplasm and nucleus simultaneously such as MAPK14,
   ARRB1 and SMAD3.

Table 5.

   The functional enrichment analysis.
   Enriched Functions Genes
   Acetylation TRAF2, PPP2R1A, HSP90AA1, RELA, SMAD3, PRKDC, SIRT1, ITGB1,
   FLNA, IQGAP1, HDAC5, CDC42, PTK2, CSNK2A1, MCM7, PPP2CA, CALM3, CALM2,
   HSPA8, CALM1, SPTAN1, TERF1
   Active site:Proton acceptor AKT1, PTK2, CSNK2A1, RPS6KA1, MAPK14, RAF1,
   JAK2, SIRT1, PRKCD, SRC
   Apoptosis AKT1, TRAF2, HTT, SIRT1, CTNNB1
   ATP-binding AKT1, PTK2, CSNK2A1, HSP90AA1, MCM7, RPS6KA1, MAPK14,
   PRKDC, RAF1, JAK2, PRKCD, SRC, HSPA8
   Chromatin structure and dynamics/Secondary metabolites biosynthesis,
   transport, and catabolism HDAC5, HDAC3
   Cytoplasm TRAF2, HSP90AA1, HTT, RELA, SMAD3, PARK2, PRKCD, FLNA,
   CTNNB1, AKT1, HDAC5, ARRB1, MAPK14, PPP2CA, CALM3, CALM2, HSPA8, CALM1,
   TERF1, SPTAN1
   Cytoskeleton PPP2CA, CALM3, ITGA4, CALM2, FLNA, SPTAN1, TERF1, CALM1,
   CTNNB1
   Duplication CALM3, ITGA4, ITGB1, PRKCD, CALM2, IQGAP1, FLNA, CALM1
   Host-virus interaction RELA, SIRT1, ITGB1, SRC, HSPA8
   Kinase AKT1, PTK2, CSNK2A1, RPS6KA1, MAPK14, CALM3, PRKDC, RAF1, JAK2,
   PRKCD, CALM2, SRC, CALM1
   Nucleotide-binding AKT1, CDC42, PTK2, CSNK2A1, HSP90AA1, MCM7, RPS6KA1,
   MAPK14, PRKDC, RAF1, JAK2, PRKCD, SRC, HSPA8
   Nucleus HTT, RELA, SOX2, ESR1, SMAD3, PRKDC, PARK2, SIRT1, CTNNB1,
   CUL3, AKT1, HDAC5, HDAC3, NOTCH1, MCM7, ARRB1, MAPK14, PPP2CA, TERF1
   Phosphotransferase CSNK2A1, RPS6KA1, MAPK14, PRKDC, PRKCD, SRC
   Proto-oncogene AKT1, RAF1, JAK2, SRC
   Serine/threonine-protein kinase AKT1, CSNK2A1, RPS6KA1, MAPK14, PRKDC,
   RAF1, PRKCD
   Transcription regulation HDAC5, HDAC3, NOTCH1, MCM7, ARRB1, RELA, SOX2,
   ESR1, SMAD3, SIRT1, CTNNB1
   Transferase AKT1, PTK2, CSNK2A1, RPS6KA1, MAPK14, PRKDC, RAF1, JAK2,
   PRKCD, SRC
   Ubl conjugation TRAF2, HTT, RELA, SOX2, ESR1, PARK2, CTNNB1, CUL3,
   AKT1, HDAC5, HDAC3, ARRB1, CALM3, CALM2, CALM1, TERF1
   [125]Open in a new tab

   *The predicted genes related to the corresponding function.

Identification of clusters in the predicted genes

   The clustering is one of the first steps in computational genetics
   analysis. The cluster analysis was performed on the predicted 39 genes
   using the K-means^[126]28 clustering method in STRING (Search Tool for
   the Retrieval of Interacting Genes/Proteins)
   ([127]http://string-db.org/) to identify groups within them so as to
   prioritize genes for experimental validation. The analysis resulted in
   a network with high clustering coefficient of 0.58 and the average
   degree of ~13. The high clustering coefficient suggests that the
   network is also a community which may be involved in similar kind of
   function(s). The strategy resulted in four clusters (referred to as
   clusters A, B, C and D) with a maximum and minimum size of 16 and 3
   genes respectively (Fig. [128]5). The cluster A has 9 members and was
   found to be enriched in genes controlling the cell communication
   process through signal transduction. The cluster B has 11 genes that
   were found to be related with processes like component organization and
   blood coagulation. The cluster C was a very small with only 3 genes. We
   could not find enriched processes for this cluster however, the genes
   of this cluster involve mainly in cell communication and cellular
   component organization and nicotine pharmacodynamics. The cluster D
   with 16 members was found to be enriched in the biological processes
   like cell differentiation and stress induced cell death.

Figure 5.

   [129]Figure 5
   [130]Open in a new tab

   The cluster analysis of predicted 39 genes using STRING. The identified
   clusters are colored in red (A), green (B), yellow (C) and blue (D).
   The solid and the dotted lines indicate connection within the same and
   different cluster respectively. Different color indicates different
   type of interactions. (Cyan-from curated databases; Pink-experimentally
   determined; Blue-gene co-occurrence; Khaki-from text mining;
   Black-co-expression; Light blue-protein homology).

   Prior to validation through immunohistochemistry, the results of
   pathway analysis and the literature mining were carefully scrutinized
   in addition to clustering for the selection of the cluster
   representative. One representative from each cluster was selected
   (CALM3, ARRB1, FLNA and HTT from cluster A, B, C and D respectively) to
   represent diverse biological processes.

   Thorough literature survey and pathway analysis led us to propose a
   mechanism (Fig. [131]6) to show the role of four genes ARRB1, CALM3,
   FLNA and HTT in OSCC. N-formyl-methionyl-leucyl-phenylalanine (fMLP) is
   a formylated peptide that disassociates ARRB1–Ral-GDS protein complexes
   and release Ral-GDS which results in activation of RALA and positively
   regulates actin cytoskeletal re-organization through FLNA^[132]29.
   Alternatively RALA is also involved in calcium/calmodulin-mediated
   intracellular signaling pathways where it is activated by Ca^2+ via
   binding with CALM3^[133]30. FLNA is an effector protein of RALA which
   recruits a guanine nucleotide exchange factor (GEF) TRIO and catalyze
   the transition of Ras-like protein Rac1 from the GDP to the GTP bound
   form. Rac1-GTP then activates p21-activated kinase 1
   (PAK1)^[134]31,[135]32. FLNA also directly binds to PAK1 and gets
   phosphorylated. PAK1 promotes activation of actin polymerization by
   phosphorylating of Arp2/3 complex which ultimately helps in actin
   cytoskeletal reorganization^[136]33.

Figure 6.

   Figure 6
   [137]Open in a new tab

   The proposed role of ARRB1, CALM3, FLNA and HTT in OSCC.

   Actin cytoskeleton reorganization helps in development of protrusive
   actin structures called lamellipodia and filopodia which help in cell
   migration and invasion^[138]34,[139]35. It has been reported that PAK1
   bind to HTT and enhance the HTT-HTT interaction which leads to
   aggregation of HTT in the cytoplasm^[140]36. The HTT aggregation
   produces toxicity and causes the cell death^[141]37.

Immunohistochemistry (IHC)

   The expression of ARRB1, FLNA, CALM3 and HTT was found to be localized
   predominantly in the cytoplasm of the tumor cells.

ARRB1

   In the current study, the mean expression index of ARRB1 was found to
   be 1.4, 0.95, 0.83 and 0.4 for TANT (Tumor Adjacent Normal Tissue),
   stage 1, stage 2 and stage 3 respectively. A statistical analysis by
   Mann-Whitney U test^[142]38 revealed the mean expression was
   significantly different between TANT and stage 2 (p = 0.043) and
   between TANT and stage 3 (p = 0.012) (Fig. [143]7a). The tissue
   expression was found to be decreased significantly between TANT and
   stage 2 (p = 0.039), between TANT and stage 3 (p = 0.007) by χ^2 test
   in stage 2 and stage 3 as compared to TANT (Figs [144]7a and [145]8a).
   The grade wise comparison of its expression shows significant
   (p = 0.033) difference from TANT to grade 1 sample but not with the
   other grades (Supplementary Fig. [146]S1a and Supplementary
   Table [147]S7).

Figure 7.

   [148]Figure 7
   [149]Open in a new tab

   Box plot showing the stage wise expression of the genes on
   immunohistochemically stained TMA sections of OSCC samples. The scoring
   was done on scale of 0 to 4 [where 0: no staining, 1: 25% (mild
   staining), 2: 25–50% (medium staining), 3: 50–75% (moderate staining)
   and 4: ≥75% (strong staining)]. (a) ARRB1 expression was found to be
   significantly (p < 0.05) decreased from TANT (tumor adjacent normal
   tissue) to cancer but within the stages (stage 1 to stage 3) the
   expression did not show significant change. (b) Significantly
   (p < 0.05) upregulated expression of FLNA was found in the oral cancer
   patients as compared to the TANT samples but within the stages (stage 1
   to stage 3) the expression goes down. (c) The expression of CALM3 did
   not show a significant change from TANT to cancerous samples or among
   the different stages of cancerous samples. (d) HTT showed a decreased
   expression as compared to the TANT samples. Especially, the expression
   significantly decreased from TANT to stage 2 and stage 1 to stage 2
   (p < 0.05). The number of samples used is indicated in brackets.

Figure 8.

   [150]Figure 8
   [151]Open in a new tab

   Representative figures of immunohistochemically stained TMA sections in
   TANT (tumor adjacent normal tissue), Grade 1(G1), Grade 2 (G2) and
   Grade 3 (G3) oral cancer patient samples. Upper panel: (a) Strong
   cytoplasmic staining of ARRB1 was observed in the TANT cells as
   compared to the cancerous cells. (b) FLNA showed a strong cytoplasmic
   staining in case of cancerous samples as compared to TANT samples (c)
   CALM3 showed cytoplasmic staining with no significant change between
   TANT to grade 1 but showed significant (p = 0.04) increases from grade1
   to grade 3 cancer samples, (d) HTT showed a cytoplasmic staining which
   decreases from TANT to grade 3. All the images were captured using
   LEICA DM500 HD compound microscope at 100X magnification. The insert
   (400X) in the IHC images are the zoom in section of the same tissue
   section which provides a better visualization for the intensity of
   staining. Lower panel: table representing the corresponding score of
   each image in upper panel, which was calculated by taking the
   percentage of positive cells and intensity of staining.

FLNA

   The mean expression of FLNA was found to be 1.50, 2.55, 2.04 and 1.80
   in TANT, stage 1, stage 2 and stage 3 respectively. The Mann-Whitney U
   test indicated that the difference in mean expression was statistically
   significant between TANT and stage 1 (p = 0.0002); TANT and stage 2
   (p = 0.039); stage 1 and stage 2 (p = 0.0142) and stage 1 and stage 3
   (p = 0.049). The χ^2 test also revealed that the difference in
   expression, as compared to TANT, was significantly higher in stage 1
   (p = 0.001) and stage 2 (p = 0.032) of OSCC (Fig. [152]7b and [153]8b).
   Among different stages, the FLNA expression was found to decrease
   progressively from stage 1 to stage 3 (p = 0.010 between stage 1 and
   stage 2; p = 0.018 between stage 1 and stage 3 by χ^2 test
   (Fig. [154]7b). The grade wise association too, follows the same
   pattern (p < 0.05) (Fig. [155]8b, Supplementary Fig. [156]S1b and
   Supplementary Table [157]S7). Chi et al. also reported FLNA
   overexpression in OSCC as compared to TANT^[158]39. However, in their
   study they used only two samples that were inadequate to assess
   statistical significance.

CALM3

   The mean expression index was found to be 2.30, 2.55, 2.29 and 2.20 for
   TANT, stage 1, stage 2 and stage 3 respectively. The stage-wise
   comparison did not show any significant (p-value > 0.05) difference in
   CALM3 expression between TANT and cancerous tissue (Fig. [159]7c).
   However, a grade-wise comparison showed a significant difference
   between grade 1 and grade 3 (p = 0.04) (Fig. [160]8c, Supplementary
   Fig. [161]S1c and Supplementary Table [162]S7).

HTT

   The mean expression index of HTT was found to be 2.20, 2.50, 1.65 and
   1.80 for TANT, stage 1, stage 2 and stage 3 respectively. The HTT was
   found to be significantly downregulated between TANT and stage 1
   (p = 0.016), and between stage 1 to stage 2 (p = 0.002) by Mann-Whitney
   U test. The χ^2 test also showed that the expression of HTT is
   significantly down-regulated in stage 2 as compared to TANT
   (p = 0.009), and in stage 2 as compared to stage 1 (p = 0.002)
   (Fig. [163]7d). In other stages, the difference in expression was not
   statistically significant (p > 0.05). A grade-wise comparison did not
   show any significant difference (Fig. [164]8d, Supplementary
   Fig. [165]S1d and Supplementary Table [166]S7).

Discussion

   Understanding the causative association between diseases and genes has
   been an important goal in computational biology. A highly successful
   strategy the so-called guilt-by-association (GBA) approach has been
   used widely to predict new candidate genes through their association
   with known genes. The following observations were made during the
   course of the study.

The centrality measures can rank the genes in a meaningful way

   Similar to social networks certain genes are more important in
   biological networks as they are involved in the control of essential
   processes. These genes are frequently found among disease genes.
   However, it is not easy to determine the relative importance of a
   protein in an exceedingly complex PPI network. The centralities rank
   network components according to their importance within the network
   structure. Thus, the network centralities help to determine
   significance of a protein in a PPI network.

   To measure the relative importance of a protein in the HCGN, we
   calculated various centralities (total 11) dealing with the
   connectivity, distance, betweenness of a node, flow of information,
   contact with other important proteins etc. The ranks showed strong
   correlations with each other indicating a general agreement between
   different centralities. Many genes were found to have a high value of
   degree, betweenness, vulnerability etc. A majority of them were also
   connected with other important proteins as indicated by their high page
   rank value. The centrality values for four selected genes (ARRB1,
   CALM3, FLNA, HTT) were checked vis-a-vis with the average value of each
   centrality for all genes in the network. It can be seen from
   Supplementary Table [167]S8 that the four selected genes have much
   better centrality scores as compared to the average.

   Importantly, we wanted to evaluate the genes that are ranked higher in
   the network and also have more relevance to the diseases. To achieve
   this, we obtained information on the gene-disease associations from
   DisGeNET database^[168]40. The data relevant to all the genes in HCGN
   was extracted from “all gene disease associations” subset of DisGeNET.
   The genes, according to their consensus rank in HCGN, were pooled in
   ten equally sized bins i.e. the top ranked 10% genes were kept in the
   first bin followed by next 10% (i.e. genes ranked between top 10–20%)
   in the second bin and so on. The average number of diseases per gene in
   each bin was then calculated. Finally, the average numbers of diseases
   per genes were plotted against bin numbers to check if top ranked genes
   (that were placed in first or second bin) are actually related to
   higher number of diseases.

   Interestingly, the results show that the genes that were ranked higher
   by the centralities are in fact related with more number of diseases as
   compared to the genes which were ranked lower as indicated by a high
   coefficient of determination (R^2 = 0.72) (Supplementary Fig. [169]S2).
   The results clearly indicate that the centrality measures can rank the
   genes in a meaningful way.

   The biological relevance of the genes can be assessed through GO terms
   to which they are associated. The GO provides information for protein
   functions, processes, and localization. To identify new genes which are
   associated with the oral cancer related GO terms, we further identified
   enriched GO terms. The pruning of the identified GO terms further
   resulted in a set of more enriched terms. A majority of these GO terms
   were related to transport, vascular development, signaling or
   regulation. The results also showed enriched GO terms are related to
   the proteins which have associations with important cancer process such
   as regulation of apoptosis, vascular generation etc.

   Finally, the enriched GO terms were combined with the top ranking genes
   in HCGN to predict new candidate genes. A list of total 39 candidate
   genes was obtained which were further analyzed.

The combination of GO and network centralities is effective

   It is always prudent to verify and validate the in-silico predictions.
   In the current case, verification was done by looking at differential
   gene expression in TANT vs. oral cancer samples. It is assumed that if
   a gene is differentially expressed in a disease then it may have a role
   in the disease process. However, it is not a necessary condition; a
   gene may also greatly influence the disease process even though its
   expression is not different significantly in TANT and disease states
   e.g. by means of epigenetic modifications or post-translational
   modifications etc. However being differentially expressed gives an
   important clue that the gene in question may be involved.

   The analysis has shown that ~44% (17 out of 39) predicted genes were
   found to have differential mRNA expression in oral cancer tissues.
   There were twelve genes which were found to be upregulated while four
   genes were found to be downregulated. The gene ESR1 was found to be
   upregulated in the Expression Atlas while it was shown downregulated in
   Oncomine. A detailed literature survey further resulted in the
   identification of twenty-four genes which may be related to oral cancer
   in some way. Taken together the mRNA expression data and literature,
   twenty-nine out of the predicted thirty-nine genes have shown
   association with oral cancer. The results also cement our belief that
   combination of GO and network centralities is quite robust for the
   prediction of candidate genes. This led us to validate the remaining
   genes through experimental means.

The cluster analysis identified four distinct clusters in the network

   The cluster analysis resulted in the identification of four distinct
   clusters. It was interesting for us to see that these groups of genes
   are actually enriched in distinct processes like cell differentiation
   and stress-induced cell death, component organization and blood
   coagulation, signal transduction and nicotine pharmacodynamics.

The immunohistochemistry corroborated the predictions

   The ARRB1, also known as beta-arrestin1 is a ubiquitously expressed
   adaptor protein with a wide range of cellular and molecular functions
   and have shown to contribute to a number of diseases, including
   cancer^[170]41–[171]43. The ARRB1 also induce hypoxia and may help in
   cancer cells growth and metastasis^[172]44,[173]45. The IHC results
   clearly indicate that the ARRB1 expression gets downregulated with the
   advancing cancer stage.

   The FilaminA (FLNA) is an actin-binding protein, known to be involved
   in cytoplasmic gelation, cell contraction, and spreading^[174]46.
   Cytoplasmic full-length FLNA promotes metastasis and migration through
   its actin-binding properties^[175]47. It acts as a scaffolding molecule
   and found to be overexpressed in multiple types of cancer, including
   prostate, breast, lung, hemangiomas, colon, melanoma, neuroblastoma,
   squamous cell carcinoma and hepatic cholangiocarcinoma^[176]48–[177]50.
   Yue, et al. opined that the high level of FLNA in initial stage helps
   cancer cell detachment and migration from primary site, invasion to ECM
   (extra cellular matrix)^[178]46. Further the decreased expression of
   FLNA helps in re-establishment of the cancer cells to ECM at the
   secondary sites, thus helping in cancer metastasis^[179]46. In our
   study, the expression of FLNA showed a marked increase in the stage 1
   patient samples as compared to the TANT. It then showed a downtrend as
   the cancer progresses to later stages i.e. stage 2 and 3. The same
   pattern was also seen in grade wise analysis for FLNA expression in the
   immunohistochemistry. These results indicate the proposed role of FLNA
   in cell invasion and metastasis.

   The Calmodulin 3 (CALM3) is considered the major regulator of Ca^2+
   dependent signaling in all eukaryotes which assists in tumor growth,
   tumor-associated angiogenesis, metastasis and involved in various
   pathways associated with survival of the cancer cells^[180]51. The
   grade wise comparison showed no significant change in the CALM3
   expression between TANT v/s Grades but from grade 1 to grade 3 the
   expression increases significantly. This indicates it may have a role
   in cancer progression.

   The Huntingtin (HTT) is the protein mutated in Huntington’s disease
   (HD). Recently, it was reported that the expression of HTT is low at
   both mRNA and protein level in case of metastatic breast carcinomas. In
   the same report, it has been shown that downregulation of HTT was also
   associated with poor survival of breast cancer patients^[181]52. In our
   study also the HTT was found to be downregulated.

   The detailed information about the genes used in the current study is
   given in Supplementary Table [182]S9. The columns indicate the status
   of a gene in different datasets (HIPPIE, TCGA, Oral cancer databases,
   HCGN, set A/B/C).

Conclusion

   Though our work has been inspired by many other in-silico studies,
   there is no universally accepted protocol for the prediction of
   potential candidate genes. In this approach we have used an integrated
   approach where each of the component was optimized separately e.g. the
   cut-off for top ranking genes was done using a consensus ranking,
   further the ranking was checked for robustness by validating through
   DisGeNET database. The selection of enriched GO terms was also
   optimized using a variety of combinations and finally a trimming was
   done using REVIGO. In this way we were able to predicted 39 potential
   candidate genes.

   Initially, an investigation of the predictions was done using
   literature, Oncomine (mRNA) and Expression ATLAS (mRNA and RNA-seq).
   The investigation showed that out of 39 genes, we have 29 positive
   predictions which indicate the robustness of the method. The
   experimental validation of predicted genes indicated the robustness of
   the current approach. The clustering, pathway analysis and literature
   survey led to the generation of a hypothesis for the role of four genes
   (ARRB1, FLNA, CALM3 and HTT) in OSCC. Further experimental validation
   by immunohistochemistry indicates that these genes are related to oral
   cancer. It is hoped that the identified genes will aid future drug
   discovery efforts.

Material and Methods

   In this study we have tried to identify the candidate genes involved in
   OSCC using the below integrated approach of PPI network analysis and GO
   analysis. The flow chart of the detailed methodology is shown in
   Fig. [183]1.

Data collection

   The cancer genes were obtained from TCGA-generanker
   ([184]http://cbio.mskcc.org/tcga-generanker/sources.jsp, accessed on
   21^st April 2014). The generanker has been utilized in many such
   studies^[185]53,[186]54. The TCGA contains a total of 7658 cancer
   candidate genes derived from 39 gene lists; each list has been given a
   score of 0.25 to 1; where 1 is high confidence value of the cancer
   genes. In the current study, we have selected 5873 candidate genes with
   a cutoff score ≥ 1.

   The PPIs can be detected through different experimental approaches and
   have been collected in several expert-curated databases. We have used
   HIPPIE version 1.7 (2014) database for the present work^[187]20. This
   database is generated by pooling information from many other PPI
   databases e.g. BioGrid, DIP, HPRD, MINT, BIND, and MIPS, etc.
   Collectively, it contained 15519 proteins and 193486 interactions at
   the time of the current work. HIPPIE’s scoring scheme has been
   optimized to reflect the amount and quality of evidence for a given
   PPI. This score has been calculated from three major criteria i.e.
   number of studies in which an interaction was detected, number and
   quality of experimental techniques used to measure an interaction, and
   number of non-human organisms in which an interaction was reproduced.
   This database contains only those interactions that have been
   experimentally identified in human proteins. Further, different scores
   (0–10) have been given to various experiment types e.g. methods that
   can ascertain interactions with the highest reliability (X-ray
   crystallography) are assigned the highest scores, complementation-based
   assays, and affinity-based technologies are scored with an average
   value of 5, and co-localization or co-sedimentation based methods are
   scored lowest. The overall score is calculated in such a way that it
   gives scores between 0 (no confidence) to 1 (high confidence) for a
   given PPI. In the current study, we have used HIPPIE score cutoff of
   0.63 (corresponding to the second quartile of the HIPPIE score
   distribution) to select more reliable interactions (148707) among 14391
   proteins in the current analysis.

   Homo sapiens related GO terms for biological process, molecular
   function, and cellular component were extracted from GO database
   ([188]www.geneontology.org, version July 2015).

   The genes related to oral cancer were acquired from OrCGDBv.2 (2011)
   and HNOCDB v1(2012). The OrCGDB contains total 374 genes, and the
   HNOCDB contains 133 oral cancer genes. The HNOCDB contains information
   about genetic mutations, methylations, and polymorphism, etc. and
   expression profiles. It also contains a chromosomal map of head-neck
   and oral cancer genes and experimental validation for their inclusion
   in HNOCDB. These two databases were combined, and 465 oral cancer genes
   were obtained. These genes were subsequently used in this study as
   known oral cancer genes. All the data reported above was collected
   during the course of this work.

Construction of human cancer gene network

   The interactions between cancer genes were extracted from HIPPIE
   database and a network was constructed. The orphan nodes self-loop, and
   parallel edges were removed. The fully connected remaining sub-network
   of PPI was defined as human cancer gene network (HCGN) for the current
   study. Genes, which were common among oral cancer (465) and HCGN (4704)
   genes, were defined as oral cancer genes in HCGN (Set B 297).

Topological analysis of HCGN

   The HCGN was analyzed for type of graph and the graph properties. The
   topological parameters like average degree, degree distribution, graph
   diameter, average distance, graph density, modularity, global
   efficiency and average clustering coefficient were calculated to
   characterize the interaction network. The degree distribution is the
   probability that a randomly chosen node have the K number of
   connection. The degree distribution graph is used to investigate the
   network structure type where the degree is represented by X-axis and
   degree distribution on the Y-axis respectively. The graph diameter
   tells the maximum numbers of nodes needed to be traveled to reach from
   one node to another node in the network. In other terms, it can
   indicate the ease with which the proteins can interact and affect their
   reciprocal functions. Average degree signifies the average number of
   connection per node. The graph density indicates the status of
   completion of the graph, as maximum density for the complete graph is
   1. Modularity is a measure of the strength of division of network in
   different modules. Average clustering coefficient tells the probability
   of each node to cluster with other nodes. The graph parameters which
   are called network centralities were also calculated. The centrality
   measure of each node indicates its importance in the network.
   Centralities, for a node, are usually based on the number of its
   connections (e.g. degree) and shortest path distance (e.g. betweenness)
   with other nodes. Each centrality gives the significance of a node in a
   different way for the network under consideration. In the current study
   eleven different centralities viz. degree, closeness, radiality,
   shortest-path betweenness, current-flow betweenness, current-flow
   closeness, centroid, page rank, vulnerability, stress and eigenvector
   were calculated using software Centibin, Cytoscape 3.2.1, and in-house
   code^[189]55,[190]56. Detailed information about these centralities is
   given in the supplementary data.

   The genes were ranked according to their importance in the network
   based on the calculated centrality parameters. Initially the ranking
   was done using individual centrality i.e. each gene was ranked by each
   centrality. These ranks were then used to create a consensus ranking
   (rank-by-rank consensus scoring) and the genes were ranked based on the
   consensus score. This kind of scoring makes it easy to combine the
   ranks by two or more centralities without the need of normalization.
   This also avoids bias because of a single centrality. Such rank-by-rank
   consensus scoring has been successfully used in other areas like
   screening of chemical databases to identify candidate molecules. In the
   current study, as shown later, this scheme was found to perform better
   than other ranking schemes.

   The rank correlation analysis was performed to check inter-correlation
   among centralities. This analysis is important to prevent specific bias
   that may get introduced in consensus score because of very high
   inter-correlation between two or more centralities.

   The rank correlation between two centralities was calculated by
   spearman’s rank correlation coefficient ρ
   [MATH:
   <mi>ρ</mi><mo>=</mo><mn>1</mn><mo>−</mo><mfrac><mrow><mn>6</mn><mo>∑</m
   o><mrow><msubsup><mrow><mi>d</mi></mrow><mrow><mi>i</mi></mrow><mrow><m
   n>2</mn></mrow></msubsup></mrow></mrow><mrow><mi>n</mi><mo
   stretchy="false">(</mo><msup><mrow><mi>n</mi></mrow><mn>2</mn></msup><m
   o>−</mo><mn>1</mn><mo stretchy="false">)</mo></mrow></mfrac> :MATH]

   where d is the difference of the rank given by two different
   centralities for one gene and n is the total number of nodes or genes.
   The coefficient ρ can range from −1 to 1 where 1 stands for perfectly
   positive correlation zero (0) for no correlation and −1 perfectly
   negative correlation.

   The idea to rank the genes by different centralities was to identify
   important nodes in the network that may have control over the other
   nodes in the network. These hubs or information critical nodes are
   known to be associated with many other cancers. Therefore, the presence
   of oral cancer genes in HCGN (Set B) in top 2%, 5%, 7% and 10% of
   ranked gene list by different centralities was checked to assess the
   association of these hubs with oral cancer. Additionally, consensus
   rankings by taking different centralities were also assessed to
   identify the best combination of centralities for prediction of oral
   cancer genes.

   It was found that a combination of top 5% genes from 10 different
   centralities was having a significant number of genes known to be
   associated with oral cancer (details shown in results). This
   observation supported our assumption that the nodes with high
   importance may be associated with oral cancer by virtue of their
   control over flow of information in the network.

GO enrichment analysis

   GORILLA was used for the identification of enriched GO terms. It is a
   web-based application that identifies enriched GO terms in a target
   list of genes compared to a background list of genes. It can also
   identify enriched GO terms, in lists with ranked genes, without the
   need of explicit target and background sets. GORILLA employs a flexible
   threshold statistical approaches to discover GO terms that are
   significantly enriched as compared to background list. GORILLA’s unique
   features and advantages over other threshold free enrichment tools
   include rigorous statistics, fast running time and intuitive graphical
   representation.

   The GO enrichment analysis for oral cancer genes was performed using
   GORILLA online tool in two list mode with p-value < 0.001. The target
   was oral cancer genes (total 465) and the background was cancer genes
   (total 6023). The total of 6023 genes contain 5872 cancer candidate
   genes (5873 from TCGA list-one gene could not be mapped to gene id) and
   151 (out of total 465) oral cancer genes which is not listed in TCGA.
   The target and background were used to extract all the enriched GO
   terms specific to oral cancer.

   GO term enrichment analysis by the majority of the tools generally
   gives p-value and enrichment score. The p-value gives the significant
   level while the enrichment score gives the over/under representation of
   a particular term in a given gene list. However, it has been reported
   that highly explored GO terms are more significantly enriched as
   compared to less explored GO terms. An inspection of the enriched GO
   terms revealed that there are many terms having significant overlap
   with other enriched GO terms; additionally, there are many terms that
   were too general e.g. metabolic process (GO:0008152). Therefore, it was
   necessary to prune the identified terms to reduce redundancy and to
   make the information obtained from GO more interpretable.

   There are many tools like AmiGO, Ontolizer and RedundancyMiner that
   usually cluster the GO terms to the higher level or to more general
   term and help reduce redundancy. In the current study, we have used
   REVIGO which implements a clustering procedure conceptually similar to
   the hierarchical clustering methods such as the neighbor-joining
   approach. It selects a representative GO term out of many similar or
   overlapping terms which reduce functional redundancies using semantic
   similarity measures. The semantic similarity measures between GO terms
   rely on pre-computed information content (IC) for the GO terms. The IC
   is calculated as a negative logarithm of the GO term’s relative
   frequency in a reference database. The REVIGO online server with
   default settings was used to remove the redundant GO terms from the
   enriched GO terms. The remaining GO terms were then used to select
   associated genes in HCGN.

Prediction of candidate genes for oral cancer

   Having identified the enriched GO terms and important hubs in the
   network, the next task was to utilize the obtained information to
   predict the candidate genes for oral cancer. To this end, the top
   ranking 5% genes from consensus ranking list (defined as Set A), oral
   cancer genes in HCGN (defined as Set B) and genes related to disease
   associated enriched GO terms (defined as Set C) were analyzed. A gene
   was predicted to be a candidate oral cancer associated gene if it meets
   all of the following three criteria:
    1. It should be highly ranked by centralities (member of Set A).
    2. It should be associated with enriched GO terms (member of Set C).
    3. It should not be hitherto known for association with oral cancer
       (not a member of Set B).

   Thus a gene common between Set A and Set C but not present in Set B was
   defined as candidate oral cancer gene (Fig. [191]1).

Pathway and Functional analysis

   The pathway analysis was performed on the predicted potential candidate
   genes. A pathway is considered to be enriched in a gene list based on
   an ‘expected’ value. It is calculated by the fraction of the genes in a
   list related to the pathway (observed value) compared with fraction of
   the genes in a pathway to that of the whole gene database (expected
   value). If the observed value is greater than expected value, the
   pathway is considered enriched^[192]24.

   The DAVID version 6.7 was used for the functional enrichment analysis
   in the current studies^[193]27. The DAVID has various tools for gene
   annotation, GO enrichment analysis, PPI analysis, pathway analysis and
   visualization etc. It was used to identify significantly enriched
   (p-value < 0.01) functions amongst a set of predicted candidate genes.

   Additionally, The PANTHER classification system designed for the
   classification of proteins/genes to facilitate high-throughput analysis
   was used for the pathway enrichment analysis for predicted candidate
   genes using a p-value cutoff of 0.01^[194]24.

Evidence in literature/databases for predicted candidate genes

Oncomine database

   The Oncomine is a cancer microarray database and web-based data-mining
   platform aimed at facilitating discovery from genome-wide expression
   analyses. It can give information on differential gene expression in
   various types of cancer. The pathological and clinical data for genes
   is also available for better understanding of different stages of
   cancer.

Expression Atlas

   It provides information on gene expression patterns under different
   biological conditions. It contains both the mRNA and RNA-seq data
   related with gene expression.

   In the current study we have used these databases to identify
   differentially expressed genes in oral cancer with the assumption that
   if a gene is differentially expressed in oral cancer, it may have an
   association with oral cancer.

STRING database

   The STRING v10 is a database which provides the interactions of
   proteins by integrating information from high-throughput experimental
   data, from the mining of databases and literature, and from predictions
   based on genomic context analysis. It also contains information about
   protein domains and structures. It gives an association score for two
   interacting proteins, calculated using various parameters like
   neighborhood score, fusion, co-occurrence, homology, co-expression,
   experimental, database and text mining score etc. A high score
   indicates greater association confidence. String also provide tools to
   analyze the network in many ways e.g. enrichment of GO, pathways, and
   clustering etc. In the current study, the k-means clustering as implied
   in STRING was used to cluster the genes^[195]28. The clustering was
   done on the basis of evidence score and connection cutoff was kept at
   0.40.

Immunohistochemistry (IHC)

   Four human Tissue microarray slides (TMA) were procured from US Biomax
   ([196]http://www.biomax.us/tissue-arrays/Oral_Cavity/). One slide/gene
   was used for the immunohistochemical analysis. Each slide includes 50
   samples of OSCC and 10 TANT. The stage wise information for 50 OSCC
   samples were distributed as: stage 1 = 20, stage 2 = 24 and stage 3 = 5
   tissues samples. We did not considered 1 tissue spot (A2) which was of
   stage 4a. The grade wise information for 50 OSCC samples were
   distributed as: grade 1 = 38, grade 2 = 6 and grade 3 = 5 tissue
   samples. The E5 spot was considered in both grade 1 as well as in grade
   2 as per the data sheet provided, where the grade information was not
   provided for 2 samples. Standard immunohistochemical staining procedure
   was performed to check the expression of the selected genes with slight
   modifications as described earlier^[197]57. The slides were
   deparaffinized and processed through down-gradation of ethanol (from
   100% to 70%). Then the slides were hydrated with deionized water, and
   antigen unmasking was done by boiling with antigen unmasking solution
   (Cat No. H-3301, Vector lab) in pressurized condition. Slides were
   blocked for 1hr at room temperature with blocking solution and antibody
   blocking was done overnight at 4 °C in a moist chamber. Then VECTASTAIN
   Elite ABC Kit (cat No. PK-7200, Vector lab) was used for processing as
   per the user’s manual provided and the slides were developed by DAB
   Peroxidase Substrate Kit (Cat No. SK-4100, Vector lab). The TMA slides
   were blocked with respective antibodies (ARRB1, FLNA, CALM3 and HTT)
   and IHC staining of TMA tissue spots was scored by inter-observers
   (histopathologist) manually to avoid bias. The score from 0 to 4 was
   given on the basis of percentage of positive cells and intensity of
   staining as follows: 0%, no staining = 0; 25%, mild staining = 1;
   25–50%, medium staining = 2; 50–75%, moderate staining = 3 and ≥ 75%,
   strong staining = 4) had been described in earlier studies^[198]58. All
   sections were scored in a semi-quantitative manner according to the
   method described previously, which reflects both the intensity and
   percentage of cells staining. The corresponding score for each sample
   in each TMA slide for the four genes was calculated separately.

Antibodies used

   Arrestin beta 1 Antibody (ARRB1), was obtained from Novus biochemical
   (Cat No. NB110-55485). FILAMIN A (FLNA) and Anti–Huntingtin Protein
   (HTT) a.a. 181–810, clone 1 HU–4C8 were obtained from Millipore (Cat
   No. MAB1680 and MAB2166 respectively). CaM (FL-149) was obtained from
   Santa Cruz biotechnology (Cat No. SC-5537).

Statistical Analysis

   Statistical analysis for the immunohistochemistry data was performed
   using GrapPad Prism 5 software. The categorical data was compared for
   difference using the χ^2 test. Fisher’s exact test was used to detect
   association of a gene with oral cancer. A nonparametric unpaired
   Mann-Whitney test^[199]38 was used to compare the mean of two
   independent groups e.g. normal and cancer’s stage and grade. The
   p-value < 0.05 was considered significant for each statistical
   test^[200]59,[201]60.

Electronic supplementary material

   [202]Supplementry data^ (907.5KB, doc)
   [203]Supplementary Table S9^ (2.3MB, xls)

Acknowledgements