Abstract

Background

   The identification of genes and uncovering the role they play in
   diseases is an important and complex challenge. Genome-wide linkage and
   association studies have made advancements in identifying genetic
   variants that underpin human disease. An important challenge now is to
   identify meaningful disease-associated genes from a long list of
   candidate genes implicated by these analyses. The application of gene
   prioritization can enhance our understanding of disease mechanisms and
   aid in the discovery of drug targets. The integration of
   protein-protein interaction networks along with disease datasets and
   contextual information is an important tool in unraveling the molecular
   basis of diseases.

Results

   In this paper we propose a computational pipeline for the
   prioritization of disease-gene candidates. Diverse heterogeneous data
   including: gene-expression, protein-protein interaction network,
   ontology-based similarity and topological measures and tissue-specific
   are integrated. The pipeline was applied to prioritize Alzheimer's
   Disease (AD) genes, whereby a list of 32 prioritized genes was
   generated. This approach correctly identified key AD susceptible genes:
   PSEN1 and TRAF1. Biological process enrichment analysis revealed the
   prioritized genes are modulated in AD pathogenesis including:
   regulation of neurogenesis and generation of neurons. Relatively high
   predictive performance (AUC: 0.70) was observed when classifying AD and
   normal gene expression profiles from individuals using leave-one-out
   cross validation.

Conclusions

   This work provides a foundation for future investigation of diverse
   heterogeneous data integration for disease-gene prioritization.

Background

   The rapid accumulation of high-throughput data along with advances in
   network biology have been fundamental in improving our knowledge of
   biological systems and complex disease. The emergence of network
   medicine [[29]1] has explored disease complexity through the systematic
   identification of disease pathways and modules. Via the analysis of
   network topology and dynamics, key discoveries have been made including
   identification of novel disease genes and pathways, biomarkers and drug
   targets for disease [[30]2]. Network theory is making important
   contributions to the topological study of biological networks, such as
   Protein-Protein Interaction Networks (PPIN) [[31]3]. The study by Xu et
   al. [[32]4] analyzed topological features of a PPIN and observed that
   hereditary disease-genes from the Online Mendelian Inheritance in Man
   (OMIM) database [[33]5] have a larger degree and tendency to interact
   with other disease-genes in literature curated networks. Both Chuang et
   al. [[34]6] and Taylor et al. [[35]7] have indicated that the
   alterations in the physical interaction network may be a indicator of
   breast cancer prognosis. Goh et al. [[36]8] demonstrated that the
   majority of disease genes are nonessential and located in the periphery
   of functional networks. Research by [[37]9] discovered that genes
   connected to diseases with similar phenotypes are more likely to
   interact directly with each other.

   Identification of candidate genes associated with physiological
   disorders are a fundamental task in the analysis of complex diseases
   [[38]10]. Genome-wide association studies and linkage analysis have
   been pivotal in the identification of candidate genes, however, the
   large list of resultant genes returned are time-consuming and expensive
   to analyze [[39]11]. The availability of high-throughput molecular
   interaction network provides and the application of network analysis
   tools such as clustering or graph partitioning have proved valuable in
   disease gene prioritization [[40]12]. For instance, PPIN data
   integrated with genome-wide expression profiles using DNA arrays and/or
   next generation sequencing enabling the modeling of networks have aided
   our understanding of how biological networks operate. A number of
   computational approaches to prioritize candidate genes have been
   proposed including: ToppGene [[41]13] and GeneWanderer [[42]14] which
   rank candidate genes based on known associations with disease genes
   using diverse data sources and methodology. The study by Vanunu et al.
   [[43]15] applied a diffusion- based method named PRINCE to prioritize
   genes in prostate cancer, AD and type 2 diabetes. Wu et al. proposed
   the resource AlignPI [[44]16] which applied a network alignment
   approach predict disease genes. The algorithm VAVIEN [[45]17] was also
   developed to prioritize disease genes based on topological features of
   PPINs.

   These diverse studies confirm the importance and need of improving
   methods to integrate diverse 'omic' sources to uncover candidate
   disease genes in biological systems. To address this need, we have
   developed a prioritization pipeline, which integrates diverse
   heterogeneous information. We illustrate the implementation of this
   framework using Alzheimer's Disease (AD) as a Case Study. AD is the
   most common neurodegenerative disease which is both genetically complex
   and heterogeneous. Pathological characteristics of AD include presence
   of amyloid peptide plaques, mature senile plaques and neurofibrillary
   tangles and loss of neurons in conjunction with the presence of
   oxidative stress [[46]18]. AD can be divided into two categories early
   onset AD (EOAD) (patients < 65) and late onset AD (LOAD) (patients >
   65). A set of gene mutations including APP, PSEN1 and PSEN2 involved in
   the amyloid beta and tau pathways have been associated with hereditary
   AD. Using genome-wide association studies, Lambert et al. [[47]19]
   identified the gene encoding APOE in LOAD as a risk factor along with
   11 new loci. Furthermore, studies have suggested that AD is a
   multifactorial disease in which many pathways are involved. This
   highlights the progress, which has been made in determining the genetic
   underpinnings of AD. However, there is further need for an
   understanding of AD mechanisms to develop more specific diagnostic
   tests and novel drug therapies to target this disease.

   Our proposed pipeline integrates AD gene-expression and network data
   along with ontology-based semantic similarity, topological information
   and tissue information. The integration of biological data such as
   semantic similarity which is independent of the gene expression
   profiles and PPIN used to obtain the significant hubs is advantageous
   in providing objective prioritization criteria. To evaluate the
   perfromance of our proposed method we (1) train an Random Forest
   classifier using features generated using the prioritized gene
   candidates to predict AD sample outcome using leave-one out
   cross-validation (LOOCV); (2) perform enrichment analysis and (3)
   compare the candidate gene list to a manually curated reference dataset
   of verified known and susceptible AD disease genes. Furthermore, we
   investigate the tissues in which AD candidate disease-genes are
   expressed through incorporation of tissue-specific expression data.

   The remainder of the paper is organized as follows, in Section 2 the
   integrative framework is described along with details on datasets and
   PPINs used in the analysis. Section 3 provides a summary of the results
   obtained and conclusions along with future work is presented in Section
   4.

Methods

Prioritization Workflow

   The schematic workflow of our computational integrative approach for
   disease gene prioritization is shown in Figure [48]1. Firstly, PPIN and
   gene-expression are integrated to provide an initial list of
   prioritized genes. Secondly, we incorporate additional data in the form
   of network topology and ontology-based semantic similarity to provide
   further prioritization. This list of genes is then evaluated using
   enrichment analysis and measurement of tissue specificity.

Figure 1.

   Figure 1
   [49]Open in a new tab

   Overview of Computational Framework to Prioritize Disease Gene
   Candidates.

AD gene expression data

   Human AD gene expression data was obtained from the Gene Expression
   Omnibus (GEO) database. ([50]http://www.ncbi.nlm.nih.gov/geo/). The
   selected profile [51]GSE4757 was generated using the platform
   [52]GPL570: Affymetrix Human Genome U133 Plus 2.0 Array. The study by
   Dunckley et al. [[53]20] examined the transcriptome of entorhinal
   neurons from six cortical areas with or without neurofibrillary tangles
   (a histopathology feature of AD) using Laser capture microdissection.
   The dataset consists of gene expression profiles of NFT-bearing
   entorhinal cortex neurons from 10 mid-stage AD patients (Disease)
   compared with 10 histopathologically normal neurons (Control) from the
   same patients and brain region. These represent the different stages of
   AD according to the pattern of disease spread. Using the MAS5.0
   function in R the CEL files were firstly normalized. Probes in
   expression profile were then mapped to corresponding NCBI Gene IDs. The
   average expression value was calculated in cases where the Gene ID
   related to more than one probe resulting in 20,539 unique Gene IDs.

   A total of 10,106 significant genes were obtained using the
   significance analysis of microarrays (SAM) [[54]21] technique, a
   regularized t-test approach, using the false discovery rate (FDR =
   0.98). Differentially expressed (DE) genes are genes whose expression
   levels are significantly different between two groups of experiments.

Human protein-protein interaction network

   The PPIN was constucted using data from literature curated sources
   along with the recent Y2H screening by Rolland et al. [[55]22]. The
   literature based dataset consists of 11,045 binary human protein pairs
   extracted from seven publically available databases including BioGRID
   [[56]23], DIP [[57]24], Biomolecular Interaction Network Database
   (BIND) [[58]25], HPRD [[59]26], InACT [[60]27], Protein Data Bank (PDB)
   [[61]28] and Molecular INTeraction database (MINT) [[62]29]. The
   protein pairs were filtered on evidence where pairs that are only
   supported by two or more pieces of evidence are included. Protein pairs
   were also obtained from Y2H experimentation whereby 15,517 opening read
   frames (ORFs) were systematically screened using the platform hORFeome
   v5.1 (Space II) resulting in 13,944 pairwise interactions [[63]22].
   Both these data are integrated to produce a PPIN consisting of 7,783
   nodes and 22,621 protein pairs.

Toplogical analysis

   AD gene expression data was mapped to the PPIN via NCBI geneIDs using
   Cytoscape version 3.2.1 [[64]30]. This resulted in an AD disease
   specific network consisting of 5,457 nodes and 10,852 protein pairs.
   Hub genes were defined based on the measurement of network topological
   features (1) degree and (2) betweenness centrality using Cytoscape
   version 3.2.1 [[65]30].

Degree connectivity

   Degree is a measure of the number of edges that connects a node. Genes
   with a high degree of connectivity within a network have large numbers
   of interacting partners. In PPINs it has been observed that genes with
   high degrees of connectivity are more likely to be essential as genes.
   Furthermore, many interacting partners in a network tend to be involved
   in important cellular processes [[66]1]. Using this assumption, the top
   genes with the highest degree distribution were selected as hub
   proteins. This approach has previously been applied by Taylor et al.
   [[67]7]. The degree cut-off threshold for selecting hubs is defined as:
   [MATH: <mi>A</mi><mi>V</mi><mi>G</mi><mo
   class="MathClass-bin">+</mo><mn>2</mn><mo
   class="MathClass-bin">*</mo><msub><mrow><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>S</mi><mi>t</mi><mi>d</mi></mrow
   ><mo class="MathClass-close">)</mo></mrow></mrow><mrow></mrow></msub>
   :MATH]
   (1)

   where AVG is the average degree across all DE genes in the PPINs and
   Std, the standard deviation [[68]31].

Betweenness centrality

   Betweenness is a topological feature of a network measuring information
   flow through the network. In biological networks, betweenness measures
   the paths through which signals can pass through the interaction
   network. Yu et. al. [[69]32]. identified betweenness as an important
   topological property of a network where nodes with high betweenness
   control most of the information flow. The betweenness centrality
   C[b](n) of a node n is computed as follows:
   [MATH:
   <msub><mrow><mi>C</mi></mrow><mrow><mi>b</mi></mrow></msub><mfenced
   close=")" open="("><mrow><mi>n</mi></mrow></mfenced><mo
   class="MathClass-rel">=</mo><mstyle
   displaystyle="true"><munder><mrow><mo class="MathClass-op">
   ∑</mo></mrow><mrow><mi>s</mi><mo
   class="MathClass-rel">≠</mo><mi>n</mi><mo
   class="MathClass-rel">≠</mo><mi>t</mi></mrow></munder></mstyle><mrow><m
   o
   class="MathClass-open">(</mo><mrow><msub><mrow><mi>θ</mi></mrow><mrow><
   mi>s</mi><mi>t</mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>n</mi></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-bin">/</mo><msub><mrow><mi>θ</mi></mrow><mrow><mi>s</m
   i><mi>t</mi></mrow></msub></mrow><mo
   class="MathClass-close">)</mo></mrow> :MATH]
   (2)

   where s and t are nodes in the network different from n, θ[st ]denoting
   the number of shortest paths from s to t, and θ[st](n) is the number of
   shortest paths from s to t that n lies on. Betweenness centraility is
   calculated in Cytoscape. Using the node betweenness distribution, genes
   located in the top 50% are selected as hub genes.

Network variation of hub genes

   For each hub protein in the PPIN the average of Pearson correlation
   coefficients (PCC) between the hub and each of its respective partners
   was calculated for both disease and control groups. This method has
   previously been applied by Taylor et al. [[70]7] to measure network
   variations among candidate genes and their interacting genes. To
   determine if interactions are varied, the difference of AD gene
   expression correlations of PPIs in disease and control samples is
   calculated. The average hub difference (AvgPCC) off correlation
   (Pearson's correlation co-efficient (PCC)) values between the disease
   and control groups was calculated as follows:
   [MATH: <mi>A</mi><mi>v</mi><mi>g</mi><mstyle class="text"><mtext
   class="textsf"
   mathvariant="sans-serif">P</mtext></mstyle><mi>C</mi><mi>C</mi><mo
   class="MathClass-rel">=</mo><mfrac><mrow><msubsup><mrow><mo
   class="MathClass-op">∑</mo></mrow><mrow><mi>i</mi><mo
   class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></m
   subsup><mrow><mo
   class="MathClass-open">(</mo><mrow><msub><mrow><mi>D</mi></mrow><mrow><
   mi>i</mi></mrow></msub><mo
   class="MathClass-bin">-</mo><msub><mrow><mi>C</mi></mrow><mrow><mi>i</m
   i></mrow></msub></mrow><mo
   class="MathClass-close">)</mo></mrow></mrow><mrow><mi>n</mi></mrow></mf
   rac> :MATH]
   (3)

   where D[i ]and C[i ]represent the correlations of a hub and its
   interactors for the disease and control groups respectively and n the
   number of i interactors for a given hub.

   Genes that are significantly different between the disease and control
   groups were selected as follows: (1) labels from the AD expression data
   were randomly assigned to either the disease or control group. (2) The
   AvgPCC was recalculated as RandomPCC by repeating the analysis defined
   in equation × 1000 times in order to calculate random distribution
   values. (3) P values for each hub was calculated by:
   [MATH: <mfrac><mrow><msup><mrow><mo
   class="MathClass-op">∑</mo></mrow><mrow><mi>A</mi></mrow></msup><mi>v</
   mi><mi>g</mi><mi>P</mi><mi>C</mi><mi>C</mi><mo
   class="MathClass-rel">≥</mo><mi>R</mi><mi>a</mi><mi>n</mi><mi>d</mi><mi
   >o</mi><mi>m</mi><mi>P</mi><mi>C</mi><mi>C</mi></mrow><mrow><mn>1000</m
   n></mrow></mfrac> :MATH]
   (4)

   A network of significant hub genes was generated using significant
   cut-off threshold of P <= 0.05. P values are adjusted using Bonferroni
   correction. Random reassignment of the expression data was taken by
   randomly shuffling the expression data gene labels. This method of
   random reassignment retains the topological network structure of the
   interactome during the randomization.

Ontology based semantic similarity

   Genes involved in phenotypically similar diseases are often
   functionally related on the molecular level [[71]33]. Based on this
   observation, the semantic similarity between hub genes and their
   interactors has been selected to analyze hub genes based on the Gene
   Ontology (GO) [[72]34]. The GO is a controlled vocabulary describing
   the charateristics of gene products. Semantic similarity measures
   evaluate information two genes share. The functional similarity between
   two proteins is estimated using encoded information in the GO
   hierarchies. In this study Wang's measure of similarity [[73]35] is
   applied to the Biological Process hierarchy. This measure determines
   the semantic similarity of two GO terms based on the locations of terms
   in the GO graph and their semantic relations with their ancestor terms.
   Given a GO term A, T[A ]denotes the set of all its ancestor terms
   including term A itself. S[A](t) can be defined as the contribution of
   a term
   [MATH: <mi>t</mi><mo
   class="MathClass-rel">∈</mo><msub><mrow><mi>T</mi></mrow><mrow><mi>A</m
   i></mrow></msub> :MATH]
   to the semantics of A based on the relative locations of t and A in the
   graph. Given GO terms A and B respectively, the semantic similarity
   between these two terms, S[GO](A,B), is defined as:
   [MATH: <mrow><msub><mrow><mstyle class="text"><mtext class="textsf"
   mathvariant="sans-serif">S</mtext></mstyle></mrow><mrow><mi>G</mi><mi>O
   </mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>A</mi><mo
   class="MathClass-punc">,</mo><mi>B</mi></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-rel">=</mo><mfrac><mrow><msub><mrow><mo>∑</mo></mrow><
   mrow><mi>t</mi><mo
   class="MathClass-rel">∈</mo><msub><mrow><mi>T</mi></mrow><mrow><mi>A</m
   i></mrow></msub><mo
   class="MathClass-bin">∩</mo><msub><mrow><mi>T</mi></mrow><mrow><mi>B</m
   i></mrow></msub></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><msub><mrow><mi>S</mi></mrow><mrow><
   mi>A</mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>t</mi></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-bin">+</mo><msub><mrow><mi>S</mi></mrow><mrow><mi>B</m
   i></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>t</mi></mrow><mo
   class="MathClass-close">)</mo></mrow></mrow><mo
   class="MathClass-close">)</mo></mrow></mrow><mrow><msub><mrow><mo>∑</mo
   ></mrow><mrow><mi>t</mi><mo
   class="MathClass-rel">∈</mo><msub><mrow><mi>T</mi></mrow><mrow><mi>A</m
   i></mrow></msub></mrow></msub><msub><mrow><mi>S</mi></mrow><mrow><mi>A<
   /mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>t</mi></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-bin">+</mo><msub><mrow><mo>
   ∑</mo></mrow><mrow><mi>t</mi><mo
   class="MathClass-rel">∈</mo><msub><mrow><mi>T</mi></mrow><mrow><mi>B</m
   i></mrow></msub></mrow></msub><msub><mrow><mi>S</mi></mrow><mrow><mi>B<
   /mi></mrow></msub><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>t</mi></mrow><mo
   class="MathClass-close">)</mo></mrow></mrow></mfrac></mrow> :MATH]
   (5)

   As one gene may be annotated by many GO terms, similarity between two
   genes Sim(G[1],G[2]), is then calculated by taking the average semantic
   similarity scores for all pairs of their associated terms. The
   similarity score can range between (0,1), whereby a value closer to 1
   indicates close relatedness of the two genes in biological process.
   Wang's measure was implemented using the GOSemSim package in R
   [[74]36], taking the median semantic similarity between a hub protein
   and its interactors.

Construction of AD reference dataset

   A reference dataset containing known and susceptible AD genes was
   constructed using the OMIM 'morbid map' table [[75]5]. Known and
   recently discovered AD susceptibility genes in detailed in the study by
   Lamberet et al. [[76]19] were also included. This resulted in a list of
   52 AD related genes.

Tissue specific gene expression data

   Candidate genes were filtered using tissue-specific gene expression
   data retrieved from BioGPS [[77]37] to determine if these genes are
   expressed in tissues where AD is observed including the tissue
   locations: whole brain and prefrontal cortex. This dataset contains the
   transcription levels of 84 human tissues and cell lines and was
   processed using the method described by Lopes et al. [[78]38]. A list
   of 570 housekeeping genes were also included, obtained from [[79]39].

Classification of disease outcome

   Construction of feature sets for individual samples were generated
   using the prioritized hub genes in order to classify sample outcomes
   applying the approach proposed by Taylor et al. [[80]17]. Using the
   gene co-expression values for each sample in the AD dataset, the
   absolute difference in gene co-expression CoeDiff values of the
   prioritized genes and their interactors was calculated by:
   [MATH:
   <mi>C</mi><mi>o</mi><mi>e</mi><mi>D</mi><mi>i</mi><mi>f</mi><mi>f</mi><
   mo class="MathClass-rel">=</mo><mstyle displaystyle="true"><munderover
   accent="false" accentunder="false"><mrow><mo class="MathClass-op">
   ∑</mo></mrow><mrow><mi>i</mi><mo
   class="MathClass-rel">=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></m
   underover></mstyle><mrow><mo
   class="MathClass-open">(</mo><mrow><msub><mrow><mi>I</mi></mrow><mrow><
   mi>m</mi><mi>i</mi></mrow></msub><mo
   class="MathClass-bin">-</mo><msub><mrow><mi>G</mi></mrow><mrow><mi>i</m
   i></mrow></msub></mrow><mo
   class="MathClass-close">)</mo></mrow><mi>i</mi><mo
   class="MathClass-rel">=</mo><mn>1</mn> :MATH]
   (6)

   where the difference between the gene expression values of each
   prioritized gene G and its interactors Im is calculated. This was
   evaluated for all prioritized proteins across each sample in the AD
   dataset.

Results and discussion

   Using an integrative approach defined in the Method section, diverse
   heteregenous data obtained from AD gene expression, network topological
   analysis, along with semantic similarity and tissue-specific
   information are combined to to generate and analyze candidate AD genes.

Selecting differentially expressed genes

   The microarray [81]GSE4757 was analyzed using Significance Analysis of
   Microarrays (SAM) to identify significantly expressed AD related genes
   in R using the SAM 5.0 package from [[82]40]. A total of 10,107
   significantly positive differentially expressed (DE) genes were
   observed from 20,539 genes in the AD microarray dataset based on the
   two class (disease and control) unpaired t-test, using the false
   discovery rate (FDR = 0.98). These DE genes were selected for the
   construction of the AD specific PPIN. An overview of the top 10 DE
   genes is presented in Table [83]1 along with the SAM score based on the
   t-statistic value.

Table 1.

   Overview of the top 10 differentially expressed genes obtained from the
   Alzheimer's microarray dataset using SAM analysis.
   Gene Symbol Gene Description Score
   Scd stearoyl-CoA desaturase (delta-9-desaturase) 3.41
   CCDC79 coiled-coil domain containing 79 3.10
   STAU2 staufen, RNA binding protein, homolog 2 (Drosophila) 2.78
   Hist1h2bc histone cluster 1, H2bi; histone cluster 1, H2bg; histone
   cluster 1, H2be; histone cluster 1, H2bf; histone cluster 1, H2bc 2.72
   HIST1H2BE histone cluster 1, H2bi; histone cluster 1, H2bg; histone
   cluster 1, H2be; histone cluster 1, H2bf; histone cluster 1, H2bc 2.69
   HIST1H2BF histone cluster 1, H2bi; histone cluster 1, H2bg; histone
   cluster 1, H2be; histone cluster 1, H2bf; histone cluster 1, H2bc 2.56
     __________________________________________________________________

   HIST1H2BI histone cluster 1, H2bi; histone cluster 1, H2bg; histone
   cluster 1, H2be; histone cluster 1, H2bf; histone cluster 1, H2bc 2.53
   HIST1H2BG histone cluster 1, H2bi; histone cluster 1, H2bg; histone
   cluster 1, H2be; histone cluster 1, H2bf; histone cluster 1, H2bc 2.52
   DSTYK dual serine/threonine and tyrosine protein kinase 2.50
   Lrrc16a leucine rich repeat containing 16A 2.47
   [84]Open in a new tab

Construction of AD PPIN

   The protein pairs in the human PPIN applied in this study have been
   derived from both small-scale studies described in literature and
   large-scale high-throughput Y2H experimentation. We integrate both
   these data in the construction of the PPIN to improve interactome
   coverage. Protein pairs obtained from small-scale studies are
   considered high quality, however, it has been observed by Rolland et
   al. [[85]22] that their coverage of the interactome is limited to a
   narrow dense zone. Protein pairs obtained via high-throughput Y2H
   experimentation have demonstrated distributed homogeneously across the
   interactome [[86]22].

   AD gene expression data was firstly mapped to the PPIN via NCBI geneIDs
   using Cytoscape version 3.2.1 [[87]30]. This network was further
   filtered to include only the 10,107 DE genes identified from the SAM
   analysis. The mapping of DE genes with the PPIN resulted in an AD
   disease specific network consisting of 3,795 nodes and 5,410 protein
   pairs.

Identification of hubs through network topology analysis

   Hub genes were firstly defined based on network topological features
   using Cytoscape version 3.2.1 [[88]30]. The AD PPIN is represented as
   an undirected graphs, G = (V,E), whereby V represents a set of nodes
   (proteins) and
   [MATH: <mi>E</mi><mo class="MathClass-rel">=</mo><mrow><mo
   class="MathClass-open">{</mo><mrow><mrow><mo
   class="MathClass-open">(</mo><mrow><mi>u</mi><mo
   class="MathClass-punc">,</mo><mi>v</mi></mrow><mo
   class="MathClass-close">)</mo></mrow><mo
   class="MathClass-rel">|</mo><mi>u</mi><mo
   class="MathClass-punc">,</mo><mi>v</mi><mo
   class="MathClass-rel">∈</mo><mi>V</mi></mrow><mo
   class="MathClass-close">}</mo></mrow> :MATH]
   , the set of edges connecting the nodes. An overview of the global
   properties of the network is presented in Table [89]2. Using Cytoscape,
   a total of 174 hub genes were selecting based upon the topological
   analysis measures degree distribution and betweenness centrality. Genes
   with a high degree of connectivity and genes with low connectivity but
   high betweenness were selected using the cut-off thresholds defined in
   the Methods section. Table [90]3 presents the number of hubs obtained
   using these measures. These two measures have been selected as studies
   of model organisms have observed that proteins with high degree of
   connectivity tend to be encoded by essential genes [[91]41].
   Furthermore, detection of these genes leads to larger numbers of
   phenotypic outcomes compared to genes with lower connectivity [[92]42].
   However, not all disease genes in humans are essential genes. Goh et
   al. [[93]8] found that non-essential disease genes tend to be tissue
   specific located at the functional periphery of the interactome and do
   not necessarily encode hubs. Both Yu et al. [[94]32] and Joy et al.
   [[95]43] demonstrated how nodes with a low degree of centrality but
   high betweenness are important in a network. Taking this into
   consideration, we also include betweenness as an indicator of
   centrality.

Table 2.

   Overview of the global properties of the AD specific PPIN.
   Nodes Edges Clustering Co-Efficient Avgerage Degree Average Betweeness
   Centraility
   3,795 5410 0.026 2.85 0.009
   [96]Open in a new tab

Table 3.

   Analysis of AD PPIN Topological Features to Identify Hub Genes.
                                    Number of Nodes
   High Betweenness
   (Bottlenecks) 128
                 Selected Hub Genes 175
   [97]Open in a new tab

Prioritization of hubs

Step 1: Integration of co-expression and PPIN

   To further filter the list of 174 identified hub genes and select
   significant hubs we integrated the AD gene expression data with the
   PPIN. The network variation of the hub genes and their interactors was
   calculated using the AvgPCC equation defined in Methods. PCC values
   between the hubs and their interactors were calculated for both the
   disease and control groups. Significant hub genes were selecting using
   the Bonferroni corrected cut-off threshold of P < 0:05. A total of 22
   genes were identified.

Step 2: Gene ontology semantic similarity analysis

   The semantic similarity between hub genes and their interactors has
   also been selected to prioritize hub genes based on the Gene Ontology
   (GO). It has been observed that genes involved in phenotypically
   similar diseases are often functionally related on the molecular level
   [[98]33]. In this study, the semantic similarity between a gene hub and
   it's interacting partners was performed using Wang's [[99]35] measure
   of similarity detailed in equation 4. This was applied to the GO
   Biological Process hierarchy as a quantitative measure of functionality
   similarity between gene pairs using the R package GOSemSim [[100]36].
   To obtain the similarity value for the hub and all its interactors, the
   median similarity was taken across all protein pairs. The semantic
   similarity values obtained ranged between 0 and 1. The hub gene
   semantic similarity distribution was sorted in ascending order and hubs
   with GO semantic similarity greater than 0.5 selected. This resulted in
   a total of 10 prioritized hubs. Combining candidate genes output from
   analysis in Step 1 and Step 2 resulted in a list of 32 prioritized hub
   genes summarized in Table [101]4.

Table 4.

   List of significant hubs obtained from gene co-expression network
   analysis.
                       Approach                     Number of Prioritized Hubs
   Network Variation using PPIN and Gene Expression             22
   Gene Ontology                                                10
   [102]Open in a new tab

Evaluation

Functional annotation enrichment

   GO and pathway enrichment analysis using the DAVID resource [[103]44]
   was applied to investigate the biological implications of the
   prioritized hub gene list. Functional annotation was obtained by
   extracting the most over-representative GO terms (Biological Process,
   Cellular Component and Molecular Function) for the groups of genes
   under observation with respect to the whole genome taken as the
   reference background set (p-value <0.05). We applied this approach to
   measure if the prioritized hub genes are more enriched in GO terms or
   involved in pathways than what would be expected by chance. The number
   of GO terms and KEGG pathways are summarized in Table [104]5.

Table 5.

   Number of Enriched Gene Ontology Terms and KEGG Pathways.
                         Number of Terms
   GO Biological Process 135
   GO Molecular Function 19
   GO Cellular Component 17
   KEGG Pathways         5
   [105]Open in a new tab

   Enrichment analysis of the prioritized hub genes identified significant
   biological processes modulated in AD pathogenesis including: neuron
   differentiation, neuron projection morphogenesis, neuron projection
   development and regulation of neuron apoptosis. Furthermore,
   significant KEGG pathways including: the Wnt signaling pathway and the
   TGF-beta signaling pathway both of which have been implicated in
   neurodegenerative diseases [[106]45,[107]46] These results highlight
   the potential of this approach in using prioritized hubs for the
   prediction of AD biomarkers.

Reference dataset comparison

   The list of prioritized hub genes were compared to the reference
   dataset consisting of 52 AD related genes. A total of 9 AD susceptible
   genes from the list of hub genes identified were identified summarized
   in Table [108]6. Mutations in PSEN1 are the most common cause of early
   onset of AD. TRAF1, is a critical regulator of cerebral
   ischemia-reperfusion injury and neuronal death [[109]47]. LZTS2 has
   shown associated with late onset AD [[110]48].

Table 6.

   List of AD Susceptible Genes Idenfied from the Prioritized List of Gene
   Hubs.
   Gene Symbol                         Gene Name
      LZTS2    leucine zipper, putative tumor suppressor 2
      MTUS2    KIAA0774
      TRAF1    TNF receptor-associated factor 1
      FHL3     four and a half LIM domains 3
       REL     v-rel reticuloendotheliosis viral oncogene homolog (avian)
      CARD9    caspase recruitment domain family, member 9
      TFCP2    transcription factor CP2
      MID2     midline 2
      KRT38    keratin 38
   [111]Open in a new tab

Tissue analysis

   The prioritized gene hubs were analyzed to determine if gene hubs are
   expressed in tissues in whereby by AD is observed including the
   prefrontal cortex and whole brain. Tissue specificity is an important
   component of network analysis as genetic diseases often target specific
   tissue(s). Therefore, perturbations of pathways or proteins may have
   differential effects among diverse tissues [[112]49]. Using tissue
   specific expression data from BioGPS [[113]37] along with housekeeping
   genes obtained from Lopes et al. [[114]38], we identified that 4 of the
   prioritized gene hubs are located in the whole brain and/or the
   prefrontal cortex tissues.

Disease outcome classification

   Using the prioritized gene-candidate list along with gene expression
   values from the GEO dataset [115]GSE4757, we constructed a feature
   vector applying the approach described in Methods using Equation 6.
   These features measure the differences in absolute co-expression values
   between the prioritized genes and their interactors across each sample
   in the dataset. The aim of this approach is to determine if measured
   differences in co-expression values of prioritized genes and their
   interactors can predict sample outcome. A total of 32 feature values
   (obtained using each of the genes in the prioritized list) were
   generated for each instance in the AD dataset. The AD dataset has a
   total of 20 individual samples, 10 AD and 10 normal. We use these as
   labels when measuring the classification performance. Classification of
   sample outcome using this vector was performed using the Random Forest
   classifier [[116]50] in the WEKA toolbox [[117]51]. Leave one out cross
   validation (LOOCV) was implemented, whereby, for a dataset with n
   samples, n experiments are performed. Each experiment uses n-1 samples
   for training and the remaining sample for testing. The performance was
   assessed by a receiver operating characteristics (ROC) curve which
   plots the true positive rate against the false positive rate at various
   threshold settings. The area under the ROC curve (AUC) was estimated
   for each of the prioritization approaches described in Table [118]7.
   Reasonable AUC values were observed for the various prioritization
   approaches ranging from 0.73-0.70 as presented in Table [119]7 and
   illustrated using ROC curves in Figure [120]2. This analysis suggests
   that the integration of diverse data in the prioritization of protein
   hubs are indeed useful for sample outcome prediction. Interestingly, we
   can see a slight decrease in classification performance when gene hubs
   prioritized using topological features are integrated with gene hubs
   prioritized using GO semantic similarity. This is compared to using
   only the topological or GO semantic similarity prioritization approach
   only.

Table 7.

   Summary of Disease Outcome Classification using Prioritized Genes.
   Priortization Approach Description Hubs AUC Value
   Topology Hubs prioirtized using network topology methods defined in
   Step 1 22 0.74
   GOSim Hubs prioirtized using GO Semantic Similarity defined in Step 2
   10 0.72
   Toplogy_GoSim Union of priortized hubs from Step 1 and Step 2 32 0.7
   [121]Open in a new tab

Figure 2.

   Figure 2
   [122]Open in a new tab

   ROC Curves illustrating the predictive performance of the Gene Hub
   Prioritization Approches.

Conclusions

   In this study, we have proposed a novel computational framework that
   integrates diverse heterogeneous including gene expression, network
   topological features: degree and betweenness along with GO semantic
   similarity and tissue specificity information to prioritize and analyze
   disease-gene candidates. Our proposed pipeline provides the flexibility
   to integrate other heterogeneous data sources. To illustrate our
   approach, AD was applied as a Case Study. AD is a chronic
   neurodegenerative disorder characterized by a progressive decline in
   memory and cognitive abilities. It is estimated that 4-8% of the
   population over 65 years of age suffers from this disease with the rate
   of incidence increasing with age. Currently, only the symptoms of AD
   are treated with no effective therapeutic strategy available. Using our
   pipeline, 22 prioritized candidate AD disease genes were generated.
   Enrichment analysis revealed GO Biological Process terms were enriched
   by the prioritized gene-list such as regulation of neurogenesis and
   generation of neurons which are linked to AD pathogenesis. Analysis of
   KEGG pathways identified prioritized genes involved in pathways
   including Wnt signaling pathway and the TGF-beta signaling pathway. AD
   susceptible genes: PSEN1 and TRAF1 extracted from OMIN [[123]5] and a
   recent study by Lambert et al. [[124]19] were identified using the
   prioritization approach. A reasonable predictive performance (AUC:
   0.70) was achieved when classifying AD and normal gene expression
   profiles from individuals using a feature set generated from the
   prioritized gene-list along with supervised classification using Random
   forest and LOOCV. These results demonstrate that the integration of
   PPINs along with disease datasets and contextual information is an
   important tool in unraveling the molecular basis of diseases. It is
   important to note that network based approaches for candidate disease
   gene priortization need to be viewed with care as the map of the binary
   human PPIN is still incomplete. However, coverage of the human
   interactome is increasing through systematic Y2H and transcriptional
   interaction screens. This increased coverage, quality, and diversity of
   human PPIN data will provide further opportunities for the molecular
   characterization and understanding of human disease [[125]1].

Competing interests

   The authors declare that they have no competing interests.

Authors' contributions

   FB, HW and HZ conceived and designed the framework. FB performed the
   experiments and analyzed the data. FB, HW and HZ wrote the paper.

Contributor Information

   Fiona Browne, Email: f.browne@ulster.ac.uk.

   Haiying Wang, Email: hy.wang@ulster.ac.uk.

   Huiru Zheng, Email: h.zheng@ulster.ac.uk.

Acknowledgements