Abstract

   Leukemia is a leading cause of cancer deaths in the developed
   countries. Great efforts have been undertaken in search of diagnostic
   biomarkers of leukemia. However, leukemia is highly complex and
   heterogeneous, involving interaction among multiple molecular
   components. Individual molecules are not necessarily sensitive
   diagnostic indicators. Network biomarkers are considered to outperform
   individual molecules in disease characterization. We applied an
   integrative approach that identifies active network modules as putative
   biomarkers for leukemia diagnosis. We first reconstructed the
   leukemia-specific PPI network using protein-protein interactions from
   the Protein Interaction Network Analysis (PINA) and protein annotations
   from GeneGo. The network was further integrated with gene expression
   profiles to identify active modules with leukemia relevance. Finally,
   the candidate network-based biomarker was evaluated for the diagnosing
   performance. A network of 97 genes and 400 interactions was identified
   for accurate diagnosis of leukemia. Functional enrichment analysis
   revealed that the network biomarkers were enriched in pathways in
   cancer. The network biomarkers could discriminate leukemia samples from
   the normal controls more effectively than the known biomarkers. The
   network biomarkers provide a useful tool to diagnose leukemia and also
   aids in further understanding the molecular basis of leukemia.

   Keywords: network biomarker, integrative analysis, leukemia.

Introduction

   Leukemia is a prevalent hematologic malignancy and one of the most
   common causes of cancer deaths in the developed countries[39]^1^,
   [40]^2. The overall incidence of leukemia is 14 per 100000 people in
   the United States in 2015 and is projected to continue rising. Based on
   the origin, leukemia can be classified into myeloid leukemia or
   lymphoid leukemia, which can be subdivided into acute or chronic
   according to the degree of cellular differentiation[41]^3^, [42]^4.

   Many of the symptoms of leukemia are non-specific and vague, which
   could not be diagnosed by conventional blood tests and bone marrow
   examination[43]^5^, [44]^6. Plenty of efforts have been devoted to
   investigate the molecular alterations in leukemogenesis. Next
   generation sequencing of human genomes and exomes has revealed somatic
   mutations, aberrantly expressed genes, microRNAs and DNA methylations
   with putative roles in leukemia[45]^7^-[46]^9. However, most of the
   individual molecules suffer from low reproducibility and high
   false-positive rates. Few of them have been translated to the clinic
   for diagnostic application.

   It is well recognized that cancer is a complex disease caused not by
   the malfunction of single molecules but their collective behavior in
   the network [47]^10^-[48]^15. Therefore, network biomarkers are
   considered to better characterize leukemia than individual molecules
   and have recently attracted much attention. A number of protein
   interaction sub-networks have been proposed for early diagnosis,
   prognosis and efficacy prediction of cancers[49]^16^-[50]^19.

   In this study, we proposed a framework (Figure [51]1) that integrates
   protein-protein interaction (PPI) data and microarray-based gene
   expression profiles to construct network biomarkers for accurate
   prediction of leukemia. The network biomarkers prove to be effective in
   distinguishing leukemia from normal samples.

Figure 1.

   [52]Figure 1
   [53]Open in a new tab

   The flowchart of network biomarkers identification for leukemia
   diagnosis.

Materials and Methods

Data collection

   We used two different types of datasets, protein-protein interaction
   data and disease annotation of the protein-coding genes to reconstruct
   the leukemia-specific PPI network. PPI data was extracted from the
   Protein Interaction Network Analysis (PINA) v2.0 platform [54]^20. PINA
   is a unified database of protein-protein interaction

   that collects 14454 genes and 108470 interactions from six manually
   curated public databases (listed in Table [55]1). The
   leukemia-associated genes were extracted from the commercial knowledge
   database Metacore^TM, which is developed by GeneGo.

Table 1.

   Source databases of PINA.
   Original database Version Ref. Link
   IntAc Oct 4,2012 [21] [56]http://www.ebi.ac.uk/intact/
   BioGRID 3.1.93 [22] [57]http://thebiogrid.org/
   MINT Dec 21,2010 [23] [58]http://mint.bio.uniroma2.it/mint/Welcome.do
   DIP June 14,2010 [24] [59]http://dip.doe-mbi.ucla.edu/dip/Stat.cgi
   HPRD April 13,2010 [25] [60]http://www.hprd.org/download
   MIPS/Mpact Oct 1,2008 [26] [61]http://mips.helmholtz-muenchen.de/
   [62]Open in a new tab

   The public gene expression data were downloaded from the Gene
   Expression Omnibus (GEO) database. All the gene expression data were
   obtained using Affymetrix Human Genome arrays. The samples in each GEO
   datasets are divided into three categories: Leukemia (including AML,
   CLL, T-PLL and B-CLL), others and Normal. The others samples are
   filtered out in this study since they are not associated with leukemia.
   Detailed information for GEO datasets is summarized in Table [63]2. The
   six groups of expression datasets were analyzed to get statistics
   values. Additional three sets of expression datasets were used for
   further verification (Table [64]3).

Table 2.

   Leukemia-associated gene expression datasets used for analysis.
      Series     Platform  No. Samples      Leukemia      Others Normal Ref.
                                       AML CLL T-PLL B-CLL
   [65]GSE9476  [66]GPL96  64          26                        38     [27]
   [67]GSE6691  [68]GPL96  56              11             32     13     [28]
   [69]GSE5788  [70]GPL96  14                  6                 8      [29]
   [71]GSE22529 [72]GPL96  52              41                    11     [30]
   [73]GSE26725 [74]GPL570 17                        12          5      [31]
   [75]GSE23293 [76]GPL570 41              7              18     16     [32]
   [77]Open in a new tab

Table 3.

   Leukemia-associated gene expression datasets used for validation.
      Series     Platform  No. of samples CML CLL Normal Ref.
   [78]GSE8835  [79]GPL96  66                 42  24     [33]
   [80]GSE24739 [81]GPL570 24             16      8      [34]
   [82]GSE39411 [83]GPL570 152                104 48     [35]
   [84]Open in a new tab

Reconstruction of leukemia-specific PPI network

   Human leukemia-specific protein-protein interaction network was first
   downloaded from PINA and then refined with the 1495 leukemia-associated
   gene from GeneGo. Only the interactions formed between
   leukemia-associated genes were selected to form a leukemia-specific PPI
   network.

Integration with gene expressing profiles

   The statistical analysis was invoked through the limma (Linear Models
   for Microarray Data) R package [85]^36 and the affy(Methods for
   Affymetrix Oligonucleotide Arrays) R package in R software platform
   [86]^37. Student t-test was used to identify the significant difference
   level (P-value) of each considerable gene in each dataset. To enhance
   the accuracy, we applied the empirical Bayesian statistical method to
   moderate the standard errors and then utilized the method proposed by
   Benjamini et al. to adjust the multi-testing [87]^38, and got the
   adjusted P-values simultaneously.

   To integrate the gene expression data and leukemia-specific PPI
   network, the adjusted P-value of each gene was mapped onto its
   corresponding gene in the leukemia-specific PPI network to obtain a
   dataset-specific weighted PPI network, with adjusted P-value as weight
   factor.

Active module subtraction

   In general, the network integration analysis was performed in 3 steps.
   At the first step, we converted the adjusted P-values to Z score
   through using the inverse normal cumulative distribution. Higher Z
   score indicates more important role in leukemogenesis. Given the Z
   score, we performed a greedy search to identify the modules with a
   locally maximal Z score. The candidate modules were seeded with a
   single gene and then a neighbor within a distanced=3 from the seed were
   iteratively added. If the neighbor added to the Z score, it was
   incorporated into the module. The search terminated when no addition
   increased the Z score over the improvement rate r. The parameter r was
   set as 0.05 to avoid over fitting. At last the top 10 modules with the
   highest Z-score identified from each run were merged and iteratively
   searched for 3-5 times, until the module reached the optimal size of
   70-80 nodes. We used jActiveModules [88]^39 to select active modules
   from the weighted PPI network since it is a fashionable method for this
   kind of investigation. jAM is a plug-in of Cytoscape which evaluated
   module activity with Z score.

Network-based biomarkers construction

   At last, as 6 optimized modules include 290 genes in total, which are
   too large and loosely interconnected for further analysis, we carried
   out the overlapping analysis to find out the number of enriched genes
   shared by each optimized modules. We overlapped the six modules and
   selected the genes shared by at least two networks to construct the
   final network-based biomarker.

Pathway enrichment analysis

   We performed pathway enrichment analysis in Ingenuity Pathway Analysis
   (IPA) and Kyoto Encyclopedia of Genes and Genomes (KEGG) [89]^40to
   provide functional insight into the identified network marker. The
   statistical significance of the enrichment was calculated using
   hypergeometric test and adjusted by FDR method (P-value < 0.05).

Statistical significance assessment

   Hypergeometric test was used to test whether the network biomarkers are
   significantly enriched with leukemia-related genes. Known mutation
   genes related to cancers were obtained from the Catalogue of Somatic
   Mutations in Cancer (COSMIC), which is a cancer gene census [90]^41.
   213 of the COSMIC genes are found in common with the GeneGO database.
   An empirical P-value was calculated to evaluate the statistical
   significance. P-value was obtained according to the following equation:
   graphic file with name jcav08p0278i001.jpg

   Where, N represents the number of genes in the leukemia-specific PPI
   network; M is the number of known leukemia related genes in COSMIC; n
   denotes the number of genes in the final network biomarkers; k
   represents the known leukemia related genes in the final network
   biomarkers.

Performance evaluation

   We employed the receiver-operating characteristic (ROC) analysis to
   evaluate the prediction performance of the network biomarkers in
   distinguishing leukemia samples from the normal controls. The epicalc R
   package ([91]http://CRAN.R-project.org/package=epicalc) was used to
   produce the ROC curves. A 5-fold cross validation was performed on
   three gene expression dataset listed in Table [92]3. Normal samples
   were set as 0 and cancer samples were set as 1. The classification
   performance was represented as the area under curve (AUC). We also
   provided sensitivity, specificity and accuracy for the network
   biomarkers.

Results and Discussion

Sub-network involved in leukemogenesis

   The leukemia-specific PPI network was reconstructed by integrating PPI
   from PINA and 1495 leukemia-associated genes from GeneGo. As a result,
   the leukemia-specific PPI network consists of 4136 interactions among
   978 genes.

   As is described in Methods section, gene expression profiles of 6
   independent GEO datasets were overlaid to the reconstructed
   leukemia-specific PPI network and 6 correspondent dataset-specific
   sub-networks were obtained (marked orderly as PPI_GSE9476_raw,
   PPI_GSE6691_raw, PPI_GSE9476_raw, PPI_GSE22529_raw, PPI_GSE23293_raw
   and PPI_GSE26725_raw in Figure [93]1). Greedy search was performed for
   6 sub-networks respectively. After 3 iterations for [94]GSE5788,
   [95]GSE9476 and [96]GSE22529, 4 iterations for [97]GSE23293 and
   [98]GSE26725, 5 iterations for [99]GSE6691, finally we obtained 6
   active modules with a locally maximal Z score by jActiveModules (marked
   orderly as PPI_GSE9476_TR, PPI_GSE6691_ TR, PPI_GSE9476_TR,
   PPI_GSE22529_TR, PPI_GSE23293_TR and PPI_GSE26725_TR in Figure [100]1).
   The number of nodes and edges in each module is summarized in Table
   [101]4.

Table 4.

   Detailed information of the active modules.
   PPI_GSE5788_TR PPI_GSE6691_ TR PPI_GSE9476_ TR PPI_GSE22529_ TR
   PPI_GSE23293_ TR PPI_GSE23293_ TR
   Nodes 77 71 75 73 71 75
   Edges 205 186 166 193 126 188
   [102]Open in a new tab

   After overlap analysis, a total of 97 genes along with their
   interactions were incorporated into the final network-based biomarkers,
   as illustrated in Figure [103]2. Genes with previous evidence in
   leukemia are marked yellow.

Figure 2.

   [104]Figure 2
   [105]Open in a new tab

   The final network-based biomarker for leukemia. The known cancer
   related genes in final network are marked yellow.

Functional analysis of candidate network biomarkers

   The network biomarkers were most enriched for molecular mechanisms of
   cancer (IPA) and pathways in cancer (KEGG). Leukemia-specific pathways
   such as Chronic Myeloid Leukemia (KEGG) and Acute Myeloid Leukemia
   Signaling (both IPA and KEGG) were also enriched and showed high
   statistical significance. It indicates that genes in the biomarker
   network are closely associated with the development of different types
   of leukemia. Besides, in He's study, P13K/AKT Signaling (IPA) was also
   proved to be involved in chronic myeloid leukemia[106]^42. Irwin et al.
   found that ErbB inhibitors played important roles in Philadelphia
   chromosome-positive acute lymphoblastic leukemia (Ph(+)ALL) and ErbB
   signaling (KEGG) was a complementary molecular target in Ph(+)ALL
   [107]^43. The top-ranked pathways in both IPA and KEGG displayed
   apparent correlation between leukemia and the network biomarkers, which
   implied the potential accuracy of our result. Figure [108]3 shows the
   top 10 most significantly enriched IPA and KEGG pathway respectively.

Figure 3.

   [109]Figure 3
   [110]Open in a new tab

   IPA and KEGG pathway enrichment analysis for network biomarkers. The
   top 10 most significantly enriched IPA and KEGG pathway are shown in
   panel (A) and (B) respectively.

   We used the Database for Annotation, Visualization, and Integrated
   Discovery (DAVID) [111]^44 for the Gene ontology (GO) annotation in
   three domains: molecular function, biological process, and cellular
   component. The top 10 most significantly enriched items for each domain
   are shown in Figure [112]4. These results indicate that genes in the
   network are closely associated with the biological processes in the
   development of different types of leukemia, such as cell death [113]^45
   and apoptosis [114]^46. This indicated the accuracy of the predicted
   network biomarkers to a certain extent.

Figure 4.

   [115]Figure 4
   [116]Open in a new tab

   Gene ontology annotation for the network biomarkers. The network
   biomarkers identified by our method were annotated with DAVID tools at
   three levels of gene ontology: Molecular Function, Biological Process,
   and Cellular Component. The top 10 most significantly enriched items
   for each level are shown.

Network biomarkers are significantly associated with leukemia

   We further investigated whether the genes in the network biomarkers
   were randomly obtained. The statistical significance was checked using
   hypergeometric test and a significant p-value of 0.008987933 was
   obtained. This indicates that the candidate network biomarkers are
   enriched with known leukemia-related genes and could not be obtained
   randomly.

   As illustrated in Figure [117]5(A), the blue circle represents the 978
   genes in the leukemia-specific PPI network; the red circle includes the
   522 known leukemia-related genes in COSMIC. The leukemia-specific PPI
   contains 195 known leukemia-related genes in COSMIC. The purple circle
   represents 97 genes in final network biomarkers, among which 29 genes
   belong to the known leukemia-related category.

Figure 5.

   [118]Figure 5
   [119]Open in a new tab

   Validation of the network biomarkers. (A) Distribution of the
   leukemia-associated genes in the network. (B-D) ROC curves obtained
   with the network biomarkers tested by three gene expression datasets.
   Panel (B), (C) and (D) represent respectively the results of the gene
   expression datasets in series of [120]GSE8835, [121]GSE24739 and
   [122]GSE39411.

Sub-network marker with higher classification accuracy

   To evaluate the performance of network biomarkers in classifying
   leukemia and normal gene expression profiles, we used three independent
   gene expression datasets listed in Table [123]3 as tested datasets to
   produce the ROC curves. We compared the network biomarker with three
   reported gene biomarkers: CD38[124]^47, BCL2 [125]^48 and IGFBP7
   [126]^49. The reasons we chose these three markers for comparison are
   as follows, 1) these biomarkers are all well-studied and all of them
   have been validated by clinical experiments. 2) The marker CD38 is a
   member of our network whereas the remaining two are not. We included
   two others for fair evaluating the performance of our network
   biomarker. Figure [127]5 shows the ROC curves for network biomarkers
   and 3 known biomarkers. Network-based biomarker has higher AUC than any
   of the single markers which means network-based biomarker could more
   effectively discriminate the leukemia from the normal controls. The
   sensitivity, specificity and accuracy of each dataset are given in
   Table [128]5.

Table 5.

   Detailed information of ROC curves.
   Series Biomarker Sensitivity Precision Specificity Accuracy AUC
   [129]GSE8835 CD38 0.913 0.700 0.212 0.658 0.629
   BCL2 0.885 0.650 0.166 0.623 0.604
   IGFBP7 0.965 0.665 0.148 0.668 0.584
   Network biomarkers 0.851 0.686 0.316 0.657 0.698
   [130]GSE24739 CD38 0.893 0.662 0.088 0.625 0.431
   BCL2 0.943 0.654 0.004 0.630 0.237
   IGFBP7 0.938 0.652 0.001 0.625 0.197
   Network biomarkers 0.874 0.986 0.976 0.908 0.966
   [131]GSE39411 CD38 0.999 0.725 0.177 0.740 0.981
   BCL2 0.915 0.931 0.853 0.895 0.966
   IGFBP7 0.886 0.797 0.513 0.768 0.880
   Network biomarkers 0.996 0.999 0.998 0.997 0.999
   [132]Open in a new tab

   It is worth noting that for network biomarkers from [133]GSE8835 has a
   relatively lower AUC than the other two datasets. This may be caused by
   the difference of platform and method between [134]GSE8835 and the
   training datasets. Anyhow, the accuracy of the network biomarkers is
   still higher than other three single biomarkers.

   The result indicates that the putative network biomarkers could
   diagnose leukemia samples more accurately and could be used as putative
   biomarker to aid in early diagnosis of leukemia.

Conclusions

   In conclusion, we developed a network approach for molecular
   investigation and diagnosis of leukemia. The constructed network
   biomarkers not only achieve higher accuracy rate of diagnosis compared
   to known single biomarkers but also provide systematic insights into
   the leukemogenesis process. We noticed that we only considered the
   combination of genes (or the nodes) in the network for the prediction
   of leukemia. The interactions among genes can also provide valuable
   biological signatures for diagnosis of diseases. We will take the
   edge-variation in the network into the account for the further
   improving of the leukemia prediction.

Availability of Data and Materials

   PINA: [135]http://cbg.garvan.unsw.edu.au/pina/interactome.stat.do.

   GeneGo: [136]https://portal.genego.com/cgi/data_manager.cgi.

   [137]GSE9476:
   [138]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9476.

   [139]GSE6691:
   [140]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6691.

   [141]GSE5788:
   [142]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5578.

   [143]GSE22529:
   [144]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22529.

   [145]GSE26725:
   [146]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26725.

   [147]GSE23293:
   [148]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE23293.

   [149]GSE8835:
   [150]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8835.

   [151]GSE24739:
   [152]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24739.

   [153]GSE39411:
   [154]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39411.

   COSMIC: [155]http://cancer.sanger.ac.uk/cosmic/download

   IPA: [156]http://www.ingenuity.com/products/login

   DAVID: [157]https://david.ncifcrf.gov/.

Acknowledgments