Abstract Leukemia is a leading cause of cancer deaths in the developed countries. Great efforts have been undertaken in search of diagnostic biomarkers of leukemia. However, leukemia is highly complex and heterogeneous, involving interaction among multiple molecular components. Individual molecules are not necessarily sensitive diagnostic indicators. Network biomarkers are considered to outperform individual molecules in disease characterization. We applied an integrative approach that identifies active network modules as putative biomarkers for leukemia diagnosis. We first reconstructed the leukemia-specific PPI network using protein-protein interactions from the Protein Interaction Network Analysis (PINA) and protein annotations from GeneGo. The network was further integrated with gene expression profiles to identify active modules with leukemia relevance. Finally, the candidate network-based biomarker was evaluated for the diagnosing performance. A network of 97 genes and 400 interactions was identified for accurate diagnosis of leukemia. Functional enrichment analysis revealed that the network biomarkers were enriched in pathways in cancer. The network biomarkers could discriminate leukemia samples from the normal controls more effectively than the known biomarkers. The network biomarkers provide a useful tool to diagnose leukemia and also aids in further understanding the molecular basis of leukemia. Keywords: network biomarker, integrative analysis, leukemia. Introduction Leukemia is a prevalent hematologic malignancy and one of the most common causes of cancer deaths in the developed countries[39]^1^, [40]^2. The overall incidence of leukemia is 14 per 100000 people in the United States in 2015 and is projected to continue rising. Based on the origin, leukemia can be classified into myeloid leukemia or lymphoid leukemia, which can be subdivided into acute or chronic according to the degree of cellular differentiation[41]^3^, [42]^4. Many of the symptoms of leukemia are non-specific and vague, which could not be diagnosed by conventional blood tests and bone marrow examination[43]^5^, [44]^6. Plenty of efforts have been devoted to investigate the molecular alterations in leukemogenesis. Next generation sequencing of human genomes and exomes has revealed somatic mutations, aberrantly expressed genes, microRNAs and DNA methylations with putative roles in leukemia[45]^7^-[46]^9. However, most of the individual molecules suffer from low reproducibility and high false-positive rates. Few of them have been translated to the clinic for diagnostic application. It is well recognized that cancer is a complex disease caused not by the malfunction of single molecules but their collective behavior in the network [47]^10^-[48]^15. Therefore, network biomarkers are considered to better characterize leukemia than individual molecules and have recently attracted much attention. A number of protein interaction sub-networks have been proposed for early diagnosis, prognosis and efficacy prediction of cancers[49]^16^-[50]^19. In this study, we proposed a framework (Figure [51]1) that integrates protein-protein interaction (PPI) data and microarray-based gene expression profiles to construct network biomarkers for accurate prediction of leukemia. The network biomarkers prove to be effective in distinguishing leukemia from normal samples. Figure 1. [52]Figure 1 [53]Open in a new tab The flowchart of network biomarkers identification for leukemia diagnosis. Materials and Methods Data collection We used two different types of datasets, protein-protein interaction data and disease annotation of the protein-coding genes to reconstruct the leukemia-specific PPI network. PPI data was extracted from the Protein Interaction Network Analysis (PINA) v2.0 platform [54]^20. PINA is a unified database of protein-protein interaction that collects 14454 genes and 108470 interactions from six manually curated public databases (listed in Table [55]1). The leukemia-associated genes were extracted from the commercial knowledge database Metacore^TM, which is developed by GeneGo. Table 1. Source databases of PINA. Original database Version Ref. Link IntAc Oct 4,2012 [21] [56]http://www.ebi.ac.uk/intact/ BioGRID 3.1.93 [22] [57]http://thebiogrid.org/ MINT Dec 21,2010 [23] [58]http://mint.bio.uniroma2.it/mint/Welcome.do DIP June 14,2010 [24] [59]http://dip.doe-mbi.ucla.edu/dip/Stat.cgi HPRD April 13,2010 [25] [60]http://www.hprd.org/download MIPS/Mpact Oct 1,2008 [26] [61]http://mips.helmholtz-muenchen.de/ [62]Open in a new tab The public gene expression data were downloaded from the Gene Expression Omnibus (GEO) database. All the gene expression data were obtained using Affymetrix Human Genome arrays. The samples in each GEO datasets are divided into three categories: Leukemia (including AML, CLL, T-PLL and B-CLL), others and Normal. The others samples are filtered out in this study since they are not associated with leukemia. Detailed information for GEO datasets is summarized in Table [63]2. The six groups of expression datasets were analyzed to get statistics values. Additional three sets of expression datasets were used for further verification (Table [64]3). Table 2. Leukemia-associated gene expression datasets used for analysis. Series Platform No. Samples Leukemia Others Normal Ref. AML CLL T-PLL B-CLL [65]GSE9476 [66]GPL96 64 26 38 [27] [67]GSE6691 [68]GPL96 56 11 32 13 [28] [69]GSE5788 [70]GPL96 14 6 8 [29] [71]GSE22529 [72]GPL96 52 41 11 [30] [73]GSE26725 [74]GPL570 17 12 5 [31] [75]GSE23293 [76]GPL570 41 7 18 16 [32] [77]Open in a new tab Table 3. Leukemia-associated gene expression datasets used for validation. Series Platform No. of samples CML CLL Normal Ref. [78]GSE8835 [79]GPL96 66 42 24 [33] [80]GSE24739 [81]GPL570 24 16 8 [34] [82]GSE39411 [83]GPL570 152 104 48 [35] [84]Open in a new tab Reconstruction of leukemia-specific PPI network Human leukemia-specific protein-protein interaction network was first downloaded from PINA and then refined with the 1495 leukemia-associated gene from GeneGo. Only the interactions formed between leukemia-associated genes were selected to form a leukemia-specific PPI network. Integration with gene expressing profiles The statistical analysis was invoked through the limma (Linear Models for Microarray Data) R package [85]^36 and the affy(Methods for Affymetrix Oligonucleotide Arrays) R package in R software platform [86]^37. Student t-test was used to identify the significant difference level (P-value) of each considerable gene in each dataset. To enhance the accuracy, we applied the empirical Bayesian statistical method to moderate the standard errors and then utilized the method proposed by Benjamini et al. to adjust the multi-testing [87]^38, and got the adjusted P-values simultaneously. To integrate the gene expression data and leukemia-specific PPI network, the adjusted P-value of each gene was mapped onto its corresponding gene in the leukemia-specific PPI network to obtain a dataset-specific weighted PPI network, with adjusted P-value as weight factor. Active module subtraction In general, the network integration analysis was performed in 3 steps. At the first step, we converted the adjusted P-values to Z score through using the inverse normal cumulative distribution. Higher Z score indicates more important role in leukemogenesis. Given the Z score, we performed a greedy search to identify the modules with a locally maximal Z score. The candidate modules were seeded with a single gene and then a neighbor within a distanced=3 from the seed were iteratively added. If the neighbor added to the Z score, it was incorporated into the module. The search terminated when no addition increased the Z score over the improvement rate r. The parameter r was set as 0.05 to avoid over fitting. At last the top 10 modules with the highest Z-score identified from each run were merged and iteratively searched for 3-5 times, until the module reached the optimal size of 70-80 nodes. We used jActiveModules [88]^39 to select active modules from the weighted PPI network since it is a fashionable method for this kind of investigation. jAM is a plug-in of Cytoscape which evaluated module activity with Z score. Network-based biomarkers construction At last, as 6 optimized modules include 290 genes in total, which are too large and loosely interconnected for further analysis, we carried out the overlapping analysis to find out the number of enriched genes shared by each optimized modules. We overlapped the six modules and selected the genes shared by at least two networks to construct the final network-based biomarker. Pathway enrichment analysis We performed pathway enrichment analysis in Ingenuity Pathway Analysis (IPA) and Kyoto Encyclopedia of Genes and Genomes (KEGG) [89]^40to provide functional insight into the identified network marker. The statistical significance of the enrichment was calculated using hypergeometric test and adjusted by FDR method (P-value < 0.05). Statistical significance assessment Hypergeometric test was used to test whether the network biomarkers are significantly enriched with leukemia-related genes. Known mutation genes related to cancers were obtained from the Catalogue of Somatic Mutations in Cancer (COSMIC), which is a cancer gene census [90]^41. 213 of the COSMIC genes are found in common with the GeneGO database. An empirical P-value was calculated to evaluate the statistical significance. P-value was obtained according to the following equation: graphic file with name jcav08p0278i001.jpg Where, N represents the number of genes in the leukemia-specific PPI network; M is the number of known leukemia related genes in COSMIC; n denotes the number of genes in the final network biomarkers; k represents the known leukemia related genes in the final network biomarkers. Performance evaluation We employed the receiver-operating characteristic (ROC) analysis to evaluate the prediction performance of the network biomarkers in distinguishing leukemia samples from the normal controls. The epicalc R package ([91]http://CRAN.R-project.org/package=epicalc) was used to produce the ROC curves. A 5-fold cross validation was performed on three gene expression dataset listed in Table [92]3. Normal samples were set as 0 and cancer samples were set as 1. The classification performance was represented as the area under curve (AUC). We also provided sensitivity, specificity and accuracy for the network biomarkers. Results and Discussion Sub-network involved in leukemogenesis The leukemia-specific PPI network was reconstructed by integrating PPI from PINA and 1495 leukemia-associated genes from GeneGo. As a result, the leukemia-specific PPI network consists of 4136 interactions among 978 genes. As is described in Methods section, gene expression profiles of 6 independent GEO datasets were overlaid to the reconstructed leukemia-specific PPI network and 6 correspondent dataset-specific sub-networks were obtained (marked orderly as PPI_GSE9476_raw, PPI_GSE6691_raw, PPI_GSE9476_raw, PPI_GSE22529_raw, PPI_GSE23293_raw and PPI_GSE26725_raw in Figure [93]1). Greedy search was performed for 6 sub-networks respectively. After 3 iterations for [94]GSE5788, [95]GSE9476 and [96]GSE22529, 4 iterations for [97]GSE23293 and [98]GSE26725, 5 iterations for [99]GSE6691, finally we obtained 6 active modules with a locally maximal Z score by jActiveModules (marked orderly as PPI_GSE9476_TR, PPI_GSE6691_ TR, PPI_GSE9476_TR, PPI_GSE22529_TR, PPI_GSE23293_TR and PPI_GSE26725_TR in Figure [100]1). The number of nodes and edges in each module is summarized in Table [101]4. Table 4. Detailed information of the active modules. PPI_GSE5788_TR PPI_GSE6691_ TR PPI_GSE9476_ TR PPI_GSE22529_ TR PPI_GSE23293_ TR PPI_GSE23293_ TR Nodes 77 71 75 73 71 75 Edges 205 186 166 193 126 188 [102]Open in a new tab After overlap analysis, a total of 97 genes along with their interactions were incorporated into the final network-based biomarkers, as illustrated in Figure [103]2. Genes with previous evidence in leukemia are marked yellow. Figure 2. [104]Figure 2 [105]Open in a new tab The final network-based biomarker for leukemia. The known cancer related genes in final network are marked yellow. Functional analysis of candidate network biomarkers The network biomarkers were most enriched for molecular mechanisms of cancer (IPA) and pathways in cancer (KEGG). Leukemia-specific pathways such as Chronic Myeloid Leukemia (KEGG) and Acute Myeloid Leukemia Signaling (both IPA and KEGG) were also enriched and showed high statistical significance. It indicates that genes in the biomarker network are closely associated with the development of different types of leukemia. Besides, in He's study, P13K/AKT Signaling (IPA) was also proved to be involved in chronic myeloid leukemia[106]^42. Irwin et al. found that ErbB inhibitors played important roles in Philadelphia chromosome-positive acute lymphoblastic leukemia (Ph(+)ALL) and ErbB signaling (KEGG) was a complementary molecular target in Ph(+)ALL [107]^43. The top-ranked pathways in both IPA and KEGG displayed apparent correlation between leukemia and the network biomarkers, which implied the potential accuracy of our result. Figure [108]3 shows the top 10 most significantly enriched IPA and KEGG pathway respectively. Figure 3. [109]Figure 3 [110]Open in a new tab IPA and KEGG pathway enrichment analysis for network biomarkers. The top 10 most significantly enriched IPA and KEGG pathway are shown in panel (A) and (B) respectively. We used the Database for Annotation, Visualization, and Integrated Discovery (DAVID) [111]^44 for the Gene ontology (GO) annotation in three domains: molecular function, biological process, and cellular component. The top 10 most significantly enriched items for each domain are shown in Figure [112]4. These results indicate that genes in the network are closely associated with the biological processes in the development of different types of leukemia, such as cell death [113]^45 and apoptosis [114]^46. This indicated the accuracy of the predicted network biomarkers to a certain extent. Figure 4. [115]Figure 4 [116]Open in a new tab Gene ontology annotation for the network biomarkers. The network biomarkers identified by our method were annotated with DAVID tools at three levels of gene ontology: Molecular Function, Biological Process, and Cellular Component. The top 10 most significantly enriched items for each level are shown. Network biomarkers are significantly associated with leukemia We further investigated whether the genes in the network biomarkers were randomly obtained. The statistical significance was checked using hypergeometric test and a significant p-value of 0.008987933 was obtained. This indicates that the candidate network biomarkers are enriched with known leukemia-related genes and could not be obtained randomly. As illustrated in Figure [117]5(A), the blue circle represents the 978 genes in the leukemia-specific PPI network; the red circle includes the 522 known leukemia-related genes in COSMIC. The leukemia-specific PPI contains 195 known leukemia-related genes in COSMIC. The purple circle represents 97 genes in final network biomarkers, among which 29 genes belong to the known leukemia-related category. Figure 5. [118]Figure 5 [119]Open in a new tab Validation of the network biomarkers. (A) Distribution of the leukemia-associated genes in the network. (B-D) ROC curves obtained with the network biomarkers tested by three gene expression datasets. Panel (B), (C) and (D) represent respectively the results of the gene expression datasets in series of [120]GSE8835, [121]GSE24739 and [122]GSE39411. Sub-network marker with higher classification accuracy To evaluate the performance of network biomarkers in classifying leukemia and normal gene expression profiles, we used three independent gene expression datasets listed in Table [123]3 as tested datasets to produce the ROC curves. We compared the network biomarker with three reported gene biomarkers: CD38[124]^47, BCL2 [125]^48 and IGFBP7 [126]^49. The reasons we chose these three markers for comparison are as follows, 1) these biomarkers are all well-studied and all of them have been validated by clinical experiments. 2) The marker CD38 is a member of our network whereas the remaining two are not. We included two others for fair evaluating the performance of our network biomarker. Figure [127]5 shows the ROC curves for network biomarkers and 3 known biomarkers. Network-based biomarker has higher AUC than any of the single markers which means network-based biomarker could more effectively discriminate the leukemia from the normal controls. The sensitivity, specificity and accuracy of each dataset are given in Table [128]5. Table 5. Detailed information of ROC curves. Series Biomarker Sensitivity Precision Specificity Accuracy AUC [129]GSE8835 CD38 0.913 0.700 0.212 0.658 0.629 BCL2 0.885 0.650 0.166 0.623 0.604 IGFBP7 0.965 0.665 0.148 0.668 0.584 Network biomarkers 0.851 0.686 0.316 0.657 0.698 [130]GSE24739 CD38 0.893 0.662 0.088 0.625 0.431 BCL2 0.943 0.654 0.004 0.630 0.237 IGFBP7 0.938 0.652 0.001 0.625 0.197 Network biomarkers 0.874 0.986 0.976 0.908 0.966 [131]GSE39411 CD38 0.999 0.725 0.177 0.740 0.981 BCL2 0.915 0.931 0.853 0.895 0.966 IGFBP7 0.886 0.797 0.513 0.768 0.880 Network biomarkers 0.996 0.999 0.998 0.997 0.999 [132]Open in a new tab It is worth noting that for network biomarkers from [133]GSE8835 has a relatively lower AUC than the other two datasets. This may be caused by the difference of platform and method between [134]GSE8835 and the training datasets. Anyhow, the accuracy of the network biomarkers is still higher than other three single biomarkers. The result indicates that the putative network biomarkers could diagnose leukemia samples more accurately and could be used as putative biomarker to aid in early diagnosis of leukemia. Conclusions In conclusion, we developed a network approach for molecular investigation and diagnosis of leukemia. The constructed network biomarkers not only achieve higher accuracy rate of diagnosis compared to known single biomarkers but also provide systematic insights into the leukemogenesis process. We noticed that we only considered the combination of genes (or the nodes) in the network for the prediction of leukemia. The interactions among genes can also provide valuable biological signatures for diagnosis of diseases. We will take the edge-variation in the network into the account for the further improving of the leukemia prediction. Availability of Data and Materials PINA: [135]http://cbg.garvan.unsw.edu.au/pina/interactome.stat.do. GeneGo: [136]https://portal.genego.com/cgi/data_manager.cgi. [137]GSE9476: [138]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE9476. [139]GSE6691: [140]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6691. [141]GSE5788: [142]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5578. [143]GSE22529: [144]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22529. [145]GSE26725: [146]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE26725. [147]GSE23293: [148]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE23293. [149]GSE8835: [150]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8835. [151]GSE24739: [152]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE24739. [153]GSE39411: [154]http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE39411. COSMIC: [155]http://cancer.sanger.ac.uk/cosmic/download IPA: [156]http://www.ingenuity.com/products/login DAVID: [157]https://david.ncifcrf.gov/. Acknowledgments