Abstract A disease molecular signature is a set of biomolecular features that are prognostic of clinical phenotypes and indicative of underlying pathology. It is of great importance to develop computational approaches for finding more relevant molecular signatures. Based upon the hypothesis that various components in a molecular signature are more likely to share similar patterns, we introduced a novel three step network based approach (TSNBA) to identify the molecular signature and key pathological regulators. Protein-protein interaction (PPI) network and ranking algorithm were integrated in the first step to find pathology related proteins with high accuracy. It was followed by the second step to further screen with co-expression patterns for better pathology enrichment. Context likelihood of relatedness (CLR) algorithm was used in the third step to infer gene regulatory networks and identify key transcription regulators. We applied this approach to study IL-1 (interleukin-1) and TNF-alpha (tumor necrosis factor-alpha) stimulated inflammation. TSNBA identified inflammatory signature with high accuracy and outperformed 5 competing methods namely fold change, degree, interconnectivity, neighborhood score and network propagation based approaches. The best molecular signature, with 80% (40/50) confirmed inflammatory genes, was used to predict inflammation related genes. As a result, 8 out of 10 predicted inflammation genes that were not included in the benchmark Entrez Gene database were validated by literature evidence. Furthermore, 23 of the 32 predicted inflammation regulators were validated by literature evidence. The rest 9 were also validated with TF (transcription factor) binding site analysis. In conclusion, we developed an efficient strategy for disease molecular signature finding and key pathological regulator identification. Introduction Molecular signature is defined as a set of biomolecular features that can be used as markers for a particular phenotype and underlying condition-related biological mechanisms. They can be a set of genes, proteins, metabolites, genetic variants and microRNAs. Molecular signatures have been derived and applied for various purposes [29][1], [30][2] including disease diagnosis and risk assessment [31][3]–[32][7], prediction of physiological toxicity [33][8], [34][9] and response to therapeutic drugs [35][10], [36][11]. In addition, molecular signatures are also indicative of underlying molecular pathology and have been used for investigating disease progression [37][12], [38][13] and discovering the underlying mechanisms [39][14], [40][15]. Molecular signature can be obtained via a variety of approaches. Dimension reduction techniques [41][16], [42][17], differential expression analysis [43][18], and prioritization approaches [44][19], [45][20] are commonly used for this purpose. However, signature components obtained from principal component analysis (PCA) and partial least squares (PLS) are often difficult for interpretation. In addition, reproducibility and accuracy are still two challenges for current methods. “Omics” technologies have produced a lot of high throughput data, which provides tremendously rich information to discover molecular signature for better understanding diseases. In addition, diverse types of data can be integrated in network based approaches, which advantageously incorporate complex interactions and rich disease information. Methods integrating multiple data sets, multiple data types with network-based approaches have been shown to find accurate and robust molecular signatures [46][1]. Another major challenge still exists regarding the lack of robustness for the algorithms with overly optimistic result for certain data sets and poor performance on other data sets. Different stimulations may lead to similar clinical phenotype by perturbing very different underlying molecular mechanisms. Therefore, it is important to improve current discovery process for identifying perturbation responsive signatures. Moreover, considering the experimental validation of a signature, it is more important to reduce hundreds of signature proteins/genes to a refined and manageable number of key regulators. Therefore, it is useful to develop an approach for accurate molecular signature and pathological regulators discovery at the same time. It is well recognized that interacted genes or proteins are likely involved in the same or similar biochemical process [47][21], [48][22]. Therefore, similar expression patterns are more likely to be shared by components involved in the same molecular signature for a specific pathological process. Based on this understanding, a new approach was developed for finding disease molecular signature and key regulators by integrating PPI network, gene co-expression network and context likelihood of relatedness (CLR) algorithm as shown in [49]Figure 1. In the first step, TSNBA (three step network based approach) uniquely combined gene expression data with PPI network to find pathology related proteins through a novel ranking algorithm incorporating perturbation responsive gene expression data. In the second step, the top ranking genes were further screened with co-expression network for a more enriched signature finding. Finally, CLR algorithm was used for inferring gene regulatory networks, followed by identification of key regulators based upon three screening criteria. The performance of TSNBA was tested on IL-1 (interleukin-1) and TNF-alpha (tumor necrosis factor-alpha) stimulated inflammation. As a result, TSNBA outperformed 5 competing methods namely fold change, degree, interconnectivity, neighborhood score and network propagation based approaches. The predicted pathological regulators were validated with literature evidence and provided potential new insights into the underlying molecular mechanisms of inflammation. Figure 1. Framework of TSNBA. PPI network and gene expression data are integrated in the interaction activity matrix to rank genes for their relevancy to the perturbation. [50]Figure 1 [51]Open in a new tab The top K ranking genes are further filtered with co-expression network for better pathology enrichment. Context likelihood of relatedness (CLR) algorithm is used in the third step to infer gene regulatory networks and identify key transcription regulators. Node in gray represents for known pathology related genes, white represents for predicted ones, and black represents for predicted key regulators. Materials and Methods Data Preparation PPI data The PPI data was derived from HPRD [52][23] (Release 9) with self-interactions removed. TNF-alpha and IL-1 stimulated inflammation data The former, [53]GSE2639 [54][24], contains the gene expression profiles of 4 TNF-stimulated samples and 4 normal samples. Stimulated samples were treated with 2 ng/ml TNF for 5 hours, and normal samples were left-untreated. The latter, [55]GSE973 [56][25], contains 4 IL-1 stimulated samples and 1 normal sample. Stimulated samples were treated with 100 U/ml human IL-1 for 0, 0.5, 1, 2.5 and 6 hours, and the normal sample was untreated. Two groups of data sets were both obtained by using human U133A GeneChips (Affymetrix, Santa Clara, CA) from human umbilical vein endothelial cells (HUVEC). Raw data was stored in ArrayTrack 3.5.0 [57][26]. MAS 5 [58][27], which has been suggested to be the best normalization procedure to reconstruct cellular network [59][28], was used for normalization. Expression data was summarized to the gene level by averaging all probes mapped to the same gene. Only the genes included in the PPI network were selected for further analysis. PPI data Benchmark human inflammatory genes The data was collected from Entrez Gene database [60][29]. We queried “((“inflammatory” OR “inflammation”) AND “[Homo sapiens (human)]”)” and found 2210 related genes as of 5th September 2013. Only 1462 genes were involved in PPI network and those were used for further analysis. Human TF (transcription factor) data Human transcription factor data was derived from AnimalTFDB [61][30], and only TFs involved in PPI network were considered for further analysis. The First Step: Gene ranking with PPI interaction network Generation of an interaction activity matrix An adjacency matrix Adj was constructed for PPI network. The Adj(i,j) = 1 when node i and j interact with each other and Adj(i,j) = 0 otherwise. The activity of each interaction was computed by a weighting function [62][31], [63][32]: graphic file with name pone.0094360.e001.jpg (1) Here, f[k] is the log2-fold change value of the gene k. The shape of the multivariate logistic distribution is controlled by parameters C and K (C = 1 and K = 5 by default), and the shifting parameter T (0.5 by default) is added to produce zero when f[i] and f[j] are both zeros. The weighting function includes two multivariate logistic functions, with the first term capturing co-activation of genes and the second term capturing co-suppression. These activities of interactions replace “1” in Adj and generate an interaction activity matrix. Gene ranking The influence of each node (P[i]) is the sum of the influence it receives from each interacted nodes. graphic file with name pone.0094360.e002.jpg (2) Where N is the number of nodes in the interaction activity matrix, and A[it] is the interaction activity of node t with node i. The final ranked list was obtained according to descending order of P[i]. The Second Step: Filtering with co-expression network for better enrichment Gene co-expression has been widely used for finding co-regulated genes [64][33]–[65][36] and co-regulatory relationships [66][37]–[67][41]. Here, co-expression analysis was applied to the top ranking genes obtained from the first step to further screen for better pathology enrichment. Pearson correlation coefficient p was computed for each pair of genes: graphic file with name pone.0094360.e003.jpg (3) with w[i] as the vector containing expression values of gene i under all conditions. Correlation coefficients below a certain threshold will be filtered out, and eligible connections are included in the final co-expression network and deemed as the disease molecular signature. The Third Step: Searching for key regulators Ranking putative interactions by CLR A lot of approaches have been developed to identify regulatory networks, such as CLR [68][42], Bayesian network [69][43] and ARACNe [70][44]. As an extension of relevance networks approach [71][45], [72][46], CLR detects regulatory interactions via important mutual information (MI). An adaptive background correction step is further applied to eliminate false correlations and indirect influences. The statistical likelihood of MI value for each gene is computed within its network context, and the MI value for each TF-target pair is compared to the context likelihood of both the TF and the target gene, followed by z-score normalization. graphic file with name pone.0094360.e004.jpg (4) graphic file with name pone.0094360.e005.jpg (5) graphic file with name pone.0094360.e006.jpg (6) graphic file with name pone.0094360.e007.jpg (7) graphic file with name pone.0094360.e008.jpg (8) Where C is the correlation coefficient matrix for top K genes of ranking list. C[diag] is the diagonal matrix of C. MI[K*M] is derived from MI[K*K] for M genes coding for transcription factors in top K genes. Finally, putative regulatory interactions are then ranked by decreasing z-score. Identifying key transcription factors of inflammation Key regulators should be closely associated with the pathology and play pivotal roles in the regulatory network. In the present study, three screening criteria were used for searching key inflammatory regulators. All interactions with z-score less than 3 are filtered out, the number of targets from Entrez Gene inflammation signature must be greater than 3, and the percentage of inflammatory gens in all targets should be more than 60%. Competing approaches Fold change based approach Gene expression change quantifies the response to a pathological condition for a gene. Therefore, differentially expressed genes using a fold change cutoff has been used very often as the signature under the condition. In this study, absolute fold change cutoff was set to be 1.2, 1.5, 2.0, 3.0 and 4.0, eligible genes were collected in corresponding sets, namely “FC_1.2”, “FC_1.5”, “FC_2”, “FC_3”, and “FC_4”. Moreover, top 50 and 300 genes with biggest absolute fold change were collected and corresponding enrichment ratios were calculated. Degree based approach Degree is an important topological parameter, e.g. hub genes are the genes with highest degrees. Deletion of these genes has been shown leading to more severe phenotypic outcomes than other genes [73][47]. Although the importance of hub genes are still in debate, they have been used widely as a measure of biological importance. Moreover, our first step ranks genes via the weight calculated by [74]equation (2), which may inadvertently capture the degree information. Thus, it is necessary to compare with degree based approach. According to the PPI network, genes were ranked by decreasing degrees. In the present study, degree cutoff was set to be 50, 100, 150, 200, 250 and 300 and corresponding enrichment ratios were calculated. Neighborhood Scoring Neighborhood scoring is a local measure for prioritizing candidates based on the expression of the gene itself and its direct neighbors in the network [75][48], we implemented the adapted method as described in Dorothea Emig's work [76][49]. Genes were ranked by their scores, which were calculated as follows: graphic file with name pone.0094360.e009.jpg (9) Fold change (FC) of gene i and average fold change of its neighbors N equally contribute to the score, where N(i) includes all neighboring genes of i. To note, score 0 is assigned to genes that are neither differentially expressed or have any differently expressed genes in the direct neighborhood. Interconnectivity Interconnectivity is also a local measurement for prioritizing candidates, which is based on genes' overall connectivity to differentially expressed genes [77][50]. An adapted method [78][49] is used in the present study. First, interconnectivity scores for interactors of differentially expressed genes are calculated based on their direct interactions and their shared neighborhood as follows: graphic file with name pone.0094360.e010.jpg (10) e(i,j) describes whether an edge exists between gene i and j, 1 represents for edge exists and 0 otherwise. Both direct interaction and shared neighborhood N are taken into account, which are then normalized by the overall degrees of the two genes. Then, final score of each gene is based on the interconnectivity to all differentially expressed genes: graphic file with name pone.0094360.e011.jpg (11) where DEG is the set of all differentially expressed genes and d represents one differentially expressed gene. Network Propagation Different from interconnectivity and neighborhood score, network propagation is a global method, which takes the complete network topology into account for prioritizing candidates [79][49], [80][51]. First, differentially expressed genes in the network are assigned to a score of 1, while the remaining 0. These scores represent the prior information on genes for disease and are regarded as the starting propagating flow. Then, the flow is further smoothed over the network in each iteration until a steady state is reached. Finally, each gene receives its final score according to the final flow and is ranked in the whole gene list. In each iteration, the flow for the genes is updated as follows: graphic file with name pone.0094360.e012.jpg (12) F^t is a vector containing the flow for each gene at time point t. α is diffusion parameter. A is the adjacency matrix of the network, where each entry is normalized by the degrees of the source gene and target gene. The normalization compensates for the fact that high degree genes have a higher chance of picking up flow by chance and are thus ranking higher in the prioritization. F^0 represents the starting propagating flow. The steady state is reached when the L[1] norm of the difference between F^t and F^t-1 is below 10^−6. Enrichment ratio and statistical test for inflammation Enrichment ratio and statistical test Enrichment ratio in the present work was defined as the percentage of inflammatory genes in top K genes of ranking list overlapping with the benchmark Entrez Gene inflammation set. K started from 50 with an increment of 50 at each step. For each ratio, a hypergeometric test was used to evaluate the enrichment of inflammatory genes in each selected list, and p-value was obtained. Enrichment ratio and statistical test for signature genes by TSNBA Top K genes were chosen from the ranked list for screening signature proteins for better pathology enrichment. Different thresholds of correlation coefficient were tested, ranging from 0.6 to 0.945 with an increment of 0.005. For each threshold, genes involved in the co-regulatory relationships were selected to calculate enrichment ratio, and hypergeometric test was used for enrichment analysis. Results TSNBA identified better inflammation enriched signature The final PPI network used in this study consisted of 7633 genes (nodes) and 30995 interactions. 1469 human TFs derived from AnimalTFDB database and 1462 inflammatory genes extracted from Entrez Gene database were included in the network. The background ratio for inflammatory gene was 19.2% (1462/7633). Endothelial cells play critical roles during the inflammatory response [81][25], and TNF-alpha [82][52] and IL-1 [83][53] are well known important mediators of the process. Therefore, gene expression data collected from TNF-alpha and IL-1 stimulated HUVEC were used in the present study. For IL-1 stimulated inflammation, HUVEC were treated with IL-1 for 0, 0.5, 1, 2.5 and 6 hours, 4 sets of data were constructed to calculate fold change of genes, namely “IL1_0.5h”, “IL1_1h”, “IL1_2.5h” and “IL1_6h”. In TNF-alpha stimulated inflammation, HUVEC were left untreated or stimulated for 5 h with TNF-alpha, and both were repeated for 4 times, thus constructing another 4 sets of data, namely “TNF1”, “TNF2”, “TNF3” and “TNF4”. A total of 8 sets of genes were used for following analysis. The ranked gene list was first obtained for each data set. The enrichment ratio and p-value were calculated by comparing top K ranking genes to the benchmark Entrez inflammation gene set. As shown in [84]Figure 2A, the enrichment ratio (p-value<0.0001) decreased with the increase of K, indicating higher probability of finding inflammation gene in the higher ranked genes. The highest enriched ratio was 0.72 in top 50 ranking genes from IL1_6h data set. Figure 2. Enrichment analysis of inflammation signature by TSNBA. [85]Figure 2 [86]Open in a new tab (A): Enrichment ratio of top K genes for different data sets. IL1_0.5h, IL1_1h, IL1_2.5h, and IL1_6h represent for the perturbation with 0.5, 1.0, 2.5, and 6.0 hours IL-1 stimulation, respectively. TNF1-4 represent for the perturbation with 5 hours TNF stimulation. (B) Comparison of ranking algorithm with TSNBA, rank_50 represents for top 50 ranking genes. As shown in [87]Figure 2A, there were some fluctuations when the number of selected genes ranged from 50 to 250. Taking “TNF1” data set for an example, enrichment ratio of top 100 ranking genes was higher than that of top 50, and enrichment ratio of top 300 was higher than that of top 250. These fluctuations were largely due to the small size of selected genes. Moreover, top 300 ranking genes got better pathology enrichment (which will be detailed in discussion). Given these results, K, the number of top ranking genes, was set to be 300. These genes were further used to construct co-expression networks. When the correlation threshold ranged from 0.6 to 0.84, little change in enrichment ratio was observed (see details in [88]Figure S1 in File S1). However, when threshold was larger than 0.84, enrichment ratio increased. We then adjusted threshold to constrain the number of genes in co-expression network to about 50, detailed information was shown in [89]Table 1. Except for “TNF1” data set, TSNBA significantly enriched inflammatory signature further from the top 300 ranking genes. Furthermore, we compared enrichment ratio of TSNBA with that of top 50 ranking genes from step 1. TSNBA outperformed ranking algorithm on 7 out of 8 (87.5%) cases for identifying inflammation enriched signature (as shown in [90]Figure 2B). Table 1. Inflammation enrichment by TSNBA. Data set Size of signature Enrichment ratio Threshold P-value IL1_0.5h 52 0.65 0.91 0.006131 IL1_1h 56 0.73 0.93 0.000115 IL1_2.5h 50 0.80 0.92 2.59E-05 IL1_6h 50 0.74 0.93 0.001221 TNF1 55 0.49 0.94 0.220595 TNF2 52 0.77 0.935 5.80E-05 TNF3 51 0.71 0.935 0.014408 TNF4 54 0.69 0.94 0.000223 [91]Open in a new tab TSNBA predicted new inflammation related signature genes The number of genes in co-expression network was set to be about 50, but no less than 50. Under such a restricted condition, the highest enrichment ratio was found in “IL1_2.5h” data set with the threshold of 0.92 ([92]Figure 2B). In the constructed co-expression network, 80% (40/50, p-value<0.0001) genes were confirmed by Entrez Gene database to be human inflammatory genes. Hence, the rest 10 unconfirmed genes were predicted to be inflammatory genes. According to evidence collected from literature, 8 out of the 10 genes were reported to be associated with inflammation (see [93]Table 2 with more details in [94]Table S1 in File S1). Therefore, TSNBA was able to accurately predict pathology related genes by integrating gene expression and PPI network. Table 2. Literature evaluation of predicted inflammation genes. Potential inflammatory gene Whether inflammation related gene Pubmed BCL3 √ 19270711 CALCOCO2 √ 23820297 CD22 √ 22806142 DLG3 Not known ERBB3 √ 22157714 MAGI1 √ 22806142 POU1F1 Not known SMURF2 √ 22843012 SSTR2 √ 15806094 TANK √ 16698233 [95]Open in a new tab “√” represents for “Yes”. TSNBA predicted key inflammatory regulators Starting from the top 300 ranking genes, CLR was used to infer regulatory relationships. Potential regulators were predicted for each set and a final union set was obtained. A total of 32 transcription factors were predicted as potential regulators of inflammation. Among them, 21 regulators were included in inflammatory signature from Entrez Gene database and all of them were validated by literature to be important regulators of inflammation, e.g. nuclear factor kappa B (NFκB) a known master regulator of inflammation. In the rest 11 predicted regulators, serum response factor (SRF) was reported to regulate type I interferon-signaling in macrophage, thus is suggested as important regulator for regulating innate immunity [96][54]. Moreover, SRF was proposed to be required in acrolein activation of NFκB [97][55]. Macrophage (retinoid X receptor alpha) RXRA could upregulate the expression of chemokines, such as CCL6 and CCL9, and control innate inflammatory responses [98][56]. Therefore, RXRA and SRF may play important role in the regulation of inflammation. In summary, a total of 71.9% (23/32) regulators predicted by TSNBA were validated by literature as inflammatory regulators (see [99]Table 3 and details in [100]Table S2 in File S1). Table 3. Literature evaluation of predicted inflammatory regulator. Potential Key Regulator Whether inflammatory regulator Whether in benchmark Entrez Gene inflammation set PubMed ATF3 √ √ 18794337 BCL6 √ √ 22465074 CEBPB √ √ 22074460 EGR1 √ √ 11100120 ESR2 √ √ 20045727 FOS √ √ 19995753 HES1 √ √ 20832754 JUNB √ √ 19933155 JUND √ √ 19933155 NFKB1 √ √ 18927578 NFKB2 √ √ 18927578 NR2C2 √ √ 16675448 POU2F1 √ √ 21059098 RELA √ √ 12509469 SMAD2 √ √ 20667820 SMAD7 √ √ 19352540 STAT5A √ √ 15749913 TP53 √ √ 21779518 TP73 √ √ 10716451 VDR √ √ 17224129 VTN √ √ 17982099 RXRA √ 20498053 SRF √ 23705899; 23893683 DLX5 Not Known GTF2I Not Known HEY1 Not Known MSX1 Not Known NFE2 Not Known NR5A1 Not Known PITX1 Not Known PRRX1 Not Known RARA Not Known [101]Open in a new tab “√” represents for “Yes”. In addition to the validation from the literature evidence, bioinformatics approaches were also taken to explore the relationship between predicted TFs and inflammation. Potential TF binding sites within promoter regions of the 32 TFs were assessed via TRANSFAC component of GATHER [102][57]. The result showed strong evidence that nuclear factor kappa B (NFκB) binding motifs were contained in the proximal promoter regions of all genes (see details in [103]Table S4 in File S1). Given the critical role of NFκB in regulating inflammation, it is very likely that these regulators are all involved in the process of inflammation. On the other hand, TFactS database [104][58] was used to find target genes for the rest TFs that were not validated by literature. As a result, 5 of the 6 TFs included in TFactS were reported to target inflammatory genes included in benchmark Entrez Gene database (see details in [105]Table S5 in File S1). Furthermore, these TFs were all shown to interact with confirmed inflammatory proteins via PPIs collected from STRING database (Release 9.1) (see details in [106]Table S6 in File S1) [107][59]. In summary, both literature and bioinformatics analysis suggested that our predicted TFs were closely related to inflammation, thus indicating the applicability of TSNBA for finding key inflammatory regulators. Methods comparison Statistical significance test was performed on the methods comparison part. According to the working flow of our approach, paired t-test was performed to evaluate statistical significance for results of the first and second step, and enrichment ratios of different sets via different methods were used for comparison. In the first step, top 300 ranking genes were extracted and enrichment ratio was calculated for each data set. As shown in [108]Figure 3A, the first step of TSNBA outperformed another 4 methods (see details in [109]Table S8 in File S1). Figure 3. Statistical significance test for methods comparison. [110]Figure 3 [111]Open in a new tab Fold change based approach (red), network propagation (green), interconnectivity (purple), and neighborhood scoring (cyan) are compared with first step of TSNBA (A) and full TSNBA (B). In order to perform statistical significance test for the second step of TSNBA, molecular signatures needed to be identified by given methods. When molecular signature was identified by fold change based approach, enrichment ratios were calculated for different data sets under different fold change cutoff. As shown in [112]Figure 4, enrichment ratios increased with increasing cutoff, which indicated that genes with large fold change were more likely to be included in the pathological process. Given this, top 50 ranking genes with highest absolute fold change was set to be molecular signature. As for degree based approach, enrichment ratios for different degree cutoffs were studied. The highest enrichment ratio was obtained under top 50 genes (as shown in [113]Figure 5). Therefore, these 50 genes were identified as the molecular signature. Given results from degree based and fold change based approach, absolute fold change was set to be 4 to identify differentially expressed genes. For the convenience of comparison, the size of signature was set to be 50 for 3 candidate gene prioritization methods. Moreover, diffusion parameter α was investigated (see [114]Table S7 in File S1), and finally set to be 0.1 for best performance. As shown in [115]Figure 3B and [116]Figure 6, TSNBA outperformed fold change based approach, interconnectivity, neighborhood scoring and network propagation in all data sets (see details in [117]Table S9 in File S1). Besides, our approach also outperformed degree based approach in 7 data sets (except for “TNF1” data set). Figure 4. Performance of fold change based approach. [118]Figure 4 [119]Open in a new tab Enrichment ratios are calculated under 5 absolute fold change cutoffs, namely 1.2, 1.5, 2.0, 3.0 and 4.0. Different colors represent for different data sets. Figure 5. Performance of network degree based approach. [120]Figure 5 [121]Open in a new tab Enrichment ratios are calculated for top K genes. The size of genes, K, is set to be 50, 100, 150, 200, 250 and 300. Figure 6. Methods comparison. TSNBA (blue), fold change based approach (red), interconnectivity (purple), neighborhood scoring (cyan) and network propagation (green) are tested on different data sets. [122]Figure 6 [123]Open in a new tab To note, the enrichment ratio for degree based approach only depend on the selected network. Discussion Proteins play a central role in activities of living cells and they are interconnected in PPI network. Systematic PPI network exploration could lead to a better understanding of protein function and biological processes [124][60]. Combining PPI network with gene expression profiles provides two advantages. First, microarray gene expression enables genome wide screening of all genes at once and thus guarantees a comprehensive coverage. Second, gene expression data captures perturbation responsive information and perturbation related PPIs are more likely to be activated. Our ranking algorithm takes both aspects into consideration and could find more relevant gene/protein sets. It has long been known that genes involved in the same process often share similar expression patterns. It is thus the motivation of the second step to further reveal the underlying biological process by constructing co-expression network. Co-expression network was used as an integrative filter to find functionally related signature genes. In addition, CLR provided more detailed regulatory information beyond co-expression and identified key regulators of the pathological process. In summary, TSNBA utilized PPI network, co-expression network and regulatory network to progressively find pathology relevant signature genes and regulators. TSNBA is able to find perturbation responsive molecular signatures. Due to the differences of experimental settings and biological samples, signatures may be different even for the same perturbation. Taking TNF-alpha case for an example, we identified 4 different molecular signatures. Wikipathway enrichment analysis was carried out for each signature by WebGestalt [125][61], [126][62]. “hsapiens_entrezgene_protein-coding” was set as the reference set and hypergeometric p-value was calculated. As a result, TNF-alpha signaling pathway was significantly enriched for each signature with adjusted p-value less than 0.01 (see details in [127]Table S3 in File S1). Similarly, IL-1 signaling pathway was also significantly enriched for signatures derived from IL-1 stimulated data sets with adjusted p-value less than 0.05 (see details in [128]Table S3 in File S1). These results provided a good support for the ability of TSNBA in finding perturbation responsive signatures. Moreover, many genes were shared by signatures of the same stimulation. Nine genes (NFKBIA, CALCOO2, PLAU, TNFAIP3, SQSTM1, EGFR, BCL3, BIRC3, BMP2) were shared by 4 TNF-alpha signatures and 13 genes (NFKBIA, JUNB, FASLG, NFKB2, PLAU, CCL5, ATF3, BCL6, CTNNB1, PRTN3, BCL2A1, BMP2, ERBB3) by 4 IL-1 signatures. As shown in [129]Figure 5, it should be noted that enrichment ratio generally decreased with the size of top ranking genes, which suggested that degree information was associated with enrichment. However, ranking algorithm, which guaranteed the improvement in enrichment for our approach, inadvertently captured degree information (see the method). Is improvement in enrichment largely attributed to degree information and do these ranking lists worth further study? Therefore, in order to further evaluate our algorithm's ability in finding perturbation responsive signature, wiki pathway enrichment analysis was performed for top 300 ranking genes by degree based approach and our approach, respectively. As shown in [130]Table 4, our algorithm outperformed degree based approach in all data sets by ranking perturbation responsive pathway higher. Moreover, in “TNF1” and “TNF2” data sets, our approach even ranked the “TNF alpha Signaling Pathway” the number one, which was a strong indication of the reliability of our approach. Besides, top 300 highest degree genes were compared with molecular signatures identified by TSNBA, less than 33% of genes were overlapped, suggesting that network degree was not the major contributor of the high enrichment of TSNBA. Table 4. Comparison of TSNBA and Degree based approach for top 300 ranking genes. Data set Pathway Name Rank (Selected data set vs Degree) IL1_0.5h IL-1 signaling pathway 27 vs 43 IL1_1h IL-1 signaling pathway 9 vs 43 IL1_2.5h IL-1 signaling pathway 34 vs 43 IL1_6h IL-1 signaling pathway 18 vs 43 TNF1 TNF alpha Signaling Pathway 1 vs 18 TNF2 TNF alpha Signaling Pathway 1 vs 18 TNF3 TNF alpha Signaling Pathway 6 vs 18 TNF4 TNF alpha Signaling Pathway 14 vs 18 [131]Open in a new tab An inspection of the results presented in [132]Figure 2B and [133]Figure 6 showed that degree and fold change methods showed the results on TNF2, TNF3 and TNF4 to be similar with TNF1 being particularly less. TSNBA also showed the lowest enrichment result on TNF1 data set. Our approach took the fold change into consideration in the first step, and fold change was an important factor to rank genes. Besides, as shown in [134]Figure 4, fold change based approach also performed worst in TNF1 data set, which indicated that the worst performance in “TNF1” data set was likely due to the data itself. In addition, we took union set of signatures of different conditions by fold change based approach, interconnectivity, neighborhood scoring, network propagation, and TSNBA. The size of these sets was 242, 247, 233, 265 and 192 (see details in [135]Table S10 in File S1). This indicated that more genes were overlapped in signatures determined by TSNBA, and our method was more likely to find reproducible signatures. TNFRSF9 and TRAF1 were shared by all sets, and 19 genes (PLAU, ALOX12, ATF3, POU1F1, FOS, CSF1, CCL8, JUNB, TNFRSF11B, ICAM1, SELE, VCAM1, IL8, NFKBIA, RND1, TNFAIP3, BCL2A1, CSF2, and BIRC3) were shared by 4 sets except for the sets determined by interconnectivity, while 4 genes (OCM2, CLEC2D, MATN3, and IFIT3) were shared by other 4 sets but not the set of our approach. A recent report had pointed out the regulatory role of MATN3 in inducing the IL-1Ra and raised the possibility of recombinant human MATN3 protein in anti-inflammatory therapy [136][63]. Therefore, there were still some important inflammation related genes that may be missed by our approach, and common signature of several methods was worth of follow up investigation. Gene expression profile has been used widely to represent indirectly the protein activity. It has its limitations in quantifying actual protein abundance and incapable of reflecting many ‘switches’ in PPI behavior, such as ligand binding and posttranslational modification [137][64]. Therefore, other data types, such as protein and microRNA expression profiles should be integrated to further reveal these missing actions. At the same time, the network should be updated. Ever-increasing amount of PPIs shall continuously be incorporated into the network. In addition, many other interaction types, such as DNA-protein interaction, transcription factor-target interaction and microRNA-target interaction, shall also be included. It could be envisioned that a comprehensive network with biologically relevant profiles will lead us to more accurate disease molecular signature finding. Conclusion In this study, TSNBA was proposed to identify the molecular signature and key pathological regulators. In the case study of IL-1 and TNF-alpha stimulated inflammation, TSNBA identified inflammatory signature with high enrichment of pathology related genes and outperformed 5 methods in prioritizing candidates, including fold change based approach, degree based approach, interconnectivity, neighborhood score and network propagation. In conclusion, we developed an efficient strategy for disease molecular signature finding and key pathological regulator identification. Supporting Information File S1 Including the following: (1) Enrichment ratios under different thresholds in the second step (Figure S1); (2) literature evaluation for 10 predicted inflammatory genes (Table S1); (3) literature evaluation for 32 predicted regulators of inflammation (Table S2); (4) wikipathway enrichment analysis of molecular signatures via TSNBA (Table S3); (5) TF binding sites analysis of 32 predicted regulators of inflammation (Table S4); (6) transcription factor-inflammatory gene relationships from TFactS for 5 transcription factors, namely DLX5, MSX1, NR5A1, PRRX1, and RXRA (Table S5); (7) protein-protein interactions collected from STRING database for 9 transcription factors that are confirmed by literature to be inflammatory regulators (Table S6); (8) investigation for diffusion parameter α of network propagation (Table S7); (9) methods comparison for the first step of TSNBA (Table S8); (10) methods comparison for TSNBA (Table S9); (11) union sets of signatures from different methods. (XLSX) [138]Click here for additional data file.^ (101.5KB, xlsx) Funding Statement This work was financially supported by the National S&T Major Project (No. 2012ZX09503001) and the Natural Science Foundation of China (No. 81373893). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. References