Abstract Identifying differentially expressed (DE) genes between cancer and normal tissues is of basic importance for studying cancer mechanisms. However, current methods, such as the commonly used Significance Analysis of Microarrays (SAM), are biased to genes with low expression levels. Recently, we proposed an algorithm, named the pairwise difference (PD) algorithm, to identify highly expressed DE genes based on reproducibility evaluation of top-ranked expression differences between paired technical replicates of cells under two experimental conditions. In this study, we extended the application of the algorithm to the identification of DE genes between two types of tissue samples (biological replicates) based on several independent datasets or sub-datasets of a dataset, by constructing multiple paired average gene expression profiles for the two types of samples. Using multiple datasets for lung and esophageal cancers, we demonstrated that PD could identify many DE genes highly expressed in both cancer and normal tissues that tended to be missed by the commonly used SAM. These highly expressed DE genes, including many housekeeping genes, were significantly enriched in many conservative pathways, such as ribosome, proteasome, phagosome and TNF signaling pathways with important functional significances in oncogenesis. __________________________________________________________________ The high-throughput gene expression profiling technologies facilitate screening expression levels for thousands of genes simultaneously. One of the main objectives for analyzing gene expression profiles is to identify genes differentially expressed (DE) in cancer compared with normal control[40]^1. Many methods have been proposed to identify DE genes[41]^2,[42]^3,[43]^4,[44]^5 and a popular choice is Significance Analysis of Microarrays (SAM) based on t-test statistic[45]^6. It has been reported that the t-test is biased to genes with low expression levels[46]^3,[47]^6 because a gene with low expression level may have a large absolute t-statistic due to its small variance, even when its mean difference between two conditions is small[48]^7. SAM was proposed to correct this bias. However, due to logarithmic transformation of data in SAM, the differences of log-scaled expression levels between two conditions are actually the logarithms of their fold change (FC) ratios. Because genes with low expression levels are more likely to reach large FCs than genes with high expression levels, SAM is also biased to genes with low expression levels[49]^8. Compared with genes expressed at low levels, genes expressed at high levels are more likely to be involved in some functionally conserved pathways such as oxidative phosphorylation[50]^9, glutathione metabolism[51]^10,[52]^11,[53]^12 and proteasome[54]^13 with important biological significances. In a recent study[55]^8, we have proposed an algorithm, named the pairwise difference (PD) algorithm, to identify DE genes in small-scale cell line experiments, which typically measure only two or three technical replicates for each of two different experimental conditions, respectively. Briefly, by pairing technical replicates under two conditions, the algorithm identifies DE genes with top-ranked absolute expression differences between the two types of cells which are significantly reproducible in independent paired technical replicates[56]^8. Compared with SAM and other commonly used methods, PD can exclusively identify many DE genes with high expression levels in both two types of cells[57]^8. However, this algorithm cannot be used directly to identify DE genes between two types of tissue samples (e.g., cancer and normal control) because the biological replicates of each type of tissue may have large between-individual differences. In this study, in consideration that tissue samples are biological replicates with between-individual differences, we averaged the gene expression profiles separately for two types of samples in a dataset to construct a cancer-normal pair, and then applied PD to identify DE genes using multiple cancer-normal pairs separately constructed from several independent datasets or sub-datasets of a dataset. Using datasets for lung cancer and esophagus cancer, we demonstrated the applicability and power of this strategy in finding functionally important DE genes highly expressed in both cancer and normal tissues that tend to be missed by SAM. Results The applicability of the PD algorithm to multiple datasets Firstly, for each of the three datasets for lung cancer and normal samples (see [58]Table 1), we separately averaged the gene expression profiles for cancer and normal samples in each datasets to construct a paired average gene expression profiles, referred to as a cancer-normal pair. Then, for every cancer-normal pair, all genes were ranked according to their absolute average differences (AD) of expression levels between cancer and normal samples in descending order. As shown in [59]Fig. 1a, the consistency scores of the deregulation directions of the top n (n = 1000, 2000, 3000, 4000, 5000) genes between every two cancer-normal pairs were all higher than 91.8%, which were all significantly higher than what expected by chance (binomial test, all p < 2.2 × 10^−16) (see Methods for details). We did similar analyses in two datasets for esophagus cancer ([60]Table 1) and found that the consistency scores of the deregulation directions of the top n (n = 1000, 2000, 3000, 4000, 5000) genes between the two datasets were all higher than 96.42%, as shown in [61]Fig. 1b. These results suggested that the differential expression signals between every two independent cancer-normal pairs for a particular cancer were significantly reproducible. Table 1. Description of the datasets used in this study. Tissue GEO accession Platform Cancer Normal Reference Lung [62]GSE19188 [63]GPL570 91 65 Hou, J. et al.[64]^14 [65]GSE19804 60 60 Lu, T.P. et al.[66]^15 [67]GSE27262 25 25 Wei, T.Y. et al.[68]^16 Esophagus [69]GSE29001 [70]GPL571 21 24 Yan, W. et al.[71]^17 [72]GSE20347 17 17 Hu, N. et al.[73]^18 [74]Open in a new tab Figure 1. Consistency scores between two datasets for a cancer. Figure 1 [75]Open in a new tab The consistency scores between the top n (n = 1000, 2000, 3000, 4000, 5000) genes ranked by absolute average expression differences for every two cancer-normal pairs were evaluated in (a) three datasets for lung cancer ([76]GSE19188, [77]GSE19804 and [78]GSE27262). and (b) two datasets for esophagus cancer ([79]GSE20347 and [80]GSE29001). We further did a random experiment to show that the differential expression signals were irreproducible when there were no phenotype differences between two groups of samples. Using the [81]GSE19804 dataset with 60 lung cancer samples and 60 normal samples, we randomly permuted sample labels two times to produce two datasets of artificial “cancer” and “normal” samples, and then calculated the consistency score of the deregulation directions of the top 1000 genes sorted by the average expressions difference between the two artificial cancer-normal pairs. The random experiment was repeated 1000 times. As expected, the average of the 1000 consistency scores was 49.83% with 0.1954 of standard deviation. These results suggested that the differential expression signals were irreproducible when there were no phenotype differences between two groups of samples. Then, regarding every cancer-normal pair as an independent pair of technical replicates, we used the PD algorithm to identify reproducible DE genes between the lung cancer and normal control of three datasets. The two parameters of the algorithm, the initial step and the consistency threshold, were set as 300 and 95%, respectively, as suggested previously[82]^8. With the above two parameters, PD identified a list of 6,092 DE genes for lung cancer, and this list of DE genes was denoted as C3. In comparison, 10,865, 12,287 and 10,945 DE genes were identified by SAM with 5% FDR control in the [83]GSE19188, [84]GSE19804 and [85]GSE27262 datasets, respectively. The consistency scores of the overlapped DE genes between C3 and the DE genes identified by SAM in the three datasets were 99.83%, 100% and 100%, respectively ([86]Table 2). Similarly, PD identified 3,498 DE genes based on the two datasets of esophagus cancer, denoted as C2, and the consistency scores with DE genes identified by SAM in the two datasets were both 100% ([87]Table 2). Table 2. The consistency scores of the DE genes identified by both PD and SAM. Tissue Dataset PD SAM Overlap Consistency Consistency score P Lung [88]GSE19188 6092 10865 4744 4736 99.83% <2.2 × 10^−16 [89]GSE19804 12287 5488 5488 100.00% <2.2 × 10^−16 [90]GSE27262 10945 5524 5524 100.00% <2.2 × 10^−16 Esophagus [91]GSE20347 3498 6311 3057 3057 100.00% <2.2 × 10^−16 [92]GSE29001 5882 2785 2785 100.00% <2.2 × 10^−16 [93]Open in a new tab Note: Overlap, the number of the DE genes identified by both PD and SAM; Consistency, the number of the DE genes with the same deregulation directions (either up-regulation or down-regulation); P, the significance level of reproducibility. On the other hand, approximately 9.3–22.1% of the DE genes in C3 identified by PD were not identified by SAM. As shown in [94]Fig. 2, the average expression levels of the DE genes exclusively identified by PD were rather high in both cancer and normal samples of the three datasets, while the average expression levels of most DE genes exclusively identified by SAM were quite low in cancer and/or normal samples. Similar results were observed based on the two datasets for esophagus cancer ([95]Supplementary Figure S1). Thus, the PD algorithm can identify DE genes expressed highly in both cancer and normal tissues, which tend to be missed by SAM. Figure 2. The distributions of the average expression levels for DE genes identified exclusively by PD or SAM for lung cancer. [96]Figure 2 [97]Open in a new tab Red crosses represent the DE genes exclusively identified by PD in C3, and black dots represent the DE genes exclusively identified by SAM in datasets (a) [98]GSE19188, (b) [99]GSE19804, (c) [100]GSE27262, respectively. The average expression levels of DE genes in normal samples (x-axis) and cancer samples (y-axis) were plotted. The average expression levels above 5,000 were set to 5,000. The applicability of the PD algorithm to a single dataset We used the dataset [101]GSE27262 for 25 lung cancer samples and 25 normal samples to exemplify the feasibility of the PD algorithm in analyzing a single dataset. Firstly, we divided this dataset evenly into two sub-datasets according to the GSM series numbers of samples: set 1 and set 2 with 12 and 13 pairs of cancer and normal samples, respectively ([102]Table 3). Then, we transformed the two sub-datasets into two independent cancer-normal pairs of averaged gene expression profiles. With the same parameter setting as above, PD identified 3,789 DE genes, denoted as S2. 3,386 of these 3,789 DE genes overlapped with C3 and the consistency score between S2 and C3 was 100%. When dividing the [103]GSE27262 into four small sub-datasets ([104]Table 3), PD identified 4,157 DE genes, denoted as S4. The consistency score between S4 and C3 was 99.94%. Table 3. The consistency scores of DE genes identified by PD from sub-datasets of [105]GSE27262 and three datasets for lung cancer. List Sub-dataset Cancer Normal DE genes Overlap Consistency score P S2 set 1 12 12 3789 3386 100.00% <2.2 × 10^−16 set 2 13 13 S4 set 1 6 6 4157 3119 99.94% <2.2×10^−16 set 2 6 6 set 3 6 6 set 4 7 7 [106]Open in a new tab Note: S2 and S4 separately represent two (set 1 and set 2) and four (set 1, set 2, set 3 and set 4) small sub-datasets with cancer and normal samples by evenly dividing [107]GSE27262 according to the GSM series numbers of samples. Cancer/Normal, the number of cancer/normal samples in each sub-dataset; DE genes, the number of genes identified by PD in S2 and S4; Overlap, the number of the DE genes shared by S2 (or S4) and C3; P, the significance level of reproducibility. Similarly, when dividing the dataset [108]GSE29001 evenly into two and four small sub-datasets, respectively, PD identified 1,738 and 2,298 DE genes for esophagus cancer. The consistency scores between the two lists of DE genes with the DE genes in C2 were 100% and 99.88% ([109]Table 4), respectively. Table 4. The consistency scores of DE genes identified by PD from sub-datasets of [110]GSE29001 and two datasets for esophagus cancer. List Sub-datasets Cancer Normal DE genes Overlaps Consistency score P S2 Set 1 10 12 1738 1651 100.00% <2.2 × 10–16 Set 2 11 12 S4 Set 1 5 6 2298 1724 99.88% <2.2 × 10–16 Set 2 5 6 Set 3 5 6 Set 4 6 6 [111]Open in a new tab Note: See Note for [112]Table 3. Taking together, the above results demonstrated that PD can work well by dividing a dataset evenly into several sub-datasets with sample sizes as small as about six for each type of samples. Significant functional pathways detected by the PD algorithm Here, we used the above dataset [113]GSE27262 for lung cancer and the dataset [114]GSE29001 for esophagus cancer to demonstrate that most of the pathways significantly enriched with DE genes found by PD tend to be missed by SAM. With 10% FDR control, the DE genes in S2 found by PD for lung cancer were significantly enriched in 14 pathways ([115]Fig. 3a). However, none of these pathways was identified as significant by enrichment analysis with the same FDR control for the 10,945 DE genes found by SAM with 5% FDR control. When focusing on the most significant DE genes found by SAM, with the same number of DE genes in S2, 13 of the 14 significant pathways were still unfound ([116]Fig. 3a). Besides, the DNA replication pathway[117]^19 commonly found by PD and SAM, the other 13 significant pathways are mainly associated with lung cancer, including pentose phosphate pathway[118]^20, oxidative phosphorylation[119]^9,[120]^21, cysteine and methionine metabolism[121]^22, glutathione metabolism[122]^10,[123]^11,[124]^12, biosynthesis of amino acids[125]^23, ribosome[126]^24, proteasome[127]^13, protein processing in endoplasmic reticulum[128]^25,[129]^26, phagosome[130]^27 and TNF signaling pathway[131]^28. These conservative pathways included many DE genes highly expressed in both cancer and normal tissues, which tended to be missed by SAM. For example, among the 16 DE genes found exclusively by PD in the TNF signaling pathway, the average expression level of CCL2 was ranked at the top 3.2% and 1% of all the measured genes in the cancer and normal samples, respectively. The difference between the average expression level of this gene in the cancer samples and its average expression level in the normal samples was as large as 1678.72, whereas the average of the corresponding differences for all the DE genes identified by SAM was only 245.03. It has been reported that this gene may play an important role in the development of lung cancer[132]^29. For another example, the average expression level of TNFAIP3 was ranked at the top 7.4% and 3.2% of all the measured genes in the cancer and normal samples, respectively. The difference between the average expression level of this gene in the cancer samples and its average expression level in the normal samples was 625.43. This gene has been reported as a negative regulator of NF-kappa B activation as well as TNF-mediated apoptosis[133]^30 and its underexpression can promote the progression of lung cancer[134]^31. The detailed information about these 16 DE genes was shown in [135]Supplementary Table S1. Figure 3. The comparison of functional pathways enriched with DE genes separately identified by PD and SAM. [136]Figure 3 [137]Open in a new tab The biological pathways significantly enriched with DE genes identified by PD (using two subsets of each dataset, S2) and by SAM in (a) [138]GSE27262 for lung cancer, (b) [139]GSE29001 for esophagus cancer. The most significant DE genes identified by SAM, with the same number of the DE genes found by PD, were used for pathway enrichment analyses. The p values of the KEGG pathways were adjusted by Benjamini and Hochberg (FDR = 10%), and −log10(p) was used to generate the heat map. Similarly for esophagus cancer, the four pathways significantly enriched with DE genes in S2 identified by PD were all missed by SAM ([140]Fig. 3b). These significant pathways included pathways for oxidative phosphorylation[141]^32, glutathione metabolism[142]^33, ribosome[143]^34,[144]^35 and proteasome[145]^36. The above pathway enrichment analyses demonstrated that the PD algorithm can capture important cancer-associated pathways with highly expressed DE genes, including many housekeeping genes (see Discussion), which might play important roles in oncogenesis, whereas most of these pathways tend to be missed by SAM. The results also provided extra evidence supporting the reliability of the DE genes found by PD because a list of DE genes can be significantly enriched in pathways only when it contains sufficient real DE genes[146]^37,[147]^38. Discussion In this paper, we extended the application of the PD algorithm to the identification of DE genes between cancer and normal tissue samples based on several independent datasets or sub-datasets of a dataset. The application results for lung and esophageal cancer showed that PD can exclusively identify many DE genes with high expression levels in both cancer and normal samples, which tend to be missed by the commonly used SAM. Functional enrichment analyses of DE genes identified by PD showed that it can exclusively identify many significant biological pathways related to the development of cancers. Especially, the results demonstrated that the PD algorithm could efficiently identify DE genes by dividing a dataset evenly into several sub-datasets with sample sizes as small as about six for each type of samples. In general, for researchers with their own experimental data, we would recommend them making use of independent datasets in public data sources, in cases that such data exist, in order to increase the power and accuracy of biological discovery. Notably, in our functional analysis examples for lung cancer and esophagus cancer, four pathways were commonly identified by PD but missed by SAM. These four pathways were well known cancer-related pathways for oxidative phosphorylation, glutathione metabolism, ribosome and proteasome. These biological pathways are related to two important cancer hallmarks, the metabolic network (the oxidative phosphorylation and glutathione metabolism pathways) and genome duplication network (ribosome) according to the cancer hallmarks network framework proposed by Wang et al.[148]^39. Reprogramming of metabolism is an important mechanism supporting the growth and division of cancer cell[149]^40. Genome duplication plays an important role on tumor formation and can activate several cancer hallmarks network[150]^41,[151]^42. These conservative cancer hallmarks or pathways all included many highly expressed housekeeping genes playing essential roles in the pathogenesis of cancer. For example, in the ribosome pathway, among the 45 DE genes found exclusively by PD in the [152]GSE27262 dataset for lung cancer ([153]Supplementary Table S2), 35 genes were housekeeping genes reported by Zhu et al.[154]^43. The average expression levels of these 35 housekeeping genes were all ranked among the top 20% of all the measured genes in both the cancer and normal samples. It is known that housekeeping genes maintain the basic needs for a cell to survive[155]^44,[156]^45,[157]^46, and thus their deregulations tend to induce human diseases including cancer[158]^47,[159]^48. For examples, the overexpression of RPSA may be positively correlated with the angiogenesis of lung cancer[160]^49,[161]^50, the overexpression of RPL19 promotes malignant proliferation of lung cancer cells[162]^51, and the underexpression of RPS3, a critical regulator of DNA repair and apoptosis[163]^52, might accelerate the development of lung cancer. Such cancer-related housekeeping genes tend to be evolutionarily conserved and play critical roles in carcinogenesis together with tissue-specific less-conservative cancer-related genes[164]^53. Although PD can exclusively identify many important cancer-associated genes with high expression levels which play important functional roles in carcinogenesis, it has its own shortcomings. A major limitation is that it still cannot obtain DE genes with FDR control. Obviously, the higher the consistency threshold was set, the lower the rate of false positives of DE genes identified between two independent sample pairs. However, the FDR has a complex relationship with the parameter of consistency threshold. Besides, some DE genes and pathways identified by SAM were missed by PD which is biased to genes with high expressions. For example, DE genes identified by SAM from the dataset [165]GSE27262 for lung cancer were enriched in the fanconi anemia pathway related with risk of lung adenocarcinoma[166]^54,[167]^55. However, this pathway was missed by DE genes identified by PD. In this pathway, 13 DE genes were identified by SAM but not by PD. The average expression levels of the 13 genes were among the bottom 70% and 61% of all the measured genes of all the cancer and normal samples, respectively. These results demonstrate that, different from SAM, PD tends to miss DE genes with low expression levels. Therefore, the PD algorithm is not a substitution but an effective complement to current approaches for analyzing DE genes of tissue datasets with biological replicates. Methods Data and data pre-processing Multiple gene expression datasets for lung cancer and esophageal cancer were collected from Gene Expression Omnibus (GEO)[168]^56. Detail information about these datasets used in this study were described in [169]Table 1. For each dataset, the raw data (.CEL files) was pre-processed using the robust average (RMA) algorithm[170]^57,[171]^58. Then each probe-set ID was matched to its Entrez gene ID. If multiple probesets were matched to the same gene, the expression value for the gene was referred to as the arithmetic mean of the values of the multiple probesets (on the log2 scale). Identification of reproducible DE genes The pairwise difference (PD) algorithm[172]^8 was originally designed for analyzing small-scale cell line data with two or three technical replicates for each of two different cell lines. Since technical replicates for a cell line have no biological difference, every two independent pairs of technical replicates for two different cell lines can be regarded as independent experiments to identify DE genes through reproducibility evaluation. However, because tissue samples from different individuals are biological replicates with large biological variations among individuals, every two paired samples for two types of tissues cannot be regarded as reproducible independent experiments. In order to reduce the influence of biological variations among samples with the same phenotype, we used several independent datasets to construct multiple cancer-normal pairs by averaging a set of gene expression profiles separately for each of the two phenotypes. Specifically, for each dataset, we calculated the mean non-log-transformed expression values of each gene in the normal samples (type N) and cancer samples (type C), respectively, to form a paired average gene expression profiles for cancer and normal tissues. For a given pair j consisting of one type N sample and one type C sample, the mean values of gene i in the type N sample and type C sample, denoted as Inline graphic and Inline graphic , respectively, were calculated as following: graphic file with name srep36227-m3.jpg graphic file with name srep36227-m4.jpg where n[1] and n[2] were the numbers of samples in type N and type C, respectively. x[ik] was the expression value of gene i in a type N or type C sample. Then, for gene i, the average expression difference between two phenotypes of a given cancer-normal pair j, denoted as D[ij], was calculated as following: graphic file with name srep36227-m5.jpg If the value was larger (or smaller) than zero, then gene i was defined as up-regulation (or down-regulation) in type C sample. Regarding multiple cancer-normal pairs constructed from independent datasets as independent experiments, we could identify DE genes through reproducibility evaluation with the same PD algorithm descried in details in our original paper[173]^8. Briefly, all genes in each cancer-normal pair were sorted in descending order by their absolute pairwise expression differences between two phenotypes and divided into blocks by the initial step of 300. The significantly reproducible DE gene lists between the decreasingly ranked blocks of each two independent pairs were identified if their consistency scores were higher than a pre-settled consistency threshold (here, 95%). Reproducibility evaluation of two DE gene lists For two DE gene lists from two different datasets sharing k DE genes, of which s genes had the consistent directions (either up-regulation or down-regulation) in type C samples, the consistency score was calculated as s/k. The cumulative binomial distribution model[174]^59 was used to estimate the probability of observing at least s of k DE genes with the consistent directions by chance: graphic file with name srep36227-m6.jpg in which p[e] is the probability of one gene having the consistent direction in two DE gene lists by random chance (here, p[e] = 0.5). A DE genes list is considered significantly reproducible if the p value of the consistency score is <0.01. Pathway enrichment analysis Functional enrichment analysis was done based on the Kyoto Encyclopaedia of Genes and Genomes[175]^60. The hypergeometric distribution model was used to identify biological pathways that were significantly enriched with DE genes[176]^61, the probability of observing at least k genes in a pathway by chance can be computed as follow: graphic file with name srep36227-m7.jpg n is the number of DE genes identified from N genes in a dataset and k of them are annotated in a pathway with m genes. The p values were adjusted using the Benjamini and Hochberg procedure[177]^62, controlling the False Discovery Rate (FDR) at the 10% level. Additional Information How to cite this article: Huang, H. et al. Identifying reproducible cancer-associated highly expressed genes with important functional significances using multiple datasets. Sci. Rep. 6, 36227; doi: 10.1038/srep36227 (2016). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Supplementary Material Supplementary Information [178]srep36227-s1.pdf^ (118.3KB, pdf) Acknowledgments