Abstract

   Identifying differentially expressed (DE) genes between cancer and
   normal tissues is of basic importance for studying cancer mechanisms.
   However, current methods, such as the commonly used Significance
   Analysis of Microarrays (SAM), are biased to genes with low expression
   levels. Recently, we proposed an algorithm, named the pairwise
   difference (PD) algorithm, to identify highly expressed DE genes based
   on reproducibility evaluation of top-ranked expression differences
   between paired technical replicates of cells under two experimental
   conditions. In this study, we extended the application of the algorithm
   to the identification of DE genes between two types of tissue samples
   (biological replicates) based on several independent datasets or
   sub-datasets of a dataset, by constructing multiple paired average gene
   expression profiles for the two types of samples. Using multiple
   datasets for lung and esophageal cancers, we demonstrated that PD could
   identify many DE genes highly expressed in both cancer and normal
   tissues that tended to be missed by the commonly used SAM. These highly
   expressed DE genes, including many housekeeping genes, were
   significantly enriched in many conservative pathways, such as ribosome,
   proteasome, phagosome and TNF signaling pathways with important
   functional significances in oncogenesis.
     __________________________________________________________________

   The high-throughput gene expression profiling technologies facilitate
   screening expression levels for thousands of genes simultaneously. One
   of the main objectives for analyzing gene expression profiles is to
   identify genes differentially expressed (DE) in cancer compared with
   normal control[40]^1. Many methods have been proposed to identify DE
   genes[41]^2,[42]^3,[43]^4,[44]^5 and a popular choice is Significance
   Analysis of Microarrays (SAM) based on t-test statistic[45]^6. It has
   been reported that the t-test is biased to genes with low expression
   levels[46]^3,[47]^6 because a gene with low expression level may have a
   large absolute t-statistic due to its small variance, even when its
   mean difference between two conditions is small[48]^7. SAM was proposed
   to correct this bias. However, due to logarithmic transformation of
   data in SAM, the differences of log-scaled expression levels between
   two conditions are actually the logarithms of their fold change (FC)
   ratios. Because genes with low expression levels are more likely to
   reach large FCs than genes with high expression levels, SAM is also
   biased to genes with low expression levels[49]^8. Compared with genes
   expressed at low levels, genes expressed at high levels are more likely
   to be involved in some functionally conserved pathways such as
   oxidative phosphorylation[50]^9, glutathione
   metabolism[51]^10,[52]^11,[53]^12 and proteasome[54]^13 with important
   biological significances.

   In a recent study[55]^8, we have proposed an algorithm, named the
   pairwise difference (PD) algorithm, to identify DE genes in small-scale
   cell line experiments, which typically measure only two or three
   technical replicates for each of two different experimental conditions,
   respectively. Briefly, by pairing technical replicates under two
   conditions, the algorithm identifies DE genes with top-ranked absolute
   expression differences between the two types of cells which are
   significantly reproducible in independent paired technical
   replicates[56]^8. Compared with SAM and other commonly used methods, PD
   can exclusively identify many DE genes with high expression levels in
   both two types of cells[57]^8. However, this algorithm cannot be used
   directly to identify DE genes between two types of tissue samples
   (e.g., cancer and normal control) because the biological replicates of
   each type of tissue may have large between-individual differences.

   In this study, in consideration that tissue samples are biological
   replicates with between-individual differences, we averaged the gene
   expression profiles separately for two types of samples in a dataset to
   construct a cancer-normal pair, and then applied PD to identify DE
   genes using multiple cancer-normal pairs separately constructed from
   several independent datasets or sub-datasets of a dataset. Using
   datasets for lung cancer and esophagus cancer, we demonstrated the
   applicability and power of this strategy in finding functionally
   important DE genes highly expressed in both cancer and normal tissues
   that tend to be missed by SAM.

Results

The applicability of the PD algorithm to multiple datasets

   Firstly, for each of the three datasets for lung cancer and normal
   samples (see [58]Table 1), we separately averaged the gene expression
   profiles for cancer and normal samples in each datasets to construct a
   paired average gene expression profiles, referred to as a cancer-normal
   pair. Then, for every cancer-normal pair, all genes were ranked
   according to their absolute average differences (AD) of expression
   levels between cancer and normal samples in descending order. As shown
   in [59]Fig. 1a, the consistency scores of the deregulation directions
   of the top n (n = 1000, 2000, 3000, 4000, 5000) genes between every two
   cancer-normal pairs were all higher than 91.8%, which were all
   significantly higher than what expected by chance (binomial test, all
   p < 2.2 × 10^−16) (see Methods for details). We did similar analyses in
   two datasets for esophagus cancer ([60]Table 1) and found that the
   consistency scores of the deregulation directions of the top n
   (n = 1000, 2000, 3000, 4000, 5000) genes between the two datasets were
   all higher than 96.42%, as shown in [61]Fig. 1b. These results
   suggested that the differential expression signals between every two
   independent cancer-normal pairs for a particular cancer were
   significantly reproducible.

Table 1. Description of the datasets used in this study.

   Tissue    GEO accession  Platform  Cancer Normal        Reference
   Lung      [62]GSE19188  [63]GPL570   91     65    Hou, J. et al.[64]^14
             [65]GSE19804               60     60   Lu, T.P. et al.[66]^15
             [67]GSE27262               25     25   Wei, T.Y. et al.[68]^16
   Esophagus [69]GSE29001  [70]GPL571   21     24    Yan, W. et al.[71]^17
             [72]GSE20347               17     17    Hu, N. et al.[73]^18
   [74]Open in a new tab

Figure 1. Consistency scores between two datasets for a cancer.

   Figure 1
   [75]Open in a new tab

   The consistency scores between the top n (n = 1000, 2000, 3000, 4000,
   5000) genes ranked by absolute average expression differences for every
   two cancer-normal pairs were evaluated in (a) three datasets for lung
   cancer ([76]GSE19188, [77]GSE19804 and [78]GSE27262). and (b) two
   datasets for esophagus cancer ([79]GSE20347 and [80]GSE29001).

   We further did a random experiment to show that the differential
   expression signals were irreproducible when there were no phenotype
   differences between two groups of samples. Using the [81]GSE19804
   dataset with 60 lung cancer samples and 60 normal samples, we randomly
   permuted sample labels two times to produce two datasets of artificial
   “cancer” and “normal” samples, and then calculated the consistency
   score of the deregulation directions of the top 1000 genes sorted by
   the average expressions difference between the two artificial
   cancer-normal pairs. The random experiment was repeated 1000 times. As
   expected, the average of the 1000 consistency scores was 49.83% with
   0.1954 of standard deviation. These results suggested that the
   differential expression signals were irreproducible when there were no
   phenotype differences between two groups of samples.

   Then, regarding every cancer-normal pair as an independent pair of
   technical replicates, we used the PD algorithm to identify reproducible
   DE genes between the lung cancer and normal control of three datasets.
   The two parameters of the algorithm, the initial step and the
   consistency threshold, were set as 300 and 95%, respectively, as
   suggested previously[82]^8. With the above two parameters, PD
   identified a list of 6,092 DE genes for lung cancer, and this list of
   DE genes was denoted as C3. In comparison, 10,865, 12,287 and 10,945 DE
   genes were identified by SAM with 5% FDR control in the [83]GSE19188,
   [84]GSE19804 and [85]GSE27262 datasets, respectively. The consistency
   scores of the overlapped DE genes between C3 and the DE genes
   identified by SAM in the three datasets were 99.83%, 100% and 100%,
   respectively ([86]Table 2). Similarly, PD identified 3,498 DE genes
   based on the two datasets of esophagus cancer, denoted as C2, and the
   consistency scores with DE genes identified by SAM in the two datasets
   were both 100% ([87]Table 2).

Table 2. The consistency scores of the DE genes identified by both PD and
SAM.

   Tissue Dataset PD SAM Overlap Consistency Consistency score P
   Lung [88]GSE19188 6092 10865 4744 4736 99.83% <2.2 × 10^−16
   [89]GSE19804 12287 5488 5488 100.00% <2.2 × 10^−16
   [90]GSE27262 10945 5524 5524 100.00% <2.2 × 10^−16
   Esophagus [91]GSE20347 3498 6311 3057 3057 100.00% <2.2 × 10^−16
   [92]GSE29001 5882 2785 2785 100.00% <2.2 × 10^−16
   [93]Open in a new tab

   Note: Overlap, the number of the DE genes identified by both PD and
   SAM; Consistency, the number of the DE genes with the same deregulation
   directions (either up-regulation or down-regulation); P, the
   significance level of reproducibility.

   On the other hand, approximately 9.3–22.1% of the DE genes in C3
   identified by PD were not identified by SAM. As shown in [94]Fig. 2,
   the average expression levels of the DE genes exclusively identified by
   PD were rather high in both cancer and normal samples of the three
   datasets, while the average expression levels of most DE genes
   exclusively identified by SAM were quite low in cancer and/or normal
   samples. Similar results were observed based on the two datasets for
   esophagus cancer ([95]Supplementary Figure S1). Thus, the PD algorithm
   can identify DE genes expressed highly in both cancer and normal
   tissues, which tend to be missed by SAM.

Figure 2. The distributions of the average expression levels for DE genes
identified exclusively by PD or SAM for lung cancer.

   [96]Figure 2
   [97]Open in a new tab

   Red crosses represent the DE genes exclusively identified by PD in C3,
   and black dots represent the DE genes exclusively identified by SAM in
   datasets (a) [98]GSE19188, (b) [99]GSE19804, (c) [100]GSE27262,
   respectively. The average expression levels of DE genes in normal
   samples (x-axis) and cancer samples (y-axis) were plotted. The average
   expression levels above 5,000 were set to 5,000.

The applicability of the PD algorithm to a single dataset

   We used the dataset [101]GSE27262 for 25 lung cancer samples and 25
   normal samples to exemplify the feasibility of the PD algorithm in
   analyzing a single dataset. Firstly, we divided this dataset evenly
   into two sub-datasets according to the GSM series numbers of samples:
   set 1 and set 2 with 12 and 13 pairs of cancer and normal samples,
   respectively ([102]Table 3). Then, we transformed the two sub-datasets
   into two independent cancer-normal pairs of averaged gene expression
   profiles. With the same parameter setting as above, PD identified 3,789
   DE genes, denoted as S2. 3,386 of these 3,789 DE genes overlapped with
   C3 and the consistency score between S2 and C3 was 100%. When dividing
   the [103]GSE27262 into four small sub-datasets ([104]Table 3), PD
   identified 4,157 DE genes, denoted as S4. The consistency score between
   S4 and C3 was 99.94%.

Table 3. The consistency scores of DE genes identified by PD from
sub-datasets of [105]GSE27262 and three datasets for lung cancer.

 List Sub-dataset Cancer Normal DE genes Overlap Consistency score       P
 S2      set 1      12     12     3789    3386        100.00%      <2.2 × 10^−16
         set 2      13     13
 S4      set 1      6      6      4157    3119        99.94%        <2.2×10^−16
         set 2      6      6
         set 3      6      6
         set 4      7      7
   [106]Open in a new tab

   Note: S2 and S4 separately represent two (set 1 and set 2) and four
   (set 1, set 2, set 3 and set 4) small sub-datasets with cancer and
   normal samples by evenly dividing [107]GSE27262 according to the GSM
   series numbers of samples. Cancer/Normal, the number of cancer/normal
   samples in each sub-dataset; DE genes, the number of genes identified
   by PD in S2 and S4; Overlap, the number of the DE genes shared by S2
   (or S4) and C3; P, the significance level of reproducibility.

   Similarly, when dividing the dataset [108]GSE29001 evenly into two and
   four small sub-datasets, respectively, PD identified 1,738 and 2,298 DE
   genes for esophagus cancer. The consistency scores between the two
   lists of DE genes with the DE genes in C2 were 100% and 99.88%
   ([109]Table 4), respectively.

Table 4. The consistency scores of DE genes identified by PD from
sub-datasets of [110]GSE29001 and two datasets for esophagus cancer.

List Sub-datasets Cancer Normal DE genes Overlaps Consistency score      P
S2      Set 1       10     12     1738     1651        100.00%      <2.2 × 10–16
        Set 2       11     12
S4      Set 1       5      6      2298     1724        99.88%       <2.2 × 10–16
        Set 2       5      6
        Set 3       5      6
        Set 4       6      6
   [111]Open in a new tab

   Note: See Note for [112]Table 3.

   Taking together, the above results demonstrated that PD can work well
   by dividing a dataset evenly into several sub-datasets with sample
   sizes as small as about six for each type of samples.

Significant functional pathways detected by the PD algorithm

   Here, we used the above dataset [113]GSE27262 for lung cancer and the
   dataset [114]GSE29001 for esophagus cancer to demonstrate that most of
   the pathways significantly enriched with DE genes found by PD tend to
   be missed by SAM.

   With 10% FDR control, the DE genes in S2 found by PD for lung cancer
   were significantly enriched in 14 pathways ([115]Fig. 3a). However,
   none of these pathways was identified as significant by enrichment
   analysis with the same FDR control for the 10,945 DE genes found by SAM
   with 5% FDR control. When focusing on the most significant DE genes
   found by SAM, with the same number of DE genes in S2, 13 of the 14
   significant pathways were still unfound ([116]Fig. 3a). Besides, the
   DNA replication pathway[117]^19 commonly found by PD and SAM, the other
   13 significant pathways are mainly associated with lung cancer,
   including pentose phosphate pathway[118]^20, oxidative
   phosphorylation[119]^9,[120]^21, cysteine and methionine
   metabolism[121]^22, glutathione metabolism[122]^10,[123]^11,[124]^12,
   biosynthesis of amino acids[125]^23, ribosome[126]^24,
   proteasome[127]^13, protein processing in endoplasmic
   reticulum[128]^25,[129]^26, phagosome[130]^27 and TNF signaling
   pathway[131]^28. These conservative pathways included many DE genes
   highly expressed in both cancer and normal tissues, which tended to be
   missed by SAM. For example, among the 16 DE genes found exclusively by
   PD in the TNF signaling pathway, the average expression level of CCL2
   was ranked at the top 3.2% and 1% of all the measured genes in the
   cancer and normal samples, respectively. The difference between the
   average expression level of this gene in the cancer samples and its
   average expression level in the normal samples was as large as 1678.72,
   whereas the average of the corresponding differences for all the DE
   genes identified by SAM was only 245.03. It has been reported that this
   gene may play an important role in the development of lung
   cancer[132]^29. For another example, the average expression level of
   TNFAIP3 was ranked at the top 7.4% and 3.2% of all the measured genes
   in the cancer and normal samples, respectively. The difference between
   the average expression level of this gene in the cancer samples and its
   average expression level in the normal samples was 625.43. This gene
   has been reported as a negative regulator of NF-kappa B activation as
   well as TNF-mediated apoptosis[133]^30 and its underexpression can
   promote the progression of lung cancer[134]^31. The detailed
   information about these 16 DE genes was shown in [135]Supplementary
   Table S1.

Figure 3. The comparison of functional pathways enriched with DE genes
separately identified by PD and SAM.

   [136]Figure 3
   [137]Open in a new tab

   The biological pathways significantly enriched with DE genes identified
   by PD (using two subsets of each dataset, S2) and by SAM in (a)
   [138]GSE27262 for lung cancer, (b) [139]GSE29001 for esophagus cancer.
   The most significant DE genes identified by SAM, with the same number
   of the DE genes found by PD, were used for pathway enrichment analyses.
   The p values of the KEGG pathways were adjusted by Benjamini and
   Hochberg (FDR = 10%), and −log10(p) was used to generate the heat map.

   Similarly for esophagus cancer, the four pathways significantly
   enriched with DE genes in S2 identified by PD were all missed by SAM
   ([140]Fig. 3b). These significant pathways included pathways for
   oxidative phosphorylation[141]^32, glutathione metabolism[142]^33,
   ribosome[143]^34,[144]^35 and proteasome[145]^36.

   The above pathway enrichment analyses demonstrated that the PD
   algorithm can capture important cancer-associated pathways with highly
   expressed DE genes, including many housekeeping genes (see Discussion),
   which might play important roles in oncogenesis, whereas most of these
   pathways tend to be missed by SAM. The results also provided extra
   evidence supporting the reliability of the DE genes found by PD because
   a list of DE genes can be significantly enriched in pathways only when
   it contains sufficient real DE genes[146]^37,[147]^38.

Discussion

   In this paper, we extended the application of the PD algorithm to the
   identification of DE genes between cancer and normal tissue samples
   based on several independent datasets or sub-datasets of a dataset. The
   application results for lung and esophageal cancer showed that PD can
   exclusively identify many DE genes with high expression levels in both
   cancer and normal samples, which tend to be missed by the commonly used
   SAM. Functional enrichment analyses of DE genes identified by PD showed
   that it can exclusively identify many significant biological pathways
   related to the development of cancers. Especially, the results
   demonstrated that the PD algorithm could efficiently identify DE genes
   by dividing a dataset evenly into several sub-datasets with sample
   sizes as small as about six for each type of samples. In general, for
   researchers with their own experimental data, we would recommend them
   making use of independent datasets in public data sources, in cases
   that such data exist, in order to increase the power and accuracy of
   biological discovery.

   Notably, in our functional analysis examples for lung cancer and
   esophagus cancer, four pathways were commonly identified by PD but
   missed by SAM. These four pathways were well known cancer-related
   pathways for oxidative phosphorylation, glutathione metabolism,
   ribosome and proteasome. These biological pathways are related to two
   important cancer hallmarks, the metabolic network (the oxidative
   phosphorylation and glutathione metabolism pathways) and genome
   duplication network (ribosome) according to the cancer hallmarks
   network framework proposed by Wang et al.[148]^39. Reprogramming of
   metabolism is an important mechanism supporting the growth and division
   of cancer cell[149]^40. Genome duplication plays an important role on
   tumor formation and can activate several cancer hallmarks
   network[150]^41,[151]^42. These conservative cancer hallmarks or
   pathways all included many highly expressed housekeeping genes playing
   essential roles in the pathogenesis of cancer. For example, in the
   ribosome pathway, among the 45 DE genes found exclusively by PD in the
   [152]GSE27262 dataset for lung cancer ([153]Supplementary Table S2), 35
   genes were housekeeping genes reported by Zhu et al.[154]^43. The
   average expression levels of these 35 housekeeping genes were all
   ranked among the top 20% of all the measured genes in both the cancer
   and normal samples. It is known that housekeeping genes maintain the
   basic needs for a cell to survive[155]^44,[156]^45,[157]^46, and thus
   their deregulations tend to induce human diseases including
   cancer[158]^47,[159]^48. For examples, the overexpression of RPSA may
   be positively correlated with the angiogenesis of lung
   cancer[160]^49,[161]^50, the overexpression of RPL19 promotes malignant
   proliferation of lung cancer cells[162]^51, and the underexpression of
   RPS3, a critical regulator of DNA repair and apoptosis[163]^52, might
   accelerate the development of lung cancer. Such cancer-related
   housekeeping genes tend to be evolutionarily conserved and play
   critical roles in carcinogenesis together with tissue-specific
   less-conservative cancer-related genes[164]^53.

   Although PD can exclusively identify many important cancer-associated
   genes with high expression levels which play important functional roles
   in carcinogenesis, it has its own shortcomings. A major limitation is
   that it still cannot obtain DE genes with FDR control. Obviously, the
   higher the consistency threshold was set, the lower the rate of false
   positives of DE genes identified between two independent sample pairs.
   However, the FDR has a complex relationship with the parameter of
   consistency threshold. Besides, some DE genes and pathways identified
   by SAM were missed by PD which is biased to genes with high
   expressions. For example, DE genes identified by SAM from the dataset
   [165]GSE27262 for lung cancer were enriched in the fanconi anemia
   pathway related with risk of lung adenocarcinoma[166]^54,[167]^55.
   However, this pathway was missed by DE genes identified by PD. In this
   pathway, 13 DE genes were identified by SAM but not by PD. The average
   expression levels of the 13 genes were among the bottom 70% and 61% of
   all the measured genes of all the cancer and normal samples,
   respectively. These results demonstrate that, different from SAM, PD
   tends to miss DE genes with low expression levels. Therefore, the PD
   algorithm is not a substitution but an effective complement to current
   approaches for analyzing DE genes of tissue datasets with biological
   replicates.

Methods

Data and data pre-processing

   Multiple gene expression datasets for lung cancer and esophageal cancer
   were collected from Gene Expression Omnibus (GEO)[168]^56. Detail
   information about these datasets used in this study were described in
   [169]Table 1. For each dataset, the raw data (.CEL files) was
   pre-processed using the robust average (RMA)
   algorithm[170]^57,[171]^58. Then each probe-set ID was matched to its
   Entrez gene ID. If multiple probesets were matched to the same gene,
   the expression value for the gene was referred to as the arithmetic
   mean of the values of the multiple probesets (on the log2 scale).

Identification of reproducible DE genes

   The pairwise difference (PD) algorithm[172]^8 was originally designed
   for analyzing small-scale cell line data with two or three technical
   replicates for each of two different cell lines. Since technical
   replicates for a cell line have no biological difference, every two
   independent pairs of technical replicates for two different cell lines
   can be regarded as independent experiments to identify DE genes through
   reproducibility evaluation. However, because tissue samples from
   different individuals are biological replicates with large biological
   variations among individuals, every two paired samples for two types of
   tissues cannot be regarded as reproducible independent experiments. In
   order to reduce the influence of biological variations among samples
   with the same phenotype, we used several independent datasets to
   construct multiple cancer-normal pairs by averaging a set of gene
   expression profiles separately for each of the two phenotypes.
   Specifically, for each dataset, we calculated the mean
   non-log-transformed expression values of each gene in the normal
   samples (type N) and cancer samples (type C), respectively, to form a
   paired average gene expression profiles for cancer and normal tissues.
   For a given pair j consisting of one type N sample and one type C
   sample, the mean values of gene i in the type N sample and type C
   sample, denoted as Inline graphic and Inline graphic , respectively,
   were calculated as following:
   graphic file with name srep36227-m3.jpg
   graphic file with name srep36227-m4.jpg

   where n[1] and n[2] were the numbers of samples in type N and type C,
   respectively. x[ik] was the expression value of gene i in a type N or
   type C sample.

   Then, for gene i, the average expression difference between two
   phenotypes of a given cancer-normal pair j, denoted as D[ij], was
   calculated as following:
   graphic file with name srep36227-m5.jpg

   If the value was larger (or smaller) than zero, then gene i was defined
   as up-regulation (or down-regulation) in type C sample. Regarding
   multiple cancer-normal pairs constructed from independent datasets as
   independent experiments, we could identify DE genes through
   reproducibility evaluation with the same PD algorithm descried in
   details in our original paper[173]^8. Briefly, all genes in each
   cancer-normal pair were sorted in descending order by their absolute
   pairwise expression differences between two phenotypes and divided into
   blocks by the initial step of 300. The significantly reproducible DE
   gene lists between the decreasingly ranked blocks of each two
   independent pairs were identified if their consistency scores were
   higher than a pre-settled consistency threshold (here, 95%).

Reproducibility evaluation of two DE gene lists

   For two DE gene lists from two different datasets sharing k DE genes,
   of which s genes had the consistent directions (either up-regulation or
   down-regulation) in type C samples, the consistency score was
   calculated as s/k. The cumulative binomial distribution model[174]^59
   was used to estimate the probability of observing at least s of k DE
   genes with the consistent directions by chance:
   graphic file with name srep36227-m6.jpg

   in which p[e] is the probability of one gene having the consistent
   direction in two DE gene lists by random chance (here, p[e] = 0.5). A
   DE genes list is considered significantly reproducible if the p value
   of the consistency score is <0.01.

Pathway enrichment analysis

   Functional enrichment analysis was done based on the Kyoto
   Encyclopaedia of Genes and Genomes[175]^60. The hypergeometric
   distribution model was used to identify biological pathways that were
   significantly enriched with DE genes[176]^61, the probability of
   observing at least k genes in a pathway by chance can be computed as
   follow:
   graphic file with name srep36227-m7.jpg

   n is the number of DE genes identified from N genes in a dataset and k
   of them are annotated in a pathway with m genes.

   The p values were adjusted using the Benjamini and Hochberg
   procedure[177]^62, controlling the False Discovery Rate (FDR) at the
   10% level.

Additional Information

   How to cite this article: Huang, H. et al. Identifying reproducible
   cancer-associated highly expressed genes with important functional
   significances using multiple datasets. Sci. Rep. 6, 36227; doi:
   10.1038/srep36227 (2016).

   Publisher’s note: Springer Nature remains neutral with regard to
   jurisdictional claims in published maps and institutional affiliations.

Supplementary Material

   Supplementary Information
   [178]srep36227-s1.pdf^ (118.3KB, pdf)

Acknowledgments