Abstract

   Due to the invasiveness nature of tissue biopsy, it is common that
   investigators cannot collect sufficient normal controls for comparison
   with diseased samples. We developed a pathway enrichment tool, DRFunc,
   to detect significantly disease-disrupted pathways by incorporating
   normal controls from other experiments. The method was validated using
   both microarray and RNA-seq expression data for different cancers. The
   high concordant differentially ranked (DR) gene pairs were identified
   between cases and controls from different independent datasets. The DR
   gene pairs were used in the DRFunc algorithm to detect significantly
   disrupted pathways in one-phenotype expression data by combing controls
   from other studies. The DRFunc algorithm was exemplified by the
   detection of significant pathways in glioblastoma samples. The
   algorithm can also be used to detect altered pathways in the datasets
   with weak expression signals, as shown by the analysis on the
   expression data of chemotherapy-treated breast cancer samples.

Introduction

   High-throughput biotechnologies such as microarrays and RNA sequencing
   (RNA_seq) are generating a large volume of genetic data. Such massive
   data have promoted the development of various pathway enrichment
   tools^[34]1, which can be divided into three categories: singular
   enrichment analysis (SEA), gene set enrichment analysis (GSEA) and
   modular enrichment analysis (MEA)^[35]2, [36]3. SEA usually calculates
   the enrichment p-value for a pathway based on a list of preselected
   differentially expressed genes (DEGs) using statistical methods such as
   Student’s t-test^[37]4, [38]5. GSEA identifies a significant pathway by
   determining whether the genes of the pathway are ranked at the top or
   the bottom among all the genes according to their expression
   differences between two phenotypes^[39]6. The enrichment calculation in
   MEA is similar to that in SEA, but the network topology information is
   integrated^[40]7. These pathway enrichment tools are effective in
   identifying disease-associated genes with important pathophysiologic
   roles.

   Tissue biopsy is a conventional method to collect samples for cancer
   diagnosis, monitoring and pathologic analysis^[41]8. However, biopsy is
   frequently very difficult for patients with brain cancer or metastatic
   cancers^[42]9, [43]10, and more challenging for healthy controls. As a
   consequence, studies for such diseases typically include very few or
   even no normal controls^[44]11. This situation poses a serious
   challenge to the common pathway enrichment tools discussed above, as
   they all compare quantitative expression levels of pathway genes
   between two phenotypes^[45]2, [46]3. Hereafter we refer to a dataset
   consisting of samples with only one phenotype (disease) as a one-sided
   dataset. The control samples for the same disease available in other
   datasets cannot be incorporated into a one-sided dataset because the
   quantitative expression values are sensitive to the so-called batch
   effects between different experiments^[47]12, [48]13. Datasets from the
   Cancer Genome Atlas database (TCGA) database^[49]14 should also be
   considered as one-sided, since TCGA samples were derived from different
   institutions and processed in different batches. Therefore, the DEGs
   detected directly between tumor samples and normal controls from TCGA
   are questionable without appropriate batch adjustment^[50]15. However,
   batch adjustments may be biased if study groups are not evenly
   distributed across batches^[51]15.

   To tackle the above problem, some studies have used the within-sample
   relative expression orderings (REOs) instead of the quantitative
   expression values for disease screening^[52]16, [53]17 and gene
   signaling network analysis^[54]18. We previously developed a tool,
   individPath, to identify patient-specific dysregulated pathways based
   on reversal REOs in an individual sample compared with the highly
   stable REOs identified from a large cohort of normal samples which were
   accumulated previously from various sources^[55]19. Compared with the
   algorithms based on the quantitative expression values, the REO-based
   algorithms have some unique advantages, including insensitive to batch
   effects, free of between-sample data normalization, reproducible across
   independent data^[56]17, [57]20 and reuse of accumulated data^[58]21,
   [59]22. Therefore, for a one-sided disease dataset, it is intuitive to
   compare the differences between the REOs in diseased samples and the
   REOs in control samples which may come from an independent dataset, to
   identify whether a pathway is altered by the disease or not.

   We developed a tool, DRFunc, to identify the pathways which are
   significantly enriched with differential REOs of the pathway member
   genes. Using two independent microarray datasets for gastric cancer,
   lung cancer and breast cancer, respectively, we demonstrated that
   differential REOs between diseased samples and control samples were
   reproducible for independent datasets. These differential REOs were
   preserved even after the control or case samples were changed with the
   corresponding control or case samples from the other dataset for the
   same cancer. Using two RNA-seq datasets from TCGA, we showed that
   differential REOs identified from the sequence-based data are also
   highly reproducible in the array-based data. The usage of this tool was
   further exemplified by applying to a one-sided glioblastoma dataset to
   detect significantly altered pathways. For two expression datasets
   collected for patients with breast cancer receiving chemotherapy,
   DRFunc could detect significant pathways which were elusive for the
   traditional tools which depend on the pre-selected DEGs, in particular
   when few DEGs could be identified.

Materials and Methods

Data source and data preprocessing

   We collected 11 microarray datasets from the Gene Expression Omnibus
   (GEO) database ([60]http://www.ncbi.nlm.nih.gov/geo/), as shown in
   Table [61]1. All of the datasets were measured by the Affymetrix
   platforms. The raw data were preprocessed by the Robust Multi-array
   Analysis algorithm^[62]23. The SOURCE database^[63]24 was used for
   mapping CloneIDs to GeneIDs. From the Cancer Genome Atlas database
   (TCGA), two RNA-seq datasets were downloaded (see Table [64]1). The
   RNA-seq datasets were measured by the Illumina HiSeq platform. The raw
   data were normalized^[65]25 using the edgeR BioConductor
   package^[66]26.

Table 1.

   Datasets used in this study.
        Dataset^a      Case Control Data source   Platform
   GC[38-31]           38   31      [67]GSE13911 [68]GPL570
   GC[12-15]           12   15      [69]GSE19826 [70]GPL570
   LC[91-65]           91   65      [71]GSE19188 [72]GPL570
   LC[60-60]           60   60      [73]GSE19804 [74]GPL570
   BC[12-27] ^ER       12   27      [75]GSE10810 [76]GPL570
   BC[34-17] ^ER       34   17      [77]GSE42568 [78]GPL570
   GBM[34-13]          34   13      [79]GSE50161 [80]GPL570
   GBM[70-0]           70   0       [81]GSE53733 [82]GPL570
   BC[68-46] ^Response 68   46      [83]GSE20194 [84]GPL96
   BC[61-19] ^Response 61   19      [85]GSE20271 [86]GPL96
   LUAD[125-37]        125  37      TCGA         HiSeq2000
   CRC[32-32]          32   32      [87]GSE8671  [88]GPL570
   COAD[285-41]        285  41      TCGA         HiSeq2000
   [89]Open in a new tab

   Denotes: ^aGC denotes gastric cancer, LC denotes lung cancer, BC
   denotes breast cancer, ER denotes estrogen receptor, GBM denotes
   glioblastoma, LUAD denotes lung adenocarcinoma, CRC denotes colorectal
   cancer, and COAD denotes colon adenocarcinoma. We referred to each
   dataset using the following nomenclature: cancer type followed by the
   number of case and control samples separated by a hyphen sign.

Pathway databases

   The gene ontology (GO), Kyoto encyclopedia of genes and genomes (KEGG)
   and the Molecular Signatures Database (MSigDB) were used for enrichment
   analysis in DRFunc. Taking the C2 gene sets of MSigDB as an example,
   1330 canonical pathways (as of 16 February 2016) were download from the
   GSEA website. For a given dataset, all of the measured genes which were
   annotated in the 1330 pathways were considered as the background genes.
   In total, there were 8039, 6825 and 8548 genes for the [90]GPL570,
   [91]GPL96 and Illumina HiSeq2000 platforms, respectively.

Identification of differential REOs between two phenotypes

   Given that the expression values of a gene pair (i, j) are denoted as
   (G [i], G [j]), R [ij], which is 1 if G [i] > G [j] and 0 if G [i] < G
   [j] within one sample, is defined as the REO of the gene pair. If two
   genes have the same expression value, the pair is excluded from
   analysis. For a dataset with n cases and m controls, differential REOs
   are identified through the following steps. (1) Calculate the values of
   R [ij] (0 or 1) for all pairs in each sample. (2) Count the frequencies
   of the binary values (1 or 0) of R [ij] for each pair (i, j) in each
   phenotype. For example, there are n [1] samples with R [ij] = 1 and n
   [2] samples with R [ij] = 0 in the case group (n [1] + n [2] = n), and
   m [1] samples with R [ij] = 1 and m [2] samples with R [ij] = 0 in the
   control group (m [1] + m [2] = m). (3) Test the null hypothesis that
   the frequencies have no association with phenotype (case or control)
   using the Fisher’s exact test. (4) Select differentially ranked (DR)
   gene pairs. After the Fisher’s exact test is done for all the pairs,
   the p-values are corrected to control the false discovery rate
   (FDR)^[92]27. A gene pair is considered as a DR gene pair if the
   adjusted p-value is less than 5%. Furthermore, for a DR gene pair,
   there are two possible patterns. If n [1]/n [2] > m [1]/m [2], the pair
   is called as Pattern 1, otherwise it is called as Pattern 2.

Reproducibility of DR gene pairs

   The binomial test is employed to evaluate the reproducibility between
   the two lists of DR gene pairs. If a gene pair has the same pattern of
   reversal REO in the two lists, this gene pair is considered as a
   concordant gene pair. If two lists of DR gene pairs have M common
   pairs, the probability of observing at least M [1] concordant gene
   pairs by chance is calculated by the following cumulative binomial
   distribution model,
   [MATH:
   <mi>P</mi><mo>=</mo><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><ms
   ub><mrow><mi>M</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow><mi>M</m
   i></munderover><mrow><mo stretchy="true">(</mo><mrow><mtable
   columnalign="left"><mtr><mtd><mi>M</mi></mtd></mtr><mtr><mtd><mi>i</mi>
   </mtd></mtr></mtable></mrow><mo
   stretchy="true">)</mo></mrow><msubsup><mrow><mi>p</mi></mrow><mrow><mn>
   0</mn></mrow><mrow><mi>i</mi></mrow></msubsup><msup><mrow><mo
   stretchy="false">(</mo><mrow><mn>1</mn><mo>−</mo><msub><mrow><mi>p</mi>
   </mrow><mrow><mn>0</mn></mrow></msub></mrow><mo
   stretchy="false">)</mo></mrow><mrow><mi>M</mi><mo>−</mo><mi>i</mi></mro
   w></msup> :MATH]
   1

   where p [0] is the probability for a random gene pair to be a
   concordant gene pair by chance between two lists (here p [0 = ]0.5
   since there are only two mutual-exclusive outcomes, Pattern 1 or
   Pattern 2, of a DR gene pair). The concordant ratio of these two lists
   of DR gene pairs is defined as M [1]/M. The two lists of DR gene pairs
   are considered significantly reproducible if P < 0.05.

Pathway enrichment analysis based on DR gene pairs

   If k gene pairs are DR gene pairs from n background gene pairs, the
   probability of observing at least x DR gene pairs in a pathway with a
   total of m background gene pairs by chance is given by the cumulative
   hypergeometric distribution function as follows,
   [MATH: <mi
   mathvariant="normal">P</mi><mo>=</mo><mn>1</mn><mo>−</mo><munderover><m
   o>∑</mo><mrow><mi
   mathvariant="normal">i</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi
   mathvariant="normal">x</mi><mo>-</mo><mn>1</mn></mrow></munderover><mro
   w><mfrac><mrow><mrow><mo stretchy="true">(</mo><mrow><mtable
   columnalign="left"><mtr><mtd><mi>m</mi></mtd></mtr><mtr><mtd><mi>i</mi>
   </mtd></mtr></mtable></mrow><mo stretchy="true">)</mo></mrow><mrow><mo
   stretchy="true">(</mo><mrow><mtable
   columnalign="left"><mtr><mtd><mi>n</mi><mo>−</mo><mi>m</mi></mtd></mtr>
   <mtr><mtd><mi>k</mi><mo>−</mo><mi>i</mi></mtd></mtr></mtable></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mrow><mo
   stretchy="true">(</mo><mrow><mtable
   columnalign="left"><mtr><mtd><mi>n</mi></mtd></mtr><mtr><mtd><mi>k</mi>
   </mtd></mtr></mtable></mrow><mo
   stretchy="true">)</mo></mrow></mfrac></mrow> :MATH]
   2

   The number of the background gene pairs (n) is equal to N(N − 1)/2,
   where N represents the number of the background genes. The pathways
   significantly enriched with DR gene pairs were identified after
   multiple testing adjustments with FDR < 5%^[93]27.

   Figure [94]1 shows the flowchart of DRFunc. The identification of DR
   gene pairs and detection of significant pathways were implemented in an
   open-source R package which is available at
   [95]https://github.com/keyougu/DRFunc.git.

Figure 1.

   Figure 1
   [96]Open in a new tab

   Flowchart of DRFunc. The DRFunc algorithm includes three steps: input
   of expression profiles for case and control samples (from the same or
   different experiments), DR gene pair identification, annotation and
   detection of significant pathways.

Results

Reproducible DR gene pairs identified between tumor and normal samples

   The datasets of gastric cancer, lung cancer and ER^− breast cancer
   which have large sample size were first used to test whether DR gene
   pairs could be reproducibly identified in different subsets of the same
   parent dataset. For each dataset, the tumor samples and control samples
   were randomly divided into two subsets with approximately equal sample
   size respectively. For example, the 38 tumor samples in GC[38-31] were
   divided into two groups with 19 samples each, while the 31 normal
   samples were divided into two groups with 15 samples and 16 samples
   respectively. They formed two subsets, one with 19 tumor samples and 15
   normal samples and the other with 19 tumor samples and 16 normal
   samples. From these two subsets, DR gene pairs were identified and
   compared. This procedure was repeated 100 times. The result showed that
   the identified DR gene pairs were highly reproducible, with an average
   concordant ratio of 99.99% for the dataset of GC[38-31] (see
   Table [97]2). Similar results were observed for LC[91-65] and BC[34-17]
   ^ER (see Table [98]2). These results show that the identified DR gene
   pairs are highly reproducibly within one dataset.

Table 2.

   Mean and standard deviation of the number of DR gene pairs identified
   from random subsets.
   Dataset #DR pair #Overlapped pair #Concordant pair Concordant ratio
   GC[38-31] 1054900 ± 237429
   1169868 ± 271089 586201 ± 36373 586198 ± 36373 0.9999 ± 8.42 × 10^−6
   LC[91-65] 5211347 ± 236859
   4983364 ± 256758 4078924 ± 69845 4078880 ± 69861 0.9999 ± 6.74 × 10^−6
   BC[34-17] ^ER 1199844 ± 328353
   1046124 ± 308752 595768 ± 86284 595768 ± 86284 0.9999 ± 2.2 × 10^−16
   [99]Open in a new tab

   Next, the reproducibility was analyzed for the DR gene pairs identified
   from different experimental datasets for the same cancer. As shown in
   Table [100]3, in the dataset GC[12-15], 249,379 DR gene pairs were
   identified between gastric tumor samples and normal controls, among
   which 75.67% were also detected as DR gene pairs in dataset GC[38-31].
   Among the overlapped DR gene pairs, 99.97% showed the concordant REOs
   in the two gastric datasets, which could not happen by random chance
   (p < 2.2 × 10^−16, binomial test). Similar result was observed in the
   two datasets for lung cancer. In the dataset LC[60-60], the one with
   smaller sample size of the two datasets, 75.18% of the detected DR gene
   pairs were also identified in the dataset LC[91-45] which has larger
   sample size than LC[60-60], and 98.39% of the overlapped DR gene pairs
   had the concordant REOs in the two datasets, which could not happen by
   random chance (p < 2.2 × 10^−16, binomial test). In the two datasets
   for ER^− breast cancer, the concordant ratio was 99.84%. These results
   indicate that extensive disruptions of gene REOs existed in tumor
   samples and such disrupted REOs were reproducible in different
   datasets. The number of genes in each DR gene pair list were provided
   in Supplementary file, Table [101]S1.

Table 3.

   Concordance of DR gene pairs identified for each cancer dataset.
      Dataset    #DR pair #Overlapped pair #Concordant pair Concordant ratio
   GC[12-15]     249379   188706           186655           0.9997
   GC[38-31]     3060133
   LC[60-60]     5035285  3785548          3724663          0.9839
   LC[91-65]     7977878
   BC[12-27] ^ER 2527003  1406505          1404282          0.9984
   BC[34-17] ^ER 3087813
   [102]Open in a new tab

   A further test on reproducibility was carried out to exchange the case
   and/or control samples between two datasets for the same cancer type.
   The DR gene pairs identified from the newly exchanged datasets were
   compared with the DR gene pairs identified from the original datasets.
   As shown in Table [103]4, 3,870,438 DR gene pairs were identified in
   the merged dataset GC[12-31] by integrating the normal samples from
   GC[38-31] and the tumor samples from GC[12-15], among which 163,670
   were included in the DR gene pairs identified from the original dataset
   GC[12-15]. Similarly, 4,523,783 DR gene pairs were identified in the
   merged dataset GC[38-15], among which 1,560,772 were found in the
   original dataset GC[38-31]. With only the control samples exchanged,
   the concordant ratios of DR gene pairs between the new datasets and
   their respective original datasets were 99.19% and 92.41%
   (Table [104]4), which were comparable to the concordant ratio between
   the two original datasets (99.97%) and could not happen by random
   chance (p < 2.2 × 10^−16, binomial test). For lung cancer, the
   concordant ratios between the control-exchanged datasets and the
   original datasets were 95.41% and 95.19% respectively, which were also
   comparable to the concordant ratio between the two original datasets
   (98.39%). For the two control-exchanged datasets for ER^− breast
   cancer, the concordant ratios were 98.88% and 97.47% respectively, also
   comparable to the concordant ratio between the two original datasets
   (99.84%). The detected DR gene pairs were also highly reproducible in
   the case-exchanged datasets: the minimum concordant ratio was as high
   as 97.06% (see Supplementary file, Table [105]S2). These analyses
   further indicate that differential REOs for a specific tumor type could
   be reproducibly detected from independent datasets of different
   sources. Therefore, when focusing on the REOs of genes, tumor samples
   and normal samples measured by different studies can be directly
   compared.

Table 4.

   Concordance of DR gene pairs identified from datasets with the same
   case samples but different control samples.
      Dataset    #DR pair #Overlapped pair #Concordant pair Concordant ratio
   GC[12-31]     3870438  163670           162305           0.9919
   GC[12-15]     249379
   GC[38-15]     4523783  1560772          1442242          0.9241
   GC[38-31]     3060133
   LC[60-65]     7387229  3982182          3799350          0.9541
   LC[60-60]     5035285
   LC[91-60]     8935664  6335001          6030374          0.9519
   LC[91-65]     7977878
   BC[12-17] ^ER 2649823  1130216          1117603          0.9888
   BC[12-27] ^ER 2527003
   BC[34-27] ^ER 6630077  2393323          2332764          0.9747
   BC[34-17] ^ER 3087813
   [106]Open in a new tab

Performance of DRFunc in detecting significant pathways

   Significant pathways were detected from the 1330 MSigDB C2 collection
   by employing the cumulative hypergeometric distribution test
   implemented in DRFunc. With FDR < 5%, 73 and 239 pathways were
   detected, respectively, to be significantly enriched with the DR gene
   pairs identified from GC[12-15] and GC[38-31]. For lung cancer, 255 and
   380 pathways were detected in LC[60-60] and LC[91-65], respectively.
   For ER^− breast cancer, 363 and 366 pathways were detected for
   BC[12-27]^ER and BC[34-17] ^ER, respectively. The overlapped pathways
   were shown in Fig. [107]2. The pathway names were listed in
   Supplementary file. Notably, there were 17 pathways commonly detected
   in the six datasets for the three caner types, including the ECM
   receptor interaction, focal adhesion pathways in KEGG, the RB1 and
   integrin related pathways in PID. There were 290 pathways commonly
   detected for at least two cancer types, indicating that different
   cancers may have many similar enriched pathways. These results suggest
   that the REOs of genes in many pathways were significantly disrupted
   under cancerous conditions, and DRFunc could capture such disruptions.

Figure 2.

   Figure 2
   [108]Open in a new tab

   Overlaps of significant pathways detected for the three cancer types.
   The bar plot shows the number of significant pathways (y-axis) shared
   by at least two, three, four, five and six datasets (x-axis) for BC, LC
   and GC.

   With FDR < 5%, 183 and 307 significant pathways were detected,
   respectively, in the control-exchanged datasets GC[12-31] and GC[38-15]
   for gastric cancer. In the two control-exchanged datasets for lung
   cancer, 440 and 383 pathways were detected, respectively, and in the
   two control-exchanged datasets for ER^− breast cancer, 428 and 529
   pathways were detected, respectively. Figure [109]3 show the number of
   overlapped significant pathways detected for the original datasets and
   for the control-exchanged datasets. This indicate that integration of
   cancer samples and control samples from different datasets is feasible
   using the DR gene pairs in order to detect significant pathways.

Figure 3.

   Figure 3
   [110]Open in a new tab

   Numbers of significant MSigDB pathways detected for GC, LC, and BC.

   One RNA-seq dataset with 125 adenocarcinoma samples and 37 normal
   controls from 20 batches, denoted as LUAD[125-37] were downloaded from
   TCGA (see Table [111]1). By applying DRFunc, 7,661,296 DR gene pairs
   were identified and they were significantly enriched in 530 pathways.
   These results were compared with those obtained from the microarray
   datasets. There were 2,656,494 overlapped gene pairs between the
   5,035,285 DR gene pairs identified in LC[60-60] and the 7,661,296 DR
   gene pairs identified in LUAD[125-37]. Among these overlapped DR gene
   pairs, 97.48% showed the concordant REOs between the microarray-based
   and sequence-based results, which could not happen by random chance
   (p < 2.2 × 10^−16, binomial test). Among the 255 pathways detected for
   LC[60-60], 215 pathways were also detected for LUAD[125-37]. There were
   3,611,571 pairs overlapped between the DR gene pairs identified in
   LC[91-65] and in LUAD[125-37] with a concordant ratio of 95.58%
   (p < 2.2 × 10^−16, binomial test). Of the 380 pathways detected in
   LC[91-65] 291 were also detected in the sequence-based dataset
   LUAD[125-37]. Similar results were observed for colorectal cancer, for
   which one microarray dataset and one RNA-seq dataset were collected
   (see Table [112]1). The concordant ratio of the DR gene pairs between
   CRC[32-32] and COAD[285-41] was as high as 97.81%. Of the 283
   significant pathways detected in CRC[32-32], 205 were also detected in
   COAD[285-41]. These results suggest that differential REOs between case
   and control samples identified by DRFunc had cross-platform
   reproducibility.

   To address whether pathways detected by using DRFunc is robust, random
   experiments were performed by adding different proportions of
   arbitrarily chosen gene pairs from the background into the real DR gene
   pairs identified between cases and controls in each dataset. The result
   showed that the pathways detected by DRFunc were robust (see
   Supplementary file). This conclusion is consistent with the viewpoint
   that functional categories are robust to different levels of
   noises^[113]28.

Application of DRFunc to one-sided GBM data

   To demonstrate the applicability of DRFunc in one-sided data, two
   datasets, GBM[70-0] with 70 samples of primary GBM samples^[114]10 and
   GBM[34-13] with 34 primary GBM samples and 13 normal brain tissue
   samples^[115]29, were collected. By integrating the GBM samples in
   GBM[70-0] and the normal samples in GBM[34-13] (denoted as integrated
   GBM[70-13] dataset) 5,756,553 DR gene pairs were identified. In the
   dataset GBM[34-13] itself, 3,659,102 DR gene pairs were identified,
   among which 80.84% overlapped with the former DR gene pairs. In
   particular, 99.85% of the overlapped gene pairs had the concordant REOs
   in the two groups of GBM patients. With FDR < 5%, 363 pathways were
   detected to be significantly enriched with the DR gene pairs identified
   in the integrated GBM[70-13] dataset. Meanwhile, 324 pathways were
   identified in GBM34-13, among which 266 were also detected in the
   integrated GBM[70-13] dataset. They were listed in the supplementary
   material. As the 1330 MSigDB C2 pathways integrated several online
   pathway databases with redundancy, the number of the detected pathways
   was also showed in Fig. [116]4, grouped by the pathway database source.
   Many of the pathways were found to be associated with GBM in
   literature, including the BioCarta EGF pathway and MTOR
   pathway^[117]30, the KEGG P53 signaling pathway and the PID TGF-β and
   Ras signaling pathway^[118]31.

Figure 4.

   Figure 4
   [119]Open in a new tab

   Venn diagrams for the number of significant MSigDB pathways detected
   for GBM. The 1330 significant MSigDB pathways were divided into five
   groups according to the source databases, including Biocarta, KEGG,
   PID, Reactom and the others.

   Since the number of GBM samples in GBM[70-0] was approximately twice of
   that in GBM[34-13], we performed resampling experiments to evaluate the
   effect of sample size. A subset of 34 GBM samples were randomly
   extracted from GBM[70-0] and integrated with the normal samples from
   GBM[34-13] for DR gene pair identification and significant pathway
   detection. This random experiment was repeated 100 times. The average
   number of DR gene pairs identified in the resampling experiments was
   5,046,700, and the average concordant ratio was 99.91% with the
   3,659,102 DR gene pairs identified in GBM[34-13]. The average number of
   significant pathways detected in the 100 resampling experiments was
   374.80, and the average of overlapped pathways were 271.40 with the 324
   pathways detected in GBM[34-13]. These results suggest that DRFunc
   could be used in one-sided data to detect the underlying dysregulated
   and disease associated biological pathways.

   The DRFunc algorithm was compared with two pathway analysis algorithms,
   SEA^[120]5 and GSEA^[121]6. As no controls were collected in GBM[70-0],
   the traditional SEA analysis could not be applied^[122]5. In
   GBM[34-13], 8,731 DEGs were identified (FDR < 5%, Student’s t-test).
   Using SEA, these DEGs were significantly enriched in 41 MSigDB
   pathways^[123]5, much fewer than the number of pathways detected by
   DRFunc in GBM[34-13]. Notably, the above mentioned EGR, MTOR, P53,
   TGF-β and Ras signaling pathways detected by DRFunc were not included
   in these 41 pathways. In contrast, 32 of these 41 pathways were also
   detected by DRFunc. When using GSEA, even with FDR < 25%, no
   significant pathways were detected in GBM[34-13]. These results suggest
   that the rank-based tool DRFunc could identify much more biologically
   meaningful pathways than the traditional enrichment analysis.

   DRFunc can detect pathways with only a few DEGs, since a dysregulated
   gene with a large change in quantitative expression level may result in
   many DR gene pairs. For example, the BioCarta EIF4 pathway, which
   mainly describes the regulation of eIF4E and p70 S6 kinase, contained
   24 genes measured in the GBM[34-13] dataset, among which only nine
   genes were identified as DEGs using Student’s t-test. The percentage of
   DEGs was only 37.50% in this pathway, while the percentage of DEGs was
   47.54% in the background. Thus this pathway was not detected as
   significant by SEA. In contrast, these 24 genes formed 276 gene pairs,
   among which 53 were identified as DR gene pairs in GBM[34-13].
   Therefore the pathway was detected to be significant by DRFunc. It has
   been reported that the overexpression of eIF4E could cause oncogenic
   transformation and elevated eIF4E protein levels were found in many
   human cancers including GBM^[124]32, [125]33. Interestingly, PRKCB in
   this pathway involved in 20 DR gene pairs, and its average expression
   level was higher than the expression levels of all its 20 partner genes
   in the normal samples but became lower than the expression levels of
   all its 20 partner genes in the GBM samples. That is to say that PRKCB
   was down-regulated greatly in GBM. This was consistent with the
   expression level changes as observed in GBM[34-13] and literature
   results reported for GBM^[126]34 as well as for other cancer
   types^[127]35. Similarly, WIF1 in KEGG WNT signaling pathway was found
   to be down-regulated greatly in GBM by comparing its expression level
   with those of 127 partner genes. This was consistent with the result
   reported previously^[128]36. These two examples suggest that such
   strongly dysregulated genes could lead to a high appearance frequency
   in DR gene pairs and make the associated pathways detectable by DRFunc.

Application of DRFunc to preoperative chemotherapy response data of breast
cancer

   A pathway with only a few DEGs cannot be detected by SEA but it may be
   significantly enriched with DR gene pairs. This hinted us that DRFunc
   might be able to capture functional disruptions in data with weak
   expression signals. Breast cancer patients with the pathological
   complete response (pCR) have a favorable prognosis compared to patients
   with residual disease (RD) and our previous analysis has shown that
   expression differences between these two conditions could be
   weak^[129]37. Two gene expression datasets were collected for
   preoperative chemotherapy response of breast cancer (see Table [130]1)
   to test whether DRFunc could identify such weak expression signals.
   Using Student’s t-test with FDR < 5%, one gene was identified as DEG
   between 61 RD patients and 19 pCR patients in the dataset BC[61-19]
   ^Response, indicating that expression differences between these two
   conditions were small. Because there was only one DEG, it was unable to
   detect significantly dysregulated pathways using SEA. However, using
   DRFunc, 9,569 DR gene pairs were identified with FDR < 5%, which
   significantly enriched in 38 MSigDB pathways (Supplementary file). In
   BC[68-46] ^Response, 321 genes were identified as DEGs between 68 RD
   patients and 46 pCR patients, significantly enriched in 28 MSigDB
   pathways as detected by SEA. With FDR < 5%, DRFunc identified 90,561 DR
   gene pairs, which significantly enriched in 84 MSigDB pathways
   (Supplementary file). When using GSEA, with FDR < 5%, no significant
   pathways were detected in either of the two datasets. These results
   suggested that the rank-based algorithm DRFunc could identify more
   biological pathways than the traditional enrichment analysis,
   especially when the expression differences were not significant.

Discussion

   Gene expression profiling for only one phenotype is frequently seen in
   experimental design when sampling of normal control tissues is
   difficult due to the invasive nature of biopsy^[131]9. For such
   one-sided data, current functional enrichment analysis tools which
   focus on quantitative expression differences between two phenotypes
   have difficulty in finding phenotype-related functional pathways. The
   within-sample REOs have been found robust against systematic batch
   effects and transferable among independent datasets which enables the
   reuse of accumulated samples^[132]19, [133]21, [134]38. In the present
   work, we proposed an REO-based algorithm DRFunc, which could robustly
   identify the underlying disturbed pathways from such one-sided dataset
   by integrating control samples of the same tissue measured by other
   independent experiments.

   Our analyses showed that the DR gene pairs identified by DRFunc for
   gastric cancer, lung cancer and ER^− breast cancer were highly
   reproducible among independent datasets and among datasets with
   case-control samples integrated from different studies. The comparison
   between microarray-based and sequence-based data for lung cancer and
   colorectal cancer also suggested the high cross-platform
   reproducibility of DR gene pairs identified by DRFunc. Such consistent
   DR gene pairs were previously observed among datasets generated by
   different microarray platforms^[135]39.

   The power of DRFunc may be influenced by the sample size in detecting
   DR gene pairs. For example, with FDR < 5%, 249,379 DR gene pairs were
   identified from the smaller-size dataset of GC[12-15], which was less
   than ten-fold of the number of DR gene pairs (3,060,113) identified
   from the larger-size dataset of GC[38-31] (Table [136]3). The
   insufficient sample size for any of the datasets will reduce the number
   of overlapped DR gene pairs^[137]40. Although the numbers of DR gene
   pairs in GC[12-31] and GC[38-15] were almost the same, the overlapped
   DR gene pairs between GC[12-31] and GC[12-15] was ten-fold less than
   the number of overlapped DR gene pairs between GC[38-15] and GC[38-31]
   (Table [138]3). The reduced power of DR gene pair identification will
   ultimately reduce the power of significant pathway detection. As shown
   in Fig. [139]3, only 73 pathways were significantly enriched with the
   DR gene pairs identified in GC[12-15], whereas 239 significant pathways
   were detected in GC[38-31].

   Some DR gene pairs may not overlap between different experiments. This
   is probably due to the fact that an experiment cannot capture all
   disease-associated differential signals, thus different experiments for
   the same disease may capture only partial DEGs each^[140]40. For
   example, among the top 100 genes with the highest appearance
   frequencies in the DR gene pairs identified only in GC[38-31], not in
   GC[12-15], 65 were identified as DEGs in GC[38-31], not in GC[12-15]
   (Student’s t-test, FDR < 5%). Non-overlapped DEGs would result in
   non-overlapped DR gene pairs between different experiments.

   Due to the above mentioned reasons, ultimately, some significant
   pathways cannot overlap between different datasets for the same
   disease. The problem of pathway overlaps has been discussed, and it has
   been suggested that the significant pathways could be rather
   functionally similar by reducing their corresponding statistical
   significance levels^[141]5, [142]41.

   It has been reported that many confounding factors such as gender and
   ethnicity may lead to gene expression differences among
   individuals^[143]42–[144]45. Therefore, the two datasets (LC[91-65] and
   LC[60-60]) for lung cancer with larger sample sizes were used to
   evaluate whether heterogeneous gene expression exists among normal
   samples. Information on the normal samples of the two lung cancer
   datasets was available in Supplementary file, Table [145]S3. The normal
   samples in LC[91-65] were obtained from 41 males, 11 females and 13
   samples without gender information. Comparing the gene expression
   profiles of the 41 males and 11 females, only 0.03% of the background
   gene pairs could be identified as DR gene pairs. However, when
   comparing the 11 normal female samples of Caucasian in LC[91-65] to the
   60 normal female samples of Chinese in LC[60-60], about 3.50% of the
   background gene pairs were found as DR gene pairs. This result
   indicates that ethnicity might be a confounding factor, which might
   introduce some disease-irrelevant DR gene pairs. Consequently, when
   applying DRFunc to detect DR gene pairs for significant pathway
   detection, some disease-irrelevant pathways may creep in.

   In spite of this, comparing to the traditional pathway enrichment
   analysis methods based on quantitative gene expression levels, which
   have limited usage with one-sided data, DRFunc has superiority in
   providing candidate pathways. To evaluate whether a significant pathway
   detected by DRFunc have specific biological implications or not, it is
   required to generate some biological hypotheses for wet lab experiment
   (such as Q-PCR) validation^[146]5. In this paper, we firstly showed
   that, in the one-sided GBM data, 266 of the 363 pathways detected in
   GBM[70-13] could be reproducibly detected in the other dataset
   GBM[34-13], which shares the same normal samples. Then, in the two
   application examples, besides the pathways already discussed in the
   Result section, we have additionally found evidence from published
   literature for the top 10 most significant pathways to support their
   association with the corresponding phenotype (Supplementary file).
   Further, to show that the significant pathways detected by DRFunc could
   not be detected if no phenotype differences exist, we have additionally
   performed random experiments by randomly reassigning labels to the
   disease and normal samples. By independently permuting the 70 GBM
   samples from GBM[70-0] and 13 normal samples from GBM[34-13] for 100
   times, only 10.70 significant pathways were detected on average. When
   applying the same randomization procedure to BC[68-46] ^Response
   dataset, only 7.02 pathways were detected on average. All these results
   support the ability of DRFunc in providing candidate disease-associated
   significant pathways using gene expression data even the one-sided
   data. Finally, if only a limited (or insufficient) number of normal
   control samples for a tissue were obtained in a study, normal samples
   from other independent datasets should be integrated for DR gene pair
   identification to reduce disease-irrelevant DR gene pairs introduced by
   population variations.

   In conclusion, through detection of DR gene pairs between diseased
   samples and normal controls collected from different experiments,
   disease-relevant pathways can be identified, which provide functional
   insights into the disease mechanism. The usage of the DR gene pairs
   instead of the DEGs enables us to make adequate use of the large
   one-sided disease samples and the samples with weak expression signals
   available in public data archives. This may facilitate many downstream
   analyses such as survival prediction. Our algorithm also provides a new
   tool for comparing transcriptional expression profiling of genes
   between two groups of samples from the same or different experiments.

Electronic supplementary material

   [147]Supplementary file^ (378.2KB, pdf)

Acknowledgements