Abstract Due to the invasiveness nature of tissue biopsy, it is common that investigators cannot collect sufficient normal controls for comparison with diseased samples. We developed a pathway enrichment tool, DRFunc, to detect significantly disease-disrupted pathways by incorporating normal controls from other experiments. The method was validated using both microarray and RNA-seq expression data for different cancers. The high concordant differentially ranked (DR) gene pairs were identified between cases and controls from different independent datasets. The DR gene pairs were used in the DRFunc algorithm to detect significantly disrupted pathways in one-phenotype expression data by combing controls from other studies. The DRFunc algorithm was exemplified by the detection of significant pathways in glioblastoma samples. The algorithm can also be used to detect altered pathways in the datasets with weak expression signals, as shown by the analysis on the expression data of chemotherapy-treated breast cancer samples. Introduction High-throughput biotechnologies such as microarrays and RNA sequencing (RNA_seq) are generating a large volume of genetic data. Such massive data have promoted the development of various pathway enrichment tools^[34]1, which can be divided into three categories: singular enrichment analysis (SEA), gene set enrichment analysis (GSEA) and modular enrichment analysis (MEA)^[35]2, [36]3. SEA usually calculates the enrichment p-value for a pathway based on a list of preselected differentially expressed genes (DEGs) using statistical methods such as Student’s t-test^[37]4, [38]5. GSEA identifies a significant pathway by determining whether the genes of the pathway are ranked at the top or the bottom among all the genes according to their expression differences between two phenotypes^[39]6. The enrichment calculation in MEA is similar to that in SEA, but the network topology information is integrated^[40]7. These pathway enrichment tools are effective in identifying disease-associated genes with important pathophysiologic roles. Tissue biopsy is a conventional method to collect samples for cancer diagnosis, monitoring and pathologic analysis^[41]8. However, biopsy is frequently very difficult for patients with brain cancer or metastatic cancers^[42]9, [43]10, and more challenging for healthy controls. As a consequence, studies for such diseases typically include very few or even no normal controls^[44]11. This situation poses a serious challenge to the common pathway enrichment tools discussed above, as they all compare quantitative expression levels of pathway genes between two phenotypes^[45]2, [46]3. Hereafter we refer to a dataset consisting of samples with only one phenotype (disease) as a one-sided dataset. The control samples for the same disease available in other datasets cannot be incorporated into a one-sided dataset because the quantitative expression values are sensitive to the so-called batch effects between different experiments^[47]12, [48]13. Datasets from the Cancer Genome Atlas database (TCGA) database^[49]14 should also be considered as one-sided, since TCGA samples were derived from different institutions and processed in different batches. Therefore, the DEGs detected directly between tumor samples and normal controls from TCGA are questionable without appropriate batch adjustment^[50]15. However, batch adjustments may be biased if study groups are not evenly distributed across batches^[51]15. To tackle the above problem, some studies have used the within-sample relative expression orderings (REOs) instead of the quantitative expression values for disease screening^[52]16, [53]17 and gene signaling network analysis^[54]18. We previously developed a tool, individPath, to identify patient-specific dysregulated pathways based on reversal REOs in an individual sample compared with the highly stable REOs identified from a large cohort of normal samples which were accumulated previously from various sources^[55]19. Compared with the algorithms based on the quantitative expression values, the REO-based algorithms have some unique advantages, including insensitive to batch effects, free of between-sample data normalization, reproducible across independent data^[56]17, [57]20 and reuse of accumulated data^[58]21, [59]22. Therefore, for a one-sided disease dataset, it is intuitive to compare the differences between the REOs in diseased samples and the REOs in control samples which may come from an independent dataset, to identify whether a pathway is altered by the disease or not. We developed a tool, DRFunc, to identify the pathways which are significantly enriched with differential REOs of the pathway member genes. Using two independent microarray datasets for gastric cancer, lung cancer and breast cancer, respectively, we demonstrated that differential REOs between diseased samples and control samples were reproducible for independent datasets. These differential REOs were preserved even after the control or case samples were changed with the corresponding control or case samples from the other dataset for the same cancer. Using two RNA-seq datasets from TCGA, we showed that differential REOs identified from the sequence-based data are also highly reproducible in the array-based data. The usage of this tool was further exemplified by applying to a one-sided glioblastoma dataset to detect significantly altered pathways. For two expression datasets collected for patients with breast cancer receiving chemotherapy, DRFunc could detect significant pathways which were elusive for the traditional tools which depend on the pre-selected DEGs, in particular when few DEGs could be identified. Materials and Methods Data source and data preprocessing We collected 11 microarray datasets from the Gene Expression Omnibus (GEO) database ([60]http://www.ncbi.nlm.nih.gov/geo/), as shown in Table [61]1. All of the datasets were measured by the Affymetrix platforms. The raw data were preprocessed by the Robust Multi-array Analysis algorithm^[62]23. The SOURCE database^[63]24 was used for mapping CloneIDs to GeneIDs. From the Cancer Genome Atlas database (TCGA), two RNA-seq datasets were downloaded (see Table [64]1). The RNA-seq datasets were measured by the Illumina HiSeq platform. The raw data were normalized^[65]25 using the edgeR BioConductor package^[66]26. Table 1. Datasets used in this study. Dataset^a Case Control Data source Platform GC[38-31] 38 31 [67]GSE13911 [68]GPL570 GC[12-15] 12 15 [69]GSE19826 [70]GPL570 LC[91-65] 91 65 [71]GSE19188 [72]GPL570 LC[60-60] 60 60 [73]GSE19804 [74]GPL570 BC[12-27] ^ER 12 27 [75]GSE10810 [76]GPL570 BC[34-17] ^ER 34 17 [77]GSE42568 [78]GPL570 GBM[34-13] 34 13 [79]GSE50161 [80]GPL570 GBM[70-0] 70 0 [81]GSE53733 [82]GPL570 BC[68-46] ^Response 68 46 [83]GSE20194 [84]GPL96 BC[61-19] ^Response 61 19 [85]GSE20271 [86]GPL96 LUAD[125-37] 125 37 TCGA HiSeq2000 CRC[32-32] 32 32 [87]GSE8671 [88]GPL570 COAD[285-41] 285 41 TCGA HiSeq2000 [89]Open in a new tab Denotes: ^aGC denotes gastric cancer, LC denotes lung cancer, BC denotes breast cancer, ER denotes estrogen receptor, GBM denotes glioblastoma, LUAD denotes lung adenocarcinoma, CRC denotes colorectal cancer, and COAD denotes colon adenocarcinoma. We referred to each dataset using the following nomenclature: cancer type followed by the number of case and control samples separated by a hyphen sign. Pathway databases The gene ontology (GO), Kyoto encyclopedia of genes and genomes (KEGG) and the Molecular Signatures Database (MSigDB) were used for enrichment analysis in DRFunc. Taking the C2 gene sets of MSigDB as an example, 1330 canonical pathways (as of 16 February 2016) were download from the GSEA website. For a given dataset, all of the measured genes which were annotated in the 1330 pathways were considered as the background genes. In total, there were 8039, 6825 and 8548 genes for the [90]GPL570, [91]GPL96 and Illumina HiSeq2000 platforms, respectively. Identification of differential REOs between two phenotypes Given that the expression values of a gene pair (i, j) are denoted as (G [i], G [j]), R [ij], which is 1 if G [i] > G [j] and 0 if G [i] < G [j] within one sample, is defined as the REO of the gene pair. If two genes have the same expression value, the pair is excluded from analysis. For a dataset with n cases and m controls, differential REOs are identified through the following steps. (1) Calculate the values of R [ij] (0 or 1) for all pairs in each sample. (2) Count the frequencies of the binary values (1 or 0) of R [ij] for each pair (i, j) in each phenotype. For example, there are n [1] samples with R [ij] = 1 and n [2] samples with R [ij] = 0 in the case group (n [1] + n [2] = n), and m [1] samples with R [ij] = 1 and m [2] samples with R [ij] = 0 in the control group (m [1] + m [2] = m). (3) Test the null hypothesis that the frequencies have no association with phenotype (case or control) using the Fisher’s exact test. (4) Select differentially ranked (DR) gene pairs. After the Fisher’s exact test is done for all the pairs, the p-values are corrected to control the false discovery rate (FDR)^[92]27. A gene pair is considered as a DR gene pair if the adjusted p-value is less than 5%. Furthermore, for a DR gene pair, there are two possible patterns. If n [1]/n [2] > m [1]/m [2], the pair is called as Pattern 1, otherwise it is called as Pattern 2. Reproducibility of DR gene pairs The binomial test is employed to evaluate the reproducibility between the two lists of DR gene pairs. If a gene pair has the same pattern of reversal REO in the two lists, this gene pair is considered as a concordant gene pair. If two lists of DR gene pairs have M common pairs, the probability of observing at least M [1] concordant gene pairs by chance is calculated by the following cumulative binomial distribution model, [MATH: P=i=M1M(Mi )p 0i(1p 0)Mi :MATH] 1 where p [0] is the probability for a random gene pair to be a concordant gene pair by chance between two lists (here p [0 = ]0.5 since there are only two mutual-exclusive outcomes, Pattern 1 or Pattern 2, of a DR gene pair). The concordant ratio of these two lists of DR gene pairs is defined as M [1]/M. The two lists of DR gene pairs are considered significantly reproducible if P < 0.05. Pathway enrichment analysis based on DR gene pairs If k gene pairs are DR gene pairs from n background gene pairs, the probability of observing at least x DR gene pairs in a pathway with a total of m background gene pairs by chance is given by the cumulative hypergeometric distribution function as follows, [MATH: P=1i=0x-1(mi )(nm ki)(nk ) :MATH] 2 The number of the background gene pairs (n) is equal to N(N − 1)/2, where N represents the number of the background genes. The pathways significantly enriched with DR gene pairs were identified after multiple testing adjustments with FDR < 5%^[93]27. Figure [94]1 shows the flowchart of DRFunc. The identification of DR gene pairs and detection of significant pathways were implemented in an open-source R package which is available at [95]https://github.com/keyougu/DRFunc.git. Figure 1. Figure 1 [96]Open in a new tab Flowchart of DRFunc. The DRFunc algorithm includes three steps: input of expression profiles for case and control samples (from the same or different experiments), DR gene pair identification, annotation and detection of significant pathways. Results Reproducible DR gene pairs identified between tumor and normal samples The datasets of gastric cancer, lung cancer and ER^− breast cancer which have large sample size were first used to test whether DR gene pairs could be reproducibly identified in different subsets of the same parent dataset. For each dataset, the tumor samples and control samples were randomly divided into two subsets with approximately equal sample size respectively. For example, the 38 tumor samples in GC[38-31] were divided into two groups with 19 samples each, while the 31 normal samples were divided into two groups with 15 samples and 16 samples respectively. They formed two subsets, one with 19 tumor samples and 15 normal samples and the other with 19 tumor samples and 16 normal samples. From these two subsets, DR gene pairs were identified and compared. This procedure was repeated 100 times. The result showed that the identified DR gene pairs were highly reproducible, with an average concordant ratio of 99.99% for the dataset of GC[38-31] (see Table [97]2). Similar results were observed for LC[91-65] and BC[34-17] ^ER (see Table [98]2). These results show that the identified DR gene pairs are highly reproducibly within one dataset. Table 2. Mean and standard deviation of the number of DR gene pairs identified from random subsets. Dataset #DR pair #Overlapped pair #Concordant pair Concordant ratio GC[38-31] 1054900 ± 237429 1169868 ± 271089 586201 ± 36373 586198 ± 36373 0.9999 ± 8.42 × 10^−6 LC[91-65] 5211347 ± 236859 4983364 ± 256758 4078924 ± 69845 4078880 ± 69861 0.9999 ± 6.74 × 10^−6 BC[34-17] ^ER 1199844 ± 328353 1046124 ± 308752 595768 ± 86284 595768 ± 86284 0.9999 ± 2.2 × 10^−16 [99]Open in a new tab Next, the reproducibility was analyzed for the DR gene pairs identified from different experimental datasets for the same cancer. As shown in Table [100]3, in the dataset GC[12-15], 249,379 DR gene pairs were identified between gastric tumor samples and normal controls, among which 75.67% were also detected as DR gene pairs in dataset GC[38-31]. Among the overlapped DR gene pairs, 99.97% showed the concordant REOs in the two gastric datasets, which could not happen by random chance (p < 2.2 × 10^−16, binomial test). Similar result was observed in the two datasets for lung cancer. In the dataset LC[60-60], the one with smaller sample size of the two datasets, 75.18% of the detected DR gene pairs were also identified in the dataset LC[91-45] which has larger sample size than LC[60-60], and 98.39% of the overlapped DR gene pairs had the concordant REOs in the two datasets, which could not happen by random chance (p < 2.2 × 10^−16, binomial test). In the two datasets for ER^− breast cancer, the concordant ratio was 99.84%. These results indicate that extensive disruptions of gene REOs existed in tumor samples and such disrupted REOs were reproducible in different datasets. The number of genes in each DR gene pair list were provided in Supplementary file, Table [101]S1. Table 3. Concordance of DR gene pairs identified for each cancer dataset. Dataset #DR pair #Overlapped pair #Concordant pair Concordant ratio GC[12-15] 249379 188706 186655 0.9997 GC[38-31] 3060133 LC[60-60] 5035285 3785548 3724663 0.9839 LC[91-65] 7977878 BC[12-27] ^ER 2527003 1406505 1404282 0.9984 BC[34-17] ^ER 3087813 [102]Open in a new tab A further test on reproducibility was carried out to exchange the case and/or control samples between two datasets for the same cancer type. The DR gene pairs identified from the newly exchanged datasets were compared with the DR gene pairs identified from the original datasets. As shown in Table [103]4, 3,870,438 DR gene pairs were identified in the merged dataset GC[12-31] by integrating the normal samples from GC[38-31] and the tumor samples from GC[12-15], among which 163,670 were included in the DR gene pairs identified from the original dataset GC[12-15]. Similarly, 4,523,783 DR gene pairs were identified in the merged dataset GC[38-15], among which 1,560,772 were found in the original dataset GC[38-31]. With only the control samples exchanged, the concordant ratios of DR gene pairs between the new datasets and their respective original datasets were 99.19% and 92.41% (Table [104]4), which were comparable to the concordant ratio between the two original datasets (99.97%) and could not happen by random chance (p < 2.2 × 10^−16, binomial test). For lung cancer, the concordant ratios between the control-exchanged datasets and the original datasets were 95.41% and 95.19% respectively, which were also comparable to the concordant ratio between the two original datasets (98.39%). For the two control-exchanged datasets for ER^− breast cancer, the concordant ratios were 98.88% and 97.47% respectively, also comparable to the concordant ratio between the two original datasets (99.84%). The detected DR gene pairs were also highly reproducible in the case-exchanged datasets: the minimum concordant ratio was as high as 97.06% (see Supplementary file, Table [105]S2). These analyses further indicate that differential REOs for a specific tumor type could be reproducibly detected from independent datasets of different sources. Therefore, when focusing on the REOs of genes, tumor samples and normal samples measured by different studies can be directly compared. Table 4. Concordance of DR gene pairs identified from datasets with the same case samples but different control samples. Dataset #DR pair #Overlapped pair #Concordant pair Concordant ratio GC[12-31] 3870438 163670 162305 0.9919 GC[12-15] 249379 GC[38-15] 4523783 1560772 1442242 0.9241 GC[38-31] 3060133 LC[60-65] 7387229 3982182 3799350 0.9541 LC[60-60] 5035285 LC[91-60] 8935664 6335001 6030374 0.9519 LC[91-65] 7977878 BC[12-17] ^ER 2649823 1130216 1117603 0.9888 BC[12-27] ^ER 2527003 BC[34-27] ^ER 6630077 2393323 2332764 0.9747 BC[34-17] ^ER 3087813 [106]Open in a new tab Performance of DRFunc in detecting significant pathways Significant pathways were detected from the 1330 MSigDB C2 collection by employing the cumulative hypergeometric distribution test implemented in DRFunc. With FDR < 5%, 73 and 239 pathways were detected, respectively, to be significantly enriched with the DR gene pairs identified from GC[12-15] and GC[38-31]. For lung cancer, 255 and 380 pathways were detected in LC[60-60] and LC[91-65], respectively. For ER^− breast cancer, 363 and 366 pathways were detected for BC[12-27]^ER and BC[34-17] ^ER, respectively. The overlapped pathways were shown in Fig. [107]2. The pathway names were listed in Supplementary file. Notably, there were 17 pathways commonly detected in the six datasets for the three caner types, including the ECM receptor interaction, focal adhesion pathways in KEGG, the RB1 and integrin related pathways in PID. There were 290 pathways commonly detected for at least two cancer types, indicating that different cancers may have many similar enriched pathways. These results suggest that the REOs of genes in many pathways were significantly disrupted under cancerous conditions, and DRFunc could capture such disruptions. Figure 2. Figure 2 [108]Open in a new tab Overlaps of significant pathways detected for the three cancer types. The bar plot shows the number of significant pathways (y-axis) shared by at least two, three, four, five and six datasets (x-axis) for BC, LC and GC. With FDR < 5%, 183 and 307 significant pathways were detected, respectively, in the control-exchanged datasets GC[12-31] and GC[38-15] for gastric cancer. In the two control-exchanged datasets for lung cancer, 440 and 383 pathways were detected, respectively, and in the two control-exchanged datasets for ER^− breast cancer, 428 and 529 pathways were detected, respectively. Figure [109]3 show the number of overlapped significant pathways detected for the original datasets and for the control-exchanged datasets. This indicate that integration of cancer samples and control samples from different datasets is feasible using the DR gene pairs in order to detect significant pathways. Figure 3. Figure 3 [110]Open in a new tab Numbers of significant MSigDB pathways detected for GC, LC, and BC. One RNA-seq dataset with 125 adenocarcinoma samples and 37 normal controls from 20 batches, denoted as LUAD[125-37] were downloaded from TCGA (see Table [111]1). By applying DRFunc, 7,661,296 DR gene pairs were identified and they were significantly enriched in 530 pathways. These results were compared with those obtained from the microarray datasets. There were 2,656,494 overlapped gene pairs between the 5,035,285 DR gene pairs identified in LC[60-60] and the 7,661,296 DR gene pairs identified in LUAD[125-37]. Among these overlapped DR gene pairs, 97.48% showed the concordant REOs between the microarray-based and sequence-based results, which could not happen by random chance (p < 2.2 × 10^−16, binomial test). Among the 255 pathways detected for LC[60-60], 215 pathways were also detected for LUAD[125-37]. There were 3,611,571 pairs overlapped between the DR gene pairs identified in LC[91-65] and in LUAD[125-37] with a concordant ratio of 95.58% (p < 2.2 × 10^−16, binomial test). Of the 380 pathways detected in LC[91-65] 291 were also detected in the sequence-based dataset LUAD[125-37]. Similar results were observed for colorectal cancer, for which one microarray dataset and one RNA-seq dataset were collected (see Table [112]1). The concordant ratio of the DR gene pairs between CRC[32-32] and COAD[285-41] was as high as 97.81%. Of the 283 significant pathways detected in CRC[32-32], 205 were also detected in COAD[285-41]. These results suggest that differential REOs between case and control samples identified by DRFunc had cross-platform reproducibility. To address whether pathways detected by using DRFunc is robust, random experiments were performed by adding different proportions of arbitrarily chosen gene pairs from the background into the real DR gene pairs identified between cases and controls in each dataset. The result showed that the pathways detected by DRFunc were robust (see Supplementary file). This conclusion is consistent with the viewpoint that functional categories are robust to different levels of noises^[113]28. Application of DRFunc to one-sided GBM data To demonstrate the applicability of DRFunc in one-sided data, two datasets, GBM[70-0] with 70 samples of primary GBM samples^[114]10 and GBM[34-13] with 34 primary GBM samples and 13 normal brain tissue samples^[115]29, were collected. By integrating the GBM samples in GBM[70-0] and the normal samples in GBM[34-13] (denoted as integrated GBM[70-13] dataset) 5,756,553 DR gene pairs were identified. In the dataset GBM[34-13] itself, 3,659,102 DR gene pairs were identified, among which 80.84% overlapped with the former DR gene pairs. In particular, 99.85% of the overlapped gene pairs had the concordant REOs in the two groups of GBM patients. With FDR < 5%, 363 pathways were detected to be significantly enriched with the DR gene pairs identified in the integrated GBM[70-13] dataset. Meanwhile, 324 pathways were identified in GBM34-13, among which 266 were also detected in the integrated GBM[70-13] dataset. They were listed in the supplementary material. As the 1330 MSigDB C2 pathways integrated several online pathway databases with redundancy, the number of the detected pathways was also showed in Fig. [116]4, grouped by the pathway database source. Many of the pathways were found to be associated with GBM in literature, including the BioCarta EGF pathway and MTOR pathway^[117]30, the KEGG P53 signaling pathway and the PID TGF-β and Ras signaling pathway^[118]31. Figure 4. Figure 4 [119]Open in a new tab Venn diagrams for the number of significant MSigDB pathways detected for GBM. The 1330 significant MSigDB pathways were divided into five groups according to the source databases, including Biocarta, KEGG, PID, Reactom and the others. Since the number of GBM samples in GBM[70-0] was approximately twice of that in GBM[34-13], we performed resampling experiments to evaluate the effect of sample size. A subset of 34 GBM samples were randomly extracted from GBM[70-0] and integrated with the normal samples from GBM[34-13] for DR gene pair identification and significant pathway detection. This random experiment was repeated 100 times. The average number of DR gene pairs identified in the resampling experiments was 5,046,700, and the average concordant ratio was 99.91% with the 3,659,102 DR gene pairs identified in GBM[34-13]. The average number of significant pathways detected in the 100 resampling experiments was 374.80, and the average of overlapped pathways were 271.40 with the 324 pathways detected in GBM[34-13]. These results suggest that DRFunc could be used in one-sided data to detect the underlying dysregulated and disease associated biological pathways. The DRFunc algorithm was compared with two pathway analysis algorithms, SEA^[120]5 and GSEA^[121]6. As no controls were collected in GBM[70-0], the traditional SEA analysis could not be applied^[122]5. In GBM[34-13], 8,731 DEGs were identified (FDR < 5%, Student’s t-test). Using SEA, these DEGs were significantly enriched in 41 MSigDB pathways^[123]5, much fewer than the number of pathways detected by DRFunc in GBM[34-13]. Notably, the above mentioned EGR, MTOR, P53, TGF-β and Ras signaling pathways detected by DRFunc were not included in these 41 pathways. In contrast, 32 of these 41 pathways were also detected by DRFunc. When using GSEA, even with FDR < 25%, no significant pathways were detected in GBM[34-13]. These results suggest that the rank-based tool DRFunc could identify much more biologically meaningful pathways than the traditional enrichment analysis. DRFunc can detect pathways with only a few DEGs, since a dysregulated gene with a large change in quantitative expression level may result in many DR gene pairs. For example, the BioCarta EIF4 pathway, which mainly describes the regulation of eIF4E and p70 S6 kinase, contained 24 genes measured in the GBM[34-13] dataset, among which only nine genes were identified as DEGs using Student’s t-test. The percentage of DEGs was only 37.50% in this pathway, while the percentage of DEGs was 47.54% in the background. Thus this pathway was not detected as significant by SEA. In contrast, these 24 genes formed 276 gene pairs, among which 53 were identified as DR gene pairs in GBM[34-13]. Therefore the pathway was detected to be significant by DRFunc. It has been reported that the overexpression of eIF4E could cause oncogenic transformation and elevated eIF4E protein levels were found in many human cancers including GBM^[124]32, [125]33. Interestingly, PRKCB in this pathway involved in 20 DR gene pairs, and its average expression level was higher than the expression levels of all its 20 partner genes in the normal samples but became lower than the expression levels of all its 20 partner genes in the GBM samples. That is to say that PRKCB was down-regulated greatly in GBM. This was consistent with the expression level changes as observed in GBM[34-13] and literature results reported for GBM^[126]34 as well as for other cancer types^[127]35. Similarly, WIF1 in KEGG WNT signaling pathway was found to be down-regulated greatly in GBM by comparing its expression level with those of 127 partner genes. This was consistent with the result reported previously^[128]36. These two examples suggest that such strongly dysregulated genes could lead to a high appearance frequency in DR gene pairs and make the associated pathways detectable by DRFunc. Application of DRFunc to preoperative chemotherapy response data of breast cancer A pathway with only a few DEGs cannot be detected by SEA but it may be significantly enriched with DR gene pairs. This hinted us that DRFunc might be able to capture functional disruptions in data with weak expression signals. Breast cancer patients with the pathological complete response (pCR) have a favorable prognosis compared to patients with residual disease (RD) and our previous analysis has shown that expression differences between these two conditions could be weak^[129]37. Two gene expression datasets were collected for preoperative chemotherapy response of breast cancer (see Table [130]1) to test whether DRFunc could identify such weak expression signals. Using Student’s t-test with FDR < 5%, one gene was identified as DEG between 61 RD patients and 19 pCR patients in the dataset BC[61-19] ^Response, indicating that expression differences between these two conditions were small. Because there was only one DEG, it was unable to detect significantly dysregulated pathways using SEA. However, using DRFunc, 9,569 DR gene pairs were identified with FDR < 5%, which significantly enriched in 38 MSigDB pathways (Supplementary file). In BC[68-46] ^Response, 321 genes were identified as DEGs between 68 RD patients and 46 pCR patients, significantly enriched in 28 MSigDB pathways as detected by SEA. With FDR < 5%, DRFunc identified 90,561 DR gene pairs, which significantly enriched in 84 MSigDB pathways (Supplementary file). When using GSEA, with FDR < 5%, no significant pathways were detected in either of the two datasets. These results suggested that the rank-based algorithm DRFunc could identify more biological pathways than the traditional enrichment analysis, especially when the expression differences were not significant. Discussion Gene expression profiling for only one phenotype is frequently seen in experimental design when sampling of normal control tissues is difficult due to the invasive nature of biopsy^[131]9. For such one-sided data, current functional enrichment analysis tools which focus on quantitative expression differences between two phenotypes have difficulty in finding phenotype-related functional pathways. The within-sample REOs have been found robust against systematic batch effects and transferable among independent datasets which enables the reuse of accumulated samples^[132]19, [133]21, [134]38. In the present work, we proposed an REO-based algorithm DRFunc, which could robustly identify the underlying disturbed pathways from such one-sided dataset by integrating control samples of the same tissue measured by other independent experiments. Our analyses showed that the DR gene pairs identified by DRFunc for gastric cancer, lung cancer and ER^− breast cancer were highly reproducible among independent datasets and among datasets with case-control samples integrated from different studies. The comparison between microarray-based and sequence-based data for lung cancer and colorectal cancer also suggested the high cross-platform reproducibility of DR gene pairs identified by DRFunc. Such consistent DR gene pairs were previously observed among datasets generated by different microarray platforms^[135]39. The power of DRFunc may be influenced by the sample size in detecting DR gene pairs. For example, with FDR < 5%, 249,379 DR gene pairs were identified from the smaller-size dataset of GC[12-15], which was less than ten-fold of the number of DR gene pairs (3,060,113) identified from the larger-size dataset of GC[38-31] (Table [136]3). The insufficient sample size for any of the datasets will reduce the number of overlapped DR gene pairs^[137]40. Although the numbers of DR gene pairs in GC[12-31] and GC[38-15] were almost the same, the overlapped DR gene pairs between GC[12-31] and GC[12-15] was ten-fold less than the number of overlapped DR gene pairs between GC[38-15] and GC[38-31] (Table [138]3). The reduced power of DR gene pair identification will ultimately reduce the power of significant pathway detection. As shown in Fig. [139]3, only 73 pathways were significantly enriched with the DR gene pairs identified in GC[12-15], whereas 239 significant pathways were detected in GC[38-31]. Some DR gene pairs may not overlap between different experiments. This is probably due to the fact that an experiment cannot capture all disease-associated differential signals, thus different experiments for the same disease may capture only partial DEGs each^[140]40. For example, among the top 100 genes with the highest appearance frequencies in the DR gene pairs identified only in GC[38-31], not in GC[12-15], 65 were identified as DEGs in GC[38-31], not in GC[12-15] (Student’s t-test, FDR < 5%). Non-overlapped DEGs would result in non-overlapped DR gene pairs between different experiments. Due to the above mentioned reasons, ultimately, some significant pathways cannot overlap between different datasets for the same disease. The problem of pathway overlaps has been discussed, and it has been suggested that the significant pathways could be rather functionally similar by reducing their corresponding statistical significance levels^[141]5, [142]41. It has been reported that many confounding factors such as gender and ethnicity may lead to gene expression differences among individuals^[143]42–[144]45. Therefore, the two datasets (LC[91-65] and LC[60-60]) for lung cancer with larger sample sizes were used to evaluate whether heterogeneous gene expression exists among normal samples. Information on the normal samples of the two lung cancer datasets was available in Supplementary file, Table [145]S3. The normal samples in LC[91-65] were obtained from 41 males, 11 females and 13 samples without gender information. Comparing the gene expression profiles of the 41 males and 11 females, only 0.03% of the background gene pairs could be identified as DR gene pairs. However, when comparing the 11 normal female samples of Caucasian in LC[91-65] to the 60 normal female samples of Chinese in LC[60-60], about 3.50% of the background gene pairs were found as DR gene pairs. This result indicates that ethnicity might be a confounding factor, which might introduce some disease-irrelevant DR gene pairs. Consequently, when applying DRFunc to detect DR gene pairs for significant pathway detection, some disease-irrelevant pathways may creep in. In spite of this, comparing to the traditional pathway enrichment analysis methods based on quantitative gene expression levels, which have limited usage with one-sided data, DRFunc has superiority in providing candidate pathways. To evaluate whether a significant pathway detected by DRFunc have specific biological implications or not, it is required to generate some biological hypotheses for wet lab experiment (such as Q-PCR) validation^[146]5. In this paper, we firstly showed that, in the one-sided GBM data, 266 of the 363 pathways detected in GBM[70-13] could be reproducibly detected in the other dataset GBM[34-13], which shares the same normal samples. Then, in the two application examples, besides the pathways already discussed in the Result section, we have additionally found evidence from published literature for the top 10 most significant pathways to support their association with the corresponding phenotype (Supplementary file). Further, to show that the significant pathways detected by DRFunc could not be detected if no phenotype differences exist, we have additionally performed random experiments by randomly reassigning labels to the disease and normal samples. By independently permuting the 70 GBM samples from GBM[70-0] and 13 normal samples from GBM[34-13] for 100 times, only 10.70 significant pathways were detected on average. When applying the same randomization procedure to BC[68-46] ^Response dataset, only 7.02 pathways were detected on average. All these results support the ability of DRFunc in providing candidate disease-associated significant pathways using gene expression data even the one-sided data. Finally, if only a limited (or insufficient) number of normal control samples for a tissue were obtained in a study, normal samples from other independent datasets should be integrated for DR gene pair identification to reduce disease-irrelevant DR gene pairs introduced by population variations. In conclusion, through detection of DR gene pairs between diseased samples and normal controls collected from different experiments, disease-relevant pathways can be identified, which provide functional insights into the disease mechanism. The usage of the DR gene pairs instead of the DEGs enables us to make adequate use of the large one-sided disease samples and the samples with weak expression signals available in public data archives. This may facilitate many downstream analyses such as survival prediction. Our algorithm also provides a new tool for comparing transcriptional expression profiling of genes between two groups of samples from the same or different experiments. Electronic supplementary material [147]Supplementary file^ (378.2KB, pdf) Acknowledgements