Abstract Researchers usually measure only a few technical replicates of two types of cell line, resistant or sensitive to a drug, and use a fold-change (FC) cut-off value to detect differentially expressed (DE) genes. However, the FC cut-off lacks statistical control and is biased towards the identification of genes with low expression levels in both cell lines. Here, viewing every pair of resistant-sensitive technical replicates as an experiment, we proposed an algorithm to identify DE genes by evaluating the reproducibility of the expression difference or FC between every two independent experiments without overlapping samples. Using four small datasets of cancer cell line resistant or sensitive to a drug, we demonstrated that this algorithm could efficiently capture reproducible DE genes significantly enriched in biological pathways relevant to the corresponding drugs, whereas many of them could not be found by the FC and other commonly used methods. Therefore, the proposed algorithm is an effective complement to current approaches for analysing small cancer cell line data. __________________________________________________________________ In experiments for comparing gene expression profiles between two types of cancer cell lines which are respectively resistant and sensitive to a particular drug, researchers usually generate only two or three technical replicates for each type given that there is no biological difference between technical replicates for a particular clonally derived cell line. Typically, in these small datasets, genes are selected as differentially expressed (DE) genes if their fold changes (FCs), computed as the ratios of the average gene expressions of the genes between the two types of cell lines, are larger than a arbitrarily determined cut-off value. Obviously, genes that are highly expressed in both cell lines can hardly reach large FCs, and thus the FC approach tend to miss them even if their absolute expression level differences between the two types of cell lines are rather large[42]^1. Besides, genes with low expression levels in both cell lines may reach large FCs simply due to large measurement variations, whereas the FC approach lacks of statistical control to reduce such false discoveries. Significance Analysis of Microarrays[43]^1 (SAM) and the Rank Product[44]^2 (RP) methods have also been applied to analyse small datasets from cell line experiments[45]^3,[46]^4. However, similar to FC, they are both biased towards the identification of genes with large FCs between two types[47]^5,[48]^6,[49]^7,[50]^8,[51]^9. Compared with genes expressed at low levels, genes with high expression levels are more likely to participate in some more conserved pathways with important biological significance, such as RNA processing, metabolism, and membrane trafficking[52]^10. To increase the power of detecting genes with high expression levels, the Average Difference (AD) method have been proposed, which ranks genes by the difference in average expression levels between two groups of samples[53]^3. In contrast to FC, SAM and RP, AD tends to identify genes that are highly expressed in both cell lines with large absolute differences in expression levels and may miss genes expressed at low levels in both cell lines, even if their FCs are truly large. In response, the Weighted Average Difference (WAD) method[54]^3 is proposed to rank genes using the relative average signal intensity as the weight to compute the gene expression ratio between two types. However, the WAD method is limited by uncertainties in weighting. Furthermore, both AD and WAD lack statistical control. In the present study, for small datasets that include several technical replicates for each cell line, we considered every pair of resistant-sensitive technical replicates as an experiment and every two experiments without overlapping samples as independent experiments. Then, we proposed an algorithm to rank genes according to their absolute expression differences or FCs in each pair of resistant-sensitive cell line experiment and identified significantly reproducible DE genes between every two independent experiments through reproducibility evaluation[55]^11. The algorithm was comprehensively evaluated using four microarray datasets for drug-resistant and drug-sensitive cancer cell lines. Results The reproducibility-based PD ranking method In each of the four datasets, referred to as CP70/A2780, MDA-MB-231, LCC2/MCF7 and HCT116, we ranked genes according to the Pairwise Difference (PD) of expression levels in every pair of resistant and sensitive replicates, and then evaluated the dysregulation direction consistency score of the top n (n = 300, 600,……3000) genes between every two independent pairs without overlapping samples (see Methods for details). In the CP70/A2780, MDA-MB-231, LCC2/MCF7 datasets, the consistency scores of the top n (n = 300, 600,……3000) genes sorted from every two independent pairs were all significantly higher than expected by chance (binomial test, all p < 2.2E-16)([56]Fig. 1a–c). In the HCT116 dataset, the top n (n = 300, 600,……3000) genes between pair 1 (MEXP:179508 VS MEXP:179487) and pair 2 (MEXP:179509 VS MEXP:179486) were significantly consistent (binomial test, all p < 2.2E-16), suggesting that there were reproducible differential expression signals between the two independent pairs ([57]Fig. 1d). However, the top n (n = 300, 600,……3000) genes in another independent pair, pair 3 (MEXP:179507 and MEXP:179488), were inconsistent with the top n (n = 300, 600,……3000) genes ranked by PD in either pair 1 or pair 2, with all the consistency scores below 33%. The results suggested that unreliable measurements may be present in one or two samples of pair 3. Therefore, we excluded pair 3 and used only pair 1 and pair 2 for the subsequent analyses in this study. Detailed information about the independent pairs selected in each dataset is provided in [58]Table 1. Figure 1. The evaluation of consistency scores between every two independent pairs. [59]Figure 1 [60]Open in a new tab The consistency scores of top n (n = 300, 600,……3000) genes ranked by PD between every two independent pairs were evaluated in the CP70/A2780 dataset (a), the MDA-MB-231dataset (b), the LCC2/MCF7 dataset (c) and the HCT116 dataset (d). Consistency(%) ranged from 0%-100%, top 300 (top 600, etc.) were the top number of genes of two independent pairs ranked by PD. pair1-2, pair1-3, pair2-3 were the comparisons of reproducibility between pair 1 and pair 2, pair 1 and pair 3, pair 2 and pair 3 in each dataset, respectively. Table 1. The reproducibility of the top 300 genes ranked by PD between every two independent pairs selected in the four datasets. Dataset Pair Sample No. Comparison K S S/K(%) p CP70/A2780 pair 1 [61]GSM709781 VS [62]GSM709780 pair1–2 267 267 100.00% <2.2E-16 pair 2 [63]GSM709782 VS [64]GSM709778 pair1–3 151 142 94.04% <2.2E-16 pair 3 [65]GSM709783 VS [66]GSM709779 pair2–3 155 146 94.19% <2.2E-16 MDA-MB-231 pair 1 [67]GSM712688 VS [68]GSM712682 pair1–2 200 194 97.00% <2.2E-16 pair 2 [69]GSM712689 VS [70]GSM712683 pair1–3 186 174 93.55% <2.2E-16 pair 3 [71]GSM712690 VS [72]GSM712684 pair2–3 179 167 93.30% <2.2E-16 LCC2/MCF7 pair 1 [73]GSM1326254 VS [74]GSM1326258 pair 2 [75]GSM1326255 VS [76]GSM1326259 pair1–2 196 190 96.94% <2.2E-16 HCT116 pair 1 MEXPl: 179508 VS MEXP: 179487 pair 2 MEXP: 179509 VS MEXP: 179486 pair1–2 147 136 92.52% <2.2E-16 [77]Open in a new tab Note. pair1-2, pair1-3, pair2-3, the comparison of reproducibility between pair 1 and pair 2, pair1 and pair3, pair2 and pair3 in each dataset; K, the number of genes shared by two gene lists; S, the number of genes with same dysregulation directions; S/K(%), the consistency score; p, the probability of significant reproducibility. The consistency score between two gene lists with reproducible differential expression signals gradually decreased with increasing n, indicating the importance of optimising the length of the initial step. For example, in pair 1 and pair 2 of the HCT116 dataset, when n was set to 300 (i.e., the top 1 to 300 genes were included), the overlapping number was 147. Out of these 147 genes, 136 genes had the same dysregulation directions in the drug-resistant cell compared with the drug-sensitive cell in the two independent experiments; the consistency score was 92.52%. When n = 1200, although the consistency score for the overlapping 750 genes reached 90.53%, which was above the acceptability threshold of 90%, the consistency score between the genes in the 4th block (that is, the top 901 to 1,200 genes) was only 89.59%. Thus, at a certain consistency threshold, the consistency score between genes at the top of one block of the two gene lists was higher than the consistency score between genes at the bottom of the same block at the same step. Thus, when applying the reproducibility-based PD ranking algorithm to identify DE genes, a large initial step would allow genes with a low consistency score at the bottom of the block exceed the consistency threshold to become candidate DE genes and increase the probability of false positives. By contrast, if the initial step is too short, the number of overlapping genes between two gene lists could be too small to reach stable consistency scores, and the search process will likely terminate prematurely. As a result, an initial step that included 300 genes was adopted in the subsequent analyses. In the last but one section , we will further analyse the influence of parameter selection on the performance of the algorithm in detail. Using the reproducibility-based PD with an initial step of 300 and a consistency threshold of 90%, 6,366, 2,031, 2,928 and 1,057 DE genes were detected for the CP70/A2780, MDA-MB-231, LCC2/MCF7and HCT116 datasets, respectively. These DE genes were enriched in 9, 11, 19 and 12 KEGG pathways with an FDR less than 5%, respectively ([78]Supplementary Table S1–S4). Because a DE genes list can be significantly enriched in pathways only when it contains sufficient real DE genes[79]^12,[80]^13, these results suggest that the reproducibility-based PD algorithm can detect reliably DE genes in small datasets including only two or three technical replicates. To prove this, we did random experiments based on the dataset CP70/A2780 as an example. From the background genes (i.e., all the measured genes) in this dataset, we randomly selected a set of genes with the same number of DE genes detected by the reproducibility-based PD method with an initial step of 300 and a consistency threshold of 90%, and then did the enrichment analysis to count the number of pathways that were significantly enriched with the randomly selected genes at the same FDR level of 5%. This random experiment was repeated 1,000 times. The result showed that 97% of the 1,000 random experiments could not identify any significant pathway and the average of the number of significant pathways of the 1,000 random experiments was 0.043, whereas the number of the pathways significantly enriched with DE genes detected by the reproducibility-based PD method was 9. None of the 1,000 random experiments detected more significant pathways than the reproducibility-based PD method, indicating that the random probability of observing no less than the number of significant pathways detected by the reproducibility-based PD method was less than 0.001. In addition, we randomly replaced a certain proportion of the DE genes detected by the reproducibility-based PD method with the same number of genes randomly selected from the background genes (excluding genes in the DE genes list), and then did the enrichment analysis. With a mixture of 75% genes from the DE genes list and 25% genes from the background, averagely only 4.32 significant pathways could be detected in 1,000 random experiments and the number decreased to 0.62 when the proportion of random genes increased to 50%. The results showed that the number of pathways enriched by genes decreased greatly as the proportion of random genes (i.e., false DE genes) in the DE genes lists increased. It suggested that most of the DE genes detected by the reproducibility-based PD method were reliable. We also need to notice that the drug-resistant cancer cell lines in the four datasets were induced to resistance by repeated exposure to increasing concentrations of particular drugs, and thus some genes differentially expressed between drug-sensitive cancer cell lines and drug-induced resistant cancer cell lines may just represent drug-induced transcriptional changes that may be irrelevant to drug resistance mechanisms[81]^14,[82]^15. Nevertheless, for each dataset, many of the biological pathways significantly enriched with the DE genes detected by the reproducibility-based PD method are related to drug sensitivity. For example, the DE genes detected by the reproducibility-based PD method for the CP70/A2780 dataset of cisplatin resistant-sensitive cell lines were significantly enriched in biological pathways which are known to be related to drug cisplatin sensitivity, such as cell cycle, P53 signalling pathway, ubiquitin mediated proteolysis and protein processing in endoplasmic reticulum (see [83]Supplementary Table S1). The target of cisplatin is thought to be DNA and the formation of cisplatin-DNA adducts block DNA replication and transcription, leading to cell cycle arrest and apoptosis[84]^14. P53 is involved in DNA damage signaling and many studies[85]^14,[86]^16,[87]^17,[88]^18 have reported that the activity of P53 is related with the sensitivity of cancer cell to cisplatin. The relationship between ubiquitin mediated proteolysis and DNA damage response is close[89]^19 and cisplatin can activate the pathway. It has been suggested that ubiquitin- proteasome genes as targets for modulating cisplatin sensitivity[90]^20. Cisplatin is also shown to cause apoptotic caspases through the endoplasmic reticulum stress pathway[91]^21,[92]^22. In [93]supplementary Table S2–S4, we have listed some references which provided evidence of the corresponding