Abstract

   Researchers usually measure only a few technical replicates of two
   types of cell line, resistant or sensitive to a drug, and use a
   fold-change (FC) cut-off value to detect differentially expressed (DE)
   genes. However, the FC cut-off lacks statistical control and is biased
   towards the identification of genes with low expression levels in both
   cell lines. Here, viewing every pair of resistant-sensitive technical
   replicates as an experiment, we proposed an algorithm to identify DE
   genes by evaluating the reproducibility of the expression difference or
   FC between every two independent experiments without overlapping
   samples. Using four small datasets of cancer cell line resistant or
   sensitive to a drug, we demonstrated that this algorithm could
   efficiently capture reproducible DE genes significantly enriched in
   biological pathways relevant to the corresponding drugs, whereas many
   of them could not be found by the FC and other commonly used methods.
   Therefore, the proposed algorithm is an effective complement to current
   approaches for analysing small cancer cell line data.
     __________________________________________________________________

   In experiments for comparing gene expression profiles between two types
   of cancer cell lines which are respectively resistant and sensitive to
   a particular drug, researchers usually generate only two or three
   technical replicates for each type given that there is no biological
   difference between technical replicates for a particular clonally
   derived cell line. Typically, in these small datasets, genes are
   selected as differentially expressed (DE) genes if their fold changes
   (FCs), computed as the ratios of the average gene expressions of the
   genes between the two types of cell lines, are larger than a
   arbitrarily determined cut-off value. Obviously, genes that are highly
   expressed in both cell lines can hardly reach large FCs, and thus the
   FC approach tend to miss them even if their absolute expression level
   differences between the two types of cell lines are rather large[42]^1.
   Besides, genes with low expression levels in both cell lines may reach
   large FCs simply due to large measurement variations, whereas the FC
   approach lacks of statistical control to reduce such false discoveries.
   Significance Analysis of Microarrays[43]^1 (SAM) and the Rank
   Product[44]^2 (RP) methods have also been applied to analyse small
   datasets from cell line experiments[45]^3,[46]^4. However, similar to
   FC, they are both biased towards the identification of genes with large
   FCs between two types[47]^5,[48]^6,[49]^7,[50]^8,[51]^9.

   Compared with genes expressed at low levels, genes with high expression
   levels are more likely to participate in some more conserved pathways
   with important biological significance, such as RNA processing,
   metabolism, and membrane trafficking[52]^10. To increase the power of
   detecting genes with high expression levels, the Average Difference
   (AD) method have been proposed, which ranks genes by the difference in
   average expression levels between two groups of samples[53]^3. In
   contrast to FC, SAM and RP, AD tends to identify genes that are highly
   expressed in both cell lines with large absolute differences in
   expression levels and may miss genes expressed at low levels in both
   cell lines, even if their FCs are truly large. In response, the
   Weighted Average Difference (WAD) method[54]^3 is proposed to rank
   genes using the relative average signal intensity as the weight to
   compute the gene expression ratio between two types. However, the WAD
   method is limited by uncertainties in weighting. Furthermore, both AD
   and WAD lack statistical control.

   In the present study, for small datasets that include several technical
   replicates for each cell line, we considered every pair of
   resistant-sensitive technical replicates as an experiment and every two
   experiments without overlapping samples as independent experiments.
   Then, we proposed an algorithm to rank genes according to their
   absolute expression differences or FCs in each pair of
   resistant-sensitive cell line experiment and identified significantly
   reproducible DE genes between every two independent experiments through
   reproducibility evaluation[55]^11. The algorithm was comprehensively
   evaluated using four microarray datasets for drug-resistant and
   drug-sensitive cancer cell lines.

Results

The reproducibility-based PD ranking method

   In each of the four datasets, referred to as CP70/A2780, MDA-MB-231,
   LCC2/MCF7 and HCT116, we ranked genes according to the Pairwise
   Difference (PD) of expression levels in every pair of resistant and
   sensitive replicates, and then evaluated the dysregulation direction
   consistency score of the top n (n = 300, 600,……3000) genes between
   every two independent pairs without overlapping samples (see Methods
   for details). In the CP70/A2780, MDA-MB-231, LCC2/MCF7 datasets, the
   consistency scores of the top n (n = 300, 600,……3000) genes sorted from
   every two independent pairs were all significantly higher than expected
   by chance (binomial test, all p < 2.2E-16)([56]Fig. 1a–c). In the
   HCT116 dataset, the top n (n = 300, 600,……3000) genes between pair 1
   (MEXP:179508 VS MEXP:179487) and pair 2 (MEXP:179509 VS MEXP:179486)
   were significantly consistent (binomial test, all p < 2.2E-16),
   suggesting that there were reproducible differential expression signals
   between the two independent pairs ([57]Fig. 1d). However, the top n
   (n = 300, 600,……3000) genes in another independent pair, pair 3
   (MEXP:179507 and MEXP:179488), were inconsistent with the top n
   (n = 300, 600,……3000) genes ranked by PD in either pair 1 or pair 2,
   with all the consistency scores below 33%. The results suggested that
   unreliable measurements may be present in one or two samples of pair 3.
   Therefore, we excluded pair 3 and used only pair 1 and pair 2 for the
   subsequent analyses in this study. Detailed information about the
   independent pairs selected in each dataset is provided in [58]Table 1.

Figure 1. The evaluation of consistency scores between every two independent
pairs.

   [59]Figure 1
   [60]Open in a new tab

   The consistency scores of top n (n = 300, 600,……3000) genes ranked by
   PD between every two independent pairs were evaluated in the CP70/A2780
   dataset (a), the MDA-MB-231dataset (b), the LCC2/MCF7 dataset (c) and
   the HCT116 dataset (d). Consistency(%) ranged from 0%-100%, top 300
   (top 600, etc.) were the top number of genes of two independent pairs
   ranked by PD. pair1-2, pair1-3, pair2-3 were the comparisons of
   reproducibility between pair 1 and pair 2, pair 1 and pair 3, pair 2
   and pair 3 in each dataset, respectively.

Table 1. The reproducibility of the top 300 genes ranked by PD between every
two independent pairs selected in the four datasets.

   Dataset Pair Sample No. Comparison K S S/K(%) p
   CP70/A2780 pair 1 [61]GSM709781 VS [62]GSM709780 pair1–2 267 267
   100.00% <2.2E-16
   pair 2 [63]GSM709782 VS [64]GSM709778 pair1–3 151 142 94.04% <2.2E-16
   pair 3 [65]GSM709783 VS [66]GSM709779 pair2–3 155 146 94.19% <2.2E-16
   MDA-MB-231 pair 1 [67]GSM712688 VS [68]GSM712682 pair1–2 200 194 97.00%
   <2.2E-16
   pair 2 [69]GSM712689 VS [70]GSM712683 pair1–3 186 174 93.55% <2.2E-16
   pair 3 [71]GSM712690 VS [72]GSM712684 pair2–3 179 167 93.30% <2.2E-16
   LCC2/MCF7 pair 1 [73]GSM1326254 VS [74]GSM1326258
   pair 2 [75]GSM1326255 VS [76]GSM1326259 pair1–2 196 190 96.94% <2.2E-16
   HCT116 pair 1 MEXPl: 179508 VS MEXP: 179487
   pair 2 MEXP: 179509 VS MEXP: 179486 pair1–2 147 136 92.52% <2.2E-16
   [77]Open in a new tab

   Note. pair1-2, pair1-3, pair2-3, the comparison of reproducibility
   between pair 1 and pair 2, pair1 and pair3, pair2 and pair3 in each
   dataset; K, the number of genes shared by two gene lists; S, the number
   of genes with same dysregulation directions; S/K(%), the consistency
   score; p, the probability of significant reproducibility.

   The consistency score between two gene lists with reproducible
   differential expression signals gradually decreased with increasing n,
   indicating the importance of optimising the length of the initial step.
   For example, in pair 1 and pair 2 of the HCT116 dataset, when n was set
   to 300 (i.e., the top 1 to 300 genes were included), the overlapping
   number was 147. Out of these 147 genes, 136 genes had the same
   dysregulation directions in the drug-resistant cell compared with the
   drug-sensitive cell in the two independent experiments; the consistency
   score was 92.52%. When n = 1200, although the consistency score for the
   overlapping 750 genes reached 90.53%, which was above the acceptability
   threshold of 90%, the consistency score between the genes in the 4th
   block (that is, the top 901 to 1,200 genes) was only 89.59%. Thus, at a
   certain consistency threshold, the consistency score between genes at
   the top of one block of the two gene lists was higher than the
   consistency score between genes at the bottom of the same block at the
   same step. Thus, when applying the reproducibility-based PD ranking
   algorithm to identify DE genes, a large initial step would allow genes
   with a low consistency score at the bottom of the block exceed the
   consistency threshold to become candidate DE genes and increase the
   probability of false positives. By contrast, if the initial step is too
   short, the number of overlapping genes between two gene lists could be
   too small to reach stable consistency scores, and the search process
   will likely terminate prematurely. As a result, an initial step that
   included 300 genes was adopted in the subsequent analyses. In the last
   but one section , we will further analyse the influence of parameter
   selection on the performance of the algorithm in detail.

   Using the reproducibility-based PD with an initial step of 300 and a
   consistency threshold of 90%, 6,366, 2,031, 2,928 and 1,057 DE genes
   were detected for the CP70/A2780, MDA-MB-231, LCC2/MCF7and HCT116
   datasets, respectively. These DE genes were enriched in 9, 11, 19 and
   12 KEGG pathways with an FDR less than 5%, respectively
   ([78]Supplementary Table S1–S4). Because a DE genes list can be
   significantly enriched in pathways only when it contains sufficient
   real DE genes[79]^12,[80]^13, these results suggest that the
   reproducibility-based PD algorithm can detect reliably DE genes in
   small datasets including only two or three technical replicates.

   To prove this, we did random experiments based on the dataset
   CP70/A2780 as an example. From the background genes (i.e., all the
   measured genes) in this dataset, we randomly selected a set of genes
   with the same number of DE genes detected by the reproducibility-based
   PD method with an initial step of 300 and a consistency threshold of
   90%, and then did the enrichment analysis to count the number of
   pathways that were significantly enriched with the randomly selected
   genes at the same FDR level of 5%. This random experiment was repeated
   1,000 times. The result showed that 97% of the 1,000 random experiments
   could not identify any significant pathway and the average of the
   number of significant pathways of the 1,000 random experiments was
   0.043, whereas the number of the pathways significantly enriched with
   DE genes detected by the reproducibility-based PD method was 9. None of
   the 1,000 random experiments detected more significant pathways than
   the reproducibility-based PD method, indicating that the random
   probability of observing no less than the number of significant
   pathways detected by the reproducibility-based PD method was less than
   0.001. In addition, we randomly replaced a certain proportion of the DE
   genes detected by the reproducibility-based PD method with the same
   number of genes randomly selected from the background genes (excluding
   genes in the DE genes list), and then did the enrichment analysis. With
   a mixture of 75% genes from the DE genes list and 25% genes from the
   background, averagely only 4.32 significant pathways could be detected
   in 1,000 random experiments and the number decreased to 0.62 when the
   proportion of random genes increased to 50%. The results showed that
   the number of pathways enriched by genes decreased greatly as the
   proportion of random genes (i.e., false DE genes) in the DE genes lists
   increased. It suggested that most of the DE genes detected by the
   reproducibility-based PD method were reliable.

   We also need to notice that the drug-resistant cancer cell lines in the
   four datasets were induced to resistance by repeated exposure to
   increasing concentrations of particular drugs, and thus some genes
   differentially expressed between drug-sensitive cancer cell lines and
   drug-induced resistant cancer cell lines may just represent
   drug-induced transcriptional changes that may be irrelevant to drug
   resistance mechanisms[81]^14,[82]^15. Nevertheless, for each dataset,
   many of the biological pathways significantly enriched with the DE
   genes detected by the reproducibility-based PD method are related to
   drug sensitivity. For example, the DE genes detected by the
   reproducibility-based PD method for the CP70/A2780 dataset of cisplatin
   resistant-sensitive cell lines were significantly enriched in
   biological pathways which are known to be related to drug cisplatin
   sensitivity, such as cell cycle, P53 signalling pathway, ubiquitin
   mediated proteolysis and protein processing in endoplasmic reticulum
   (see [83]Supplementary Table S1). The target of cisplatin is thought to
   be DNA and the formation of cisplatin-DNA adducts block DNA replication
   and transcription, leading to cell cycle arrest and apoptosis[84]^14.
   P53 is involved in DNA damage signaling and many
   studies[85]^14,[86]^16,[87]^17,[88]^18 have reported that the activity
   of P53 is related with the sensitivity of cancer cell to cisplatin. The
   relationship between ubiquitin mediated proteolysis and DNA damage
   response is close[89]^19 and cisplatin can activate the pathway. It has
   been suggested that ubiquitin- proteasome genes as targets for
   modulating cisplatin sensitivity[90]^20. Cisplatin is also shown to
   cause apoptotic caspases through the endoplasmic reticulum stress
   pathway[91]^21,[92]^22. In [93]supplementary Table S2–S4, we have
   listed some references which provided evidence of the corresponding