Abstract

Background

   The amount of RNA per cell, namely the transcriptome size, may vary
   under many biological conditions including tumor. If the transcriptome
   size of two cells is different, direct comparison of the expression
   measurements on the same amount of total RNA for two samples can only
   identify genes with changes in the relative mRNA abundances, i.e.,
   cellular mRNA concentration, rather than genes with changes in the
   absolute mRNA abundances.

Results

   Our recently proposed RankCompV2 algorithm identify differentially
   expressed genes (DEGs) through comparing the relative expression
   orderings (REOs) of disease samples with that of normal samples. We
   reasoned that both the mRNA concentration and the absolute abundances
   of these DEGs must have changes in disease samples. In simulation
   experiments, this method showed excellent performance for identifying
   DEGs between normal and disease samples with different transcriptome
   sizes. Through analyzing data for ten cancer types, we found that a
   significantly higher proportion of the DEGs with absolute mRNA
   abundance changes overlapped or directly interacted with known cancer
   driver genes and anti-cancer drug targets than that of the DEGs only
   with mRNA concentration changes alone identified by the traditional
   methods. The DEGs with increased absolute mRNA abundances were enriched
   in DNA damage-related pathways, while DEGs with decreased absolute mRNA
   abundances were enriched in immune and metabolism associated pathways.

Conclusions

   Both the mRNA concentration and the absolute abundances of the DEGs
   identified through REOs comparison change in disease samples in
   comparison with normal samples. In cancers these genes might play more
   important upstream roles in carcinogenesis.

Electronic supplementary material

   The online version of this article (10.1186/s12864-019-5502-y) contains
   supplementary material, which is available to authorized users.

   Keywords: Differentially expressed gene, Relative expression ordering,
   Cellular mRNA concentration, Absolute mRNA abundances

Background

   It is a common practice to identify differentially expressed genes
   (DEGs) between two phenotypes through comparing the gene expression
   profiles measured with the same amount of RNA (or mRNA) extracted from
   two-phenotype samples, based on the assumption that different types of
   cells have approximately the same amount of total RNA per cell
   (transcriptome size) [[45]1]. However, this assumption does not hold
   under many biological conditions. For example, high expression level of
   c-Myc can induce global transcriptional amplification of cancer cells
   [[46]2] and many cancer cells are aneuploid and/or polyploid [[47]3],
   both of which may cause a change in the transcriptome size [[48]4,
   [49]5]. Consequently, if two samples being compared are different in
   transcriptome sizes but still the same amounts of RNA are used which
   will result in different numbers of cells between the two measured
   samples, a direct comparison of the measurement values of the two
   samples can only identify transcripts with changed cellular
   concentration which might have no changes in absolute mRNA abundances
   [[50]6]. Although it could be argued that the concentrations of
   macromolecules are relevant parameters governing biochemical reactions
   inside cells, inappropriate interpretation of mRNA concentration
   changes might lead to incorrect conclusions for a range of biological
   questions, including the transcriptional characteristics of cancer
   cells [[51]7].

   Experimental methods have been proposed to identify genes with
   differential absolute mRNA abundances between two cells with different
   transcriptome sizes [[52]1, [53]7–[54]9]. However, none has been
   commonly accepted as a reliable standard approach. For example, if the
   number of cells used for RNA extraction can be determined, such as in
   experiments for cell lines or microdissected solid tumor tissues,
   external control RNA could be spiked into each RNA sample in proportion
   to the numbers of cells for data normalization [[55]1]. However, large
   technical variations in spike-in control enrichment and amplification
   during library preparation challenge the use of spike-in controls
   [[56]10, [57]11]. The biological scaling normalization approach
   proposed by Aanes et al. [[58]8] for adjusting the variation in
   transcriptome sizes between samples cannot be employed if the number of
   cells is unknown, which is often the case in experiments for
   undissected solid tissues. Besides the above-mentioned difficulties, it
   must be pointed out that all of these strategies are not useful for
   previously published data, where none of the information about external
   RNA controls or the cell numbers could be available.

   Ranking all genes according to their measured expression levels in a
   descending (or ascending) order in a sample, the within-sample relative
   expression ordering (REO) of a gene pair (G[A] and G[B]) represents
   whether the expression level of G[A] is higher or lower than that of
   G[B] in the sample. Previously, we have found that the within-sample
   REOs of gene pairs are highly stable in a particular type of normal
   tissues but widely disturbed in tumor tissues. Based on this finding,
   an algorithm, RankComp [[59]12], was proposed to detect DEGs through
   analyzing reversal REOs pattern in an individual disease sample, taking
   the highly stable REOs in normal samples as the background. Recently,
   we adjusted this algorithm slightly to fit case-control cohort data,
   named RankCompV2 [[60]13, [61]14]. The RankComp and RankCompV2
   algorithm detect the genes with expression changes that disrupt the
   gene correlation structures and change the REOs of the gene pairs from
   one phenotype to the other. Here, we reasoned that DEGs identified
   through REOs comparison must change in both mRNA concentration and
   absolute abundances through theoretical reasoning and simulation
   experiment. Then, RankCompV2 was applied to ten cancer datasets.
   Finally, we provided preliminary evidence that the DEGs with changes in
   both absolute mRNA abundances and concentration are more likely to be
   closely related with cancer driver genes and drug targets than the DEGs
   which may change only in mRNA concentration exclusively identified by
   the popular SAM or edgeR algorithm. RankCompV2 is implemented in C
   language on Linux and is available on GitHub
   ([62]https://github.com/pathint/reoa).

Methods

Data and processing

   All expression datasets, as summarized in Table [63]1, were collected
   from the Gene Expression Omnibus (GEO) database. For microarray and
   beadchip datasets, quantile normalized values were used in both SAM
   [[64]15] and RankCompV2. For the RNAseq data, edgeR uses raw counts as
   input to identify DEGs [[65]16]. When applying the edgeR package, we
   employed the default TMM (trimmed mean of M-values) [[66]17] to
   normalize the raw count for sequencing depth and RNA composition.
   Because TMM does not deal with the transcript length bias of sequencing
   data, the data normalized with this algorithm are not suitable to rank
   expression levels of genes with different transcript lengths. Thus,
   RankCompV2 uses log[2] RPKM data where the transcript length bias has
   been normalized, as input to identify DEGs. See supplementary method
   for details (Additional file [67]1).

Table 1.

   Twenty datasets for ten cancers analyzed in this study
   Cancer Type GEO series Platform Normal Cancer # of Genes
   LIHC [68]GSE57957 [69]GPL10558 Illumina beadchip 39 39 30,500
   [70]GSE45267 [71]GPL570 Affymetrix array 39 48 20,486
   KIRC [72]GSE46699 [73]GPL570 Affymetrix array 42 42 20,486
   [74]GSE53757 [75]GPL570 Affymetrix array 72 72 20,486
   HNSC [76]GSE33205 [77]GPL5175 Affymetrix array 25 44 14,963
   [78]GSE6631 [79]GPL8300 Affymetrix array 22 22 8592
   LUSC [80]GSE19188 [81]GPL570 Affymetrix array 65 27 20,486
   [82]GSE18842 [83]GPL570 Affymetrix array 32 32 20,486
   STAD [84]GSE13911 [85]GPL570 Affymetrix array 31 31 20,486
   [86]GSE29998 [87]GPL6947 Illumina beadchip 49 50 24,384
   COAD [88]GSE23878 [89]GPL570 Affymetrix array 24 35 20,486
   [90]GSE44076 [91]GPL13667 Affymetrix array 98 98 19,040
   LUAD [92]GSE27262 [93]GPL570 Affymetrix array 25 25 20,486
   [94]GSE87340 [95]GPL11154 Illumina HiSeq 27 27 19,471
   BRCA [96]GSE10780 [97]GPL570 Affymetrix array 70 30 20,486
   [98]GSE10810 [99]GPL570 Affymetrix array 27 31 20,486
   PAAD [100]GSE15471 [101]GPL570 Affymetrix array 36 36 20,486
   [102]GSE16515 [103]GPL570 Affymetrix array 16 36 20,486
   ESCA [104]GSE23400 [105]GPL96 Affymetrix array 53 53 12,432
   [106]GSE38129 [107]GPL571 Affymetrix array 30 30 12,432
   [108]Open in a new tab

   Abbreviation: LIHC Liver hepatocellular carcinoma, KIRC Kidney renal
   clear cell carcinoma, HNSC Head and Neck squamous cell carcinoma, LUSC
   Lung squamous cell carcinoma, STAD Stomach adenocarcinoma, COAD Colon
   adenocarcinoma, LUAD Lung adenocarcinoma, BRCA Breast invasive
   carcinoma, PAAD Pancreatic adenocarcinoma, ESCA Esophageal carcinoma

RankCompV2 algorithm

   The RankCompV2 algorithm was proposed for identifying DEGs with large
   expression changes lead REOs within diseased samples reversed,
   comparing with the stable REOs within the normal samples [[109]13].

   First, gene pairs with significantly stable REOs are identified in the
   normal samples. Stable gene pairs, defined as gene pairs with identical
   REO pattern in significantly more samples for one phenotype than
   expected by chance, were identified by a binomial test. For a given
   gene pair (G[i], G[j]), let s denote the number of samples in which
   gene i has a higher (or lower) expression level than gene j in a total
   of n samples, the significance of the REO pattern is determined by a
   binomial test as follows,
   [MATH: <mi>P</mi><mo>=</mo><mn>1</mn><mo>−</mo><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>0</mn></mrow>
   <mrow><mi>s</mi><mo>−</mo><mn>1</mn></mrow></munderover><mrow><mo>(</mo
   ><mtable columnspacing="1em"
   rowspacing="4pt"><mtr><mtd><mi>n</mi></mtd></mtr><mtr><mtd><mrow></mrow
   ><mi>i</mi></mtd></mtr></mtable><mo>)</mo></mrow><msup><mrow><mo
   stretchy="false">(</mo><msub><mi>p</mi><mrow><mn>0</mn></mrow></msub><m
   o
   stretchy="false">)</mo></mrow><mi>i</mi></msup><msup><mrow><mrow><mo>(<
   /mo><mn>1</mn><mo>−</mo><msub><mi>p</mi><mrow><mn>0</mn></mrow></msub><
   mo>)</mo></mrow></mrow><mrow><mi>n</mi><mo>−</mo><mi>i</mi></mrow></msu
   p> :MATH]
   1

   where p[0] is the probability of observing a certain REO pattern
   (G[i] > G[j] or G[i] < G[j]) in a sample by chance (p[0] = 0.5).
   Current approaches for adjusting the p-values in discrete statistics
   are still arguable [[110]18–[111]23]. Here, we used the Benjamini and
   Hochberg method [[112]24] for this purpose, though the method tends to
   have insufficient power for discrete data [[113]25].

   Similarly, gene pairs with significantly stable REOs in the disease
   samples are identified. Focusing on the overlaps of the two lists of
   gene pairs, the gene pairs with stable REOs in the normal samples are
   defined as the normal background stable REOs while the gene pairs with
   reversely stable REOs in the disease samples compared with the normal
   samples are defined as the reversal REOs of the disease group. For a
   given gene G, we counted the numbers of gene pairs with G > G[i] and
   gene pairs with G < G[i] in normal and disease samples, respectively,
   and listed the four-cell contingency table through analyzing the
   G-background REOs and the reversal REOs. Then the Fisher’s exact test
   is used to test whether gene G expresses differentially in the disease
   group. After the identification of all candidate DEGs, the gene pairs
   including candidate DEGs as partner genes are excluded from the
   construction of the contingency table. And the Fisher’s exact test is
   performed again to minimize the confound effects of the expression
   changes of the partner genes. This filtering process is conducted
   iteratively until the number of DEGs keeps stable in two successive
   iterations. The details of the RankCompV2 algorithm were described in
   our previous work [[114]13].

   RankCompV2 is an empirical algorithm, where the default FDR parameter
   (FDR < 0.05) for the determination of significantly stable REOs can
   control false discoveries in simulation experiments, as demonstrated on
   nine datasets in our previous study [[115]13] and on the twenty
   datasets in this study (Additional file [116]1, Additional file [117]2:
   Figure S1 and Additional file [118]3: Table S1).

Reproducibility evaluation of DEGs

   We used the POG (Percentage of Overlapping Genes) score [[119]26] and
   the concordance score to evaluate the reproducibility of DEGs
   identified from two independent datasets. If two lists of DEGs with
   length L[1] and L[2,] have n overlaps, among which s have the same
   dysregulation directions (up- or down-regulation), then the POG score
   from list 1 (or 2) to list 2 (or 1), denoted as POG[12] (or POG[21]),
   is calculated as s/L[1] (or s/L[2]), and the concordance score is
   calculated as s/n. We evaluated whether a concordance score is higher
   than what expected by chance using the binomial distribution as
   described above, where p[0] is the probability of a gene having the
   concordant dysregulation direction in the two lists by chance.

Enrichment analysis

   The hypergeometric distribution was used to determine the biological
   pathways significantly enriched with up- and down-regulated DEGs
   [[120]27], respectively, based on the Kyoto Encyclopedia of Genes and
   Genomes database (downloaded on May 16, 2016) [[121]28].

Results

Theoretical basis for identifying DEGs with changes in absolute mRNA
abundances

   The absolute mRNA abundance of a given gene in a cell is defined as the
   transcript number of the gene in the cell, and the mRNA concentration
   is defined as the proportion of mRNA of a given gene in the total mRNA
   of the cell. Because the mRNA concentration of a gene in a cell is
   equal to the mRNA concentration of the gene in the corresponding sample
   including many identical cells, a direct comparison of the measurement
   values of two samples can identify DEGs with changes in cellular mRNA
   concentration. However, when the transcriptome size of a tumor cell is
   different from that of a normal cell, direct comparison of the
   measurement values of two samples cannot identify genes with changes in
   absolute mRNA abundances at the single cell level. Here, we reasoned
   that DEGs identified through REOs comparison must change in both mRNA
   concentration and absolute abundances at the single cell level.

   Let T[k] represent the amount of total mRNA in a cell of sample k
   (k = 1, 2) and S represent the same amount of total mRNA extracted from
   the two samples. Then the number of cells in sample k can be
   represented as
   [MATH:
   <msub><mi>n</mi><mi>k</mi></msub><mo>=</mo><mi>S</mi><mo>/</mo><msub><m
   i>T</mi><mi>k</mi></msub> :MATH]
   2

   Under the ideal condition that mRNA is extracted from the pure normal
   epithelial cells (sample 1) and pure tumor epithelial cells (sample 2),
   the measured expression level of gene i in sample k is,
   [MATH: <msub><mi>M</mi><mi
   mathvariant="italic">ki</mi></msub><mo>=</mo><msup><msub><mi>r</mi><mi>
   k</mi></msub><mo>∗</mo></msup><msup><msub><mi>N</mi><mi
   mathvariant="italic">ki</mi></msub><mo>∗</mo></msup><msub><mi>n</mi><mi
   >k</mi></msub><mo>=</mo><msup><msub><mi>r</mi><mi>k</mi></msub><mo>∗</m
   o></msup><msup><msub><mi>N</mi><mi
   mathvariant="italic">ki</mi></msub><mo>∗</mo></msup><mi>S</mi><mo>/</mo
   ><msub><mi>T</mi><mi>k</mi></msub><mo>=</mo><msup><msub><mi>r</mi><mi>k
   </mi></msub><mo>∗</mo></msup><msub><mi>N</mi><mi
   mathvariant="italic">ki</mi></msub><mo>/</mo><msup><msub><mi>T</mi><mi>
   k</mi></msub><mo>∗</mo></msup><mi>S</mi><mo>=</mo><msup><msub><mi>r</mi
   ><mi>k</mi></msub><mo>∗</mo></msup><msup><msub><mi>C</mi><mi
   mathvariant="italic">ki</mi></msub><mo>∗</mo></msup><mi>S</mi> :MATH]
   3

   where r[k] is the linear correlation coefficient between the measured
   expression value and the transcript number of gene i in sample k with
   n[k] cells, N[ki] represent the transcript number of gene i
   (i = 1,...,m) in a cell of sample k. C[ki] = N[ki]/T[k] is proportional
   to the cellular concentration of the transcript of gene i in sample k.
   Here, we assume that the normalized count values of RNA-sequencing
   platforms and the fluorescence intensity values of microarray platforms
   are approximately linearly correlated with the transcript number in a
   sample within a certain range of gene expression level
   [[122]29–[123]32].

   Since S are the same for two samples and r[k] are comparable between
   two samples after data normalization, direct comparison of the
   normalized measurements (M[ki]) between the two samples is equivalent
   to the comparison of the concentrations (C[ki]) between the two
   samples. Consequently, the DEGs identified by using traditional
   methods, such as SAM for microarray data or edgeR for RNA-sequencing
   data, are the genes with changes in mRNA concentration (C[ki]) between
   the two samples.

   Because both T[k] and S in eq. ([124]3) are constant for a particular
   sample k, the within-sample REOs ranked according to the concentration
   (C[ki]) are the same with the REOs ranked according to the transcript
   number (N[ki]). Therefore, the observed reversal REOs in sample 2
   compared with sample 1 must be the reversal REOs of both the mRNA
   concentration and the transcript number (absolute mRNA abundances).
   Thus, the DEGs identified by the REO-based RankCompV2 algorithm must
   have changes in both mRNA concentration and absolute abundances.
   Consequently, they should be included in the DEGs with concentration
   changes detected by traditional quantitative-based methods such as SAM
   or edgeR, given that the later can achieve sufficient power in the data
   under analysis.

Evaluation of performance

   We assumed that the number of reads mapping to a transcript sequence is
   roughly proportional to the RNA amount of the transcript and the sum of
   the read counts of all transcripts (total mapped reads) was used to
   represent the total RNA amount of a sample. Thus, we performed a
   simulation experiment based on the RNA-sequencing data of the
   [125]GSE87340 dataset to evaluate the performance of RankCompV2 in data
   with global transcriptome size changes. For the 19,471 genes measured
   for the 27 normal samples, after removing genes with a count of 0 in
   more than 75% of the samples, we simulated disease samples by randomly
   selecting 6000 genes to produce 3000, 4000 and 5000 up-regulated DEGs
   and correspondingly 3000, 2000 and 1000 down-regulated DEGs,
   respectively. For each simulation experiment, the up-regulated genes
   were equally divided into four groups and the fold change (FC) levels
   of the genes in the four groups were assigned as 2, 3, 4 and 5,
   respectively. Similarly, the down-regulated genes were equally divided
   into four groups and the FC levels of the genes in the four groups were
   assigned as 1/2, 1/3, 1/4 and 1/5, respectively.

   When simulating more up-regulated DEGs than down-regulated DEGs, the
   simulated disease samples tend to have more total transcript counts
   than the normal samples, which mean that the transcriptome size of a
   disease cell is larger than that of a normal cell. In order to simulate
   the same amount of total RNA extracted from two samples, the read
   counts of each transcript in simulated disease samples were multiplied
   by a transcriptome size factor to make the total counts of simulated
   disease samples keep the same with that of normal samples. The factor
   is the fold change of the transcriptome size (the amount of total RNA
   per cell) between the tumor cell and the normal cell.

   Read counts were used in edgeR and the RPKM values calculated from the
   counts were used in RankCompV2 to identify DEGs. Each simulation
   experiment was repeated 100 times. The sensitivity (the ratio of
   correctly identified DEGs to all true DEGs), the specificity (the ratio
   of correctly identified non-DEGs to all true non-DEGs), the F-score (a
   harmonic mean of the sensitivity and the specificity) and the false
   discovery rate (FDR, the ratio of true non-DEGs to all identified DEGs)
   were employed to evaluate the performance of different algorithms.

   As shown in Table [126]2, when the number of up-regulated DEGs
   increased from 3000 to 5000, the ratio of the total counts of the
   normal sample to that of the simulated disease samples decreased from
   0.7570 to 0.5972. The TMM normalization [[127]17] can estimate a scale
   factor to adjust the different total RNA output between samples. When
   the numbers of up- and down-regulated DEGs were equal, edgeR which can
   incorporate these factors into DEGs analysis exhibited higher average
   sensitivity and F-score than RankCompV2. However, the FDR of edgeR was
   up to 54.89% when the up-regulated DEGs were more than the
   down-regulated DEGs. Instead, RankCompV2 exhibited rather good
   performance with sensitivity > 95%, specificity > 99% and FDR < 0.15%.
   For each simulation experiment the DEGs identified by RankCompV2 were
   completely included in the DEGs identified by edgeR. The simulation
   results confirmed the above mathematical reasoning and demonstrated
   RankCompV2 can identify genes with expression change in both mRNA
   concentration and absolute abundances.

Table 2.

   Simulation evaluation with different transcriptome sizes for edgeR and
   RankCompV2
    Up/Down  Factor             edgeR                     RankCompV2
                     Sen     Spe   F-score  FDR    Sen     Spe   F-score  FDR
   3000/3000 0.7570 99.31% 100.00% 99.65%  0.00%  95.24% 100.00% 98.53%
   0.00%
   4000/2000 0.6682 99.55% 89.40%  94.18%  18.95% 95.47% 99.98%  98.58%  0.05%
   5000/1000 0.5972 99.41% 45.62%  62.51%  54.89% 95.96% 99.94%  98.71%  0.13%
   [128]Open in a new tab

   Note: Up (or Down) indexes the number of simulated up-regulated (or
   down-regulated) DEGs; Factor is defined as the fold change of the
   transcriptome sizes between the simulated tumor cell and the normal
   cell; Sen represents sensitivity defined as the ratio of correctly
   identified DEGs to all true DEGs; Spe represents specificity defined as
   the ratio of correctly identified non-DEGs to all true non-DEGs);
   F-score is a harmonic mean of the sensitivity and the specificity; FDR
   is the abbreviation of false discovery rate defined as the ratio of
   true non-DEGs to all identified DEGs

   To assess the strength of the methodology, we performed another
   simulation experiment based on the RNA-sequencing data of the 27 normal
   samples in the [129]GSE87340 dataset. We randomly generated 4000
   up-regulated and 2000 down-regulated genes by changing their measured
   values in each samples with FC levels of 1.5 to 3.5 to produce 27
   disease samples. In each simulation experiment, all the selected genes
   were set at the same FC level. Each simulation experiment was repeated
   100 times. The simulated disease samples were also multiplied by a
   transcriptome size factor so as to simulate the same amount of total
   RNA extracted from two samples. The average transcriptome size factors
   for each 100 simulated experiments were listed in Table [130]3. Then
   edge R and RankCompV2 were performed to identify DEGs. The sensitivity,
   specificity, F-score and FDR were calculated.

Table 3.

   Simulation evaluation with different FC levels for edgeR and RankCompV2
   FC level Factor             edgeR                     RankCompV2
                    Sen     Spe   F-score  FDR    Sen     Spe   F-score  FDR
   1.5      0.9360 85.83% 100.00% 92.37%  0.00%  56.10% 100.00% 71.88%  0.00%
   2        0.8647 98.76% 99.76%  99.26%  0.51%  82.50% 100.00% 90.41%  0.00%
   2.5      0.7996 99.52% 97.82%  98.63%  4.23%  93.50% 100.00% 96.64%  0.00%
   3        0.7448 99.78% 94.85%  97.21%  9.83%  97.79% 100.00% 98.88%  0.01%
   3.5      0.6965 99.88% 92.61%  96.08%  13.87% 98.84% 99.98%  99.40%  0.05%
   [131]Open in a new tab

   Note: FC is the abbreviation of Fold change; Factor is defined as the
   fold change of the transcriptome sizes between the simulated tumor cell
   and the normal cell; Sen represents sensitivity defined as the ratio of
   correctly identified DEGs to all true DEGs; Spe represents specificity
   defined as the ratio of correctly identified non-DEGs to all true
   non-DEGs); F-score is a harmonic mean of the sensitivity and the
   specificity; FDR is the abbreviation of false discovery rate defined as
   the ratio of true non-DEGs to all identified DEGs

   As shown in Table [132]3, the average sensitivity of RankCompV2 was
   only 56.10% for DEGs with FC of 1.5 and up to 98.84% with FC of 3.5,
   which suggested that RankCompV2 performed well for DEGs with large
   expression changes. In general, RankCompV2 showed a very high
   specificity and a very low FDR when the FC level increased from 1.5 to
   3.5. Notably, the FDR of edgeR rises as the FC level increases. The
   underlying reason is as follows. When up-regulated DEGs with a larger
   FC level was introduced in the simulation, it leads to a bigger global
   transcriptome size of a tumor cell than that of a normal cell, as shown
   by the decreased transcriptome size factor (Table [133]3), the ratio
   between the normal transcriptome size and the simulated tumor cell
   size. The edgeR algorithm identifies DEGs through comparing the read
   counts of a gene between the two samples. Given the same amount of
   total input RNA for the two samples, many genes without differences in
   mRNA absolute abundances would have lower read counts in the tumor
   sample than in the normal sample, thus be identified as down-regulated
   genes, which leads to a higher FDR of edgeR if taking DEGs with
   absolute mRNA abundance changes as the reference.

   We also performed a simulation experiment on the genome size changes
   leading to the global transcriptome size variations. The simulation
   experiment also demonstrated that RankCompV2 could identify genes with
   expression changes in absolute abundances and performed well for DEGs
   with large expression changes (Additional file [134]4).

Reproducible DEGs with changes in absolute mRNA abundances in ten cancers

   We collected two datasets of gene expression profiles for each of ten
   cancer types (Table [135]1). For each dataset, we compared the DEGs
   between the normal and cancer samples identified by RankCompV2 with the
   DEGs identified by SAM for microarray data or by edgeR for
   RNA-sequencing data. In [136]GSE57957 measured by microarray for liver
   hepatocellular carcinoma (LIHC), SAM identified 11,497 DEGs with false
   discovery rate (FDR) < 0.05, which included 3603 of the 3715 RankCompV2
   DEGs. In [137]GSE45267 for LIHC, the 14,192 DEGs selected by SAM
   included 4712 of the 4728 DEGs detected by RankCompV2. The concordance
   scores of the dysregulation directions of the overlaps between DEGs
   detected by RankCompV2 and DEGs detected by SAM in the two datasets
   were all 100%. Similar results were observed in the 18 datasets for
   other nine cancer types (Fig. [138]1a and Additional file [139]5: Table
   S2). As expected, POG[21] is lower than POG[12] because the DEGs
   identified by RankCompV2 should be included in the DEGs identified by
   SAM or edgeR. The results confirmed the above mathematical reasoning,
   which also provided circumstantial evidence of the high accuracy of the
   RankCompV2 method.

Fig. 1.

   [140]Fig. 1
   [141]Open in a new tab

   Reproducibility evaluation of DEGs identified by RankCompV2. a
   Comparison of DEGs identified by RankCompV2 with DEGs identified by SAM
   or edgeR in the same dataset. POG[12] (or POG[21]) denotes the ratio of
   consistently detected DEGs by the two methods to DEGs identified by
   RankCompV2 (or by SAM or edgeR) for the same dataset. b Comparison of
   RankCompV2 DEGs identified from different datasets. POG[12] (or
   POG[21]) denotes the ratio of consistently detected DEGs from two
   datasets to DEGs detected from the first dataset (or the second
   dataset) for each cancer type. The concordance score denotes the
   percentage of consistently detected DEGs that display the same
   dysregulation directions to the overlapped DEGs between two DEG lists

   The DEGs identified by RankCompV2 were highly reproducible in
   independent datasets. For LIHC, RankCompV2 identified 3715 DEGs from
   the [142]GSE57957 dataset and 51.71% of them were included in the 4728
   DEGs detected from the [143]GSE45267 dataset. The concordance score of
   the overlapped 1946 DEGs was 98.72% which was unlikely to be observed
   by chance (binomial test, p < 1.0E-16). The highly reproducibility of
   RankCompV2 DEGs identified from independent datasets were also observed
   in the other nine cancer types (Fig. [144]1b).

Enrichment of cancer driver genes and drug targets in DEGs with absolute
abundance changes

   For each cancer type, the two lists of DEGs identified from the two
   datasets by SAM for the microarray data or by edgeR for the
   RNA-sequencing data were combined, excluding those with contradictory
   dysregulation directions. Similar combination processes were performed
   on DEGs selected by RankCompV2. The DEGs identified by RankCompV2, with
   expression change in both mRNA concentration and absolute abundances,
   were termed as absolute DEGs, while the DEGs solely detected by SAM or
   edgeR were termed as relative DEGs with changes in concentration only.
   The numbers of the absolute DEGs and relative DEGs for the ten cancer
   types were listed in Table [145]4. Then, we explored the biological
   significance of the absolute DEGs.

Table 4.

   Numbers of identified absolute and relative DEGs in ten cancers
   Cancer type  Absolute DEGs  Relative DEGs
                Up     Down     Up    Down
   LIHC        3525 2947       5424 6625
   KIRC        5295 4420       2882 6279
   HNSC        841  1000       2142 1790
   LUSC        3523 4795       7067 2363
   STAD        2946 2003       4611 6042
   COAD        3922 4692       4799 2877
   LUAD        2905 3868       6186 2121
   BRCA        2185 3251       6458 1771
   PAAD        4261 2294       3658 8408
   ESCA        1901 1334       2459 4438
   [146]Open in a new tab

   Abbreviation: LIHC Liver hepatocellular carcinoma, KIRC Kidney renal
   clear cell carcinoma, HNSC Head and Neck squamous cell carcinoma, LUSC
   Lung squamous cell carcinoma, STAD Stomach adenocarcinoma, COAD Colon
   adenocarcinoma, LUAD Lung adenocarcinoma, BRCA Breast invasive
   carcinoma, PAAD Pancreatic adenocarcinoma, ESCA Esophageal carcinoma

   With the 616 cancer driver genes downloaded from the Catalogue Of
   Somatic Mutations (COSMIC, version81, updated 9th May 2017) database
   [[147]33], we found 25.79% of the 6472 absolute DEGs of LIHC overlapped
   or directly interacted with known cancer driver genes based on the
   protein–protein interaction data downloaded from the STRING v10
   database [[148]34], which was significantly higher than the
   corresponding ratio (18.35%) for the 12,049 relative DEGs (Fisher’s
   exact test, p < 1.0E-16). Similar results were observed for the
   remained nine cancer types (Fig. [149]2a). Based on the cancer driver
   genes downloaded from the DriverDBv2 database [[150]35], where the
   driver genes for each cancer type were identified by at least two
   algorithms from the mutation data in the TCGA database, we also
   observed significantly higher ratios of cancer driver genes and
   interaction genes in the absolute DEGs than in the relative DEGs for
   all the ten cancer types (Fig. [151]2b). The results indicate that DEGs
   changing in absolute abundances are more likely to be related with
   upstream events of carcinogenesis than DEGs changing in concentration
   only.

Fig. 2.

   Fig. 2
   [152]Open in a new tab

   Association between the identified absolute and relative DEGs with
   known cancer driver genes or drug targets. a 616 cancer driver genes
   from COSMIC; b cancer driver genes from DriverDBv2; c anti-cancer drug
   targets. Statistically significant differences (p < 0.05) were found in
   all the ratios between absolute DEGs and relative DEGs

   With 116 targets of 148 anti-cancer drugs documented in CancerDR
   [[153]36], we found that 16.56% of the 6472 absolute DEGs for LIHC
   overlapped or directly interacted with known anti-cancer drug targets
   in the STRING network, which was significantly higher than the
   corresponding ratio 10.91% for the 12,049 relative DEGs (Fisher’s exact
   test, p < 1.0E-16). Similar results were observed for the remained nine
   cancer types (Fig. [154]2c).

Functional analysis of DEGs with absolute abundance changes

   Pathway enrichment analysis was performed for the absolute and relative
   DEGs, respectively. As shown in Additional file [155]6: Table S3, for
   each of the ten cancers, the pathways enriched with the absolute DEGs
   were much more than and quite different from the pathways enriched with
   the relative DEGs. As summarized in Fig. [156]3, the pathways enriched
   with absolute DEGs for at least five cancer types were very different
   from the pathways enriched with relative DEGs for at least five cancer
   types. The up-regulated absolute DEGs were enriched in many pathways
   related with response to DNA damages, including “mismatch repair”,
   “base excision repair”, “nucleotide excision repair”, “homologous
   recombination” and “Fanconi anemia pathway” [[157]37]. These genes were
   also enriched in “p53 signaling”, “cell cycle”, “DNA replication”,
   “pyrimidine metabolism” and “purine metabolism”. The pathways enriched
   by relative DEGs included “proteasome” and “protein processing in
   endoplasmic reticulum” besides “RNA transport” and “spliceosome” which
   were also enriched by up-regulated absolute DEGs.

Fig. 3.

   [158]Fig. 3
   [159]Open in a new tab

   The pathways significantly enriched with absolute and relative DEGs,
   respectively, in each cancer type and commonly enriched in at least
   five cancer types

   The down-regulated absolute DEGs were commonly enriched in many
   metabolism pathways, including amino acid, carbohydrate and lipid
   metabolism, and in immune associated pathways, including “chemokine
   signaling”, “complement and coagulation cascades” and “leukocyte
   transendothelial migration”. In contrast, the down-regulated relative
   DEGs were enriched in signaling pathways including “calcium signaling”,
   “cAMP signaling”, “neuroactive ligand-receptor interaction” and two
   sensory system pathways of “olfactory transduction” and “taste
   transduction”. The difference of pathways enriched by two type DEGs is
   an issue worth further analysis.

Discussion

   We demonstrated that the REO-based RankCompV2 algorithm can identify
   DEGs with changes in both mRNA concentration and absolute abundances
   (the absolute DEGs), while the quantitative-based algorithms can
   identify only those with changes in mRNA concentration (the relative
   DEGs). Through studies for all the ten cancers, the absolute DEGs have
   a higher probability associated with both known cancer driver genes and
   drug targets than the relative DEGs. Thus, we speculate that the
   absolute DEGs might play a more important upstream role in
   carcinogenesis. In addition pathway enrichment analysis showed that
   up-regulated absolute DEGs are significantly enriched in DNA
   damage-related pathways and down-regulated absolute DEGs are
   significantly enriched in immune and metabolism associated pathways.
   The genome instability including DNA damages and tumor-promoting
   inflammation driven by immune cells are two enabling characteristics of
   tumor and instrumental for tumorigenesis and progression [[160]38], and
   energy metabolism dysregulation is a fundamental hallmark to fuel
   cancer cell growth and division [[161]38].

   The reasoning for that the REOs-based RankCompV2 algorithm can identify
   DEGs with changes in absolute mRNA abundances is based on the ideal
   condition that the gene expression measurements are well correlated
   with the transcript numbers. For tumor tissues, the ideal condition
   could be violated due to variations of the tumor epithelial cell
   proportions in tissues sampled from different sites of a tumor and
   partial RNA degradation during sample preparation. However, the
   qualitative nature of REOs lends them the advantage being robust
   against the above-mentioned confounding factors [[162]39–[163]41]. As
   demonstrated in our recent study, the stromal cells in tumor tissues
   have similar REOs with those of epithelial cells in tumor tissues
   [[164]39, [165]41]. More than 96% REOs in the tumor tissues with above
   70% of proportion of epithelial cells are consistent with the REOs in
   tumor epithelial cells, and about 90% REOs in tumor epithelial cells
   are kept in tumor tissues even when the proportion of epithelial cells
   decreases to 30% [[166]41]. Therefore, the REOs-based RankCompV2
   algorithm would be largely applicable to real tumor data of
   macro-dissected cancer tissues.

   In order to simplify the reasoning, we assumed that genes have similar
   measurement biases and the correlation coefficients (r) between the
   measured expression values and the transcript numbers are similar for
   different genes in eq. ([167]3). However, different measurement biases
   exist among different samples and among different genes within samples
   in actual measurement [[168]42–[169]44]. Thus, it seems unreasonable to
   compare the measured expression levels of different genes within a
   sample. However, it has been shown that the within-sample REOs are
   robust against systematic biases of measurements, experimental batch
   effects and data normalization [[170]45]. Moreover, the gene pairs with
   large rank difference tend to retain the same REO patterns in samples
   measured with different platforms [[171]46]. The high robustness of
   within-sample REOs indicate that the influence of measurement biases on
   the REOs is small.

   Here, we further analyzed the influence of measurement biases on the
   within-sample REO-based algorithm. Considering two genes, i and j which
   have the expression levels E[i] and E[j], respectively, their measured
   values can be written as M[i] = E[i]r[i] and M[j] = E[j]r[j]. The
   ordering between the two measured values can be judged by the ratio of
   M[i]/M[j] = (r[i]/r[j]) (E[i]/E[j]). If there is no bias, r[i] = r[j],
   then M[i]/M[j] will be the same as E[i]/E[j]. If r[i] and r[j] are not
   the same but remain constant in the measurement range, M[i]/M[j] is
   proportional to the ground truth ratio, therefore the observed REOs may
   not reflect the true REOs, which will reduce the statistical power of
   the RankCompV2 algorithm but will not introduce false discoveries
   (Additional file [172]7). Furthermore, if the bias is not systematic,
   the misjudged REOs will distribute randomly in the four cells of the
   contingency table of the counts of numbers of gene pairs with
   M[i] > M[j] or M[i] < M[j]. Therefore, the detection power is expected
   to be reduced. If the dynamic range is not linear, the situation will
   be more complicated. But we also expect that the main influence to our
   method is to reduce the detection power slightly. It means that the
   RankCompV2 algorithm can detect at least a part of the DEGs with
   absolute mRNA abundances changes, which still have biological
   significances. We believe that the power of the RankCompV2 algorithm
   will increase along with the improvement of gene expression measurement
   technologies.

   Genomic copy number aberrations (CNAs) could also account for a
   substantial portion of gene expression changes. Currently, CNAs in
   cancer genomes are determined by comparing the measurements for the
   same amounts of DNA extracted from cancer and normal tissues, based on
   the assumption that the overall yields of DNA per cell (genome sizes)
   of different cell types are approximately the same. However, this
   assumption is least likely to hold because many cancers are aneuploid
   and/or polyploid [[173]3]. The wrong assumption might lead to a serious
   consequence because CNAs are often used to determine cancer driver
   genes [[174]47, [175]48]. Although several methods such as FREEC
   [[176]49] and CNAnorm [[177]50] have been proposed to correct the
   issues associated with cancer genome sizes in estimating copy number
   alterations based on deep sequencing data, there still exist some
   limitations. For example, FREEC cannot deal with patients’ tumor
   samples due to the need of providing the ploidy of the most abundant
   copy number; CNAnorm is based on the assumption that tumor cells are
   largely monoclonal or polyclonal in a similar way, which could produce
   misleading results in tumors with large clonal variations [[178]49,
   [179]50]. Furthermore, these methods cannot analyze the vast amount of
   microarray CNA data. The REOs comparison algorithm cannot be used to
   detect CNAs because, theoretically, the DNA intensity signals in normal
   cells should be equal. Due to the same problem of cancer cell aneuploid
   and/or polyploid [[180]3], DNA methylation analyses using bisulfite-Seq
   data based on the same amount of DNA are also problematic [[181]9].
   Notably, the average beta value of a given locus measured by the
   Illumina bead-array can be interpreted as an estimate on the proportion
   of methylated cells to all measured cells [[182]51–[183]53]. Thus, when
   comparing two samples with different average DNA yields per cell, the
   differentially methylated loci will not be affected when similar
   amounts of DNA are extracted from different number of cells.

   The RankCompV2 algorithm can only identify DEGs with sufficiently large
   expression changes that widely change the REOs of the genes from one
   phenotype to the other. On one hand, such DEGs might be of special
   biological significance, because functionally related genes tend to
   express coordinately in a stable state of physiological or pathological
   condition [[184]54]. On the other hand, many genes with small absolute
   abundance changes would be determined as relative DEGs, which would
   blur the differences between the absolute DEGs and the relative DEGs.
   To learn more about the DEGs with changes in absolute mRNA abundances,
   it needs to develop new biological techniques and/or bioinformatics
   algorithms. Finally, we note that the studies on expression comparisons
   of microRNA and long non-coding RNA between two phenotypes are also
   based on the wrong assumption of similar overall yields of RNA
   molecules among different cells [[185]9, [186]55], where the REO-based
   methods are applicable [[187]56, [188]57]. It would be also interesting
   to study whether these RNA molecules with absolute abundances changes
   might have specific biological significances.

Conclusions

   REO-based algorithm identified DEGs with changes in both mRNA
   concentration and absolute abundances. Through studies for all the ten
   cancers, we found DEGs with absolute mRNA abundance changes are more
   likely to be closely related with cancer driver genes and drug targets
   and enriched in DNA damage, metabolism and immune associated pathways.
   The genes with absolute mRNA abundances changes in cancers might play
   more important upstream roles in carcinogenesis.

Additional files

   [189]Additional file 1:^ (28.9KB, docx)

   Supplementary Method including Data and pre-processing, The SAM and
   edgeR algorithms and simulation experiments on null datasets. (DOCX 28
   kb)
   [190]Additional file 2:^ (1.2MB, tif)

   Figure S1. Stacked bar chart for the distribution of the numbers of
   DEGs identified from the simulated null datasets among 100 repeated
   experiments. (TIF 1234 kb)
   [191]Additional file 3:^ (17.9KB, xlsx)

   Table S1. Numbers of DEGs identified from the null datasets. (XLSX 17
   kb)
   [192]Additional file 4:^ (18.9KB, docx)

   The simulation experiments in data with global transcriptome size
   changes due to the genome size changes; Table S4. Simulation evaluation
   with different levels of copy number variations for RankCompV2. (DOCX
   18 kb)
   [193]Additional file 5:^ (18.2KB, docx)

   Table S2. The numbers of DEGs identified by RankCompV2 and SAM or edgeR
   for each dataset. (DOCX 18 kb)
   [194]Additional file 6:^ (58.7KB, xlsx)

   Table S3. Pathway enrichment analysis of the absolute DEGs and relative
   DEGs for ten cancers. (XLSX 58 kb)
   [195]Additional file 7:^ (17.3KB, docx)

   The influence of measurement biases on RankCompV2. (DOCX 17 kb)

Acknowledgements