Abstract

   The vast amount of RNA-seq data deposited in Gene Expression Omnibus
   (GEO) and Sequence Read Archive (SRA) is still a grossly underutilized
   resource for biomedical research. To remove technical roadblocks for
   reusing these data, we have developed a web-application GREIN (GEO
   RNA-seq Experiments Interactive Navigator) which provides user-friendly
   interfaces to manipulate and analyze GEO RNA-seq data. GREIN is powered
   by the back-end computational pipeline for uniform processing of
   RNA-seq data and the large number (>6,500) of already processed
   datasets. The front-end user interfaces provide a wealth of
   user-analytics options including sub-setting and downloading processed
   data, interactive visualization, statistical power analyses,
   construction of differential gene expression signatures and their
   comprehensive functional characterization, and connectivity analysis
   with LINCS L1000 data. The combination of the massive amount of
   back-end data and front-end analytics options driven by user-friendly
   interfaces makes GREIN a unique open-source resource for re-using GEO
   RNA-seq data. GREIN is accessible at:
   [32]https://shiny.ilincs.org/grein, the source code at:
   [33]https://github.com/uc-bd2k/grein, and the Docker container at:
   [34]https://hub.docker.com/r/ucbd2k/grein.

   Subject terms: Gene expression profiling, Transcriptomics, Software

Introduction

   Depositing RNA-seq datasets in Gene Expression Omnibus (GEO)^[35]1 and
   Sequence Read Archive (SRA)^[36]2 repositories ensures the
   reproducibility of published studies and facilitates its reuse.
   Re-analysis of these data can lead to novel scientific insights^[37]3
   and it has been routinely used to inform the design of new
   studies^[38]4. However, reuse of GEO RNA-seq data is made difficult by
   the complexity of the processing protocols^[39]5 and analytical
   tools^[40]6 which are often inaccessible to biomedical scientists not
   specializing in bioinformatics.

   Recent efforts at re-processing GEO/SRA RNA-seq data alleviate this
   problem by providing access to large number of processed and
   per-transcript summarized RNA-seq datasets which significantly
   simplifies its use^[41]7–[42]10. Other resources provide access and
   analysis tools for specific datasets^[43]11,[44]12. While these
   platforms are extremely useful, they do not support additional
   functionalities for downstream analyses. For example, exploratory data
   analysis, differential expression analysis with batch effect
   adjustment, or statistical power analysis (Table [45]1). Therefore,
   open-source user-friendly tools with comprehensive analytical toolbox
   for re-analysis of public RNA-seq data are still lacking. We address
   this problem by developing and deploying GEO RNA-seq Experiments
   Interactive Navigator (GREIN) web tool for analysis of GEO RNA-seq
   data. In addition to the rich repertoire of analysis tools, GREIN
   provides access to more than 6,500 uniformly processed human, mouse,
   and rat GEO RNA-seq datasets with >400,000 samples that are ready for
   analysis. These datasets were retrieved from GEO and uniformly
   reprocessed by the back-end GEO RNA-seq experiments processing pipeline
   (GREP2). The pipeline also curates metadata for each of the datasets
   and annotates each sample with biomedical ontologies provided by
   MetaSRA^[46]13. As the number of new studies are included in GEO, more
   datasets are processed and added to GREIN on a regular basis. Apart
   from the preprocessed datasets, GREIN also facilitates user requested
   processing of GEO RNA-seq datasets on-the-fly (Table [47]1). We also
   release GREP2 as an R^[48]14 package and GREIN as a Docker^[49]15
   container for easy local deployment of the complete infrastructure
   which can be used to reproduce GREIN results off-line.

Table 1.

   Comparison of different RNA-seq data resources.
   Comparison RNA-seq processed data resources
   GREIN Recount2^[50]7 ARCHS4^[51]8 Toil^[52]9 Skymap^[53]10 Expression
   Atlas^[54]11
   Data availability Gene and transcript read counts Gene and transcript
   read counts Gene and transcript read counts Transcript read counts
   Allelic and transcript read counts Gene read counts
   Quantification type Read mapping (Salmon) Read alignment (Rail-RNA)
   Read mapping (Kallisto) Read mapping (Kallisto) Read alignment
   (Bowtie2) Read alignment (Tophat)
   Number of samples >400,000^* >80,000 >300,000 >20,000 >400,000 >120,000
   Interactive exploratory analysis Yes No No No No No
   QC report Yes No No Yes No Yes
   Power analysis Yes No No No No No
   Differential expression analysis on-the-fly Yes No No No No No
   Enrichment analysis for differential expression signature Yes No No No
   No Yes
   Pipeline availability R package and Docker container R package and
   Docker container Docker container Docker container Python scripts NA
   Process data on-the-fly Yes No No No No No
   Ontological annotations of samples Yes No No No No No
   [55]Open in a new tab

   All these resources provide access to processed RNA-seq data, however
   most of them do not provide interactive interfaces for further
   manipulation and analyses of the datasets; *>250,000 are already
   processed and the rest are accessible using user-initiated processing.

Results

   The conceptual outline of GREIN is showed in Fig. [56]1. Individual
   RNA-seq datasets are processed by the GREP2 pipeline and stored locally
   as R Expression Sets. User can access and analyze preprocessed datasets
   via GREIN graphical user interface (GUI) or submit for processing
   datasets that have not yet been processed. GUI-driven workflows
   facilitate examination and visualization of data, statistical analysis,
   transcriptional signature construction, and systems biology
   interpretation of differentially expressed (DE) genes. Both GREIN and
   the back-end pipeline (GREP2) are written in R and released as Docker
   container and R package respectively. Graphical user interfaces for
   GREIN are implemented in Shiny^[57]16, a web framework for building
   dynamic web applications in R. The web instance at
   [58]https://shiny.ilincs.org/grein is deployed via robust Docker swarm
   of load-balanced Shiny servers. The complete GREIN infrastructure,
   including processing pipeline is deployed via Docker containers.

Figure 1.

   [59]Figure 1
   [60]Open in a new tab

   Schematic workflow of GREP2, web interface and outputs of GREIN. GEO
   datasets are systematically processed using GREP2 pipeline and stored
   within the back-end dataset library. GUI-driven GREIN workflows
   facilitate comprehensive analysis and visualization of processed
   datasets.

   User friendly GUI driven workflows in GREIN facilitate typical reuse
   scenarios for RNA-seq data such as examination of quality control
   measures and visualization of expression patterns in the whole dataset,
   sample size and power analysis for the purpose of informing
   experimental design of future studies, statistical differential gene
   expression, gene list enrichment, and network analysis. Besides
   standard two-group comparison, the differential gene expression
   analysis module also supports fitting of a generalized linear model
   that accounts for covariates or batch-effects. The interactive
   visualization and exploration tools implemented include cluster
   analysis, interactive heatmaps, principal component analysis (PCA),
   t-distributed stochastic neighbor embedding (t-SNE), etc.
   (Supplementary Table [61]S1). User can also search for ontological
   annotations of human RNA-seq samples and datasets provided by the
   MetaSRA project^[62]13. Each processed human RNA-seq sample is labelled
   with MetaSRA mapping of biomedical ontologies including Disease
   Ontology, Cell Ontology, Experimental Factor Ontology, Cellosaurus, and
   Uberon. Biological interpretation of differential gene expressions is
   aided by direct links to other online tools for performing typical
   post-hoc analyses such as the gene list and pathway enrichment analysis
   and the network analysis of differentially expressed (DE) genes. The
   connection to these analytical web services is implemented by
   submitting the differential gene expression signature (i.e., the list
   of average changes in gene expression and associated p-values for
   all/up/down regulated genes analyzed) to iLINCS^[63]17 (Integrative
   LINCS). iLINCS also provides the signatures connectivity analysis for
   recently released Connectivity Map L1000 signatures^[64]18. Detailed
   step-by-step instructions about GREIN analysis workflows are provided
   in the Supplementary Material and ‘Help’ section in GREIN.

Key functionalities

Search or submit for processing

   User can either search for an already processed GEO data set in the
   ‘Search for GEO series (GSE) accession’ box or submit a dataset for
   processing if the dataset is not already processed (Supplementary
   Fig. [65]S2). At this point in time, the vast majority of GEO human,
   mouse, and rat RNA-seq datasets have been preprocessed and the
   user-submission of GEO datasets for processing will be required only
   occasionally. User can check the processing status of the requested
   dataset in the ‘Processing console’ tab (Supplementary Fig. [66]S3).
   Other search options include keyword search through metadata of the
   datasets and search samples through biomedical ontologies via MetaSRA
   ontological annotations.

Explore dataset

   GREIN allows access to both raw and normalized (counts per million and
   transcript per million) gene and transcript level data. GREIN comes
   with several interactive and customizable tools to visualize expression
   patterns such as interactive heatmaps of clustered genes and samples,
   density plots for all or a subset of samples, between and within group
   variability analysis through 2D and 3D dimensionality reduction
   analyses and visualizations such as PCA and t-SNE (Fig. [67]2). User
   can also visualize expression profile of each gene separately
   (Supplementary Fig. [68]S6).

Figure 2.

   [69]Figure 2
   [70]Open in a new tab

   Exploratory analysis plots in GREIN. (A) Correlation heatmap shows a
   higher correlation within cell lines and low correlation between cell
   lines. Generally high correlations within each cell line indicate high
   quality of transcriptional profiles. (B) Hierarchical clustering based
   on Pearson correlation of top 500 most variable genes based on median
   absolute deviation as the variability measure. Data is normalized and
   centered to the mean. (C) Three-dimensional principal component
   analysis plot of the cell lines. (D) Two-dimensional t-SNE plot of
   treatment condition and cell line shows clear separation of the cell
   lines, and then the RNA fractions indicating two dominant sources of
   the variability between RNA-seq profiles.

Quality control

   The quality of the RNA-seq data in public repositories continues to be
   a major problem. In a recent study by Deelen et al.^[71]19, more than
   half of 65,000 processed public RNA-seq samples had to be removed due
   to QC issues. Rather than removing samples, GREIN provides a
   comprehensive quality control (QC) report of raw sequence data and
   sequence mapping for each sample (Supplementary Fig. [72]S7), and
   allows user to make a decision about which samples should be excluded
   from downstream analyses.

Statistical power analysis

   The power analysis module in GREIN facilitates calculation and
   visualization of statistical power of detecting differentially
   expressed genes in future studies utilizing similar biological samples.
   Estimating appropriate sample size for future studies with similar
   biological samples is often the key motivating factor in re-analysis of
   RNA-seq data. Power analysis also facilitates the post-hoc analysis of
   false negative rates in the current dataset. The lack of statistical
   power and differences in statistical power between genes can produce
   false negative results leading to wrong conclusions^[73]20. The ‘Power
   curve’ segment provides power estimates for different number of samples
   based on a single gene (Fig. [74]3A). User can modify the default
   values of the parameters. The ‘Detectability of genes’ plot visualizes
   power estimate of each of the genes based on the selected groups and
   gene-wise dispersion (Fig. [75]3B). Mean coverage of the genes are
   plotted against their biological variability and are displayed in two
   sets based on their detectability status (power ≥0.8 and power < 0.8).

Figure 3.

   [76]Figure 3
   [77]Open in a new tab

   Power analysis for assessing transcriptional changes in non-malignant
   MCF10A cell line. (A) Single-gene based power estimates for different
   number of samples in each group with a minimal fold change of 2 and
   statistical significance α = 0.01. (B) Gene-wise detectability on the
   log2CPM-BCOV plane with FDR ≤0.1 and two samples in each group.

Differential gene expression

   Creating and interpreting differential gene expression signature is a
   typical analysis scenario in RNA-seq experiments. With GREIN, user can
   create a signature by comparing gene expression between two groups of
   samples with or without adjustments for experimental covariates or
   batch effects. GREIN can handle complex experimental designs by
   providing the flexibility of rearranging groups and sub-groups or
   selecting specific samples. Differential expression signature can be
   visualized via interactive graphics that include heatmap of top
   differentially deregulated genes (Supplementary Fig. [78]S15) ranked by
   false discovery rate (FDR), log fold change vs. log average expression
   (MA) plot (Supplementary Fig. [79]S16), and gene detectability plot
   (Supplementary Fig. [80]S17). Differential expression signature, with
   or without accounting for potentially false negative results, can be
   directly exported to iLINCS for enrichment and connectivity analysis.

Use case: Analysis of transcriptional and translational regulation of hypoxia
in non-malignant breast epithelial and triple-negative breast cancer cell
lines

   We demonstrate the usage of GREIN by re-analyzing a recently published
   GEO RNA-seq data ([81]GSE104193). Sesé et al.^[82]21 examined the
   transcriptional and translational regulation of hormone-refractory
   triple-negative breast cancer (TNBC) subtype under a combination of
   hypoxia and mTOR (mechanistic target of rapamycin) inhibitor treatment.
   In particular, the authors analyzed the expression profiles of TNBC
   (MDA-MB-231) and non-malignant breast epithelial (MCF10A) cells exposed
   to normoxic (21% O[2]) and hypoxic (0.5% O[2]) conditions and/or
   treated with an mTORC1 and −2 inhibitor PP242. Each of the samples were
   sequenced for total (T) and polysome-bound (P) mRNA. The dataset
   contains 32 samples, representing two biological replicates for each
   combination of cell line, oxygen level, treatment status, and mRNA
   fraction.

   Exploratory analysis of the processed dataset in GREIN (Fig. [83]2)
   shows that the strongest source of variation in between samples comes
   from differences between the two cell lines. This is re-enforced by the
   correlation analysis of full expression profiles (Fig. [84]2A), the
   hierarchical clustering of top 500 highly variable genes based on
   median absolute deviation (Fig. [85]2B), 3D PCA plot of the samples
   (Fig. [86]2C), and the 2D t-SNE plot (Fig. [87]2D). Furthermore, high
   correlations between expression profiles for the same cell line
   (Fig. [88]2A) indicates good signal-to-noise in the gene expression
   measurements. The additional substructure of data indicated by the 2D
   t-SNE plot has been examined by painting samples according to different
   attributes (Supplementary Fig. [89]S1). This analysis revealed that
   separations within each cell line are induced by different mRNA
   fractions and then differences between experimental conditions.

   Next, we used GREIN to perform statistical power analysis based on the
   pattern of biological variability observed in this dataset. We
   considered transcriptional profiles of each cell line exposed to
   hypoxia and treated with or without PP242 which leads to four
   comparisons. Assuming an expression difference of at least two-fold
   between the groups, at the statistical significance of α = 0.01, and
   with only two replicates in each group, statistical power of a gene to
   be detected as differentially expressed is below 0.55 in all the
   comparisons (Table [90]2). Our analysis indicates that one would need
   four replicates per group to achieve 80% power detecting two-fold
   change in expression (Table [91]2 and Fig. [92]3A). In a typical
   RNA-seq experiment, a sequencing depth of 20–30 million is sufficient
   to quantify gene expression for almost all genes^[93]4,[94]22 which is
   also evident in this dataset. We also evaluated statistical power of
   each gene to be detected as differentially expressed from the
   ‘Detectability of genes’ plot. Average log of counts per million (CPM)
   values of the genes were plotted against gene-wise biological
   coefficient of variation (BCOV) and power was calculated for the
   corresponding genes (Fig. [95]3B). A controlled false discovery rate of
   0.05 and expected percentage of true positives of 10% was used to
   estimate statistical significance. We define a gene to be detectable as
   differentially expressed in hypoxic condition if its power is 0.8 or
   above. As expected, there exists an inverse relationship between BCOV
   and power (Fig. [96]3B). Also, power to detect differential expression
   of a gene increases with a higher log CPM or effect size.

Table 2.

   Statistical power analysis to assess transcriptional changes in
   malignant and non-malignant cell lines.
   Cell line Comparison groups Average sequencing depth (in million)
   Common BCOV Power (2 samples) Power (3 samples) Power (4 samples)
   MCF10A Hypoxia and normoxia 41.72 0.21 0.52 0.74 0.87
   MCF10A Hypoxia + PP242 and normoxia 38.18 0.31 0.28 0.44 0.59
   MDA-MB-231 Hypoxia and normoxia 27.86 0.24 0.38 0.73 0.86
   MDA-MB-231 Hypoxia + PP242 and normoxia 29.52 0.19 0.51 0.73 0.86
   [97]Open in a new tab

   With a minimal fold change of 2 between the groups and statistical
   significance at α = 0.01, all the comparisons are under-powered with
   two samples in each group. However, power increases as we increase the
   sample size.

   One of the goals of the study was to analyze transcriptional changes in
   hypoxic and normoxic conditions with and without PP242 treatment in
   both MCF10A and MDA-MB-231 cell lines. We created transcriptional
   signatures of hypoxia and hypoxia + PP242 in total mRNA by differential
   expression analysis between hypoxia and hypoxia + PP242 samples
   respectively against the control samples while adjusting for batch
   effect by treating ‘replicate’ as a covariate, for each cell line
   separately. We found a higher number of genes differentially expressed
   (DE) in MCF10A cell lines compared to MDA-MB-231 in both hypoxia and
   hypoxia + PP242 (Fig. [98]4A) indicating that perhaps the tumor cell
   line is better equipped to deal with hypoxia. This analysis also showed
   that most non-differentially expressed genes are also not detectable,
   indicating that they may represent false negative results. This is in
   accordance with the power analysis showing that 4 samples per group
   would be needed to consistently identify differentially expressed genes
   with average BCOV. To identify lower expressed genes an even higher
   sample size would be required.

Figure 4.

   [99]Figure 4
   [100]Open in a new tab

   Differential expression and detectability of the genes. (A) The number
   of genes (log10 scale) not differentially expressed and not detectable
   (NDE&NDT), differentially expressed (DE), and not differentially
   expressed but detectable (NDE&DT) in the comparisons with normoxia for
   total mRNA fraction. We call a gene detectable (DT) if its power ≥0.8
   and differentiable if FDR < 0.05. (B) The gene detectability plot for
   the first comparison (MCF10A and hypoxia) which visualizes the
   above-mentioned list of genes along with their respective fold changes
   (FC).

   To interpret differentially expressed genes in terms of affected
   biological pathways, we submitted the differential gene expression
   signatures of hypoxia to online enrichment tools (DAVID^[101]23,
   ToppGene^[102]24, Enrichr^[103]25, and Reactome^[104]26) via iLINCS.
   The submitted signatures included a combined list of DE and NDE&DT
   genes representing likely true positive and true negatives. Genes were
   selected based on a cutoff of 0.7 and 0.01 for statistical power and
   FDR respectively. Figure [105]5 illustrates the enrichment results
   obtained from ToppGene for the MCF10 hypoxia signature. Significantly
   enriched (FDR < 0.05) top 10 gene ontology (GO) categories from
   ToppGene and DAVID functional annotation tool include response to
   hypoxia, response to decreased oxygen levels, angiogenesis, regulation
   of cell proliferation, oxidation-reduction process, and response to
   abiotic stimulus that are common in both cell lines (Supplementary
   Table [106]S2 and Supplementary Table [107]S3). Most of these
   categories are consistent with the original study. In addition,
   ToppGene suite identified hypoxia induced factor (HIF-1-alpha)
   transcription factor network that was activated in both cell lines
   (Supplementary Table [108]S4 and Supplementary Table [109]S5).

Figure 5.

   [110]Figure 5
   [111]Open in a new tab

   Snapshot of some of the significant pathway and gene ontology (GO)
   categories from ToppGene via iLINCS. These categories are found in the
   comparison between hypoxia and normoxia in MCF10A cell line using a
   combined list of DE and NDE&DT genes. The red vertical line is the
   selected cutoff of 0.05.

   Finally, we utilized GREIN connection with iLINCS to “connect” the
   uploaded signature with LINCS^[112]27 consensus (CGS) gene knockdown
   signatures^[113]18. We found 3,727 LINCS consensus gene knockdown
   signatures that were significantly (pValue < 0.05) connected with our
   uploaded signature. The target genes of top 100 connected signatures
   were selected for further enrichment analysis. We found cellular
   response to hypoxia and regulation of Hypoxia-inducible Factor (HIF) by
   oxygen in the list of top 10 activated pathways in both cell lines
   (Supplementary Table [114]S6 and Supplementary Table [115]S7). While
   this analysis yields similar enriched functional categories as the
   initial enrichment analysis, it complements the original analysis by
   implicating several target genes that are not differentially expressed
   although they are sufficiently highly expressed to be detectable
   according to our power analyses. Tying these two results together
   implicates these genes as potential higher level regulators of the
   response to hypoxia.

Discussion

   The combination of the access to a vast number of preprocessed mRNA
   expression data and the diversity of the analytical tools implemented
   makes GREIN a unique new resource for reuse of public domain RNA-seq
   data. GREIN not only removes technical barriers in reusing GEO RNA-seq
   data for biomedical scientists, but also facilitates easy assessment of
   the validity and reproducibility of the analysis results while
   uncovering new insights of the experiments. Our case-study analysis
   indicated that GREIN can be used for quickly reproducing results of the
   published studies as well as for gleaning additional insights that go
   beyond the original analysis. The complete analysis can be reproduced
   in less than 10 minutes. The web instance deployed on our server
   provides no-overhead use and requires no technical expertise. Open
   source R packages and the Docker container provide a flexible and
   transparent way for computationally savvy users to deploy the complete
   infrastructure locally with very little effort.

   The GUI-driven analysis workflows implemented by GREIN covers a large
   portion of use cases for RNA-seq data analysis, making it the only tool
   that a scientist may need to meaningfully re-analyze GEO RNA-seq data.
   In Table [116]1 we provide a comprehensive comparison of GREIN with
   existing resources in terms of data processed and implemented analysis
   tools (Recount2^[117]7, ARCHS4^[118]8, Toil^[119]9, Skymap^[120]10, and
   Expression Atlas^[121]11). Two key features distinguish GREIN from
   other currently available resources, it provides access to an
   exhaustive collection of RNA-seq samples in GEO, and it provides by far
   the most comprehensive set of interactive tools for analyzing RNA-seq
   datasets. Of existing tools, ARCHS4 and SkyMap both aim to provide
   access to the similar set of processed samples, but lack the
   interactive analytical toolbox. Other resources provide access to fewer
   processed samples and interactive analysis tools. In addition to
   standard analysis and visualization tools, the GREIN’s power analysis
   workflow provides means to assess the statistical reasons for detecting
   or not-detecting specific differentially expressed genes. This kind of
   analyses is not common in the standard RNA-seq pipelines, but the
   results can be extremely useful when assessing false negative results.
   For example, when performing gene list enrichment analysis of
   differentially expressed genes, genes that do not meet a statistical
   significance cut-off are considered not differentially expressed.
   However, our power analysis indicates that the vast majority of these
   genes may simply be below detection limits for the available number of
   samples. GREIN allows user to distinguish between genes not detectable
   due to sample size limitations and/or their low expression levels, vs
   genes that are expressed at high enough expression levels but were not
   differentially expressed. This kind of resolution allows for more
   nuanced interpretation of analysis results. Off-line use of GREP2 and
   GREIN packages, as well as flexible export options enable additional
   analyses of GREIN-processed and pre-analyzed datasets.

Methods

GREIN back-end pipeline

   To consistently process GEO RNA-seq datasets through a robust and
   uniform system, we have developed GEO RNA-seq experiments processing
   pipeline (GREP2) which is available as an R package in CRAN. Both GREP2
   and GREIN are simultaneously running on different Docker containers.
   The whole processing workflow can be summarized in the following steps:
    1. Obtain GEO series accession ID (series type: Expression profiling
       by high throughput sequencing) from GEO that contain at least two
       human, mouse, or rat RNA-seq samples. We then retrieve metadata for
       each GEO series accession using Bioconductor package
       GEOquery^[122]28. We also obtain metadata files from the Sequence
       Read Archive (SRA) to get corresponding run information and merge
       both the GEO and SRA metadata.
    2. Download corresponding experiment run files from SRA using ‘ascp’
       utility of Aspera Connect^[123]29 and convert them into FASTQ file
       format using NCBI SRA toolkit^[124]30. All the downloaded files are
       stored in the local repository until processed.
    3. Run FastQC^[125]31 on each FASTQ file to generate QC reports and
       remove adapter sequences if necessary using Trimmomatic^[126]32.
    4. Quantify transcript abundances by mapping reads to reference
       transcriptome using Salmon^[127]33.
    5. Transcript level abundances are then then summarized to gene level
       using Bioconductor package tximport^[128]34. We use lengthScaledTPM
       option in the summarization step which gives estimated counts
       scaled up to library size while considering for transcript length.
       Gene annotation for Homo sapiens (GRCh38), Mus musculus (GRCm38),
       and Rattus norvegicus (Rnor_6.0) are obtained from Ensemble^[129]35
       (release-91).
    6. Compile FastQC reports and Salmon log files into a single
       interactive HTML report using MultiQC^[130]36.

Power analysis

   The power analysis in GREIN is performed using the Bioconductor package
   RNASeqPower^[131]4 which uses the following formula:
   [MATH: <mtable><mtr><mtd><mrow><msup><mrow><mo
   stretchy="true">(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">z</mi></mrow><mrow><mn>1</mn><mo>−</mo><mfrac
   ><mi
   mathvariant="bold-italic">α</mi><mn>2</mn></mfrac></mrow></msub><mo>+</
   mo><msub><mrow><mi mathvariant="bold-italic">z</mi></mrow><mrow><mi
   mathvariant="bold-italic">β</mi></mrow></msub></mrow><mo
   stretchy="true">)</mo></mrow><mn>2</mn></msup><mo>=</mo><mfrac><mrow><m
   i
   mathvariant="bold-italic">n</mi><msub><mrow><mi>log</mi></mrow><mrow><m
   i mathvariant="bold-italic">e</mi></mrow></msub><msup><mrow><mi
   mathvariant="normal">Δ</mi></mrow><mn>2</mn></msup></mrow><mrow><mn>2</
   mn><mrow><mo stretchy="true">(</mo><mrow><mfrac><mn>1</mn><mi
   mathvariant="bold-italic">μ</mi></mfrac><mo>+</mo><msup><mrow><mi
   mathvariant="bold-italic">σ</mi></mrow><mn>2</mn></msup></mrow><mo
   stretchy="true">)</mo></mrow></mrow></mfrac><mo>#</mo></mrow></mtd></mt
   r></mtable> :MATH]
   1

   where, α is the target false positive rate, β is the target false
   negative rate or 1-β is power, n is the sample size, Δ is the effect
   size, μ is the average sequencing depth, and σ is the biological
   coefficient of variation (BCOV) calculated as the square root of the
   dispersion. We use common dispersion and tagwise dispersion estimates
   from Bioconductor package edgeR^[132]37 for computing power of a single
   gene and multiple genes respectively.

   Typically, thousands of genes are tested simultaneously for
   differential expression in RNA-seq experiments. Therefore, the above
   method for estimating power needs further adjustment to correct for
   multiple testing. Jung et al.^[133]38 derived an FDR correction formula
   and consequently the significance level (α^*) to calculate sample size
   for microarray data in the following form:
   [MATH: <mtable><mtr><mtd><mrow><msup><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>⁎</mo></msup><mo>=</mo><mfra
   c><mrow><msub><mrow><mi
   mathvariant="bold-italic">r</mi></mrow><mrow><mn>1</mn></mrow></msub><m
   i mathvariant="bold-italic">f</mi></mrow><mrow><msub><mrow><mi
   mathvariant="bold-italic">m</mi></mrow><mrow><mn>0</mn></mrow></msub><m
   o stretchy="false">(</mo><mn>1</mn><mo>−</mo><mi
   mathvariant="bold-italic">f</mi><mo
   stretchy="false">)</mo></mrow></mfrac><mo>#</mo></mrow></mtd></mtr></mt
   able> :MATH]
   2

   where, r[1] is the expected number of true positives in m[1] rejected
   null hypotheses, m[0] denote the number of genes for which null
   hypotheses are true, and f implies desired FDR level. Hence, to
   calculate power for each of the genes, we replace α with α^* in eq.
   ([134]1).

Differential expression analysis

   GREIN uses negative binomial generalized linear model as implemented in
   edgeR to find differentially expressed genes between sample groups.
   Data is normalized using trimmed mean of M-values (TMM) as implemented
   in edgeR. All the analyses are based on CPM values and genes are
   filtered at the onset with a cutoff of CPM > 0 in m samples, where m is
   the minimum sample size in any of the groups. Besides two-group
   comparison, GREIN also supports adjustment for experimental covariates
   or batch effects. A design matrix is constructed with the selected
   variable and groups. We use gene-wise negative binomial generalized
   linear models with quasi-likelihood tests and gene-wise exact tests to
   calculate differential expression between groups with and without
   covariates respectively. P-values are adjusted for multiple testing
   correction using Benjamini-Hochberg method^[135]39. Interactive
   visualization of the differentially expressed genes is also available
   via heatmap of the top ranked genes, MA plot, and gene detectability
   plot.

Supplementary information

   [136]Supplementary Information^ (4.7MB, pdf)

Acknowledgements