Abstract

   Differential gene expression analysis using RNA sequencing (RNA-seq)
   data is a standard approach for making biological discoveries. Ongoing
   large-scale efforts to process and normalize publicly available gene
   expression data enable rapid and systematic reanalysis. While several
   powerful tools systematically process RNA-seq data, enabling their
   reanalysis, few resources systematically recompute differentially
   expressed genes (DEGs) generated from individual studies. We developed
   a robust differential expression analysis pipeline to recompute 3162
   human DEG lists from The Cancer Genome Atlas, Genotype-Tissue
   Expression Consortium, and 142 studies within the Sequence Read
   Archive. After measuring the accuracy of the recomputed DEG lists, we
   built the Differential Expression Enrichment Tool (DEET), which enables
   users to interact with the recomputed DEG lists. DEET, available
   through CRAN and RShiny, systematically queries which of the recomputed
   DEG lists share similar genes, pathways, and TF targets to their own
   gene lists. DEET identifies relevant studies based on shared results
   with the user's gene lists, aiding in hypothesis generation and
   data-driven literature review.

INTRODUCTION

   RNA sequencing (RNA-seq) is commonly used to measure genome-wide
   transcriptional abundance within and across biological samples ([38]1).
   RNA-seq experiments typically compare RNA transcript abundances between
   two or more groups to calculate differentially expressed genes (DEGs)
   ([39]2). Interpreting DEG results often involves gene ontology (GO)
   enrichment ([40]3–5), public gene co-expression network comparisons
   using a myriad of tools such GeneMANIA and EnrichR ([41]6,[42]7), and
   directly comparing results to published studies. Currently, hundreds of
   thousands of human RNA-seq samples are publicly available within the
   Sequence Read Archive (SRA) ([43]8,[44]9), and accessing these data
   efficiently and meaningfully is an important step in RNA-seq analysis.

   Advances in large-scale analyses of publicly available RNA-seq data
   make it possible to interact with public data ([45]5–7) systematically.
   Reprocessing publicly available RNA-seq data before interpreting their
   results is essential because of technical variation inherent to the
   experimental and analytical steps of an RNA-seq study. Large-scale
   projects like recount, toil-recompute, and ARCHS4 ([46]10–13), remove
   unwanted variation in analysis by developing and applying efficient
   computational strategies to consistently align and enumerate RNA-seq
   data across tens of thousands of samples simultaneously ([47]10–13).
   Briefly, recount2 stores consistently reprocessed RNA-seq data from ∼70
   000 samples. Of these samples, ∼20 000 of them originated from
   consortia with complete metadata, with 9538 samples from The Cancer
   Genome Atlas (TCGA) ([48]14,[49]15) and 11 284 samples from the
   Genotype-Tissue Expression Consortium (GTEx) ([50]16). Modern public
   consortia of re-reprocessed RNA-seq data now combine to store over a
   million human and mouse RNA-seq samples ([51]10,[52]12). Using recount,
   ARCHS4, and other consistently processed RNA-seq databases, researchers
   can download and compare RNA-seq samples to their data ([53]17,[54]18)
   without worrying about any technical heterogeneity in RNA-seq data
   analysis.

   High-quality metadata is fundamental to analyzing consistently
   processed RNA-seq data properly. However, metadata is not always
   consistently stored. For example, Ellis et al. analyzed the metadata of
   49 564 human RNA-seq samples stored with the Sequence Read Archive
   (SRA) and found that sex was only reported in 3640 (7.3%) of those
   samples ([55]19). The Gene Expression Omnibus (GEO) and ArrayExpress
   ([56]20,[57]21) provide guidelines and greatly facilitate the
   submission of RNA-seq data and associated metadata. MetaSRA also
   improved the organization of public metadata by developing a
   semi-automated metadata normalization process to convert published
   metadata to a format comparable to metadata stored in the Encyclopedia
   of DNA Elements (ENCODE) ([58]22,[59]23). While these efforts and
   others facilitate the pairing of RNA-seq studies with metadata, there
   are still considerable inconsistencies in metadata between datasets
   regarding metadata organization and sample missingness. One solution to
   the problem of incomplete metadata was addressed by Ellis et al. using
   the PhenoPredict ([60]19) package to improve the metadata within
   recount2 ([61]11). Specifically, PhenoPredict ([62]19) trained a
   metadata classifier from TCGA and GTEx RNA-seq data stored within
   recount2 before annotating the remaining ∼50 000 SRA samples within
   recount2, resulting in uniform metadata across recount2. Additional
   projects like recount-brain use a third party to manually annotate a
   consistent set of metadata for brain RNA-seq samples within recount
   ([63]24). Together, consistent RNA-seq count data and metadata allow
   for the development of pipelines to conduct high-throughput
   differential expression analysis.

   Several existing robust methodologies allow for querying extensive
   systematically processed RNA-seq data. For example, Enrichr and the
   Expression Atlas from ArrayExpress allow for the systematic querying of
   gene lists ([64]7,[65]25). Enrichr included co-expression from
   consistently reprocessed RNA-seq data in ARCHS4 ([66]12,[67]25), while
   GenomicSuperSignatures applies a principal component analysis approach
   to 536 studies to study-associated gene-expression patterns ([68]26).
   Two other tools have also reprocessed DEG lists to aid in biological
   discovery. Specifically, The Expression Atlas ([69]25,[70]27) contains
   co-expression and DEGs from many species and experiments, including
   over 330 pairwise DE comparisons from human RNA-seq alone. Secondly,
   Crow et al. used 635 pairwise human DE comparisons from consistently
   processed microarray data from Gemma database to better understand
   common distributions of DEGs ([71]28,[72]29). These methods highlight
   the value of uniformly processed RNA-seq data, metadata, and
   differential expression; however, there is still a considerable need
   for larger-scale atlases of interactive DE.

   In this study, we describe the Differential Expression Enrichment Tool
   (DEET), a database and bioinformatic package that allows users to query
   systematically generated differential gene expression results from
   published RNA-seq studies. DEETs database contains 3162 consistently
   processed human pairwise differential gene expression comparisons from
   studies within recount2 ([73]11), spanning 99 tissues, 55 cell lines,
   and 985 conditions (486 from SRA, 433 from TCGA, 66 from GTEx). DEET
   allows users to input a list of genes with relevant coefficients (e.g.
   P-value, fold-change, GWAS effect size) to systematically query the
   gene expression and pathway enrichment profiles of thousands of
   consistent gene lists through gene set enrichment and correlational
   analyses. DEET and it's database are assessable via a freely-available
   library of DE comparisons, open-source R package
   ([74]https://cran.rstudio.com/web/packages/DEET/index.html), and Shiny
   App ([75]https://wilsonlab-sickkids-uoft.shinyapps.io/DEET-shiny/).

MATERIALS AND METHODS

   The purpose of the Differential Expression Enrichment Tool (DEET) is to
   facilitate comparing user-defined lists of differentially expressed
   genes (DEGs) against a uniformly computed and annotated compendium of
   DEGs (Figure [76]1A, B). To build the DEET database, we computed a
   compendium of 3162 unique, consistently processed human DEG
   comparisons, and developed supporting software (R package and Shiny
   app) to interact with the DEG compendium. For each pairwise comparison,
   DEGs were identified using a custom pipeline that uses factor analysis
   of metadata and DESeq2 for differential analysis. We chose to build our
   database using DESeq2 due to its widespread use and consistently
   positive benchmarking under many conditions and sample sizes
   ([77]30,[78]31). In addition, DESeq2 tends towards a more conservative
   estimate of detected DEGs, which is favorable when being unable to
   manually evaluate thousands of DE lists ([79]32). Next, for each DEGs
   list, DEET performs pathway and TF target enrichment analysis. The
   pre-computed DEGs and enrichment results are stored in DEET.

Figure 1.

   [80]Figure 1.
   [81]Open in a new tab

   Overview of the Differential Expression Enrichment Tool (DEET). (A)
   Schematic of how the consistently processed DEGs were computed and
   annotated. (B) Flowchart of DEET’s primary analysis. (C) Barplot of the
   number of comparisons from each DEG-comparison category in DEET.
   Categories plotted were derived from the categories labeling 635
   pairwise DE comparisons from Microarray studies in the Gemma database
   (Crow et al., 2019). We added sex, developmental staging, and
   combinations of treatments as additional categories. Bars are coloured
   by source (i.e. GTEx, TCGA and SRA). (D) Scatterplot showing the odds
   ratio of overlapping common DEGs between the DEET database and Crow
   et al. (2019). The X-axis represents the proportion of included genes,
   ranked from most common to least common. For example, the ‘1%’ point
   includes genes in the top 1% most common in either DEET or Crow
   et al. (2019). The Y-axis represents the odds ratio of
   over-representation of shared genes at each increment. Points in red
   represent increments with a significant over-representation of shared
   DEGs between the DEET database and Crow et al. (2019).

Data acquisition

   All RNA-seq count data were acquired from the ‘recount’ R package using
   the ‘download_study’ function with default parameters ([82]11).
   Metadata from studies with the SRA, TCGA, and GTEx were acquired from
   multiple sources.

   SRA. Metadata for studies within SRA was acquired by using the
   ‘all_metadata’ function in the ‘recount’ R package and supplemented
   with the ‘human_matrix_v9.h5’ file in ARCHS4 ([83]8,[84]11,[85]12).
   Samples stored within recount-brain ([86]24) was further supplemented
   with ‘add_metadata(source = ’recount_brain_v2‘)’ using the ‘recount’ R
   package ([87]11). Specifically, we extracted overlaid sample metadata
   in recount2 and ARCHS4 by their ‘geo_accession’. We then added the
   ‘title’ variable from ARCHS4 to the metadata stored in recount2
   ([88]11,[89]12,[90]19) ([91]Supplementary File S1). Lastly, we
   downloaded brief descriptions of each study from the DRA compendium
   ([92]https://trace.ddbj.nig.ac.jp/DRASearch/).

   TCGA. Metadata for The Cancer Genome Atlas (TCGA) was acquired from the
   ‘recount’ R package using the ‘all_metadata’ function ([93]11,[94]15).

   GTEx. Publicly available metadata for the Genotype-Tissue Expression
   (GTEx) consortium was acquired with the ‘all_metadata’ function
   ([95]11,[96]16). Privately available metadata for GTEx was acquired
   using dbGap (phs000424.v9) with all required ethical approvals and data
   protection (REB 1000063863).

Metadata pre-processing

   We needed to streamline the metadata with SRA, GTEx, and TCGA before we
   could perform differential analysis within each study and tissue.
   Streamlined metadata in combination with consistently reprocessed
   RNA-seq count data allowed for high-throughput differential expression
   analysis within each sample source.

   SRA. Metadata across different studies submitted to SRA is inherently
   inconsistent. Accordingly, within SRA, we focused on metadata
   compatible with the PhenoPredict R package ([97]19). These compatible
   metadata variables are tissue, cell type, sample source, sex, and
   sequencing strategy. Specifically, if the authors reported values for
   these variables, then PhenoPredict converted the consistent variable
   names across datasets (e.g. ‘reported tissue’, and they are populated
   with the reported value. PhenoPredict also matches reported metadata
   with predicted metadata variables based on the RNA-seq profile of each
   sample trained on the metadata and RNA-seq profiles within GTEx
   ([98]19). For the DEET database, we used the author's reported metadata
   value when available before imputing metadata with the predicted
   metadata computed from the RNA-seq data itself. Predicted metadata
   incorporated into DEET were already generated and evaluated in the
   context of data accuracy and in the context of accurate differential
   analysis as part of the PhenoPredict study ([99]19). Cleaned
   sample-level metadata for included SRA comparisons were stored within
   the DEET database.

   TCGA. Metadata was first manually processed to remove possible
   inconsistencies. Specifically, we manually adjusted and merged drug
   names based on spelling errors and generic and brand names,
   respectively (e.g. ibuprofen versus Advil). Variables where values
   contained different units (e.g. body temperature measured in celsius
   versus fahrenheit) were also corrected so that every value adhered to
   the most common unit. For example, if the majority of body temperatures
   were reported in celsius, then every sample changed their reported body
   temperature to celsius. Missingness of continuous variables was
   populated with a mean imputation stratified by sex. Missingness of
   categorical variables was populated with an ‘unknown’ label. Cleaned
   sample-level metadata for included TCGA comparisons were stored within
   the DEET database.

   GTEx. We did not detect metadata requiring manual corrections within
   GTEx. Like TCGA, missing continuous variables were populated with a
   mean imputation stratified by sex, and missing categorical variables
   were populated with an ‘unknown’ label. Cleaned sample-level metadata
   for included GTEx comparisons were not included in the DEET database
   because sample-level phenotypic information for the GTEx dataset are
   protected.

Comparison exclusion and inclusion criteria for the DEET database

   Several criteria needed to be met for a comparison to be included in
   the DEET database. These inclusion and exclusion criteria were
   consistent across TCGA, SRA and GTEx. Comparisons were filtered if they
   had fewer than three biological replicates in each condition, if
   conditions were generic identifications (e.g. Patient ID 1–5 versus
   6–10), if conditions were compared across different tissues, or if the
   comparison had a complete stratification of metadata (e.g. all
   ‘drug-control’ were female and all ‘drug-treated’ were male).
   Time-series and stepwise dosage comparisons kept the original reference
   point to each timepoint and stepwise timepoints while non-linear
   timepoints (e.g. Time-2 versus Time-4 if Time-3 was present) were
   filtered. Comparisons where one condition was ‘NA’ or ‘unknown’ were
   filtered for interpretability. Lastly, studies with more than three
   comparisons were filtered so that each treatment was only compared to
   an untreated control (e.g. each TCGA-drug was compared to the untreated
   condition). We removed these ‘treatment-a versus treatment-b’
   comparisons to avoid having DEET be primarily populated with
   permutations of DE comparisons that are challenging to interpret.
   Studies with three comparisons (i.e. Control, Treatment 1 and Treatment
   2) include every pairwise difference, including Treatment 1 versus
   Treatment 2, as it only added one extra comparison. Lastly, after DE
   was performed (See ‘High-throughput differential expression analysis’),
   comparisons with >10 000 DEGs and fewer than 5 DEGs were filtered.

   While the exclusion criterion for comparisons originating from TCGA,
   SRA and GTEx was the same, the inclusion criteria for comparisons
   originating from these sources differed.

   SRA. SRA is a repository of unique studies. Therefore, comparison
   variables across studies in SRA were inconsistent. Comparisons were
   included in DEET if they passed the general exclusion criterion. The
   remaining comparisons were then paired with the description of each
   study found within the DRASearch ([100]https://ddbj.nig.ac.jp/search).
   Comparisons reflecting the study description are included, and
   comparisons that do not reflect the study description are flagged and
   only included if the comparison was not confounded by the primary
   comparison. Five of these studies, SRP043162 ([101]33), SRP063978
   ([102]34), SRP063980 ([103]34), SRP064561 ([104]35), SRP067214
   ([105]36) and SRP050892 ([106]37), each had multiple timepoints and
   tissues with two conditions (98 comparisons total). Our high-throughput
   pipelines would have treated these features as blocking factors instead
   of variables to stratify pairwise comparisons. Accordingly, we
   completed these DE comparisons manually and provided them with the
   ‘SRA-manual’ source identification. Lastly, 14 studies (21 comparisons)
   had their metadata manually supplemented with the recount-brain
   ([107]24) dataset. Metadata from recount-brain ([108]24) did not
   influence how DEs were calculated, but it did influence how these DE
   comparisons were described.

   TCGA. Over 10 000 samples contained sample information mapped to the
   same metadata table. Accordingly, variables from this curated TCGA
   metadata table were manually selected for their potential to provide
   biologically meaningful comparisons. Specifically, we included
   variables describing the tumour such as tumour presence, reoccurrence,
   stage, grade, histological diagnosis, and subdivision. In addition, we
   included variables describing tumour treatment (e.g. follow-up, drug
   treatment, and surgery performed). Variables specific to individual
   cancer types (e.g. Estrogen receptor positivity, KRAS mutation
   presence) were included and automatically filtered from irrelevant
   cancers due to DEET’s database exclusion criterion because other tumour
   types contained missing or unknown cancers. In addition, we included
   ordinally annotated medical conditions (e.g. presence of diabetes,
   presence of heart disease, chronic pancreatitis). Lastly,
   population-level variables, namely sex and weight, were included.
   Weight was compared using body mass index (BMI) and was grouped into
   broad categories provided by the Centre for Disease Control and
   Prevention
   ([109]https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.h
   tml).

   GTEx. Like in TCGA, metadata variables within GTEx were chosen from
   GTEx's library of clinical variables
   ([110]https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetListOfAllObj
   ects.cgi?study_id=phs000424.v4.p1&object_type=variable) for their
   ability to yield interpretable DE. Firstly, all clinical ordinal
   variables (e.g. presence of pneumonia at the time of death—yes or no)
   were included. Population variables, namely sex, age, race, BMI (with
   the same criteria as in TCGA), and Hardy Scale (i.e. death
   circumstances), were also included. Ages were binned into 20-year
   periods (i.e. 20–39, 40–59 and 60–79 years).

High-throughput differential expression analysis

   Each pairwise comparison has a different number of samples, different
   sample stratification, and potentially different combinations of
   categorical and continuous metadata to control for. Furthermore, each
   major source of samples contained a different set of metadata to
   control for (Figure [111]1A). Accordingly, our high-throughput
   differential analysis pipelines needed to be flexible for different
   experimental designs and variability in metadata. We used the variables
   below to account for population-level metadata within SRA, TCGA, and
   GTEx.

   SRA. We control for tissue or cell type, sequence strategy, and sex.

   TCGA. We control for tissue source, age, histological subtype and sex.

   GTEx. We control for age, time passed until sample freezing, Hardy
   Scale and sex.

   If we are measuring DEGs in a variable that we typically control for,
   for example, sex differences, then we do not control for sex in that
   comparison. We accounted for the variability in experimental designs
   within the DEET database by applying automated correspondence analysis
   to each comparison. Specifically, continuous metadata (e.g. age, sample
   freezing time, etc.) underwent an Escoffier transformation using the
   ‘ours’ R package ([112]38,[113]39). Categorical metadata (e.g. sex,
   tissue, etc.) underwent a disjunctive transformation using the ‘ours’ R
   package. We then reduced these metadata into a smaller set of
   explanatory variables with a correspondence analysis (CA) using the
   ‘epCA’ function in ExPosition ([114]38). We used a correspondence
   analysis instead of a principal component analysis (PCA) because the
   metadata used to control for comparisons within DEET were mixed (i.e.
   continuous, and categorical). Then, we generated a Screeplot of the
   eigenvectors for every CA and every comparison. We picked the number of
   components using the elbow of each graph. Together, every pairwise
   comparison controlled for variables appropriate to the variation in
   their metadata.

   Differential gene expression analysis for all pairwise comparisons was
   completed with the likelihood ratio test in the DESeq2 R package
   ([115]2). The appropriate number of factors as measured by the CA were
   used to reduce DE. Genes were considered differentially expressed if
   they had an FDR-adjusted P-value < 0.05. The ‘downregulated’ group was
   decided alphabetically, as not all comparisons had a clear ‘case’ and
   ‘control’.

   Next, we performed pathway enrichment for every pairwise DE comparison.
   Specifically, we inputted all genes in each comparison into
   ActivePathways ([116]40). We selected genes detected in each comparison
   as the statistical background. Genes with an FDR-adjusted P-value <0.05
   were labeled significant. Enriched pathways included both up-regulated
   and down-regulated genes. For both pathway and TF enrichment, we used
   an FDR-adjusted P-value to correct for enriched pathways. All pathways,
   regardless of significance, were returned. Pathways were derived from a
   curated homogeneous gene-set database
   ([117]http://download.baderlab.org/EM_Genesets/) including Gene
   Ontology, Reactome, Panther, and other publicly available gene sets
   ([118]4). We will refer to this curated database as the ‘pathway’
   database. Specifically, we used the
   ‘Human_GO_AllPathways_with_GO_iea_June_01_2021_symbo. gmt’ pathway
   enrichment file, where we included paths with 15–2000 genes.
   Transcription factor (TF) targets were derived from the
   ‘Human_TranscriptionFactors_MSigdb_June_01_2021_symbol. gmt’ TF target
   file, where we included TFs with 15–5000 genes.

Display of the differential expression comparisons within the DEET database

   Every comparison was named using the following format: Study ID:
   Cell/Tissue type. condition 1 versus condition 2. When available, cell
   and tissue types were identified from internal metadata and the study
   summaries from the DRA compendium otherwise. In addition, for every
   pairwise comparison, the study name, source (SRA, TCGA, GTEx and
   SRA-manual) ([119]8,[120]15,[121]16), description from the DRA
   compendium, the number of samples (total, up-condition and
   down-condition), samples (total, up-condition, down-condition), tissue
   (including tumour from TCGA), number of DEs (total, up-condition,
   down-condition), age (mean ± sd), sex, top 15 DEGs—up, top 15
   DEGs—down, top 5 enriched pathways, and top 5 enriched TFs
   ([122]Supplementary File S1) are provided. PubMed IDs are also
   available for studies selected from SRA. Lastly, each pairwise
   comparison was given an overall category based on a DE category list
   decided in Crow et al. ([123]29). We also added the additional
   categories of sex, age and combinations of categories (e.g.
   treatment + timepoint) to accommodate our additional comparisons.

Comparing DE comparisons in the DEET database against their original studies

   We compared a subset of the pairwise DE comparisons we recomputed
   against the same comparisons stored within the supplemental data of
   their original studies.

   SRA. DEGs from Lin41 treatments vs. control in ESCs were obtained from
   [124]Supplementary file S2 from Worringer et al. ([125]41). DEGs from
   timepoints after FOXM1 inhibition MCF-7 cells were obtained from the
   GEO file ([126]GSE58626) associated with Gormally et al. ([127]42).

   TCGA. Sex differences in DEGs for twelve different cancers summarized
   in the first table of Yuan et al. ([128]43) were obtained by contacting
   the authors directly.

   GTEx. Sex differences in DEGs for seventeen tissues were obtained from
   the supplementary tables of Lopez-Ramos et al. ([129]44).

   For all comparisons in the DEET database and the original study, we
   applied an absolute-value fold change cutoff of 1.5 and an FDR-adjusted
   P-value <0.05 cutoff. For each matching comparison, we evaluated the
   over-representation of overlapping DEGs between the original study and
   DEET’s evaluation of DEGs with a Fisher's Exact-test of overlapping
   genes using ‘cellmarker_enrich’ in the scMappR R package
   ([130]45,[131]46), where the background is the number of genes detected
   in the original study. We then tested whether applying the DEET
   enrichment tool to each original DE comparison would enrich for their
   analogous DEG comparison within the DEET database. We outputted the
   rank that the analogous comparison was enriched. If the analogous
   comparison was not ranked one (i.e. the most enriched study), we
   outputted whether every more strongly enriched comparison contained the
   same primary variable (e.g. sex differences in a different tissue).
   Then, we measured the log[2](Fold-change) similarity of genes that
   overlapped between the two studies using a Pearson's correlation.
   P-values for these Fisher's-exact tests and correlations were FDR
   corrected, and the log[2](Fold-change) of the genes designated as DE in
   the original study or the DEET database were plotted using the
   ‘ggplot2’ R package ([132]47).

Implementation

   We provide an R package, DEET and Shiny applet that allows users to
   query a list of their genes against our 3162 consistently computed DEG
   lists. The DEET R package, can be installed from CRAN
   ([133]Supplementary File S2; p. 1), and the Shiny applet can be found
   at ([134]https://wilsonlab-sickkids-uoft.shinyapps.io/DEET-shiny/). We
   also provide a workflow for users to query and visualize their DEGs
   against the DEET database ([135]Supplementary File S1). We only query
   significant DEGs in the DEET R package and Shiny App. Both data sets
   can be downloaded with the DEET_data_download() function in the R
   package. The DEGs from each pairwise comparison within the DEET
   database are also stored in the gene-matrix transpose (*.gmt) format,
   allowing users to incorporate the DEET database with other pathway
   enrichment tools such as g:Profiler and GSEA ([136]3,[137]4).

   The primary function of the DEET R package is to allow users to query
   their list of DEGs against the consistently computed DEGs within the
   DEET database by using the function DEET_enrich(). The optimal input
   into DEET’s enrichment function, DEET_enrich(), is a data frame of
   genes (human gene symbols) with an associated P-value and coefficient
   (e.g. Fold-change) in conjunction with a list of genes designating the
   statistical background. First, DEET internally applies ActivePathways
   ([138]40) function to the user's gene list to identify enriched
   pathway's and TF’s using the same pathway and TF datasets stored within
   the DEET database. DEET then uses ActivePathways ([139]40) again to
   compute the enrichment of DEET comparisons at the gene, GO and TF
   levels. ActivePathways ([140]40) used all detected genes as the
   statistical background, Brown's P-value fusion method, and an
   FDR-adjusted P-value cutoff of 0.05. Then, DEET_enrich() enriches the
   users' inputted genes, pathways, and TF targets against the DEET
   database's DEGs, pathways, and TF targets. Then, DEET_enrich() computes
   the Spearman's and Pearson's correlation between the coefficients of
   the user's gene list and the log[2](Fold-change) of DEGs within
   enriched pairwise comparisons in the DEET database. Finally, the
   P-values of these correlations are corrected with an FDR correction.
   Users can also adjust the significant thresholds of the FDR-adjusted
   P-values and log[2](Fold-changes) of detected DEGs within this database
   using the adjust_DE_cutoffs() function. This function also provides
   instructions to repeat pathway enrichments of each comparison using the
   new cutoffs. Finally, if users do not want to recompute pathways but
   want new cutoffs, they can use the gene-centric equivalent to
   DEET_enrich(), DEET_enrich_genesonly(). This function is also
   compatible if users want to use Ensembl IDs to identify DE pseudogenes
   and non-coding RNAs that do not have a gene symbol. Together, DEET’s
   enrichment tool returns significantly enriched studies based on
   overlapping DEGs, pathways, and TFs with the flexibility of many
   different hypotheses and inputted data types.

   Optionally, DEET_enrich() may be used with a generic gene list (i.e.
   without P-values or coefficients). We assume an inputted list is
   unordered or in decreasing order of significance. If the gene list is
   ordered, we evenly space their P-value, with the least significant
   P-value being 0.049. The Pearson's correlation between the inputted
   gene list and the DEGs within the DEET database is excluded. If the
   inputted gene list is unordered, then all P-values are set to 0.049,
   and both Spearman's and Pearson's correlations between the users'
   inputted genes and the DEGs within the DEET database are excluded. If
   users do not provide a background set of genes, we assume the
   background set is all genes detected within the DEET database.
   Alternatively, users may leverage the P-values and coefficients of our
   precomputed DEG lists to enrich an unranked gene list. Briefly, the
   DEET_Input_as_Reference() function converts a users gene list into a
   gene set database (i.e. a *gmt file) of one gene list before enriching
   each DEG list (weighted by FDR adjusted P-value) against the users list
   of genes. This way, enrichment is not just detected by the overlap of
   precomputed DEGs to inputted genes and by the significance of these
   precomputed DEGs.

   The DEET R package also contains plotting functions to summarize the
   most significant studies based on each enrichment test and correlation
   within DEET_enrich(). The process_and_plot_DEET_enrich() function plots
   barplots of the most enriched studies based on gene set enrichment
   (ActivePathways ([141]40)) of the studies enriched studies based on
   overlapping DEGs, pathways, and TF targets. DEET also generates
   scatterplots of the most enriched studies based on Spearman's
   correlation analysis. All plots are generated using ggplot2 ([142]47),
   and DEET_enrich() returns the ggplot2 ([143]47) objects for each plot
   to allow researchers to customize plots further.

   Lastly, the DEET R package contains a function called
   DEET_feature_extract(), allowing researchers to identify genes whose
   log2Fold-change is associated with a response variable (e.g.
   fold-change of the gene of interest, and whether the study investigates
   cancer, etc.). Genes must be detected (not necessarily DE) in at least
   70% of studies (users can adjust this threshold) to be included in
   predicting the fold-changes of other genes. This threshold exists
   because highly sparse genes have the potential to over-predict the
   fold-changes of other genes detected in the same studies using both
   elastic net regressions and simple correlations. After filtering, genes
   are extracted by calculating the coefficients from a Gaussian family
   elastic net regression using the ‘glmnet’ R package ([144]48,[145]49)
   and Spearman's correlation between every gene and the response
   variable. If the response variable is categorical (e.g. comparison
   category), features are extracted by calculating the coefficients from
   a multinomial family elastic net regression and an ANOVA ([146]50)
   between each category within the response variable. Lastly, if the
   response variable is ordinal (e.g. enriches for TNFa pathway yes/no,
   Cancer study yes/no, etc.), features are extracted using a binomial
   family elastic net regression and a Wilcoxon's test ([147]51) between
   the two categories within the response variable.

Clustering of studies within the DEET database

   Pairwise correlation analysis was completed within every study in the
   DEET database. Specifically, we took genes DE in at least one of the
   studies for each pair of studies and completed a Pearson's correlation
   of their FDR-adjusted P-values. The R^2 of these pairwise correlations
   were populated into a correlation matrix. We then computed the
   Euclidean distance matrix of the absolute value of the correlation
   matrix before performing a hierarchical clustering correlation matrix
   using the Ward.D2 ([148]52) method and with a height cut-off of 30. The
   correlation matrix was clustered and plotted with the Pheatmap R
   package ([149]47,[150]52). Median proportions of overlapping DEGs
   within each cluster were calculated by making a
   comparison-by-comparison matrix and populating it with the number of
   intersecting genes. Then, each row of the matrix was divided by the
   number of DEs in that row's comparison. The median of this matrix was
   then calculated and represented by the barplot. For example, a value of
   0.075 for cluster 5 means that ‘on average, a comparison within cluster
   5 will share 7.5% of their DEGs with another comparison within cluster
   5’. Finally, we annotated the biological and hallmark gene-sets for
   each cluster using ActivePathways, using Brown's P-value fusion method.

Case study: evaluating TNFa response in human endothelial cells

   We acquired the full edgeR results of differential expression analysis
   in both the intronic RNA-seq and exonic RNA-seq from the original
   authors of Alizada et al., 2021 ([151]53,[152]54). All detected genes
   in Alizada et al. ([153]54) were used as the statistical background.
   Genes were separated into up-regulated and down-regulated based on
   false-discovery rate using the authors’ cut-offs of FDR <0.1 and
   absolute-value log[2](Fold-change) of 0.6. Then, each gene list was
   inputted into the DEET_enrich() function using default parameters. We
   also generated matrices of all FDR-adjusted P-values where each row is
   a gene, and each column is an RNA-seq type (i.e. intronic RNA-seq and
   exonic RNA-seq). Genes with a log2(Fold-change) > 0 had their FDR set
   to 1 to focus on downregulated genes. These matrices were inputted into
   ActivePathways ([154]40) using default parameters. The *gmt file
   inputted into ActivePathways was the full list of DEGs stored within
   the DEET database and can be accessed with DEET_data_download().

RESULTS

Summary of the differential expression enrichment tool: atlas and R package

   We furthered the advancements in high-throughput RNA-seq analysis by
   re-computing thousands of pairwise differential analyses. Specifically,
   the total of 3162 comparisons were selected based on sample numbers and
   the interpretability of the comparisons. In total, 405 studies in
   recount2, the reprocessed RNA-seq count data used to recompute these
   DEG sets, contained at least five samples and one variable with two or
   more groups. After study filtering, 142 of these 405 studies remained
   to recompute differential analysis. Specifically, 162 studies were
   filtered due to insufficient sample sizes in one group (N < 3) within a
   study and/or because DESeq2 was unable to estimate parametric or local
   dispersions. The remaining 98 studies were filtered because their
   metadata variables with multiple conditions did not meet the DEET
   databases inclusion criteria (see ‘Materials and Methods’ for details).
   Briefly, these criteria included study-relatedness, metadata
   stratification, confounding, and the interpretability of an individual
   comparison. In addition, studies where fastq files were generated from
   a scRNA-seq protocol were filtered because their bioinformatic analysis
   and DE are inherently different from bulk and cell-sorted RNA-seq.
   Additionally, only potentially meaningful DE comparisons from within
   the original study and tissue (e.g. TCGA samples were not compared to
   GTEx samples, liver samples were not compared to kidney samples) were
   included (Figure [155]1A). It is important to note that filtered
   studies were not necessarily intended for differential analysis, and
   there was not an inherent flaw in the original studies but an
   incompatibility with DEET. Lastly, while no entire study was filtered
   because of the number of DEGs, 246 comparisons were filtered for
   containing more than 10,000 or fewer than 5 DEGs.

   Comparisons in GTEx (N = 1594 comparisons) ([156]16) and TCGA (N = 957
   comparisons) ([157]15) were chosen based on whether the metadata had
   discrete options in their clinical metadata sheets. The primary
   variable comparisons from SRA (N = 611 comparisons across 142 studies)
   ([158]8) were chosen based on their relationship to the author's
   reported study description, which we added to DEET’s metadata. To
   provide an overview of the 985 types of DE comparisons in the DEET
   database, we sorted comparisons into 26 combinations of DE categories
   originally defined by Crow et al. ([159]29), with most categories
   related to ‘disease’ or ‘treatment’ (Figure [160]1C). Overall, we have
   recomputed 3162 differential expression analyses across 144 studies,
   including almost 1000 comparisons from TCGA and over 1500 comparisons
   from GTEx. Our database spans hundreds of different hypotheses,
   including but not limited to cell-line treatments, pairwise time-series
   comparisons, cancer treatments, population-level transcriptomic
   studies, and cellular development and differentiation. These
   comparisons' DEGs, enriched pathways and TFs, comparison level
   metadata, and non-protected sample-level metadata are open source and
   publicly available for use (see Supplementary Data).

   We built the DEET as a bioinformatic package that leverages and expands
   upon pre-existing gene-set enrichment tools interact with our set of
   DEG comparisons in a user-friendly way, facilitating in hypothesis
   generation and providing biological insight from user-defined
   differential gene expression results. To use DEET, users input a list
   of genes with an associated P-value and summary statistic (i.e.
   fold-change). DEET performs pathway term- and TF target-enrichment
   analysis using this gene list. DEET compares (a) the gene list itself
   against a database of the precomputed DEGs within this study and (b)
   enriched pathway terms and potential regulatory TFs with precomputed
   enrichment results. DEET returns and visualizes a set of RNA-seq
   experiments with similar results together with the genes and pathways
   responsible for the overlap between studies.

   DEET uses a ranked hypergeometric test provided by ActivePathways to
   compare user-provided gene list to pre-computed DEGs ([161]40). Unlike
   the gene sets stored within GO and pathway databases, the gene lists
   used by DEET are weighted by P-value and fold-change. DEET correlates
   the DEG coefficients with the fold-changes of a user's DEG list and
   tests if other studies are changing in a similar pattern. Lastly, DEET
   uses enriched GOs and TFs based on the user's gene list to identify
   studies with similar pathway enrichments using the hypergeometric test
   in ActivePathways ([162]40). Lastly, DEET provides software for data
   visualization of enriched gene lists.

Global patterns of differentially expressed genes within the DEET database

   We first investigated the number of samples within each comparison
   within the DEET database. Specifically, we found a median of 127,
   141 and 12 samples per comparison from TCGA, GTEx and SRA sources,
   respectively. After accounting for the ratio of samples in each
   condition (see ‘Materials and Methods’ for details), there was a
   ‘scaled’ sample size of 26, 13 and 7. As expected, we found that the
   number of DEGs was positively correlated with the ratio-scaled number
   of samples in every source ([163]Supplementary Figure S1). Furthermore,
   when accounting for the ratio in sample size, the variance in the total
   number of DEGs also decreases as the sample size increases
   ([164]Supplementary Figure S2).

   Previously, Crow et al. used 635 pairwise human DE comparisons from
   consistently processed microarray data from the Gemma database
   ([165]28,[166]29). They developed a ‘DE prior’ statistic, a
   multifunctionality analysis optimizing the rank ([167]55) of common
   DEGs that were predictive of gene expression in most studies ([168]29).
   Their 'DE prior' highlighted that genes related to sex, cellular
   response, extracellular matrix, and inflammation were commonly DE
   regardless of comparison, while housekeeping genes were uncommonly DE.
   Furthermore, due to the unbiased nature of the DE comparisons used to
   predict their 'DE prior', they predicted these DEGs to be robust across
   consortia. Therefore, we generated a 'DE prior' for the DEET database
   to be able to compare whether the overall patterns of differential
   expression within the DEET database replicate those in Crow et al.

   We found that building a DE prior from the DEGs stored within the DEET
   database yielded a correlated ranking of DEGs
   (P-value = 2.64 × 10^−171, rho = 0.215) to the 'DE prior' in Crow
   et al. Furthermore, the top 1% of DE genes in each ‘DE prior’ list were
   significantly overlapping (FDR-adjusted P-value = 1.37 × 10^−18,
   OR = 15.3), with 26 overlapping genes primarily related to the Y
   chromosome and inflammation (Figure [169]1D, [170]Supplementary Figure
   S3). We then repeated this analysis at 1% intervals. We found that the
   top 10% of genes significantly overlapped between the ‘DE prior’ from
   Crow et al. ([171]29) and the DE prior from the DEET database (Figure
   [172]1D, [173]Supplementary Figure S3). Together, the global patterns
   of DEG frequency within the DEET database replicate established
   differential expression patterns.

Distribution of DEG comparisons and pathways within the DEET database

   After profiling the DEGs within the DEET database, we investigated how
   the 3162 comparisons clustered based on their DE profile. We expected
   comparisons to be clustered by shared underlying biology and
   experimental design; however, many comparisons originate from
   population-level comparisons in large consortium datasets (e.g. age,
   sex, time of death, presence of pneumonia, etc., in GTEx). Accordingly,
   population versus experimental RNA-seq designs, such as those found in
   SRA, may also drive cluster structure. We indeed found that the
   comparison source played a substantial role in cluster formation, with
   7/23 clusters composed entirely from GTEx comparisons
   ([174]Supplementary Figure S4A, B, [175]Supplementary File S1). While
   TCGA is a population-level cohort, much of the metadata stored within
   TCGA is related to specific treatments (i.e. drug treatment). Like the
   sample source, the tissue of origin within the DE comparison also
   contributed to cluster identification. For example, clusters 20 and
   23 were composed primarily of GTEx comparisons in EBV-transformed
   lymphocytes and clusters 21–22 contained almost exclusively GTEX
   comparisons in different brain regions ([176]Supplementary Figure S4B).

   We investigated how many DEGs overlap between all pairwise comparisons
   within a cluster. We found that clusters primarily annotated by shared
   experimental design (i.e. clusters 1–4, 6–7, 9–10 and 13) shared an
   average of 21.0% (3.4–45.1%) of their DEGs with another comparison
   within the same cluster ([177]Supplementary Figure S4C). In contrast,
   contrasts primarily annotated by source (GTEx, TCGA, and SRA) and
   tissue (i.e. clusters 11–12, 15, 18–23) s only shared 6.49% of their
   DEGs with other comparisons in the same cluster (1.95–12.8%)
   ([178]Supplementary Figure S4C). Using ActivePathways ([179]40) which
   allows for data fusion of P-values merging across different DE
   comparisons before conducting gene set enrichment, we annotated each
   cluster with pathway ([180]Supplementary Figure S4D) and the 50
   Hallmark gene sets ([181]Supplementary Figure S4E). Many clusters
   contained enrichment for development and immune response pathways in
   the Hallmark and the pathway gene sets. For example, the ‘Humoral
   immune response’ gene ontology was in the top 5 most enriched pathways
   for 7/23 clusters ([182]Supplementary Figure S4D), and the
   ‘Inflammatory response’ was in the top 5 most enriched Hallmarks
   in 12/23 clusters ([183]Supplementary Figure S4E). In addition, the
   ‘Kras signaling - down’ hallmark gene set was in the top 5 most
   enriched gene sets in 21/23 clusters ([184]Supplementary Figure S4E).
   This strong and consistent enrichment of KRAS signaling likely reflects
   a bias towards cancer-related experiments in the DEET database.
   Specifically, there are 957 comparisons from TCGA, and all considered
   at least cancer-related, 47 comparisons in GTEx investigating cancer,
   and 134 comparisons in SRA where ‘cancer’ or ‘tumour’ were part of the
   DE comparison name or description.

Differential expressed genes within the DEET database reflect the findings in
the original studies

   We next evaluated how the gene lists within the DEET database reflect
   the DEGs reported in the original studies. We chose publicly available
   comparisons from each primary source within the DEET database (GTEx,
   TCGA, and SRA) using multiple experimental designs and expected
   perturbation strength. To verify if our DE comparisons made from GTEx
   data correspond to previously published analyses, we compared the
   pairwise analysis of sex differences within 17 tissues to what was
   reported in the original study ([185]44). To verify our DE analysis of
   TCGA data, we compared our results for the pairwise sex differences
   within the 12 tumour types to what was reported in the original study
   ([186]43). To verify our comparisons in SRA, we chose two studies: DEGs
   measured from (a) MCF-7 cells after FOXM1 inhibition (control t = 0
   versus 3, 6 and 9 h) ([187]42) and (b) Lin41-1 knockdown, and Lin41-2
   knockdown in human embryonic stem cells ([188]41). These comparisons
   also contain a wide range of perturbation strength, with a minimum of
   30 DEGs detected and a maximum of 2077 DEGs detected. Gormally
   et al. also completed qPCR for five genes at the t = 0 versus t = 6 h,
   which we replicated in our DE of that timepoint ([189]Supplementary
   Table S1). As expected, we found that each DEG list obtained from the
   original study either enriched for its own comparison as the single
   most enriched gene list (6/6 comparisons from SRA, 4/12 comparisons
   from TCGA, 12/17 comparisons from GTEx) or enriched for a study within
   the same source and comparison type but in a different tissue (Table
   [190]1, [191]Supplementary File S3). For example, sex differences in
   glioblastoma multiforme (GBM) stored within the supplementary files of
   Yuan et al. enriched for DEET-computed sex differences in Glioblastoma
   (GBM), the fifth most significant comparison, while the most
   significantly enriched comparison was sex differences in Uveal melanoma
   (UVM) within the TCGA cohort ([192]15) (Table [193]1,
   [194]Supplementary File S3). We also found that every pairwise
   comparison from these studies had a highly significant overlap in DEGs
   and highly correlated fold-changes in overlapping DEGs (Table [195]1,
   [196]Supplementary Figure S5). We captured 31.4–87.1% of the original
   DEGs, which is in line with differences that can occur when comparing
   any two commonly used differential analysis approaches to the same
   RNA-seq count matrix ([197]56).

Table 1.

   Overview of the performance evaluation testing the accuracy of the
   recomputed DEG compendium and DEET tool. This table consists of 35
   comparisons across four studies with statistics reporting on how well
   the DEGs from the original study overlap with their analogous
   comparison within the DEET database
   Study Comparison/tissue DEET_enrich() study rank DEET_enrich() top
   comparison Pearson's correlation of intersecting DEGs (R^2) Pearson's
   correlation of intersecting DEGs (FDR) FDR - hypergeometric test Odds
   ratio of overlap DEET-specific DEGs Study-specific DEGs Intersecting
   DEGs Genes captured
   Lopez-Ramos et al., 2020 Adipose Subcutaneous 1 0.91 9.58E−87 5.20E−134
   12.32 1063 482 230 47.72%
   Lopez-Ramos et al., 2020 Adipose Visceral 1 0.89 8.37E−21 4.30E−45
   19.37 781 89 57 64.04%
   Lopez-Ramos et al., 2020 Adrenal Gland 1 0.84 1.77E−09 9.75E−28 22.06
   695 48 32 66.67%
   Lopez-Ramos et al., 2020 Artery Aorta 1 0.86 2.82E−22 2.00E−61 22.96
   612 127 72 56.69%
   Lopez-Ramos et al., 2020 Artery Coronary 1 0.83 1.63E−09 3.64E−23 13.22
   929 63 34 53.97%
   Lopez-Ramos et al., 2020 Artery Tibial 1 0.87 3.25E−20 2.24E−47 16.76
   654 139 63 45.32%
   Lopez-Ramos et al., 2020 Brain Cerebellum 7 GTEx-Nucleus Accumbens -
   sex 0.87 3.13E−10 8.08E−22 14.54 810 58 30 51.72%
   Lopez-Ramos et al., 2020 Colon Sigmoid 7 GTEx- Lung - sex 0.90 2.04E−10
   6.19E−25 24.28 569 45 27 60.00%
   Lopez-Ramos et al., 2020 Colon Transverse 3 GTEx- Stomach - sex 0.82
   1.48E−08 3.30E−30 28.30 502 51 31 60.78%
   Lopez-Ramos et al., 2020 Esophagus Mucosa 1 0.85 8.83E−18 1.23E−51
   22.02 612 110 61 55.45%
   Lopez-Ramos et al., 2020 Esophagus Muscularis 23 GTEx- Colon Sigmoid -
   sex 0.87 6.62E−11 6.84E−28 21.60 602 57 32 56.14%
   Lopez-Ramos et al., 2020 Heart Atrial Appendage 1 0.90 8.65E−16
   4.05E−35 21.43 600 75 41 54.67%
   Lopez-Ramos et al., 2020 Lung 1 0.85 1.27E−11 6.56E−42 37.90 397 63 39
   61.90%
   Lopez-Ramos et al., 2020 Pituitary Gland 1 0.86 2.61E−22 1.18E−47 14.27
   993 118 71 60.17%
   Lopez-Ramos et al., 2020 Spleen 12 GTEx- lung - sex 0.83 1.04E−08
   1.47E−26 21.09 676 50 31 62.00%
   Lopez-Ramos et al., 2020 Stomach 1 0.81 3.21E−09 2.00E−36 32.06 468 57
   36 63.16%
   Lopez-Ramos et al., 2020 Thyroid 1 0.94 1.62E−118 2.30E−144 12.84 1186
   437 245 56.06%
   Yuan et al., 2016 BLCA 47 TCGA - UVM - sex 0.80 3.77E−06 3.06E−18 18.80
   939 48 23 47.92%
   Yuan et al., 2017 COAD 12 TCGA - PAAD - sex 0.72 4.86E−05 3.06E−30
   75.45 587 36 25 69.44%
   Yuan et al., 2018 GBM 5 TCGA - UVM - sex 0.83 6.39E−06 6.74E−30 131.78
   273 31 20 64.52%
   Yuan et al., 2019 HNSC 56 TCGA - UVM - sex 0.88 1.12E−11 1.55E−25 19.87
   1242 60 34 56.67%
   Yuan et al., 2020 KIRC 1 0.88 7.99E−74 3.66E−74 5.97 2192 538 227
   42.19%
   Yuan et al., 2021 KIRP 1 0.80 4.59E−100 2.17E−139 6.04 2383 1007 451
   44.79%
   Yuan et al., 2022 LGG 26 TCGA - CHOL - sex 0.78 5.39E−06 1.50E−31 85.85
   518 36 25 69.44%
   Yuan et al., 2023 LIHC 1 0.63 1.73E−17 4.57E−60 7.82 1859 325 144
   44.31%
   Yuan et al., 2024 LUAD 1 0.86 4.85E−26 6.89E−61 18.38 1038 170 85
   50.00%
   Yuan et al., 2025 LUSC 5 TCGA - PAAD - sex 0.88 1.65E−11 4.28E−24 16.29
   947 74 33 44.59%
   Yuan et al., 2026 READ 3 TCGA - UVM - sex 0.80 7.11E−06 8.15E−35 180.59
   241 32 22 68.75%
   Yuan et al., 2027 THCA 5 TCGA - UVM - sex 0.87 4.13E−09 1.29E−34 65.83
   280 56 27 48.21%
   SRP043378 FOXM1 Inhibition, 0 versus 6 H 1 0.95 0 0 22.72 1510 1154 867
   75.13%
   SRP043378 FOXM1 Inhibition, 0 versus 9 H 1 0.94 0 0 24.28 1486 1180 897
   76.02%
   SRP043378 FOXM1 Inhibition, 0 versus 3 H 1 0.95 1.31E−265 0 19.27 790
   927 517 55.77%
   SRP032743 Control vs siLIN41-1 1 0.98 2.15E−23 1.89E−52 307.23 782 40
   35 87.50%
   SRP032743 Control versus siLIN41-2 1 0.94 1.77E−05 3.76E−18 109.45 146
   35 11 31.43%
   SRP032743 siLIN41-1 versus siLIN41-2 1 −0.96 2.40E−07 1.16E−24 414.12
   272 17 13 76.47%
   [198]Open in a new tab

   Lastly, when looking at the total number of DEGs, we found a similar
   number or, in most cases, more DEGs between all the comparisons within
   the DEET database compared to the original studies (Table [199]1).
   Differences in alignment, gene counting and normalization, and
   differential analysis all influence gene DEG detections and
   dispersions, thus impacting the total number of DEGs. In particular,
   DEET-specific non-coding DEG detection partially explains why DEET
   detects more DEGs than many of the original comparisons. Specifically,
   DEET-specific DEGs are, on average, 6.8× (0.65–36.3) more likely to be
   non-coding genes than DEGs shared between the DEET database and the
   original study ([200]Supplementary Figure S6). Overall, the automated
   differential pipeline DEET used to calculate DEGs accurately captured
   the DEGs from their original studies.

DEET identifies relevant studies when applied to TNFa-mediated inflammation

   To demonstrate how DEET can be used to explore user-generated DEG
   lists, we took our lab's previously published analysis of human aortic
   endothelial cells (HAoEC) treated with proinflammatory cytokine tumour
   necrosis factor-alpha (TNF) ([201]54). TNFa stimulation activates the
   transcription factor complex NF-κB and drives rapid proinflammatory
   gene expression. This study has a 45-min post-TNF treatment versus
   untreated comparison. Two DEG lists were generated: one conventional
   comparison looking at exonic RNA and another comparing intronic RNA
   (which can be used as a proxy for actively regulated genes ([202]54)).

   We applied DEET’s enrichment tool function to both the intronic- and
   exonic-calculated, TNFa-induced (upregulated) DEGs. We found that both
   intronic- and exonic-derived DEGs from Alizada et al. ([203]54)
   retrieve comparisons related to TNFa treatment and bacterial infection
   (Figure [204]2A, [205]Supplementary Figure S7). For example, the top 15
   most enriched studies from each list include studies measuring gene
   expression after <1 h of TNFa treatment (TNFa treatment to breast
   cancer cells for 40 min ([206]57) and TNFa treatment to neutrophils for
   1 h ([207]58)).

Figure 2.

   [208]Figure 2.
   [209]Open in a new tab

   Summary of the gene-centric output of DEET’s function applied to
   upregulated DEGs after TNFa treatment in Human Aortic endothelial cells
   (HAoECs) for 45 min from Alizada et al. (2021). (A) Barplot of the top
   15 most enriched pairwise comparisons based on overlapping DEGs from
   intronic RNA-seq. Rows are different comparisons within DEET, and the
   barplot is the −log[10](FDR-adjusted P-value) of gene set enrichment
   computed by ActivePathways. (B) Scatterplot of the log[2](Fold-changes)
   of the upregulated DEGs in Alizada et al. (2021) from intronic RNA-seq
   (x-axis) versus the DEGs in SRP043379 between 0 (naive) and 6 h of
   FOXM1 inhibition (y-axis). Points are individual genes. Grey points are
   only DE in one study, purple points are DE in the same direction
   between studies, and orange points are DE in the opposite direction.
   For (A), comparisons annotated with a blue symbol are treatments of
   TNFa in different cell-lines. Comparisons annotated with a yellow
   symbol originate from infection and immune disorders studies.
   Comparisons annotated with an orange symbol originate from SRP043378,
   Gormally et al. (2014), which investigates differences in gene
   expression in MCF-7 cells after FOXM1 inhibition for 0 (naive) 3, 6 and
   9 h.

   One important motivation for using DEET is to facilitate identifying
   new connections between one's gene list and other studies that do not
   share a common experimental design. For example, the above DEET
   analysis of TNFa-treated endothelial cells returned a methods-based
   study looking at the effect of overexpressing NF-κB subunits RELA and
   NFKB1 in HEK293 cells ([210]59) and another study of macrophages
   infected with Mycobacterium abscesses ([211]60). We also retrieved a
   study that, at first glance, did not contain an obvious connection to
   proinflammatory gene responses but rather investigated differences in
   gene expression after FOXM1 inhibition in MCF7 breast cancer cells for
   0 (naive) versus 6 h ([212]42) (Figure [213]2A). We found that
   overlapping DEGs had correlated fold-changes (exonic RNA-seq DEGs: 93
   same-sign DEGs, R^2 = 0.595, FDR-adjusted P-value = 7.60 × 10^−7,
   intronic RNA-seq DEGs: 153 same-sign DEGs, R^2 = 0.342, FDR-adjusted
   P-value = 0.0534) ([214]Supplementary Figure S7, Figure [215]2B). While
   FOXM1 is often studied as a transcription factors that plays a role in
   proliferation and differentiation ([216]42), previous studies link
   FOXM1 to TNF signaling through extensive chromatin co-localization of
   FOXM1 and NF-κB ([217]61). Accordingly, while FOXM1 binding was not
   directly studied in ([218]54), the biological processes have previously
   been linked ([219]61). These 153 overlapping genes significantly enrich
   the ‘TNFa signaling via NFkB’ hallmark gene set (54 genes, FDR-adjusted
   P-value = 2.175 × 10^−66). Investigating what genes overlap between the
   inputted genes and enriched comparisons is particularly important
   because genes within our database may be DE due to technical reasons
   such as collagenase digestion and RNA processing times in short-term
   experiments ([220]29,[221]62). Accordingly, even enriched comparisons
   with a similar experimental design should be scrutinized at the
   overlapping gene level before concluding that enrichment is due to
   shared biology. Lastly, DEET is also designed to identify significantly
   associated comparisons based on overlapping pathway and TF-target terms
   obtained from user-submitted DEG lists. Using the above NF-κB DEG list
   and associated pathway and TF-target terms, we identified additional
   DEG comparisons within the DEET dataset driven by pathway terms ‘TNFa
   signaling via NF-κB’, ‘Response to lipopolysaccharide’, and ‘response
   to molecule of bacterial origin’ (Figure [222]3A, [223]B). Together,
   DEET can identify diverse, biologically relevant studies, and by
   interpreting the shared DEGs between these studies, users can generate
   hypotheses as to why these gene lists are linked, thus facilitating
   hypothesis generation and data interpretation.

Figure 3.

   [224]Figure 3.
   [225]Open in a new tab

   Summary of the pathway- and transcription-factor-centric output of
   DEET’s function applied to upregulated DEGs after TNFa treatment in
   Human Aortic endothelial cells (HAoECs) for 45 min from Alizada
   et al. (2021). (A) Barplot of the top 10 most enriched pairwise
   comparisons based on overlapping biological pathways from intronic
   RNA-seq. Rows are different comparisons within DEET, and the barplot is
   the −log[10](FDR-adjusted P-value) of path-set enrichment. (B) Barplot
   of the top 10 most enriched pairwise comparisons based on overlapping
   TFs from intronic RNA-seq. Rows are different comparisons within DEET,
   and the barplot is the −log[10](FDR-adjusted P-value) of the TF-set.
   For (A) and (B), comparisons annotated with a blue symbol are
   treatments of TNFa in different cell-lines. Comparisons annotated with
   a yellow symbol originate from infection and immune disorders studies.
   Comparisons annotated with an orange symbol originate from SRP043378,
   Gormally et al. (2014), which investigates differences in gene
   expression in MCF-7 cells after FOXM1 inhibition for 0 (naive) 3, 6 and
   9 h.

   To further demonstrate the potential use of DEET and to provide an
   example where DEET was able to reveal novel biological insights that
   might be missed by transitional pathway enrichment analysis, we queried
   the list of genes downregulated after TNFa treatment. Such
   downregulated genes are known to have a weaker signal than upregulated
   genes and are often related to genes involved in cell-type-specific
   processes ([226]63). We identified seven enriched comparisons using
   downregulated genes identified by integrating exonic and intronic DEGs
   ([227]40). Interestingly, one comparison investigated breast cancer
   cells with both estradiol and TNFa treatment for 40 min ([228]57) (6
   genes, FDR-adjusted P-value = 5.34 × 10^−4), and another which
   investigated ‘11–18’ lung adenocarcinoma cell line after
   pharmacological activation and inactivation of NF-κB ([229]64) (5
   genes, FDR-adjusted P-value = 2.66 × 10^−3) ([230]Supplementary Figure
   S8). In contrast, traditional pathway enrichment ([231]65) only
   identified pathways related to cell-lineage specificity
   ([232]Supplementary Figure S8). We then investigated whether the six
   overlapping genes between Alizada et al.’s ([233]54) downregulated
   genes and SRP044608 (estradiol + TNFa treatment) ([234]57) have been
   previously linked to TNFα in the literature. Two overlapping genes,
   TXNIP ([235]66) and SMAD7 ([236]67) are negatively correlated with TNFa
   treatment, and the other genes expressed based on TNFα varied based on
   the biological context ([237]68–72).

DEET identifies individual gene-gene associations across datasets

   Lastly, DEGs that show correlated expression changes across different
   conditions are more likely to be part of the same biological pathway
   and undergo shared gene regulation ([238]3,[239]73). We can leverage
   the associations of fold-changes between genes across all the
   comparisons in the DEET database to identify genes that may be under
   the same regulation. Specifically, the DEET_feature_extract() function
   detects genes associated with an input variable that can be assigned to
   every comparison (e.g. a gene of interest, whether the comparison
   investigated cancer, etc.) using an elastic net regression ([240]48) in
   conjunction with correlation analysis to determine what genes are
   associated with the input variable.

   We applied the elastic net component of the DEET_feature_extract()
   function to every gene within the DEET database that was detected (not
   necessarily DE) in at least 70% of studies to identify which genes best
   predicted the fold-changes of other genes in our database. Then, we
   built a gene-by-gene matrix populated by how well the fold-change a
   gene in column ‘j’ predicts the expression of the gene in row ‘i’ (see
   Supplementary Data). Accordingly, by summing the columns of this
   matrix, we can identify which genes are the best predictors of
   differential expression. As expected, we found that the top 1% of the
   best predictors of differential expression enriched for the
   ‘DNA-binding transcription factor activity, RNA polymerase II-specific’
   gene ontology (34 genes, FDR-adjusted P-value = 3.63 × 10^–7) more
   strongly than any other gene set ([241]Supplementary Figure S9A).
   Furthermore, the most predictive genes do not overlap with the top 1%
   of genes in DEET’s ‘DE prior’ and overall, these genes are not
   correlated, reflecting that how well a gene predicts differential
   expression is independent of how frequent the gene is differentially
   expressed ([242]Supplementary Figure S9B).

   To showcase the results of feature extraction within a single gene we
   looked for genes whose fold changes are correlated with that of the
   TNFa encoding gene TNF. The 17 genes retrieved by DEET were enriched
   for ‘TNFα signaling via NF-κB’ more than any other gene ontology
   (FDR-adjusted P-value = 3.71 × 10^−11) ([243]Supplementary Figure S9C)
   and included well-known TNFa signaling genes NFKBIA (rank 2) and SEMA4A
   (rank 6) and ([244]Supplementary Figure S9C). Interestingly, the
   top-ranked gene was CCDC7 ([245]Supplementary Figure S9D), a gene that
   is not annotated as a hallmark of TNFa signaling. Supporting the
   relevance of this hit, CCDC7 has been shown to simultaneously activate
   interleukin-6 and the vascular endothelial growth factor ([246]74),
   which TNFα can also do ([247]75–77). Notably, comparisons within the
   DEET database where both CCDC7 and TNF are DE did not include studies
   investigating short-term TNFa treatment. Instead, they included studies
   involving tumour versus non-tumour, bacterial infection and Crohn's
   disease. Together, this vignette demonstrates how DEET can be used to
   obtain meaningful information from DEG comparisons made from uniformly
   processed public RNA-seq data.

DISCUSSION

   The DEET allows users to compare their DE gene lists to a curated atlas
   of 3162 DEG comparisons originating from GTEx, ([248]16), TCGA
   ([249]14) and studies within SRA ([250]78). We envision DEET to be used
   alongside established and emerging tools that leverage uniformly
   processed data to allow users to discover biological patterns within
   their RNA-seq data (e.g. ([251]7,[252]26,[253]27)).

   A major challenge for implementing a tool like DEET, which investigates
   differential gene expression results in public data ([254]29), lies in
   the scalability and consistency of publicly available metadata. We were
   able to build the DEET database because the PhenoPredict ([255]19) tool
   annotated necessary metadata across every sample within SRA. However,
   there was considerable manual curation and study filtering even with
   this consistent annotation. The first major way to improve these
   annotations are with the continued development and use of metadata
   prediction algorithms like PhenoPredict ([256]19), automated algorithms
   of existing metadata within SRA ([257]8) like in MetaSRA ([258]23) and
   ffq ([259]https://github.com/pachterlab/ffq). The second major way to
   improve these annotations will be through community- and
   consortium-driven manual annotation of metadata such as the Biostudies
   and GEOMetaCuration tools ([260]79) and ([261]80). In the context of
   differential analysis, allowing researchers to report which variables
   are the experimental, stratifying, blocking, and covariate variables
   will be invaluable for tools like DEET to encompass larger uniformly
   processed datasets such as those provided by RNASeq-er ([262]80),
   recount3 ([263]10), ARCHS4 ([264]12) and refine.bio
   ([265]https://www.refine.bio/) which collectively contains more RNA-seq
   studies from human and non-human species ([266]10,[267]12).

   Including model organism studies into differential gene expression
   databases is of great value given the greater diversity and controlled
   nature of study designs (i.e. tissue types, experimental variables,
   genetic backgrounds) which are not possible for human studies. In
   addition, public RNA-seq from model organisms will contain many
   smaller-scale, hypothesis-driven experiments compared to TCGA, and
   GTEx. Future developments of DEET would extend its database to
   searchable, consistently analyzed, and curated differential expression
   analyses collected from multiple species in Expression Atlas ([268]27).
   Lastly, extending DEET to be able to search differential comparisons
   derived from consistent experiments beyond RNA-seq would be a logical
   next step to harness ongoing efforts for systematic analysis of public
   data from different genomic techniques such as scRNA-seq
   ([269]20,[270]81), accessible chromatin profiling (ATAC-seq/DNAse-seq)
   ([271]82,[272]83), and protein-DNA interactions mapping (ChIP-seq and
   in the future CUT&RUN/TAG) ([273]84–87). In summary, by allowing users
   to rapidly connect their gene lists to a curated set of uniformly
   processed differential gene expression analyses, tools like DEET will
   facilitate access to the treasure trove of public RNA-seq data.

DATA AVAILABILITY

   Code and data to regenerate the figures within this dataset can be
   found at figshare
   ([274]https://doi.org/10.6084/m9.figshare.20427774.v2). Code and data
   to rebuild the DEET database can be found at figshare, however dbGap
   protected data in these code are excluded
   ([275]https://doi.org/10.6084/m9.figshare.20425464.v1). A stable
   dataset of the DEET database at the time of submission can be found on
   zenodo ([276]https://zenodo.org/record/7321664#.Y5j203bMI2w). The
   developmental dataset of the DEET database can be found at
   ([277]http://wilsonlab.org/public/DEET_data).

Supplementary Material

   lqad003_Supplemental_Files
   [278]Click here for additional data file.^ (5.7MB, zip)

ACKNOWLEDGEMENTS