Source: https://github.com/markziemann/combined_enrichment

Background

  • Functional enrichment analysis is a very frequently used approach to understand the broader trends within omics data, especially gene expression microarrays and RNA-seq.

  • There are many algorithms, but ORA and FCS appears to be the most popular.

  • FCS is an approach whereby all genes are given a score depending on some differential expression metric, then an algorithm evaluates whether members of a gene set are collectively more up- or down-regulated.

  • In a general sense, ORA is conducted first by selecting a set of genes of interest and then determining with a statistical test whether the genes of interest are over-represented for particular functional annotations.

  • In the literature, ORA appears to be conducted in two ways. Either by considering all differentially regulated genes as the genes of interest in a single test. Or by considering up- and down-regulated genes in separate ORA tests.

  • On the face of it, these two approaches are valid, however they address different null hypotheses. The former is simply a test for dysregulation irrespective of fold change direction. The latter is a directional test.

  • There is a common view that the combined analysis is the right approach as a gene set could contain a mixture of up and down-regulated members. Such gene sets could be missed with the separate approach.

  • On the other hand, it has been already shown that due to the correlated nature of genes in a gene set, separate analysis of up and downregulated genes is more sensitive than combined analysis.

  • The popularity of gene set analysis methods does not correlated with performance. Rather, popularity is inflenced by first-mover advantage and ease of use [2].

Aims and hypothesis

  • Here we investigate the apparent rift between method performance and popularity with a combination of approaches.

  • Firstly, we survey how common it is to conduct combined or separate ORA analysis.

  • Second, using the Illumina bodymap expression dataset we will examine the correlation structure of genes in a set.

  • Next, with seven independent RNA-seq datasets, we ask whether it is worthwhile to conduct combined gene set analysis.

  • Lastly, we conduct a survey asking researchers about the reasons for selecting the methods in their articles, which should highlight reasons for why sub-optimal methods are used.

Methods

Survey of published enrichment practices

Previously, we compiled a set of 186 open-access articles with terms “enrichment analysis”, “pathway analysis” or “ontology analysis” in the abstract [3]. From this set of articles, we selected subsetted for articles that conducted differential expression analysis. These were examined for whether functional enrichment analysis was conducted based on combined, separate or both.

Examination of correlation structure in RNA-seq data

Illumina bodymap data was downloaded from the DEE2 database with the getDEE2 R package. Genes with fewer than 10 reads on average across the samples set were discarded. Gene expression correlation analysis (Spearman) was conducted for each gene set with 10 or members in the dataset. These correlation values were compared, to correlation values derived from a set of randomly selected genes using a t-test.

Impact of combined or separate enrichment analysis on real RNA-seq data

To seven publicly available RNA-seq datasets were downloaded from DEE2 on 27th January 2022 [4]. Transcript level counts were aggregated to genes using the getDEE2 R package v1.2.0. Next, genes with an average of less than 10 reads per sample were omitted from downstream analysis. Differential expression statistical analysis was conducted with DESeq2 v1.32.0 [5] to identify genes altered by high glucose exposure. For gene set analysis, human Reactome gene sets [6] were downloaded in GMT format from the Reactome website (accessed 27th January 2022).

Over-representation analysis was conducted using clusterProfiler R package (v4.0.5) enricher function that implements a hypergeometric test [34]. No fold-change threshold was used to select genes for ORA. The background gene set consisted of all detected genes.

FCS was performed using the mitch R package v1.4.1 with default settings, which uses a rank-ANOVA statistical test [10]. Differentially expressed genes with FDR<0.05 were used for ORA analysis using the clusterProfiler R package (v4.0.5) enricher function that implements a hypergeometric test [34].

For genes and gene sets, a false discovery rate adjusted p-value (FDR) of 0.05 was considered significant. Analyses were conducted in R version 4.1.2. Details of the contrasts examined are shown in Table 1.

SRA accession and citation Control datasets Case datasets Genes detected Genes differentially expressed
SRP128998 [31] GSM2932797 GSM2932798 GSM2932799 GSM2932791 GSM2932792 GSM2932793 15635 3472
SRP038101 [35] GSM1329862 GSM1329863 GSM1329864 GSM1329859 GSM1329860 GSM1329861 13926 3589
SRP037718 [36] GSM1326472 GSM1326473 GSM1326474 GSM1326469 GSM1326470 GSM1326471 15477 9488
SRP096177 [37] GSM2448985 GSM2448986 GSM2448987 GSM2448982 GSM2448983 GSM2448984 15607 5150
SRP247621 [38] GSM4300737 GSM4300738 GSM4300739 GSM4300731 GSM4300732 GSM4300733 14288 230
SRP253951 [39] GSM4462339 GSM4462340 GSM4462341 GSM4462336 GSM4462337 GSM4462338 15182 8588
SRP068733 [40] GSM2044431 GSM2044432 GSM2044433 GSM2044428 GSM2044429 GSM2044430 14255 7365

Survey of methodological choices by researchers

We sent a questionnaire to authors of the 186 open access articles selected above and asked the following questions:

  • In your 2019 article, what tool did you use?

  • Did you select this tool based on its previous popularity in similar journals?

  • Did you select this tool based on its performance in benchmarking studies?

  • Did you select this tool based on its ease of use?

  • What other motivators prompted you to use this tool?

  • Are you aware of the benchmarked performance of your chosen tool against some other methods?

  • Do you think it’s important to use a custom background (reference) gene list?

  • Do you think it’s important to perform false discovery rate correction for multiple tests?

  • For your 2019 article, did you perform separate analysis of up- and down-regulated genes, or combined?

  • Why did you choose this approach?

  • How long have you been using bioinformatics tools?

  • Would you consider yourself a beginner in computational biology or a seasoned veteran?

A similar questionnaire was circulated on social media to users of ORA tools such as DAVID, PANTHER, KOBAS and clusterProfiler, asking about their last

  • Over the last three years, which ORA tool have you used the most in your publications?

Results

Most ORA are conducted using a combined set of up- and down-regulated genes

From 109 selected open-articles that performed ORA on differential expression data, 28 (25%) considered the up- and down-regulated gene lists in separate tests, while 76 (69%) performed combined analysis of up- and downregulated genes (Fig 1). Five of the 109 studies did not describe whether the gene lists were combined or not. This finding outlines that ORA with combined gene set is much more common than separated analysis.

Members of curated gene sets are positively correlated

Combined analysis is a problem if genes in a set are highly correlated. Hong demonstrated this correlation by showing that genes in a set were more likely to show imbalanced gene regulation direction. We thought we would investigate this further by looking at correlation structure of Reactome gene set members in the Illumina Bodymap RNA-seq dataset. There were 21,991 genes above our detection threshold across these 48 samples. Spearman correlation coefficients (rho) were calculated between genes in each set. For comparison, rho was also calculated for members of a set of randomly selected genes of equal size. The t-test statistic distribution indicates a strong positive skew, suggesting some degree of correlation for most gene sets (Fig 2). When the t-test p-values were considered, 1132 gene sets were indicative of positive correlation (p<0.05), 65 showed negative correlation (p<0.05) and 332 were neutral (p>0.05). RNA polymerase II transcription is shown as an example of a positively correlated set. Transport of small molecules is shown as an example of a negatively correlated set. This data shows that it is more common for members of a gene set to be positively correlated.

Combined analysis of up- and down-regulated genes severely impairs sensitivity of ORA tests

We examined seven independent differential RNA-seq experiments, each with 230 to 9488 differentially expressed genes. ORA was conducted both with combined and separated gene lists, revealing that separated gene list analysis consistently detected a larger proportion of differentially regulated gene sets (Fig 3). Nevertheless there was a small number of gene sets which were specific to the combined analysis.

We also performed FCS for these same profiles, which consistently yielded more statistically significant sets than with ORA. Of note, the combined analysis with FCS identified a larger proportion of gene sets as compared to ORA combined.

These results confirm that ORA with separate analysis of up- and down-regulated gene lists is more sensitive than the combined approach. Furthermore, FCS was more sentitive in all tests conducted.

Discussion

  • This finding outlines that ORA with combined gene set is much more common than separated analysis.

  • This data shows that it is more common for members of a gene set to be positively correlated.

  • These results confirm that ORA with separate analysis of up- and down-regulated gene lists is more sensitive than the combined approach.

ORA methodological choices by researchers are not driven by benchmarking performance. More likely they are driven by precidence and ease of use.

References

  1. Hong G, Zhang W, Li H, Shen X, Guo Z. Separate enrichment analysis of pathways for up- and downregulated genes. J R Soc Interface. 2013 Dec 18;11(92):20130950. doi: 10.1098/rsif.2013.0950. PMID: 24352673; PMCID: PMC3899863.

  2. Xie C, Jauhari S, Mora A. Popularity and performance of bioinformatics software: the case of gene set analysis. BMC Bioinformatics. 2021 Apr 15;22(1):191. doi: 10.1186/s12859-021-04124-5. PMID: 33858350; PMCID: PMC8050894.

  3. Wijesooriya K, Jadaan SA, Perera KL, Kaur T, Ziemann M. Guidelines for reliable and reproducible functional enrichment analysis. bioRxiv 2021.09.06.459114; doi: https://doi.org/10.1101/2021.09.06.459114