Source: https://github.com/markziemann/enrichment_recipe

Introduction

This guide is a Rmarkdown script that conducts differential expression and enrichment analysis, which are very popular workflows for transcriptome data.

The goal of this work is to understand how ClusterProfiler manages the background list, as we have observed some weird behaviour. In particular we see that when we provide a custom background, those genes do not appear in the final analysis unless they are also included in the gene list.

In the code chunk below called libs, you can add and remove required R library dependancies. Check that the libraries listed here match the Dockerfile, otherwise you might get errors.

suppressPackageStartupMessages({
  library("DESeq2")
  library("kableExtra")
  library("clusterProfiler")
  library("fgsea")
  library("eulerr")
  library("gplots")
  library("IlluminaHumanMethylation450kanno.ilmn12.hg19")
})

For this report I will be using proteomics data from Stukalov et al (https://pubmed.ncbi.nlm.nih.gov/33845483/), who profiled the effect of SARS-COV2 infection on A549 cells.

Here we used the 24hr profile, as there were relatively more changes found.

Probable contaminants were removed, as well as viral proteins. This left 5825 proteins, including 165 downregulated and 44 upregulated.

The data were obtained from Supplementary Table 5 and saved to the /data folder.

x <- read.table("../data/27651444_smoking_current_vs_never_smoking_16938071371108067.tsv",header=TRUE,sep="\t")

Get background gene list.

anno = getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
bg <- anno$UCSC_RefGene_Name
bg <- unique(unlist(strsplit(bg,";")))
length(bg)
## [1] 21231

Get up and down-methylated list.

defup <- subset(x,beta>0)$gene
defup <- defup[defup!="-"]
defup <- unique(unlist(strsplit(defup,";")))
length(defup)
## [1] 2137
defdn <- subset(x,beta<0)$gene
defdn <- defdn[defdn!="-"]
defdn <- unique(unlist(strsplit(defdn,";")))
length(defdn)
## [1] 2786

Gene sets. Get the gene sets loaded in R. These are KEGG gene sets

# from https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp 16th June 2023
genesets2 <- read.gmt("../ref/c2.cp.kegg.v2023.1.Hs.symbols.gmt")
gsets <- gmtPathways("../ref/c2.cp.kegg.v2023.1.Hs.symbols.gmt")

message(paste("number of genes described in the annotation set:",length(unique(genesets2$gene))))
## number of genes described in the annotation set: 5244

KEGG pathway analysis

In the next sections I will run enrichment analysis with over-representation analysis (ORA) test and compare it to functional class scoring. I will also investigate some strange behviour of the ORA tool clusterprofiler.

ORA with Clusterprofiler custom background with otherwise default analysis

I’ve compiled a reporting checklist:

Reporting criteria Method/resource used
Origin of gene sets KEGG (2023-06-16)
Tool used ClusterProfiler (check version at foot of report)
Statistical test used hypergeometric test
P-value correction for multiple comparisons FDR method
Background list Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria DESeq2 FDR<0.05
ORA directionality Separate tests for up- and down-regulation
Data availability Stukalov et al 2021: PMID 33845483
Other parameters Min gene set size of 10

Here I provide a custom background gene list.

Enrichment firstly with upregulated genes

n_pw=10

ora_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=5)
ora_ups <- rownames(ora_up)

gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))

ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL

head(ora_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)
Top upregulated pathways in LPS treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
KEGG_HEMATOPOIETIC_CELL_LINEAGE 22/556 82/4904 0.0000742 0.0129138 0.0117185 22 2.366380
KEGG_FOCAL_ADHESION 38/556 195/4904 0.0004755 0.0413714 0.0375421 38 1.718797
topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))

The reported background size does not match the dataset.

Now repeat with the downregulated genes

ora_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=5)
ora_dns <- rownames(ora_dn)

gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))

ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL

head(ora_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)
Top downregulated pathways in LPS treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
KEGG_RIBOSOME 31/751 87/4904 0.0000020 0.0003566 0.0002965 31 2.326767
KEGG_PATHOGENIC_ESCHERICHIA_COLI_INFECTION 18/751 53/4904 0.0005646 0.0330185 0.0274496 18 2.217722
KEGG_CHRONIC_MYELOID_LEUKEMIA 22/751 73/4904 0.0009380 0.0330185 0.0274496 22 1.967933
KEGG_REGULATION_OF_ACTIN_CYTOSKELETON 49/751 208/4904 0.0009285 0.0330185 0.0274496 49 1.538308
KEGG_PATHWAYS_IN_CANCER 71/751 320/4904 0.0004751 0.0330185 0.0274496 71 1.448835
KEGG_MAPK_SIGNALING_PATHWAY 58/751 262/4904 0.0016565 0.0485894 0.0403943 58 1.445564
topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE KEGGs",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA test")
}

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

I’ve compiled a reporting checklist:

Reporting criteria Method/resource used
Origin of gene sets KEGG (2023-06-16)
Tool used ClusterProfiler (check version at foot of report)
Statistical test used hypergeometric test
P-value correction for multiple comparisons FDR method
Background list Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria DESeq2 FDR<0.05
ORA directionality Separate tests for up- and down-regulation
Data availability Data availability
Other parameters Min gene set size of 10 + BUG FIX

Here I provide a background gene list for cluterprofiler to use, like a user should.

I have also done some modification to the KEGG gene sets, to en

Now filter the gene names into three lists, up-regulated, down-regulated and background. Background is simply all genes that were detected.

Adding all detected genes to the background appears to improve results!

bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)

Enrichment firstly with upregulated genes

n_pw=10

orafix_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_up$geneID <- NULL
orafix_up <- subset(orafix_up,p.adjust<0.05 & Count >=5)
orafix_ups <- rownames(orafix_up)

gr <- as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",2))

orafix_up$es <- gr/br
orafix_up <- orafix_up[order(-orafix_up$es),]
orafix_up$Description=NULL

head(orafix_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)
Top upregulated pathways in LPS treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION 12/2137 44/21231 0.0010285 0.0255655 0.0219619 12 2.709533
KEGG_HEMATOPOIETIC_CELL_LINEAGE 22/2137 82/21231 0.0000135 0.0023442 0.0020138 22 2.665476
KEGG_AXON_GUIDANCE 26/2137 127/21231 0.0003362 0.0146231 0.0125619 26 2.033928
KEGG_FOCAL_ADHESION 38/2137 195/21231 0.0000517 0.0045008 0.0038664 38 1.936043
KEGG_REGULATION_OF_ACTIN_CYTOSKELETON 39/2137 208/21231 0.0000996 0.0057756 0.0049615 39 1.862804
KEGG_CHEMOKINE_SIGNALING_PATHWAY 32/2137 185/21231 0.0016460 0.0358010 0.0307546 32 1.718479
topup2 <- rev(head(orafix_up$es,10))
names(topup2) <- rev(head(orafix_up$ID,10))

Now with the downregulated genes

orafix_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_dn$geneID <- NULL
orafix_dn <- subset(orafix_dn,p.adjust<0.05 & Count >=5)
orafix_dns <- rownames(orafix_dn)

gr <- as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",2))

orafix_dn$es <- gr/br
orafix_dn <- orafix_dn[order(-orafix_dn$es),]
orafix_dn$Description=NULL

head(orafix_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)
Top downregulated pathways in LPS treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
KEGG_RIBOSOME 31/2786 87/21231 0.0000001 0.0000137 0.0000099 31 2.715387
KEGG_TYPE_I_DIABETES_MELLITUS 14/2786 41/21231 0.0004743 0.0092754 0.0067125 14 2.602157
KEGG_PATHOGENIC_ESCHERICHIA_COLI_INFECTION 18/2786 53/21231 0.0000838 0.0029482 0.0021336 18 2.588129
KEGG_GRAFT_VERSUS_HOST_DISEASE 12/2786 37/21231 0.0019706 0.0165151 0.0119517 12 2.471547
KEGG_ALLOGRAFT_REJECTION 11/2786 35/21231 0.0039553 0.0248619 0.0179922 11 2.395047
KEGG_CHRONIC_MYELOID_LEUKEMIA 22/2786 73/21231 0.0001115 0.0032705 0.0023668 22 2.296620
KEGG_PANCREATIC_CANCER 20/2786 69/21231 0.0003989 0.0087753 0.0063506 20 2.208870
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY 19/2786 66/21231 0.0006059 0.0104657 0.0075738 19 2.193810
KEGG_TYPE_II_DIABETES_MELLITUS 13/2786 46/21231 0.0050390 0.0305816 0.0221314 13 2.153649
KEGG_ADHERENS_JUNCTION 19/2786 68/21231 0.0009063 0.0106334 0.0076952 19 2.129286
topdn2 <- head(orafix_dn$es,n_pw)
names(topdn2) <- head(orafix_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE KEGGs",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA fix")
}

Compare pathway results

v1 <- list("ORA up"=ora_ups, "ORA fix up"=orafix_ups)
v2 <- list("ORA dn"=ora_dns, "ORA fix dn"=orafix_dns)
v3 <- list("ORA"=union(ora_ups,ora_dns),
  "ORA fix"=union(orafix_ups,orafix_dns))

par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated KEGG pathways")

plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated KEGG pathways")

plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated KEGG pathways")

svg("prot_kegg_euler.svg")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated KEGG pathways")
dev.off()
## png 
##   2

Jaccard index comparing ORA and ORA fix.

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)

jaccard(ora,orafix)
## [1] 0.195122

List gene sets identified only in the “fixed” analysis.

ora <- union(ora_ups,ora_dns)

orafix <- union(orafix_ups,orafix_dns)

setdiff(orafix,ora)
##  [1] "KEGG_AXON_GUIDANCE"                           
##  [2] "KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION" 
##  [3] "KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION"
##  [4] "KEGG_CHEMOKINE_SIGNALING_PATHWAY"             
##  [5] "KEGG_ENDOCYTOSIS"                             
##  [6] "KEGG_PANCREATIC_CANCER"                       
##  [7] "KEGG_TYPE_I_DIABETES_MELLITUS"                
##  [8] "KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY"         
##  [9] "KEGG_NEUROTROPHIN_SIGNALING_PATHWAY"          
## [10] "KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION"     
## [11] "KEGG_ADHERENS_JUNCTION"                       
## [12] "KEGG_VIRAL_MYOCARDITIS"                       
## [13] "KEGG_INSULIN_SIGNALING_PATHWAY"               
## [14] "KEGG_MELANOMA"                                
## [15] "KEGG_RENAL_CELL_CARCINOMA"                    
## [16] "KEGG_FC_GAMMA_R_MEDIATED_PHAGOCYTOSIS"        
## [17] "KEGG_GRAFT_VERSUS_HOST_DISEASE"               
## [18] "KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY"       
## [19] "KEGG_PYRIMIDINE_METABOLISM"                   
## [20] "KEGG_JAK_STAT_SIGNALING_PATHWAY"              
## [21] "KEGG_ERBB_SIGNALING_PATHWAY"                  
## [22] "KEGG_NON_SMALL_CELL_LUNG_CANCER"              
## [23] "KEGG_PHOSPHATIDYLINOSITOL_SIGNALING_SYSTEM"   
## [24] "KEGG_ALLOGRAFT_REJECTION"                     
## [25] "KEGG_TYPE_II_DIABETES_MELLITUS"               
## [26] "KEGG_COLORECTAL_CANCER"                       
## [27] "KEGG_TGF_BETA_SIGNALING_PATHWAY"              
## [28] "KEGG_TIGHT_JUNCTION"                          
## [29] "KEGG_INOSITOL_PHOSPHATE_METABOLISM"

Reactome Gene sets

ORA with Clusterprofiler custom background with otherwise default analysis

I’ve compiled a reporting checklist:

Reporting criteria Method/resource used
Origin of gene sets Reactome (2023-03-06)
Tool used ClusterProfiler (check version at foot of report)
Statistical test used hypergeometric test
P-value correction for multiple comparisons FDR method
Background list Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria DESeq2 FDR<0.05
ORA directionality Separate tests for up- and down-regulation
Data availability Data availability
Other parameters Min gene set size of 10

Here I provide a custom background gene list.

Get the gene sets loaded in R. These are Reactome gene sets

genesets2 <- read.gmt("../ref/ReactomePathways_2023-03-06.gmt")
gsets <- gmtPathways("../ref/ReactomePathways_2023-03-06.gmt")

message(paste("number of genes described in the annotation set:",length(unique(genesets2$gene))))
## number of genes described in the annotation set: 11457

Enrichment firstly with upregulated genes

n_pw=10

ora_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=5)
ora_ups <- rownames(ora_up)

gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))

ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL

head(ora_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)
Top upregulated pathways in LPS treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
NOTCH3 Intracellular Domain Regulates Transcription 10/1129 25/10174 0.0001856 0.0238237 0.0220051 10 3.604606
FCGR3A-mediated IL10 synthesis 13/1129 38/10174 0.0001364 0.0196945 0.0181911 13 3.082887
Vasopressin regulates renal water homeostasis via Aquaporins 14/1129 43/10174 0.0001395 0.0196945 0.0181911 14 2.933981
Signaling by NOTCH3 15/1129 48/10174 0.0001371 0.0196945 0.0181911 15 2.816098
RHOB GTPase cycle 17/1129 63/10174 0.0003666 0.0398149 0.0367755 17 2.431678
G alpha (12/13) signalling events 19/1129 73/10174 0.0002815 0.0331221 0.0305936 19 2.345463
topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))

Now repeat with the downregulated genes

ora_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=5)
ora_dns <- rownames(ora_dn)

gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))

ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL

head(ora_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)
Top downregulated pathways in LPS treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
Sema3A PAK dependent Axon repulsion 9/1484 16/10174 0.0001263 0.0052774 0.0047241 9 3.856385
FGFR2 alternative splicing 11/1484 25/10174 0.0003788 0.0126604 0.0113330 11 3.016550
Eukaryotic Translation Elongation 35/1484 94/10174 0.0000000 0.0000329 0.0000295 35 2.552690
Peptide chain elongation 32/1484 90/10174 0.0000006 0.0001049 0.0000939 32 2.437616
Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC) 34/1484 96/10174 0.0000003 0.0000694 0.0000621 34 2.428094
RUNX1 interacts with co-factors whose precise effect on RUNX1 targets is not known 13/1484 37/10174 0.0014930 0.0423680 0.0379260 13 2.408793
SARS-CoV-1 modulates host translation machinery 13/1484 37/10174 0.0014930 0.0423680 0.0379260 13 2.408793
Eukaryotic Translation Termination 33/1484 94/10174 0.0000005 0.0001049 0.0000939 33 2.406822
Viral mRNA Translation 31/1484 90/10174 0.0000018 0.0001971 0.0001764 31 2.361441
Nonsense Mediated Decay (NMD) enhanced by the Exon Junction Complex (EJC) 39/1484 114/10174 0.0000001 0.0000411 0.0000368 39 2.345404
topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE Reactomes",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA test")
}

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

I’ve compiled a reporting checklist:

Reporting criteria Method/resource used
Origin of gene sets Reactome (2023-03-06)
Tool used ClusterProfiler (check version at foot of report)
Statistical test used hypergeometric test
P-value correction for multiple comparisons FDR method
Background list Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria DESeq2 FDR<0.05
ORA directionality Separate tests for up- and down-regulation
Data availability Data availability
Other parameters Min gene set size of 10 + BUG FIX

Adding all detected genes to the background appears to improve results!

bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)

Enrichment firstly with upregulated genes

orafix_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_up$geneID <- NULL
orafix_up <- subset(orafix_up,p.adjust<0.05 & Count >=5)
orafix_ups <- rownames(orafix_up)

gr <- as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",2))

orafix_up$es <- gr/br
orafix_up <- orafix_up[order(-orafix_up$es),]
orafix_up$Description=NULL

head(orafix_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)
Top upregulated pathways in LPS treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
FCGR activation 6/2137 12/21231 0.0005580 0.0256074 0.0228126 6 4.967478
NOTCH3 Intracellular Domain Regulates Transcription 10/2137 25/21231 0.0000824 0.0089505 0.0079736 10 3.973982
FCGR3A-mediated IL10 synthesis 13/2137 38/21231 0.0000506 0.0064933 0.0057846 13 3.398801
Antigen activates B Cell Receptor (BCR) leading to generation of second messengers 10/2137 30/21231 0.0004734 0.0256074 0.0228126 10 3.311652
Transcriptional regulation of granulopoiesis 10/2137 30/21231 0.0004734 0.0256074 0.0228126 10 3.311652
Vasopressin regulates renal water homeostasis via Aquaporins 14/2137 43/21231 0.0000491 0.0064933 0.0057846 14 3.234637
topup2 <- rev(head(orafix_up$es,10))
names(topup2) <- rev(head(orafix_up$ID,10))

Now with the downregulated genes

n_pw=10

orafix_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_dn$geneID <- NULL
orafix_dn <- subset(orafix_dn,p.adjust<0.05 & Count >=5)
orafix_dns <- rownames(orafix_dn)

gr <- as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",2))

orafix_dn$es <- gr/br
orafix_dn <- orafix_dn[order(-orafix_dn$es),]
orafix_dn$Description=NULL

head(orafix_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)
Top downregulated pathways in LPS treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
Sema3A PAK dependent Axon repulsion 9/2786 16/21231 0.0000545 0.0017439 0.0014072 9 4.286589
Signaling by Leptin 6/2786 11/21231 0.0012988 0.0244175 0.0197042 6 4.156693
CD28 dependent Vav1 pathway 6/2786 12/21231 0.0023099 0.0376382 0.0303729 6 3.810302
FGFR2 alternative splicing 11/2786 25/21231 0.0001474 0.0043483 0.0035089 11 3.353065
Ephrin signaling 8/2786 19/21231 0.0017106 0.0302683 0.0244257 8 3.208675
Growth hormone receptor signaling 9/2786 24/21231 0.0023232 0.0376382 0.0303729 9 2.857726
TNFR1-induced proapoptotic signaling 9/2786 24/21231 0.0023232 0.0376382 0.0303729 9 2.857726
Eukaryotic Translation Elongation 35/2786 94/21231 0.0000000 0.0000011 0.0000009 35 2.837459
TNFs bind their physiological receptors 10/2786 27/21231 0.0014960 0.0267850 0.0216147 10 2.822446
Signaling by Erythropoietin 9/2786 25/21231 0.0032121 0.0460101 0.0371288 9 2.743417
topdn2 <- head(orafix_dn$es,n_pw)
names(topdn2) <- head(orafix_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE Reactomes",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA fix")
}

Compare pathway results

v1 <- list("ORA up"=ora_ups, "ORA fix up"=orafix_ups)
v2 <- list("ORA dn"=ora_dns, "ORA fix dn"=orafix_dns)
v3 <- list("ORA"=union(ora_ups,ora_dns),
  "ORA fix"=union(orafix_ups,orafix_dns))

par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated Reactomes")

plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated Reactomes")

plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated Reactomes")

svg("prot_reactome_euler.svg")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated Reactomes")
dev.off()
## png 
##   2

Jaccard index comparing ORA and ORA fix.

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)

jaccard(ora,orafix)
## [1] 0.4038462

List gene sets identified only in the “fixed” analysis.

ora <- union(ora_ups,ora_dns)

orafix <- union(orafix_ups,orafix_dns)

setdiff(orafix,ora)
##  [1] "Platelet activation, signaling and aggregation"                                      
##  [2] "Muscle contraction"                                                                  
##  [3] "Sensory processing of sound"                                                         
##  [4] "Sensory processing of sound by inner hair cells of the cochlea"                      
##  [5] "Cell death signalling via NRAGE, NRIF and NADE"                                      
##  [6] "NCAM signaling for neurite out-growth"                                               
##  [7] "Aquaporin-mediated transport"                                                        
##  [8] "NRAGE signals death through JNK"                                                     
##  [9] "Antigen activates B Cell Receptor (BCR) leading to generation of second messengers"  
## [10] "Transcriptional regulation of granulopoiesis"                                        
## [11] "NCAM1 interactions"                                                                  
## [12] "Anti-inflammatory response favouring Leishmania parasite infection"                  
## [13] "Leishmania parasite growth and survival"                                             
## [14] "FCGR activation"                                                                     
## [15] "RHOC GTPase cycle"                                                                   
## [16] "Extracellular matrix organization"                                                   
## [17] "O-glycosylation of TSR domain-containing proteins"                                   
## [18] "GPCR ligand binding"                                                                 
## [19] "RHOD GTPase cycle"                                                                   
## [20] "G alpha (q) signalling events"                                                       
## [21] "Glucagon signaling in metabolic regulation"                                          
## [22] "Integration of energy metabolism"                                                    
## [23] "Cardiac conduction"                                                                  
## [24] "Fatty acyl-CoA biosynthesis"                                                         
## [25] "Adrenaline,noradrenaline inhibits insulin secretion"                                 
## [26] "RHOF GTPase cycle"                                                                   
## [27] "EPH-ephrin mediated repulsion of cells"                                              
## [28] "p75 NTR receptor-mediated signalling"                                                
## [29] "Regulation of insulin secretion"                                                     
## [30] "Metabolism of proteins"                                                              
## [31] "Innate Immune System"                                                                
## [32] "Translation"                                                                         
## [33] "SUMO E3 ligases SUMOylate target proteins"                                           
## [34] "Metabolism of RNA"                                                                   
## [35] "Signaling by NOTCH"                                                                  
## [36] "TNFR2 non-canonical NF-kB pathway"                                                   
## [37] "TCF dependent signaling in response to WNT"                                          
## [38] "Signaling by FGFR2"                                                                  
## [39] "Metabolism"                                                                          
## [40] "Adaptive Immune System"                                                              
## [41] "SARS-CoV-1-host interactions"                                                        
## [42] "Signaling by NTRK1 (TRKA)"                                                           
## [43] "Signaling by ALK fusions and activated point mutants"                                
## [44] "Signaling by ALK in cancer"                                                          
## [45] "Signaling by TGFB family members"                                                    
## [46] "SUMOylation"                                                                         
## [47] "Gene expression (Transcription)"                                                     
## [48] "Intracellular signaling by second messengers"                                        
## [49] "Activation of anterior HOX genes in hindbrain development during early embryogenesis"
## [50] "Activation of HOX genes during differentiation"                                      
## [51] "RHOV GTPase cycle"                                                                   
## [52] "Signaling by FGFR"                                                                   
## [53] "Transcriptional Regulation by TP53"                                                  
## [54] "RNA Polymerase II Transcription"                                                     
## [55] "Signaling by Leptin"                                                                 
## [56] "Signaling by WNT"                                                                    
## [57] "Netrin-1 signaling"                                                                  
## [58] "Generic Transcription Pathway"                                                       
## [59] "TNFs bind their physiological receptors"                                             
## [60] "Ephrin signaling"                                                                    
## [61] "Notch-HLH transcription pathway"                                                     
## [62] "The role of Nef in HIV-1 replication and disease pathogenesis"                       
## [63] "TNF signaling"                                                                       
## [64] "Metabolism of nucleotides"                                                           
## [65] "CD28 dependent Vav1 pathway"                                                         
## [66] "Growth hormone receptor signaling"                                                   
## [67] "TNFR1-induced proapoptotic signaling"                                                
## [68] "Transcriptional regulation by RUNX3"                                                 
## [69] "CD28 co-stimulation"                                                                 
## [70] "PIP3 activates AKT signaling"                                                        
## [71] "Interferon alpha/beta signaling"                                                     
## [72] "Post-translational protein modification"                                             
## [73] "RHOU GTPase cycle"                                                                   
## [74] "Nuclear Receptor transcription pathway"                                              
## [75] "GPVI-mediated activation cascade"                                                    
## [76] "Ribosomal scanning and start codon recognition"                                      
## [77] "Translation initiation complex formation"                                            
## [78] "Signaling by Erythropoietin"

Session information

For reproducibility


Click HERE to show session info

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
##  [2] minfi_1.46.0                                      
##  [3] bumphunter_1.42.0                                 
##  [4] locfit_1.5-9.8                                    
##  [5] iterators_1.0.14                                  
##  [6] foreach_1.5.2                                     
##  [7] Biostrings_2.68.1                                 
##  [8] XVector_0.40.0                                    
##  [9] gplots_3.1.3                                      
## [10] eulerr_7.0.0                                      
## [11] fgsea_1.26.0                                      
## [12] clusterProfiler_4.8.3                             
## [13] kableExtra_1.3.4                                  
## [14] DESeq2_1.40.2                                     
## [15] SummarizedExperiment_1.30.2                       
## [16] Biobase_2.60.0                                    
## [17] MatrixGenerics_1.12.3                             
## [18] matrixStats_1.0.0                                 
## [19] GenomicRanges_1.52.0                              
## [20] GenomeInfoDb_1.36.2                               
## [21] IRanges_2.34.1                                    
## [22] S4Vectors_0.38.1                                  
## [23] BiocGenerics_0.46.0                               
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.1             BiocIO_1.10.0            
##   [3] bitops_1.0-7              ggplotify_0.1.2          
##   [5] filelock_1.0.2            tibble_3.2.1             
##   [7] polyclip_1.10-4           preprocessCore_1.62.1    
##   [9] XML_3.99-0.14             lifecycle_1.0.3          
##  [11] base64_2.0.1              lattice_0.21-8           
##  [13] MASS_7.3-60               scrime_1.3.5             
##  [15] magrittr_2.0.3            sass_0.4.7               
##  [17] limma_3.56.2              rmarkdown_2.24           
##  [19] jquerylib_0.1.4           yaml_2.3.7               
##  [21] askpass_1.2.0             doRNG_1.8.6              
##  [23] cowplot_1.1.1             DBI_1.1.3                
##  [25] RColorBrewer_1.1-3        abind_1.4-5              
##  [27] zlibbioc_1.46.0           quadprog_1.5-8           
##  [29] rvest_1.0.3               purrr_1.0.2              
##  [31] ggraph_2.1.0              RCurl_1.98-1.12          
##  [33] yulab.utils_0.0.9         tweenr_2.0.2             
##  [35] rappdirs_0.3.3            GenomeInfoDbData_1.2.10  
##  [37] enrichplot_1.20.1         ggrepel_0.9.3            
##  [39] tidytree_0.4.5            genefilter_1.82.1        
##  [41] annotate_1.78.0           svglite_2.1.1            
##  [43] DelayedMatrixStats_1.22.6 codetools_0.2-19         
##  [45] DelayedArray_0.26.7       DOSE_3.26.1              
##  [47] xml2_1.3.5                ggforce_0.4.1            
##  [49] tidyselect_1.2.0          aplot_0.2.0              
##  [51] farver_2.1.1              beanplot_1.3.1           
##  [53] viridis_0.6.4             BiocFileCache_2.8.0      
##  [55] illuminaio_0.42.0         webshot_0.5.5            
##  [57] GenomicAlignments_1.36.0  jsonlite_1.8.7           
##  [59] multtest_2.56.0           tidygraph_1.2.3          
##  [61] survival_3.5-7            systemfonts_1.0.4        
##  [63] polylabelr_0.2.0          tools_4.3.1              
##  [65] progress_1.2.2            treeio_1.24.3            
##  [67] Rcpp_1.0.11               glue_1.6.2               
##  [69] gridExtra_2.3             xfun_0.40                
##  [71] qvalue_2.32.0             dplyr_1.1.3              
##  [73] HDF5Array_1.28.1          withr_2.5.0              
##  [75] fastmap_1.1.1             rhdf5filters_1.12.1      
##  [77] fansi_1.0.4               openssl_2.1.0            
##  [79] caTools_1.18.2            digest_0.6.33            
##  [81] R6_2.5.1                  gridGraphics_0.5-1       
##  [83] colorspace_2.1-0          GO.db_3.17.0             
##  [85] gtools_3.9.4              biomaRt_2.56.1           
##  [87] RSQLite_2.3.1             utf8_1.2.3               
##  [89] tidyr_1.3.0               generics_0.1.3           
##  [91] data.table_1.14.8         rtracklayer_1.60.1       
##  [93] prettyunits_1.1.1         graphlayouts_1.0.0       
##  [95] httr_1.4.7                S4Arrays_1.0.6           
##  [97] scatterpie_0.2.1          pkgconfig_2.0.3          
##  [99] gtable_0.3.4              blob_1.2.4               
## [101] siggenes_1.74.0           shadowtext_0.1.2         
## [103] htmltools_0.5.6           scales_1.2.1             
## [105] png_0.1-8                 ggfun_0.1.2              
## [107] knitr_1.43                rstudioapi_0.15.0        
## [109] tzdb_0.4.0                reshape2_1.4.4           
## [111] rjson_0.2.21              nlme_3.1-163             
## [113] curl_5.0.2                cachem_1.0.8             
## [115] rhdf5_2.44.0              stringr_1.5.0            
## [117] KernSmooth_2.23-22        HDO.db_0.99.1            
## [119] AnnotationDbi_1.62.2      restfulr_0.0.15          
## [121] GEOquery_2.68.0           reshape_0.8.9            
## [123] pillar_1.9.0              grid_4.3.1               
## [125] vctrs_0.6.3               dbplyr_2.3.3             
## [127] xtable_1.8-4              evaluate_0.21            
## [129] readr_2.1.4               GenomicFeatures_1.52.2   
## [131] cli_3.6.1                 compiler_4.3.1           
## [133] Rsamtools_2.16.0          rlang_1.1.1              
## [135] crayon_1.5.2              rngtools_1.5.2           
## [137] nor1mix_1.3-0             mclust_6.0.0             
## [139] plyr_1.8.8                stringi_1.7.12           
## [141] viridisLite_0.4.2         BiocParallel_1.34.2      
## [143] munsell_0.5.0             lazyeval_0.2.2           
## [145] GOSemSim_2.26.1           Matrix_1.6-1             
## [147] hms_1.1.3                 patchwork_1.1.3          
## [149] sparseMatrixStats_1.12.2  bit64_4.0.5              
## [151] Rhdf5lib_1.22.0           ggplot2_3.4.3            
## [153] KEGGREST_1.40.0           highr_0.10               
## [155] igraph_1.5.1              memoise_2.0.1            
## [157] bslib_0.5.1               ggtree_3.8.2             
## [159] fastmatch_1.1-4           bit_4.0.5                
## [161] downloader_0.4            ape_5.7-1                
## [163] gson_0.1.0