Source: https://github.com/markziemann/enrichment_recipe

Introduction

This guide is a Rmarkdown script that conducts differential expression and enrichment analysis, which are very popular workflows for transcriptome data.

The goal of this work is to understand how ClusterProfiler manages the background list, as we have observed some weird behaviour. In particular we see that when we provide a custom background, those genes do not appear in the final analysis unless they are also included in the gene list.

In the code chunk below called libs, you can add and remove required R library dependancies. Check that the libraries listed here match the Dockerfile, otherwise you might get errors.

suppressPackageStartupMessages({
  library("DESeq2")
  library("kableExtra")
  library("clusterProfiler")
  library("fgsea")
  library("eulerr")
  library("gplots")
  library("IlluminaHumanMethylation450kanno.ilmn12.hg19")
})

For this report I will be using proteomics data from Stukalov et al (https://pubmed.ncbi.nlm.nih.gov/33845483/), who profiled the effect of SARS-COV2 infection on A549 cells.

Here we used the 24hr profile, as there were relatively more changes found.

Probable contaminants were removed, as well as viral proteins. This left 5825 proteins, including 165 downregulated and 44 upregulated.

The data were obtained from Supplementary Table 5 and saved to the /data folder.

x <- read.table("../data/27651444_smoking_current_vs_never_smoking_16938071371108067.tsv",header=TRUE,sep="\t")

Get background gene list.

anno = getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
bg <- anno$UCSC_RefGene_Name
bg <- unique(unlist(strsplit(bg,";")))
length(bg)

## [1] 21231

Get up and down-methylated list.

defup <- subset(x,beta>0)$gene
defup <- defup[defup!="-"]
defup <- unique(unlist(strsplit(defup,";")))
length(defup)

## [1] 2137

defdn <- subset(x,beta<0)$gene
defdn <- defdn[defdn!="-"]
defdn <- unique(unlist(strsplit(defdn,";")))
length(defdn)

## [1] 2786

Gene sets. Get the gene sets loaded in R. These are KEGG gene sets

# from https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp 16th June 2023
genesets2 <- read.gmt("../ref/c2.cp.kegg.v2023.1.Hs.symbols.gmt")
gsets <- gmtPathways("../ref/c2.cp.kegg.v2023.1.Hs.symbols.gmt")

message(paste("number of genes described in the annotation set:",length(unique(genesets2$gene))))

## number of genes described in the annotation set: 5244

KEGG pathway analysis

In the next sections I will run enrichment analysis with over-representation analysis (ORA) test and compare it to functional class scoring. I will also investigate some strange behviour of the ORA tool clusterprofiler.

ORA with Clusterprofiler custom background with otherwise default analysis

I’ve compiled a reporting checklist:

Reporting criteria	Method/resource used
Origin of gene sets	KEGG (2023-06-16)
Tool used	ClusterProfiler (check version at foot of report)
Statistical test used	hypergeometric test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	Stukalov et al 2021: PMID 33845483
Other parameters	Min gene set size of 10

Here I provide a custom background gene list.

Enrichment firstly with upregulated genes

n_pw=10

ora_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=5)
ora_ups <- rownames(ora_up)

gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))

ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL

head(ora_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)

Top upregulated pathways in LPS treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
KEGG_HEMATOPOIETIC_CELL_LINEAGE	22/556	82/4904	0.0000742	0.0129138	0.0117185	22	2.366380
KEGG_FOCAL_ADHESION	38/556	195/4904	0.0004755	0.0413714	0.0375421	38	1.718797

topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))

The reported background size does not match the dataset.

Now repeat with the downregulated genes

ora_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=5)
ora_dns <- rownames(ora_dn)

gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))

ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL

head(ora_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)

Top downregulated pathways in LPS treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
KEGG_RIBOSOME	31/751	87/4904	0.0000020	0.0003566	0.0002965	31	2.326767
KEGG_PATHOGENIC_ESCHERICHIA_COLI_INFECTION	18/751	53/4904	0.0005646	0.0330185	0.0274496	18	2.217722
KEGG_CHRONIC_MYELOID_LEUKEMIA	22/751	73/4904	0.0009380	0.0330185	0.0274496	22	1.967933
KEGG_REGULATION_OF_ACTIN_CYTOSKELETON	49/751	208/4904	0.0009285	0.0330185	0.0274496	49	1.538308
KEGG_PATHWAYS_IN_CANCER	71/751	320/4904	0.0004751	0.0330185	0.0274496	71	1.448835
KEGG_MAPK_SIGNALING_PATHWAY	58/751	262/4904	0.0016565	0.0485894	0.0403943	58	1.445564

topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE KEGGs",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA test")
}

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

I’ve compiled a reporting checklist:

Reporting criteria	Method/resource used
Origin of gene sets	KEGG (2023-06-16)
Tool used	ClusterProfiler (check version at foot of report)
Statistical test used	hypergeometric test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	Data availability
Other parameters	Min gene set size of 10 + BUG FIX

Here I provide a background gene list for cluterprofiler to use, like a user should.

I have also done some modification to the KEGG gene sets, to en

Now filter the gene names into three lists, up-regulated, down-regulated and background. Background is simply all genes that were detected.

Adding all detected genes to the background appears to improve results!

bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)

Enrichment firstly with upregulated genes

n_pw=10

orafix_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_up$geneID <- NULL
orafix_up <- subset(orafix_up,p.adjust<0.05 & Count >=5)
orafix_ups <- rownames(orafix_up)

gr <- as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",2))

orafix_up$es <- gr/br
orafix_up <- orafix_up[order(-orafix_up$es),]
orafix_up$Description=NULL

head(orafix_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)

Top upregulated pathways in LPS treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION	12/2137	44/21231	0.0010285	0.0255655	0.0219619	12	2.709533
KEGG_HEMATOPOIETIC_CELL_LINEAGE	22/2137	82/21231	0.0000135	0.0023442	0.0020138	22	2.665476
KEGG_AXON_GUIDANCE	26/2137	127/21231	0.0003362	0.0146231	0.0125619	26	2.033928
KEGG_FOCAL_ADHESION	38/2137	195/21231	0.0000517	0.0045008	0.0038664	38	1.936043
KEGG_REGULATION_OF_ACTIN_CYTOSKELETON	39/2137	208/21231	0.0000996	0.0057756	0.0049615	39	1.862804
KEGG_CHEMOKINE_SIGNALING_PATHWAY	32/2137	185/21231	0.0016460	0.0358010	0.0307546	32	1.718479

topup2 <- rev(head(orafix_up$es,10))
names(topup2) <- rev(head(orafix_up$ID,10))

Now with the downregulated genes

orafix_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_dn$geneID <- NULL
orafix_dn <- subset(orafix_dn,p.adjust<0.05 & Count >=5)
orafix_dns <- rownames(orafix_dn)

gr <- as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",2))

orafix_dn$es <- gr/br
orafix_dn <- orafix_dn[order(-orafix_dn$es),]
orafix_dn$Description=NULL

head(orafix_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)

Top downregulated pathways in LPS treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
KEGG_RIBOSOME	31/2786	87/21231	0.0000001	0.0000137	0.0000099	31	2.715387
KEGG_TYPE_I_DIABETES_MELLITUS	14/2786	41/21231	0.0004743	0.0092754	0.0067125	14	2.602157
KEGG_PATHOGENIC_ESCHERICHIA_COLI_INFECTION	18/2786	53/21231	0.0000838	0.0029482	0.0021336	18	2.588129
KEGG_GRAFT_VERSUS_HOST_DISEASE	12/2786	37/21231	0.0019706	0.0165151	0.0119517	12	2.471547
KEGG_ALLOGRAFT_REJECTION	11/2786	35/21231	0.0039553	0.0248619	0.0179922	11	2.395047
KEGG_CHRONIC_MYELOID_LEUKEMIA	22/2786	73/21231	0.0001115	0.0032705	0.0023668	22	2.296620
KEGG_PANCREATIC_CANCER	20/2786	69/21231	0.0003989	0.0087753	0.0063506	20	2.208870
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY	19/2786	66/21231	0.0006059	0.0104657	0.0075738	19	2.193810
KEGG_TYPE_II_DIABETES_MELLITUS	13/2786	46/21231	0.0050390	0.0305816	0.0221314	13	2.153649
KEGG_ADHERENS_JUNCTION	19/2786	68/21231	0.0009063	0.0106334	0.0076952	19	2.129286

topdn2 <- head(orafix_dn$es,n_pw)
names(topdn2) <- head(orafix_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE KEGGs",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA fix")
}

Compare pathway results

v1 <- list("ORA up"=ora_ups, "ORA fix up"=orafix_ups)
v2 <- list("ORA dn"=ora_dns, "ORA fix dn"=orafix_dns)
v3 <- list("ORA"=union(ora_ups,ora_dns),
  "ORA fix"=union(orafix_ups,orafix_dns))

par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated KEGG pathways")

plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated KEGG pathways")

plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated KEGG pathways")

svg("prot_kegg_euler.svg")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated KEGG pathways")
dev.off()

## png 
##   2

Jaccard index comparing ORA and ORA fix.

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)

jaccard(ora,orafix)

## [1] 0.195122

List gene sets identified only in the “fixed” analysis.

ora <- union(ora_ups,ora_dns)

orafix <- union(orafix_ups,orafix_dns)

setdiff(orafix,ora)

##  [1] "KEGG_AXON_GUIDANCE"                           
##  [2] "KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION" 
##  [3] "KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION"
##  [4] "KEGG_CHEMOKINE_SIGNALING_PATHWAY"             
##  [5] "KEGG_ENDOCYTOSIS"                             
##  [6] "KEGG_PANCREATIC_CANCER"                       
##  [7] "KEGG_TYPE_I_DIABETES_MELLITUS"                
##  [8] "KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY"         
##  [9] "KEGG_NEUROTROPHIN_SIGNALING_PATHWAY"          
## [10] "KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION"     
## [11] "KEGG_ADHERENS_JUNCTION"                       
## [12] "KEGG_VIRAL_MYOCARDITIS"                       
## [13] "KEGG_INSULIN_SIGNALING_PATHWAY"               
## [14] "KEGG_MELANOMA"                                
## [15] "KEGG_RENAL_CELL_CARCINOMA"                    
## [16] "KEGG_FC_GAMMA_R_MEDIATED_PHAGOCYTOSIS"        
## [17] "KEGG_GRAFT_VERSUS_HOST_DISEASE"               
## [18] "KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY"       
## [19] "KEGG_PYRIMIDINE_METABOLISM"                   
## [20] "KEGG_JAK_STAT_SIGNALING_PATHWAY"              
## [21] "KEGG_ERBB_SIGNALING_PATHWAY"                  
## [22] "KEGG_NON_SMALL_CELL_LUNG_CANCER"              
## [23] "KEGG_PHOSPHATIDYLINOSITOL_SIGNALING_SYSTEM"   
## [24] "KEGG_ALLOGRAFT_REJECTION"                     
## [25] "KEGG_TYPE_II_DIABETES_MELLITUS"               
## [26] "KEGG_COLORECTAL_CANCER"                       
## [27] "KEGG_TGF_BETA_SIGNALING_PATHWAY"              
## [28] "KEGG_TIGHT_JUNCTION"                          
## [29] "KEGG_INOSITOL_PHOSPHATE_METABOLISM"

Reactome Gene sets

ORA with Clusterprofiler custom background with otherwise default analysis

I’ve compiled a reporting checklist:

Reporting criteria	Method/resource used
Origin of gene sets	Reactome (2023-03-06)
Tool used	ClusterProfiler (check version at foot of report)
Statistical test used	hypergeometric test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	Data availability
Other parameters	Min gene set size of 10

Here I provide a custom background gene list.

Get the gene sets loaded in R. These are Reactome gene sets

genesets2 <- read.gmt("../ref/ReactomePathways_2023-03-06.gmt")
gsets <- gmtPathways("../ref/ReactomePathways_2023-03-06.gmt")

message(paste("number of genes described in the annotation set:",length(unique(genesets2$gene))))

## number of genes described in the annotation set: 11457

Enrichment firstly with upregulated genes

n_pw=10

ora_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=5)
ora_ups <- rownames(ora_up)

gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))

ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL

head(ora_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)

Top upregulated pathways in LPS treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
NOTCH3 Intracellular Domain Regulates Transcription	10/1129	25/10174	0.0001856	0.0238237	0.0220051	10	3.604606
FCGR3A-mediated IL10 synthesis	13/1129	38/10174	0.0001364	0.0196945	0.0181911	13	3.082887
Vasopressin regulates renal water homeostasis via Aquaporins	14/1129	43/10174	0.0001395	0.0196945	0.0181911	14	2.933981
Signaling by NOTCH3	15/1129	48/10174	0.0001371	0.0196945	0.0181911	15	2.816098
RHOB GTPase cycle	17/1129	63/10174	0.0003666	0.0398149	0.0367755	17	2.431678
G alpha (12/13) signalling events	19/1129	73/10174	0.0002815	0.0331221	0.0305936	19	2.345463

topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))

Now repeat with the downregulated genes

ora_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=5)
ora_dns <- rownames(ora_dn)

gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))

ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL

head(ora_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)

Top downregulated pathways in LPS treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
Sema3A PAK dependent Axon repulsion	9/1484	16/10174	0.0001263	0.0052774	0.0047241	9	3.856385
FGFR2 alternative splicing	11/1484	25/10174	0.0003788	0.0126604	0.0113330	11	3.016550
Eukaryotic Translation Elongation	35/1484	94/10174	0.0000000	0.0000329	0.0000295	35	2.552690
Peptide chain elongation	32/1484	90/10174	0.0000006	0.0001049	0.0000939	32	2.437616
Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC)	34/1484	96/10174	0.0000003	0.0000694	0.0000621	34	2.428094
RUNX1 interacts with co-factors whose precise effect on RUNX1 targets is not known	13/1484	37/10174	0.0014930	0.0423680	0.0379260	13	2.408793
SARS-CoV-1 modulates host translation machinery	13/1484	37/10174	0.0014930	0.0423680	0.0379260	13	2.408793
Eukaryotic Translation Termination	33/1484	94/10174	0.0000005	0.0001049	0.0000939	33	2.406822
Viral mRNA Translation	31/1484	90/10174	0.0000018	0.0001971	0.0001764	31	2.361441
Nonsense Mediated Decay (NMD) enhanced by the Exon Junction Complex (EJC)	39/1484	114/10174	0.0000001	0.0000411	0.0000368	39	2.345404

topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE Reactomes",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA test")
}

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

I’ve compiled a reporting checklist:

Reporting criteria	Method/resource used
Origin of gene sets	Reactome (2023-03-06)
Tool used	ClusterProfiler (check version at foot of report)
Statistical test used	hypergeometric test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	Data availability
Other parameters	Min gene set size of 10 + BUG FIX

Adding all detected genes to the background appears to improve results!

bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)

Enrichment firstly with upregulated genes

orafix_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_up$geneID <- NULL
orafix_up <- subset(orafix_up,p.adjust<0.05 & Count >=5)
orafix_ups <- rownames(orafix_up)

gr <- as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",2))

orafix_up$es <- gr/br
orafix_up <- orafix_up[order(-orafix_up$es),]
orafix_up$Description=NULL

head(orafix_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)

Top upregulated pathways in LPS treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
FCGR activation	6/2137	12/21231	0.0005580	0.0256074	0.0228126	6	4.967478
NOTCH3 Intracellular Domain Regulates Transcription	10/2137	25/21231	0.0000824	0.0089505	0.0079736	10	3.973982
FCGR3A-mediated IL10 synthesis	13/2137	38/21231	0.0000506	0.0064933	0.0057846	13	3.398801
Antigen activates B Cell Receptor (BCR) leading to generation of second messengers	10/2137	30/21231	0.0004734	0.0256074	0.0228126	10	3.311652
Transcriptional regulation of granulopoiesis	10/2137	30/21231	0.0004734	0.0256074	0.0228126	10	3.311652
Vasopressin regulates renal water homeostasis via Aquaporins	14/2137	43/21231	0.0000491	0.0064933	0.0057846	14	3.234637

topup2 <- rev(head(orafix_up$es,10))
names(topup2) <- rev(head(orafix_up$ID,10))

Now with the downregulated genes

n_pw=10

orafix_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 5000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_dn$geneID <- NULL
orafix_dn <- subset(orafix_dn,p.adjust<0.05 & Count >=5)
orafix_dns <- rownames(orafix_dn)

gr <- as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",2))

orafix_dn$es <- gr/br
orafix_dn <- orafix_dn[order(-orafix_dn$es),]
orafix_dn$Description=NULL

head(orafix_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
  kable_paper("hover", full_width = F)

Top downregulated pathways in LPS treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
Sema3A PAK dependent Axon repulsion	9/2786	16/21231	0.0000545	0.0017439	0.0014072	9	4.286589
Signaling by Leptin	6/2786	11/21231	0.0012988	0.0244175	0.0197042	6	4.156693
CD28 dependent Vav1 pathway	6/2786	12/21231	0.0023099	0.0376382	0.0303729	6	3.810302
FGFR2 alternative splicing	11/2786	25/21231	0.0001474	0.0043483	0.0035089	11	3.353065
Ephrin signaling	8/2786	19/21231	0.0017106	0.0302683	0.0244257	8	3.208675
Growth hormone receptor signaling	9/2786	24/21231	0.0023232	0.0376382	0.0303729	9	2.857726
TNFR1-induced proapoptotic signaling	9/2786	24/21231	0.0023232	0.0376382	0.0303729	9	2.857726
Eukaryotic Translation Elongation	35/2786	94/21231	0.0000000	0.0000011	0.0000009	35	2.837459
TNFs bind their physiological receptors	10/2786	27/21231	0.0014960	0.0267850	0.0216147	10	2.822446
Signaling by Erythropoietin	9/2786	25/21231	0.0032121	0.0460101	0.0371288	9	2.743417

topdn2 <- head(orafix_dn$es,n_pw)
names(topdn2) <- head(orafix_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE Reactomes",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA fix")
}

Compare pathway results

v1 <- list("ORA up"=ora_ups, "ORA fix up"=orafix_ups)
v2 <- list("ORA dn"=ora_dns, "ORA fix dn"=orafix_dns)
v3 <- list("ORA"=union(ora_ups,ora_dns),
  "ORA fix"=union(orafix_ups,orafix_dns))

par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated Reactomes")

plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated Reactomes")

plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated Reactomes")

svg("prot_reactome_euler.svg")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated Reactomes")
dev.off()

## png 
##   2

Jaccard index comparing ORA and ORA fix.

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)

jaccard(ora,orafix)

## [1] 0.4038462

List gene sets identified only in the “fixed” analysis.

ora <- union(ora_ups,ora_dns)

orafix <- union(orafix_ups,orafix_dns)

setdiff(orafix,ora)

##  [1] "Platelet activation, signaling and aggregation"                                      
##  [2] "Muscle contraction"                                                                  
##  [3] "Sensory processing of sound"                                                         
##  [4] "Sensory processing of sound by inner hair cells of the cochlea"                      
##  [5] "Cell death signalling via NRAGE, NRIF and NADE"                                      
##  [6] "NCAM signaling for neurite out-growth"                                               
##  [7] "Aquaporin-mediated transport"                                                        
##  [8] "NRAGE signals death through JNK"                                                     
##  [9] "Antigen activates B Cell Receptor (BCR) leading to generation of second messengers"  
## [10] "Transcriptional regulation of granulopoiesis"                                        
## [11] "NCAM1 interactions"                                                                  
## [12] "Anti-inflammatory response favouring Leishmania parasite infection"                  
## [13] "Leishmania parasite growth and survival"                                             
## [14] "FCGR activation"                                                                     
## [15] "RHOC GTPase cycle"                                                                   
## [16] "Extracellular matrix organization"                                                   
## [17] "O-glycosylation of TSR domain-containing proteins"                                   
## [18] "GPCR ligand binding"                                                                 
## [19] "RHOD GTPase cycle"                                                                   
## [20] "G alpha (q) signalling events"                                                       
## [21] "Glucagon signaling in metabolic regulation"                                          
## [22] "Integration of energy metabolism"                                                    
## [23] "Cardiac conduction"                                                                  
## [24] "Fatty acyl-CoA biosynthesis"                                                         
## [25] "Adrenaline,noradrenaline inhibits insulin secretion"                                 
## [26] "RHOF GTPase cycle"                                                                   
## [27] "EPH-ephrin mediated repulsion of cells"                                              
## [28] "p75 NTR receptor-mediated signalling"                                                
## [29] "Regulation of insulin secretion"                                                     
## [30] "Metabolism of proteins"                                                              
## [31] "Innate Immune System"                                                                
## [32] "Translation"                                                                         
## [33] "SUMO E3 ligases SUMOylate target proteins"                                           
## [34] "Metabolism of RNA"                                                                   
## [35] "Signaling by NOTCH"                                                                  
## [36] "TNFR2 non-canonical NF-kB pathway"                                                   
## [37] "TCF dependent signaling in response to WNT"                                          
## [38] "Signaling by FGFR2"                                                                  
## [39] "Metabolism"                                                                          
## [40] "Adaptive Immune System"                                                              
## [41] "SARS-CoV-1-host interactions"                                                        
## [42] "Signaling by NTRK1 (TRKA)"                                                           
## [43] "Signaling by ALK fusions and activated point mutants"                                
## [44] "Signaling by ALK in cancer"                                                          
## [45] "Signaling by TGFB family members"                                                    
## [46] "SUMOylation"                                                                         
## [47] "Gene expression (Transcription)"                                                     
## [48] "Intracellular signaling by second messengers"                                        
## [49] "Activation of anterior HOX genes in hindbrain development during early embryogenesis"
## [50] "Activation of HOX genes during differentiation"                                      
## [51] "RHOV GTPase cycle"                                                                   
## [52] "Signaling by FGFR"                                                                   
## [53] "Transcriptional Regulation by TP53"                                                  
## [54] "RNA Polymerase II Transcription"                                                     
## [55] "Signaling by Leptin"                                                                 
## [56] "Signaling by WNT"                                                                    
## [57] "Netrin-1 signaling"                                                                  
## [58] "Generic Transcription Pathway"                                                       
## [59] "TNFs bind their physiological receptors"                                             
## [60] "Ephrin signaling"                                                                    
## [61] "Notch-HLH transcription pathway"                                                     
## [62] "The role of Nef in HIV-1 replication and disease pathogenesis"                       
## [63] "TNF signaling"                                                                       
## [64] "Metabolism of nucleotides"                                                           
## [65] "CD28 dependent Vav1 pathway"                                                         
## [66] "Growth hormone receptor signaling"                                                   
## [67] "TNFR1-induced proapoptotic signaling"                                                
## [68] "Transcriptional regulation by RUNX3"                                                 
## [69] "CD28 co-stimulation"                                                                 
## [70] "PIP3 activates AKT signaling"                                                        
## [71] "Interferon alpha/beta signaling"                                                     
## [72] "Post-translational protein modification"                                             
## [73] "RHOU GTPase cycle"                                                                   
## [74] "Nuclear Receptor transcription pathway"                                              
## [75] "GPVI-mediated activation cascade"                                                    
## [76] "Ribosomal scanning and start codon recognition"                                      
## [77] "Translation initiation complex formation"                                            
## [78] "Signaling by Erythropoietin"

Session information

For reproducibility

Click HERE to show session info

sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
##  [2] minfi_1.46.0                                      
##  [3] bumphunter_1.42.0                                 
##  [4] locfit_1.5-9.8                                    
##  [5] iterators_1.0.14                                  
##  [6] foreach_1.5.2                                     
##  [7] Biostrings_2.68.1                                 
##  [8] XVector_0.40.0                                    
##  [9] gplots_3.1.3                                      
## [10] eulerr_7.0.0                                      
## [11] fgsea_1.26.0                                      
## [12] clusterProfiler_4.8.3                             
## [13] kableExtra_1.3.4                                  
## [14] DESeq2_1.40.2                                     
## [15] SummarizedExperiment_1.30.2                       
## [16] Biobase_2.60.0                                    
## [17] MatrixGenerics_1.12.3                             
## [18] matrixStats_1.0.0                                 
## [19] GenomicRanges_1.52.0                              
## [20] GenomeInfoDb_1.36.2                               
## [21] IRanges_2.34.1                                    
## [22] S4Vectors_0.38.1                                  
## [23] BiocGenerics_0.46.0                               
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.1             BiocIO_1.10.0            
##   [3] bitops_1.0-7              ggplotify_0.1.2          
##   [5] filelock_1.0.2            tibble_3.2.1             
##   [7] polyclip_1.10-4           preprocessCore_1.62.1    
##   [9] XML_3.99-0.14             lifecycle_1.0.3          
##  [11] base64_2.0.1              lattice_0.21-8           
##  [13] MASS_7.3-60               scrime_1.3.5             
##  [15] magrittr_2.0.3            sass_0.4.7               
##  [17] limma_3.56.2              rmarkdown_2.24           
##  [19] jquerylib_0.1.4           yaml_2.3.7               
##  [21] askpass_1.2.0             doRNG_1.8.6              
##  [23] cowplot_1.1.1             DBI_1.1.3                
##  [25] RColorBrewer_1.1-3        abind_1.4-5              
##  [27] zlibbioc_1.46.0           quadprog_1.5-8           
##  [29] rvest_1.0.3               purrr_1.0.2              
##  [31] ggraph_2.1.0              RCurl_1.98-1.12          
##  [33] yulab.utils_0.0.9         tweenr_2.0.2             
##  [35] rappdirs_0.3.3            GenomeInfoDbData_1.2.10  
##  [37] enrichplot_1.20.1         ggrepel_0.9.3            
##  [39] tidytree_0.4.5            genefilter_1.82.1        
##  [41] annotate_1.78.0           svglite_2.1.1            
##  [43] DelayedMatrixStats_1.22.6 codetools_0.2-19         
##  [45] DelayedArray_0.26.7       DOSE_3.26.1              
##  [47] xml2_1.3.5                ggforce_0.4.1            
##  [49] tidyselect_1.2.0          aplot_0.2.0              
##  [51] farver_2.1.1              beanplot_1.3.1           
##  [53] viridis_0.6.4             BiocFileCache_2.8.0      
##  [55] illuminaio_0.42.0         webshot_0.5.5            
##  [57] GenomicAlignments_1.36.0  jsonlite_1.8.7           
##  [59] multtest_2.56.0           tidygraph_1.2.3          
##  [61] survival_3.5-7            systemfonts_1.0.4        
##  [63] polylabelr_0.2.0          tools_4.3.1              
##  [65] progress_1.2.2            treeio_1.24.3            
##  [67] Rcpp_1.0.11               glue_1.6.2               
##  [69] gridExtra_2.3             xfun_0.40                
##  [71] qvalue_2.32.0             dplyr_1.1.3              
##  [73] HDF5Array_1.28.1          withr_2.5.0              
##  [75] fastmap_1.1.1             rhdf5filters_1.12.1      
##  [77] fansi_1.0.4               openssl_2.1.0            
##  [79] caTools_1.18.2            digest_0.6.33            
##  [81] R6_2.5.1                  gridGraphics_0.5-1       
##  [83] colorspace_2.1-0          GO.db_3.17.0             
##  [85] gtools_3.9.4              biomaRt_2.56.1           
##  [87] RSQLite_2.3.1             utf8_1.2.3               
##  [89] tidyr_1.3.0               generics_0.1.3           
##  [91] data.table_1.14.8         rtracklayer_1.60.1       
##  [93] prettyunits_1.1.1         graphlayouts_1.0.0       
##  [95] httr_1.4.7                S4Arrays_1.0.6           
##  [97] scatterpie_0.2.1          pkgconfig_2.0.3          
##  [99] gtable_0.3.4              blob_1.2.4               
## [101] siggenes_1.74.0           shadowtext_0.1.2         
## [103] htmltools_0.5.6           scales_1.2.1             
## [105] png_0.1-8                 ggfun_0.1.2              
## [107] knitr_1.43                rstudioapi_0.15.0        
## [109] tzdb_0.4.0                reshape2_1.4.4           
## [111] rjson_0.2.21              nlme_3.1-163             
## [113] curl_5.0.2                cachem_1.0.8             
## [115] rhdf5_2.44.0              stringr_1.5.0            
## [117] KernSmooth_2.23-22        HDO.db_0.99.1            
## [119] AnnotationDbi_1.62.2      restfulr_0.0.15          
## [121] GEOquery_2.68.0           reshape_0.8.9            
## [123] pillar_1.9.0              grid_4.3.1               
## [125] vctrs_0.6.3               dbplyr_2.3.3             
## [127] xtable_1.8-4              evaluate_0.21            
## [129] readr_2.1.4               GenomicFeatures_1.52.2   
## [131] cli_3.6.1                 compiler_4.3.1           
## [133] Rsamtools_2.16.0          rlang_1.1.1              
## [135] crayon_1.5.2              rngtools_1.5.2           
## [137] nor1mix_1.3-0             mclust_6.0.0             
## [139] plyr_1.8.8                stringi_1.7.12           
## [141] viridisLite_0.4.2         BiocParallel_1.34.2      
## [143] munsell_0.5.0             lazyeval_0.2.2           
## [145] GOSemSim_2.26.1           Matrix_1.6-1             
## [147] hms_1.1.3                 patchwork_1.1.3          
## [149] sparseMatrixStats_1.12.2  bit64_4.0.5              
## [151] Rhdf5lib_1.22.0           ggplot2_3.4.3            
## [153] KEGGREST_1.40.0           highr_0.10               
## [155] igraph_1.5.1              memoise_2.0.1            
## [157] bslib_0.5.1               ggtree_3.8.2             
## [159] fastmatch_1.1-4           bit_4.0.5                
## [161] downloader_0.4            ape_5.7-1                
## [163] gson_0.1.0

Examining background gene lists in clusterProfiler - methylation example

Mark Ziemann & Anusuiya Bora

2023-09-04

Introduction

KEGG pathway analysis

ORA with Clusterprofiler custom background with otherwise default analysis

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

Compare pathway results

Reactome Gene sets

ORA with Clusterprofiler custom background with otherwise default analysis

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

Compare pathway results

Session information