Source: https://github.com/markziemann/background

Introduction

This guide is a Rmarkdown script that conducts differential expression and enrichment analysis, which are very popular workflows for transcriptome data.

The goal of this work is to understand how ClusterProfiler manages the background list, as we have observed some weird behaviour. In particular we see that when we provide a custom background, those genes do not appear in the final analysis unless they are also included in the gene list.

In the code chunk below called libs, you can add and remove required R library dependancies. Check that the libraries listed here match the Dockerfile, otherwise you might get errors.

suppressPackageStartupMessages({
  library("getDEE2")
  library("DESeq2")
  library("kableExtra")
  library("clusterProfiler")
  library("fgsea")
  library("eulerr")
  library("gplots")
})

For this guide I will be using bulk RNA-seq data from a previous study, which is deposited at NCBI GEO and SRA under accession numbers: GSE55123 and SRP038101 (Lund et al, 2014). The experiment is designed to investigate the effect of Azacitidine treatment on AML3 cells.

The raw data have been processed by the DEE2 project, and the summary gene expression counts are available at the dee2.io website, and programmatically with the getDEE2 bioconductor package (Ziemann et al, 2019).

Alternatively, you could fetch data from another resource like NCBI GEO, Zenodo or from the host storage drive.

mdat <- getDEE2Metadata("hsapiens")

# get sample sheet
ss <- subset(mdat,SRP_accession=="SRP038101")

# fetch the whole set of RNA-seq data
x <- getDEE2("hsapiens",ss$SRR_accession , metadata=mdat, legacy=TRUE)

## For more information about DEE2 QC metrics, visit
##     https://github.com/markziemann/dee2/blob/master/qc/qc_metrics.md

mx <- x$GeneCounts
rownames(mx) <- paste(rownames(mx),x$GeneInfo$GeneSymbol)
dim(mx)

## [1] 58302     6

# aza no filtering
ss$trt <- grepl("Treated",ss$Experiment_title)

ss %>%
  kbl(caption="sample sheet for Aza treatment in AML3 cells") %>%
  kable_paper("hover", full_width = F)

sample sheet for Aza treatment in AML3 cells
	SRR_accession	QC_summary	SRX_accession	SRS_accession	SRP_accession	Experiment_title	trt
156267	SRR1171523	PASS	SRX472607	SRS559064	SRP038101	GSM1329859: Untreated.1; Homo sapiens; RNA-Seq	FALSE
156268	SRR1171524	WARN(3,4)	SRX472608	SRS559066	SRP038101	GSM1329860: Untreated.2; Homo sapiens; RNA-Seq	FALSE
156269	SRR1171525	WARN(3,4)	SRX472609	SRS559065	SRP038101	GSM1329861: Untreated.3; Homo sapiens; RNA-Seq	FALSE
156270	SRR1171526	WARN(3,4)	SRX472610	SRS559068	SRP038101	GSM1329862: Treated.1; Homo sapiens; RNA-Seq	TRUE
156271	SRR1171527	WARN(3,4)	SRX472611	SRS559067	SRP038101	GSM1329863: Treated.2; Homo sapiens; RNA-Seq	TRUE
156272	SRR1171528	WARN(3,4)	SRX472612	SRS559069	SRP038101	GSM1329864: Treated.3; Homo sapiens; RNA-Seq	TRUE

Data quality control

QC is important, even if you are using public transcriptome data. For RNA-seq it is a good idea to show the number of reads for each sample.

par(mar=c(5,7,5,1))
barplot(rev(colSums(mx)),horiz=TRUE,las=1,main="number of reads per sample in SRP038101")

Now make a MDS plot.

mds <- cmdscale(dist(t(mx)))

# expand plot area
XMIN=min(mds[,1])*1.3
XMAX=max(mds[,1])*1.3
YMIN=min(mds[,2])*1.3
YMAX=max(mds[,2])*1.3

cols <- as.character(grepl("Treated",ss$Experiment_title))
cols <- gsub("FALSE","lightblue",cols)
cols <- gsub("TRUE","pink",cols)
plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
  xlim=c(XMIN,XMAX),ylim=c(YMIN,YMAX),
  pch=19,cex=2,col=cols, main="MDS plot")
text(cmdscale(dist(t(mx))), labels=colnames(mx) )

Differential expression analysis

I will be using DESeq2 for DE analysis, the count matrix is prefiltered using a detection threshold of 10 reads per sample across all samples. All genes that meet the detection threshold will comprise the background list. The first 6 rows of the count matrix are shows.

mxf <- mx[which(rowMeans(mx)>=10),]
dim(mxf)

## [1] 13168     6

head(mxf,6) %>%
  kbl(caption="Count matrix format") %>%
  kable_paper("hover", full_width = F)

Count matrix format
	SRR1171523	SRR1171524	SRR1171525	SRR1171526	SRR1171527	SRR1171528
ENSG00000225630 MTND2P28	494	396	340	333	415	418
ENSG00000237973 MTCO1P12	52	39	40	30	37	29
ENSG00000248527 MTATP6P1	853	544	537	582	702	716
ENSG00000228327 AL669831.1	17	13	21	21	22	12
ENSG00000228794 LINC01128	42	27	30	32	40	23
ENSG00000230699 AL645608.3	20	11	13	10	15	22

dds <- DESeqDataSetFromMatrix(countData = mxf , colData = ss, design = ~ trt )
res <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

z <- results(res)
vsd <- vst(dds, blind=FALSE)
zz <-cbind(as.data.frame(z),mxf)
def <-as.data.frame(zz[order(zz$pvalue),])

head(def,10) %>%
  kbl(caption="Top DE genes for Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top DE genes for Aza treatment
	baseMean	log2FoldChange	lfcSE	stat	SRR1171523	SRR1171524	SRR1171525	SRR1171526	SRR1171527	SRR1171528
ENSG00000165949 IFI27	1960.1970	-3.384492	0.0938869	-36.04861	4689	3583	2758	309	384	334
ENSG00000090382 LYZ	7596.0299	-1.650342	0.0561143	-29.41036	14212	10237	10789	3476	3764	3738
ENSG00000115461 IGFBP5	531.2217	-5.071157	0.1795239	-28.24781	1320	823	1025	23	26	43
ENSG00000157601 MX1	827.1511	-2.877795	0.1047823	-27.46450	1732	1501	1206	184	223	186
ENSG00000111331 OAS3	2127.2010	-2.661214	0.0972124	-27.37525	4204	3977	2972	562	614	560
ENSG00000070915 SLC12A3	424.5509	-3.374852	0.1298671	-25.98697	1012	721	653	63	85	76
ENSG00000234745 HLA-B	3197.0159	-1.431566	0.0604169	-23.69479	6085	4256	4023	1590	1872	1719
ENSG00000137965 IFI44	409.0957	-2.978581	0.1319352	-22.57608	829	740	635	76	111	89
ENSG00000204525 HLA-C	1631.6421	-1.461550	0.0660214	-22.13750	3112	2150	2106	791	923	891
ENSG00000110042 DTX4	524.1318	-2.470219	0.1173182	-21.05572	1166	883	688	166	168	145

Make a smear plot.

sigf <- subset(def,padj<=0.05)

DET=nrow(mxf)
NSIG=nrow(sigf)
NUP=nrow(subset(sigf,log2FoldChange>0))
NDN=nrow(subset(sigf,log2FoldChange<0))

HEADER=paste(DET,"detected genes;",NSIG,"w/FDR<0.05;",NUP,"up;",NDN,"down")

plot(log10(def$baseMean) ,def$log2FoldChange,
  cex=0.6,pch=19,col="darkgray",
  main="5-azacitidine treatment in AML3 cells",
  xlab="log10(basemean)",ylab="log2(fold change)")

points(log10(sigf$baseMean) ,sigf$log2FoldChange,
  cex=0.6,pch=19,col="red")

mtext(HEADER)

write.table(def,file="../../data/aza_rnaseq.tsv")

KEGG pathway analysis

In the next sections I will run enrichment analysis with over-representation analysis (ORA) test and compare it to functional class scoring. I will also investigate some strange behviour of the ORA tool clusterprofiler.

Get gene sets

Get the gene sets loaded in R. These are KEGG gene sets

# from https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp 16th June 2023
genesets2 <- read.gmt("../../ref/c2.cp.kegg.v2023.1.Hs.symbols.gmt")
gsets <- gmtPathways("../../ref/c2.cp.kegg.v2023.1.Hs.symbols.gmt")

message(paste("number of genes described in the annotation set:",length(unique(genesets2$gene))))

## number of genes described in the annotation set: 5244

Now filter the gene names into three lists, up-regulated, down-regulated and background. Background is simply all genes that were detected.

defup <- rownames(subset(def,padj<=0.05 & log2FoldChange>0))
defup <- unique(sapply(strsplit(defup," "),"[[",2))

defdn <- rownames(subset(def,padj<=0.05 & log2FoldChange<0))
defdn <- unique(sapply(strsplit(defdn," "),"[[",2))

bg <- rownames(def)
bg <- unique(sapply(strsplit(bg," "),"[[",2))

message(paste("number of genes in background:",length(bg)))

## number of genes in background: 13164

writeLines(text=unique(defup),con="aza_up.txt")
writeLines(text=unique(defdn),con="aza_dn.txt")
writeLines(text=unique(bg),con="aza_bg.txt")

ORA with Clusterprofiler custom background with otherwise default analysis

I’ve compiled a reporting checklist:

Reporting criteria	Method/resource used
Origin of gene sets	KEGG (2023-06-16)
Tool used	ClusterProfiler (check version at foot of report)
Statistical test used	hypergeometric test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	via DEE2 at accession SRP038101 (human)
Other parameters	Min gene set size of 10

Here I provide a custom background gene list.

Enrichment firstly with upregulated genes

n_pw=10

ora_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 500000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=5)
ora_ups <- rownames(ora_up)

gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))

ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL

head(ora_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top upregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
KEGG_MISMATCH_REPAIR	10/406	22/3134	0.0001811	0.0044696	0.0042686	10	3.508733
KEGG_DNA_REPLICATION	15/406	35/3134	0.0000104	0.0005222	0.0004987	15	3.308234
KEGG_CELL_CYCLE	44/406	115/3134	0.0000000	0.0000000	0.0000000	44	2.953438
KEGG_NUCLEOTIDE_EXCISION_REPAIR	16/406	42/3134	0.0000314	0.0009481	0.0009054	16	2.940652
KEGG_BASE_EXCISION_REPAIR	12/406	33/3134	0.0005182	0.0086936	0.0083027	12	2.806986
KEGG_RNA_POLYMERASE	9/406	26/3134	0.0038482	0.0484237	0.0462464	9	2.672035

topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))

The reported background size does not match the dataset.

Now repeat with the downregulated genes

ora_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 500000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=5)
ora_dns <- rownames(ora_dn)

gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))

ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL

head(ora_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top downregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
KEGG_ALLOGRAFT_REJECTION	8/608	11/3134	0.0001822	0.0030068	0.0023402	8	3.748804
KEGG_GRAFT_VERSUS_HOST_DISEASE	8/608	11/3134	0.0001822	0.0030068	0.0023402	8	3.748804
KEGG_AUTOIMMUNE_THYROID_DISEASE	7/608	10/3134	0.0006959	0.0073552	0.0057246	7	3.608224
KEGG_TYPE_I_DIABETES_MELLITUS	9/608	14/3134	0.0002893	0.0039782	0.0030963	9	3.313675
KEGG_OTHER_GLYCAN_DEGRADATION	7/608	13/3134	0.0057712	0.0380901	0.0296459	7	2.775557
KEGG_CELL_ADHESION_MOLECULES_CAMS	30/608	59/3134	0.0000000	0.0000037	0.0000028	30	2.620986
KEGG_HEMATOPOIETIC_CELL_LINEAGE	19/608	39/3134	0.0000311	0.0010268	0.0007992	19	2.511218
KEGG_COMPLEMENT_AND_COAGULATION_CASCADES	10/608	21/3134	0.0031211	0.0245228	0.0190863	10	2.454574
KEGG_LEISHMANIA_INFECTION	23/608	49/3134	0.0000100	0.0005478	0.0004264	23	2.419509
KEGG_SYSTEMIC_LUPUS_ERYTHEMATOSUS	9/608	20/3134	0.0079453	0.0485544	0.0377903	9	2.319572

topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE KEGGs",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA test")
}

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

I’ve compiled a reporting checklist:

Reporting criteria	Method/resource used
Origin of gene sets	KEGG (2023-06-16)
Tool used	ClusterProfiler (check version at foot of report)
Statistical test used	hypergeometric test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	via DEE2 at accession SRP038101 (human)
Other parameters	Min gene set size of 10 + BUG FIX

Here I provide a background gene list for cluterprofiler to use, like a user should.

I have also done some modification to the KEGG gene sets, to en

Now filter the gene names into three lists, up-regulated, down-regulated and background. Background is simply all genes that were detected.

defup <- rownames(subset(def,padj<=0.05 & log2FoldChange>0))
defup <- unique(sapply(strsplit(defup," "),"[[",2))

defdn <- rownames(subset(def,padj<=0.05 & log2FoldChange<0))
defdn <- unique(sapply(strsplit(defdn," "),"[[",2))

bg <- rownames(def)
bg <- unique(sapply(strsplit(bg," "),"[[",2))
message(paste("number of genes in background:",length(bg)))

## number of genes in background: 13164

Adding all detected genes to the background appears to improve results!

bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)

Enrichment firstly with upregulated genes

n_pw=10

orafix_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 500000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_up$geneID <- NULL
orafix_up <- subset(orafix_up,p.adjust<0.05 & Count >=5)
orafix_ups <- rownames(orafix_up)

gr <- as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",2))

orafix_up$es <- gr/br
orafix_up <- orafix_up[order(-orafix_up$es),]
orafix_up$Description=NULL

head(orafix_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top upregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
KEGG_MISMATCH_REPAIR	10/1672	22/13164	0.0001611	0.0039535	0.0037783	10	3.578730
KEGG_DNA_REPLICATION	15/1672	35/13164	0.0000091	0.0004610	0.0004406	15	3.374231
KEGG_CELL_CYCLE	44/1672	115/13164	0.0000000	0.0000000	0.0000000	44	3.012357
KEGG_NUCLEOTIDE_EXCISION_REPAIR	16/1672	42/13164	0.0000275	0.0008364	0.0007993	16	2.999316
KEGG_BASE_EXCISION_REPAIR	12/1672	33/13164	0.0004583	0.0077393	0.0073963	12	2.862984
KEGG_RNA_POLYMERASE	9/1672	26/13164	0.0034659	0.0439015	0.0419557	9	2.725340

topup2 <- rev(head(orafix_up$es,10))
names(topup2) <- rev(head(orafix_up$ID,10))

Now with the downregulated genes

orafix_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 500000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_dn$geneID <- NULL
orafix_dn <- subset(orafix_dn,p.adjust<0.05 & Count >=5)
orafix_dns <- rownames(orafix_dn)

gr <- as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",2))

orafix_dn$es <- gr/br
orafix_dn <- orafix_dn[order(-orafix_dn$es),]
orafix_dn$Description=NULL

head(orafix_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top downregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
KEGG_ALLOGRAFT_REJECTION	8/1926	11/13164	0.0000226	0.0002255	0.0001444	8	4.970830
KEGG_GRAFT_VERSUS_HOST_DISEASE	8/1926	11/13164	0.0000226	0.0002255	0.0001444	8	4.970830
KEGG_AUTOIMMUNE_THYROID_DISEASE	7/1926	10/13164	0.0001134	0.0008184	0.0005241	7	4.784424
KEGG_TYPE_I_DIABETES_MELLITUS	9/1926	14/13164	0.0000300	0.0002493	0.0001597	9	4.393859
KEGG_OTHER_GLYCAN_DEGRADATION	7/1926	13/13164	0.0010809	0.0058674	0.0037578	7	3.680326
KEGG_CELL_ADHESION_MOLECULES_CAMS	30/1926	59/13164	0.0000000	0.0000000	0.0000000	30	3.475368
KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION	7/1926	14/13164	0.0018899	0.0091194	0.0058406	7	3.417445
KEGG_HEMATOPOIETIC_CELL_LINEAGE	19/1926	39/13164	0.0000005	0.0000109	0.0000070	19	3.329819
KEGG_COMPLEMENT_AND_COAGULATION_CASCADES	10/1926	21/13164	0.0003293	0.0021863	0.0014002	10	3.254710
KEGG_LEISHMANIA_INFECTION	23/1926	49/13164	0.0000001	0.0000034	0.0000022	23	3.208214

topdn2 <- head(orafix_dn$es,n_pw)
names(topdn2) <- head(orafix_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE KEGGs",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA fix")
}

Compare pathway results

v1 <- list("ORA up"=ora_ups, "ORA fix up"=orafix_ups)
v2 <- list("ORA dn"=ora_dns, "ORA fix dn"=orafix_dns)
v3 <- list("ORA"=union(ora_ups,ora_dns),
  "ORA fix"=union(orafix_ups,orafix_dns))

par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated KEGG pathways")

plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated KEGG pathways")

plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated KEGG pathways")

svg("rna_kegg_euler.svg")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated KEGG pathways")
dev.off()

## png 
##   2

Jaccard index comparing ORA and ORA fix.

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)

jaccard(ora,orafix)

## [1] 0.5909091

List gene sets identified only in the “fixed” analysis.

ora <- union(ora_ups,ora_dns)

orafix <- union(orafix_ups,orafix_dns)

setdiff(orafix,ora)

##  [1] "KEGG_CALCIUM_SIGNALING_PATHWAY"                   
##  [2] "KEGG_GAP_JUNCTION"                                
##  [3] "KEGG_GNRH_SIGNALING_PATHWAY"                      
##  [4] "KEGG_GLYCOLYSIS_GLUCONEOGENESIS"                  
##  [5] "KEGG_FC_GAMMA_R_MEDIATED_PHAGOCYTOSIS"            
##  [6] "KEGG_ACUTE_MYELOID_LEUKEMIA"                      
##  [7] "KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION"
##  [8] "KEGG_TOLL_LIKE_RECEPTOR_SIGNALING_PATHWAY"        
##  [9] "KEGG_B_CELL_RECEPTOR_SIGNALING_PATHWAY"           
## [10] "KEGG_ENDOCYTOSIS"                                 
## [11] "KEGG_APOPTOSIS"                                   
## [12] "KEGG_GLUTATHIONE_METABOLISM"                      
## [13] "KEGG_TGF_BETA_SIGNALING_PATHWAY"                  
## [14] "KEGG_GLIOMA"                                      
## [15] "KEGG_NOTCH_SIGNALING_PATHWAY"                     
## [16] "KEGG_MELANOGENESIS"                               
## [17] "KEGG_GLYCOSAMINOGLYCAN_DEGRADATION"               
## [18] "KEGG_RIG_I_LIKE_RECEPTOR_SIGNALING_PATHWAY"       
## [19] "KEGG_AXON_GUIDANCE"                               
## [20] "KEGG_TYPE_II_DIABETES_MELLITUS"                   
## [21] "KEGG_OLFACTORY_TRANSDUCTION"                      
## [22] "KEGG_MELANOMA"                                    
## [23] "KEGG_CHRONIC_MYELOID_LEUKEMIA"                    
## [24] "KEGG_ARGININE_AND_PROLINE_METABOLISM"             
## [25] "KEGG_STEROID_BIOSYNTHESIS"                        
## [26] "KEGG_P53_SIGNALING_PATHWAY"                       
## [27] "KEGG_SPHINGOLIPID_METABOLISM"

Reactome Gene sets

ORA with Clusterprofiler custom background with otherwise default analysis

I’ve compiled a reporting checklist:

Reporting criteria	Method/resource used
Origin of gene sets	Reactome (2023-03-06)
Tool used	ClusterProfiler (check version at foot of report)
Statistical test used	hypergeometric test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	via DEE2 at accession SRP038101 (human)
Other parameters	Min gene set size of 10

Here I provide a custom background gene list.

Get the gene sets loaded in R. These are Reactome gene sets

genesets2 <- read.gmt("../../ref/ReactomePathways_2023-03-06.gmt")
gsets <- gmtPathways("../../ref/ReactomePathways_2023-03-06.gmt")

message(paste("number of genes described in the annotation set:",length(unique(genesets2$gene))))

## number of genes described in the annotation set: 11457

Enrichment firstly with upregulated genes

n_pw=10

ora_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 500000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=5)
ora_ups <- rownames(ora_up)

gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))

ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL

head(ora_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top upregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
Condensation of Prometaphase Chromosomes	9/1050	11/6944	0.0000017	0.0000309	0.0000259	9	5.410909
Folding of actin by CCT/TriC	7/1050	10/6944	0.0001398	0.0016736	0.0014045	7	4.629333
Interactions of Rev with host cellular proteins	24/1050	36/6944	0.0000000	0.0000000	0.0000000	24	4.408889
Nuclear import of Rev protein	22/1050	33/6944	0.0000000	0.0000000	0.0000000	22	4.408889
Postmitotic nuclear pore complex (NPC) reformation	18/1050	27/6944	0.0000000	0.0000001	0.0000001	18	4.408889
Rev-mediated nuclear export of HIV RNA	22/1050	34/6944	0.0000000	0.0000000	0.0000000	22	4.279216

topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))

Now repeat with the downregulated genes

ora_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 500000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=5)
ora_dns <- rownames(ora_dn)

gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))

ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL

head(ora_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top downregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
Interleukin-20 family signaling	10/1178	12/6944	0.0000009	0.0000594	0.0000530	10	4.912281
Endosomal/Vacuolar pathway	9/1178	12/6944	0.0000152	0.0007214	0.0006439	9	4.421053
Interferon alpha/beta signaling	41/1178	55/6944	0.0000000	0.0000000	0.0000000	41	4.394258
Interferon gamma signaling	40/1178	63/6944	0.0000000	0.0000000	0.0000000	40	3.742690
Interleukin-10 signaling	17/1178	27/6944	0.0000001	0.0000104	0.0000093	17	3.711501
Transcriptional regulation of pluripotent stem cells	7/1178	12/6944	0.0014257	0.0337226	0.0301007	7	3.438597
IRAK4 deficiency (TLR2/4)	8/1178	14/6944	0.0007663	0.0229901	0.0205209	8	3.368421
Binding and Uptake of Ligands by Scavenger Receptors	9/1178	16/6944	0.0004129	0.0141085	0.0125932	9	3.315790
Other interleukin signaling	10/1178	18/6944	0.0002230	0.0080676	0.0072011	10	3.274854
Formation of the cornified envelope	8/1178	15/6944	0.0013992	0.0337226	0.0301007	8	3.143860

topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE Reactomes",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA test")
}

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

I’ve compiled a reporting checklist:

Reporting criteria	Method/resource used
Origin of gene sets	Reactome (2023-03-06)
Tool used	ClusterProfiler (check version at foot of report)
Statistical test used	hypergeometric test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	via DEE2 at accession SRP038101 (human)
Other parameters	Min gene set size of 10 + BUG FIX

Adding all detected genes to the background appears to improve results!

bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)

Enrichment firstly with upregulated genes

orafix_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 500000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_up$geneID <- NULL
orafix_up <- subset(orafix_up,p.adjust<0.05 & Count >=5)
orafix_ups <- rownames(orafix_up)

gr <- as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",2))

orafix_up$es <- gr/br
orafix_up <- orafix_up[order(-orafix_up$es),]
orafix_up$Description=NULL

head(orafix_up) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top upregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
Condensation of Prometaphase Chromosomes	9/1672	11/13164	4.00e-07	0.0000055	0.0000043	9	6.441714
Folding of actin by CCT/TriC	7/1672	10/13164	4.45e-05	0.0004779	0.0003726	7	5.511244
Interactions of Rev with host cellular proteins	24/1672	36/13164	0.00e+00	0.0000000	0.0000000	24	5.248804
Nuclear import of Rev protein	22/1672	33/13164	0.00e+00	0.0000000	0.0000000	22	5.248804
Postmitotic nuclear pore complex (NPC) reformation	18/1672	27/13164	0.00e+00	0.0000000	0.0000000	18	5.248804
Rev-mediated nuclear export of HIV RNA	22/1672	34/13164	0.00e+00	0.0000000	0.0000000	22	5.094427

topup2 <- rev(head(orafix_up$es,10))
names(topup2) <- rev(head(orafix_up$ID,10))

Now with the downregulated genes

n_pw=10

orafix_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 500000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

orafix_dn$geneID <- NULL
orafix_dn <- subset(orafix_dn,p.adjust<0.05 & Count >=5)
orafix_dns <- rownames(orafix_dn)

gr <- as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",2))

orafix_dn$es <- gr/br
orafix_dn <- orafix_dn[order(-orafix_dn$es),]
orafix_dn$Description=NULL

head(orafix_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top downregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
Interleukin-20 family signaling	10/1926	12/13164	0.0000002	0.0000108	0.0000089	10	5.695742
Endosomal/Vacuolar pathway	9/1926	12/13164	0.0000044	0.0001919	0.0001589	9	5.126168
Interferon alpha/beta signaling	41/1926	55/13164	0.0000000	0.0000000	0.0000000	41	5.095101
Interferon gamma signaling	40/1926	63/13164	0.0000000	0.0000000	0.0000000	40	4.339613
Interleukin-10 signaling	17/1926	27/13164	0.0000000	0.0000009	0.0000007	17	4.303450
TNFs bind their physiological receptors	6/1926	10/13164	0.0012021	0.0196365	0.0162539	6	4.100935
Transcriptional regulation of pluripotent stem cells	7/1926	12/13164	0.0005709	0.0111549	0.0092334	7	3.987020
IRAK4 deficiency (TLR2/4)	8/1926	14/13164	0.0002719	0.0066947	0.0055415	8	3.905652
Binding and Uptake of Ligands by Scavenger Receptors	9/1926	16/13164	0.0001299	0.0041000	0.0033937	9	3.844626
Other interleukin signaling	10/1926	18/13164	0.0000622	0.0021881	0.0018112	10	3.797162

topdn2 <- head(orafix_dn$es,n_pw)
names(topdn2) <- head(orafix_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

top <- c(topdn2,topup2)

if ( length(top) > 1 ) {
  barplot(c(top),
    horiz=TRUE,las=1,cex.names=0.65,col=cols,
    main="top DE Reactomes",
    xlab="ES",
    cex.main=0.9)
  mtext("ORA fix")
}

ORA with FORA with custom background

Upregulated genes.

fres_up <- fora(pathways = gsets, genes = defup, universe = bg,
  minSize = 5, maxSize = 500000)

fora_up <- subset(fres_up,padj<0.05)$pathway

Downregulated genes.

fres_dn <- fora(pathways = gsets, genes = defdn, universe = bg,
  minSize = 5, maxSize = 500000)

fora_dn <- subset(fres_dn,padj<0.05)$pathway

Compare pathway results

Compare clusterprofiler approaches.

v1 <- list("ORA up"=ora_ups, "ORA fix up"=orafix_ups)
v2 <- list("ORA dn"=ora_dns, "ORA fix dn"=orafix_dns)
v3 <- list("ORA"=union(ora_ups,ora_dns),
  "ORA fix"=union(orafix_ups,orafix_dns))

par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated Reactomes")

plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated Reactomes")

plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated Reactomes")

svg("rna_reactome_euler.svg")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated Reactomes")
dev.off()

## png 
##   2

Jaccard index comparing ORA and ORA fix.

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)

jaccard(ora,orafix)

## [1] 0.6104972

List gene sets identified only in the “fixed” analysis.

ora <- union(ora_ups,ora_dns)

orafix <- union(orafix_ups,orafix_dns)

setdiff(orafix,ora)

##   [1] "Signaling by Rho GTPases, Miro GTPases and RHOBTB3"                                   
##   [2] "RNA Polymerase II Transcription"                                                      
##   [3] "Post-translational protein modification"                                              
##   [4] "Signaling by Rho GTPases"                                                             
##   [5] "Metabolism of proteins"                                                               
##   [6] "rRNA processing"                                                                      
##   [7] "rRNA processing in the nucleus and cytosol"                                           
##   [8] "Signaling by Nuclear Receptors"                                                       
##   [9] "SARS-CoV-2-host interactions"                                                         
##  [10] "APC/C-mediated degradation of cell cycle proteins"                                    
##  [11] "Regulation of mitotic cell cycle"                                                     
##  [12] "Disease"                                                                              
##  [13] "SARS-CoV Infections"                                                                  
##  [14] "Major pathway of rRNA processing in the nucleolus and cytosol"                        
##  [15] "Anchoring of the basal body to the plasma membrane"                                   
##  [16] "Generic Transcription Pathway"                                                        
##  [17] "DNA Replication Pre-Initiation"                                                       
##  [18] "Regulation of APC/C activators between G1/S and early anaphase"                       
##  [19] "Translesion synthesis by Y family DNA polymerases bypasses lesions on DNA template"   
##  [20] "Chaperonin-mediated protein folding"                                                  
##  [21] "Positive epigenetic regulation of rRNA expression"                                    
##  [22] "Disorders of transmembrane transporters"                                              
##  [23] "Cellular responses to stress"                                                         
##  [24] "Protein folding"                                                                      
##  [25] "ESR-mediated signaling"                                                               
##  [26] "Cooperation of Prefoldin and TriC/CCT  in actin and tubulin folding"                  
##  [27] "RNA Polymerase I Transcription Termination"                                           
##  [28] "Chromatin modifying enzymes"                                                          
##  [29] "Chromatin organization"                                                               
##  [30] "SARS-CoV-2 Infection"                                                                 
##  [31] "Cellular responses to stimuli"                                                        
##  [32] "RHOBTB GTPase Cycle"                                                                  
##  [33] "Recruitment of NuMA to mitotic centrosomes"                                           
##  [34] "Centrosome maturation"                                                                
##  [35] "Recruitment of mitotic centrosome proteins and complexes"                             
##  [36] "RNA Polymerase III Transcription Initiation From Type 2 Promoter"                     
##  [37] "Loss of Nlp from mitotic centrosomes"                                                 
##  [38] "Loss of proteins required for interphase microtubule organization from the centrosome"
##  [39] "Regulation of mRNA stability by proteins that bind AU-rich elements"                  
##  [40] "Influenza Infection"                                                                  
##  [41] "E2F mediated regulation of DNA replication"                                           
##  [42] "Regulation of PLK1 Activity at G2/M Transition"                                       
##  [43] "Cooperation of PDCL (PhLP1) and TRiC/CCT in G-protein beta folding"                   
##  [44] "RNA Polymerase III Transcription Initiation From Type 1 Promoter"                     
##  [45] "Assembly of the ORC complex at the origin of replication"                             
##  [46] "SLBP independent Processing of Histone Pre-mRNAs"                                     
##  [47] "Nucleotide biosynthesis"                                                              
##  [48] "Processing of SMDT1"                                                                  
##  [49] "Activation of APC/C and APC/C:Cdc20 mediated degradation of mitotic proteins"         
##  [50] "Meiotic recombination"                                                                
##  [51] "B-WICH complex positively regulates rRNA expression"                                  
##  [52] "RNA Polymerase III Abortive And Retractive Initiation"                                
##  [53] "RNA Polymerase III Transcription"                                                     
##  [54] "RND2 GTPase cycle"                                                                    
##  [55] "Regulation of cholesterol biosynthesis by SREBP (SREBF)"                              
##  [56] "Cyclin E associated events during G1/S transition"                                    
##  [57] "Prefoldin mediated transfer of substrate  to CCT/TriC"                                
##  [58] "mRNA 3'-end processing"                                                               
##  [59] "Aberrant regulation of mitotic G1/S transition in cancer due to RB1 defects"          
##  [60] "Defective binding of RB1 mutants to E2F1,(E2F2, E2F3)"                                
##  [61] "Aberrant regulation of mitotic cell cycle due to RB1 defects"                         
##  [62] "Activation of BH3-only proteins"                                                      
##  [63] "Formation of the beta-catenin:TCF transactivating complex"                            
##  [64] "SLBP Dependent Processing of Replication-Dependent Histone Pre-mRNAs"                 
##  [65] "The role of GTSE1 in G2/M progression after G2 checkpoint"                            
##  [66] "Cyclin A:Cdk2-associated events at S phase entry"                                     
##  [67] "DNA Double Strand Break Response"                                                     
##  [68] "Deubiquitination"                                                                     
##  [69] "Developmental Biology"                                                                
##  [70] "Biological oxidations"                                                                
##  [71] "Leishmania infection"                                                                 
##  [72] "Neuronal System"                                                                      
##  [73] "RAC1 GTPase cycle"                                                                    
##  [74] "Toll-like Receptor Cascades"                                                          
##  [75] "Cell-Cell communication"                                                              
##  [76] "PI5P, PP2A and IER3 Regulate PI3K/AKT Signaling"                                      
##  [77] "Muscle contraction"                                                                   
##  [78] "Transmission across Chemical Synapses"                                                
##  [79] "Antigen processing-Cross presentation"                                                
##  [80] "RHOA GTPase cycle"                                                                    
##  [81] "Semaphorin interactions"                                                              
##  [82] "Interleukin-3, Interleukin-5 and GM-CSF signaling"                                    
##  [83] "Collagen biosynthesis and modifying enzymes"                                          
##  [84] "Negative regulation of the PI3K/AKT network"                                          
##  [85] "Attachment and Entry_9694614"                                                         
##  [86] "MyD88 deficiency (TLR2/4)"                                                            
##  [87] "Toll Like Receptor 4 (TLR4) Cascade"                                                  
##  [88] "TNFs bind their physiological receptors"                                              
##  [89] "Constitutive Signaling by Aberrant PI3K in Cancer"                                    
##  [90] "Phase I - Functionalization of compounds"                                             
##  [91] "Glycosaminoglycan metabolism"                                                         
##  [92] "Signaling by cytosolic FGFR1 fusion mutants"                                          
##  [93] "Signaling by SCF-KIT"                                                                 
##  [94] "Non-integrin membrane-ECM interactions"                                               
##  [95] "Metabolism of fat-soluble vitamins"                                                   
##  [96] "Diseases of signal transduction by growth factor receptors and second messengers"     
##  [97] "Ephrin signaling"                                                                     
##  [98] "Sema3A PAK dependent Axon repulsion"                                                  
##  [99] "ER-Phagosome pathway"                                                                 
## [100] "Signaling by PDGF"                                                                    
## [101] "CDC42 GTPase cycle"                                                                   
## [102] "Glycosphingolipid metabolism"                                                         
## [103] "Opioid Signalling"                                                                    
## [104] "Elastic fibre formation"                                                              
## [105] "Interleukin-15 signaling"                                                             
## [106] "Regulation of TLR by endogenous ligand"                                               
## [107] "Assembly and cell surface presentation of NMDA receptors"                             
## [108] "RAF-independent MAPK1/3 activation"                                                   
## [109] "Collagen formation"                                                                   
## [110] "Heparan sulfate/heparin (HS-GAG) metabolism"                                          
## [111] "Signaling by NOTCH3"                                                                  
## [112] "MyD88:MAL(TIRAP) cascade initiated on plasma membrane"                                
## [113] "Toll Like Receptor 2 (TLR2) Cascade"                                                  
## [114] "Toll Like Receptor TLR1:TLR2 Cascade"                                                 
## [115] "Toll Like Receptor TLR6:TLR2 Cascade"                                                 
## [116] "RHO GTPase cycle"                                                                     
## [117] "Nuclear Events (kinase and transcription factor activation)"                          
## [118] "Chondroitin sulfate/dermatan sulfate metabolism"                                      
## [119] "Activation of Matrix Metalloproteinases"                                              
## [120] "Metabolism of steroid hormones"                                                       
## [121] "The NLRP3 inflammasome"                                                               
## [122] "Neurotransmitter receptors and postsynaptic signal transmission"                      
## [123] "Amyloid fiber formation"                                                              
## [124] "Sphingolipid metabolism"                                                              
## [125] "Activated NOTCH1 Transmits Signal to the Nucleus"                                     
## [126] "Sensory processing of sound by outer hair cells of the cochlea"                       
## [127] "Signaling by CSF3 (G-CSF)"                                                            
## [128] "Signaling by FGFR1 in disease"                                                        
## [129] "Inflammasomes"                                                                        
## [130] "Cell recruitment (pro-inflammatory response)"                                         
## [131] "Purinergic signaling in leishmaniasis infection"                                      
## [132] "G alpha (s) signalling events"                                                        
## [133] "Prolactin receptor signaling"                                                         
## [134] "SEMA3A-Plexin repulsion signaling by inhibiting Integrin adhesion"                    
## [135] "Adaptive Immune System"                                                               
## [136] "G alpha (q) signalling events"                                                        
## [137] "Death Receptor Signaling"                                                             
## [138] "Ca-dependent events"                                                                  
## [139] "Antimicrobial peptides"

Compare clusterprofiler to fora.

v1 <- list("CP up"=ora_ups, "CP fix up"=orafix_ups, "FORA up" = fora_up)
v2 <- list("CP dn"=ora_dns, "CP fix dn"=orafix_dns, "FORA dn" = fora_dn)
v3 <- list("CP"=union(ora_ups,ora_dns),
  "CP fix"=union(orafix_ups,orafix_dns),
  "FORA" = union(fora_up,fora_dn) )

par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated Reactomes")

plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated Reactomes")

plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated Reactomes")

Jaccard index comparing CP to FORA.

ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)
fora <- c(fora_up,fora_dn)

jaccard(ora,fora)

## [1] 0.6207865

jaccard(orafix,fora)

## [1] 0.8944591

Session information

For reproducibility

Click HERE to show session info

sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gplots_3.1.3                eulerr_7.0.0               
##  [3] clusterProfiler_4.8.3       kableExtra_1.3.4           
##  [5] DESeq2_1.40.2               SummarizedExperiment_1.30.2
##  [7] Biobase_2.60.0              MatrixGenerics_1.12.3      
##  [9] matrixStats_1.0.0           GenomicRanges_1.52.0       
## [11] GenomeInfoDb_1.36.2         IRanges_2.34.1             
## [13] S4Vectors_0.38.1            BiocGenerics_0.46.0        
## [15] getDEE2_1.10.0              fgsea_1.26.0               
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3      rstudioapi_0.15.0       jsonlite_1.8.7         
##   [4] magrittr_2.0.3          farver_2.1.1            rmarkdown_2.24         
##   [7] zlibbioc_1.46.0         vctrs_0.6.3             memoise_2.0.1          
##  [10] RCurl_1.98-1.12         ggtree_3.8.2            webshot_0.5.5          
##  [13] htmltools_0.5.6         S4Arrays_1.0.6          gridGraphics_0.5-1     
##  [16] sass_0.4.7              KernSmooth_2.23-22      bslib_0.5.1            
##  [19] plyr_1.8.8              cachem_1.0.8            igraph_1.5.1           
##  [22] lifecycle_1.0.3         pkgconfig_2.0.3         gson_0.1.0             
##  [25] Matrix_1.6-1            R6_2.5.1                fastmap_1.1.1          
##  [28] GenomeInfoDbData_1.2.10 digest_0.6.33           aplot_0.2.0            
##  [31] enrichplot_1.20.1       colorspace_2.1-0        patchwork_1.1.3        
##  [34] AnnotationDbi_1.62.2    RSQLite_2.3.1           fansi_1.0.4            
##  [37] httr_1.4.7              polyclip_1.10-4         abind_1.4-5            
##  [40] compiler_4.3.1          bit64_4.0.5             withr_2.5.0            
##  [43] downloader_0.4          BiocParallel_1.34.2     viridis_0.6.4          
##  [46] DBI_1.1.3               highr_0.10              ggforce_0.4.1          
##  [49] MASS_7.3-60             DelayedArray_0.26.7     HDO.db_0.99.1          
##  [52] caTools_1.18.2          gtools_3.9.4            tools_4.3.1            
##  [55] scatterpie_0.2.1        ape_5.7-1               glue_1.6.2             
##  [58] nlme_3.1-163            GOSemSim_2.26.1         polylabelr_0.2.0       
##  [61] shadowtext_0.1.2        grid_4.3.1              reshape2_1.4.4         
##  [64] generics_0.1.3          gtable_0.3.4            tidyr_1.3.0            
##  [67] data.table_1.14.8       tidygraph_1.2.3         xml2_1.3.5             
##  [70] utf8_1.2.3              XVector_0.40.0          ggrepel_0.9.3          
##  [73] pillar_1.9.0            stringr_1.5.0           yulab.utils_0.0.9      
##  [76] splines_4.3.1           dplyr_1.1.3             tweenr_2.0.2           
##  [79] treeio_1.24.3           lattice_0.21-8          bit_4.0.5              
##  [82] tidyselect_1.2.0        GO.db_3.17.0            locfit_1.5-9.8         
##  [85] Biostrings_2.68.1       knitr_1.43              gridExtra_2.3          
##  [88] svglite_2.1.1           xfun_0.40               graphlayouts_1.0.0     
##  [91] stringi_1.7.12          lazyeval_0.2.2          ggfun_0.1.2            
##  [94] yaml_2.3.7              evaluate_0.21           codetools_0.2-19       
##  [97] ggraph_2.1.0            tibble_3.2.1            qvalue_2.32.0          
## [100] ggplotify_0.1.2         cli_3.6.1               systemfonts_1.0.4      
## [103] munsell_0.5.0           jquerylib_0.1.4         Rcpp_1.0.11            
## [106] png_0.1-8               parallel_4.3.1          ggplot2_3.4.3          
## [109] blob_1.2.4              DOSE_3.26.1             htm2txt_2.2.2          
## [112] bitops_1.0-7            viridisLite_0.4.2       tidytree_0.4.5         
## [115] scales_1.2.1            purrr_1.0.2             crayon_1.5.2           
## [118] rlang_1.1.1             cowplot_1.1.1           fastmatch_1.1-4        
## [121] KEGGREST_1.40.0         rvest_1.0.3

Examining background gene lists in clusterProfiler - RNA-seq example

Mark Ziemann & Anusuiya Bora

2023-09-08

Introduction

Data quality control

Differential expression analysis

KEGG pathway analysis

Get gene sets

ORA with Clusterprofiler custom background with otherwise default analysis

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

Compare pathway results

Reactome Gene sets

ORA with Clusterprofiler custom background with otherwise default analysis

ORA with Clusterprofiler with custom background and special gene set list to fix the potential bug

ORA with FORA with custom background

Compare pathway results

Session information