Source: https://github.com/markziemann/enrichment_recipe
This guide is a Rmarkdown script that conducts differential expression and enrichment analysis, which are very popular workflows for transcriptome data.
The goal of this work is to understand how ClusterProfiler manages the background list, as we have observed some weird behaviour. In particular we see that when we provide a custom background, those genes do not appear in the final analysis unless they are also included in the gene list.
In the code chunk below called libs
, you can add and remove required R library dependancies. Check that the libraries listed here match the Dockerfile, otherwise you might get errors.
suppressPackageStartupMessages({
library("DESeq2")
library("kableExtra")
library("clusterProfiler")
library("fgsea")
library("eulerr")
library("gplots")
library("IlluminaHumanMethylation450kanno.ilmn12.hg19")
})
For this report I will be using proteomics data from Stukalov et al (https://pubmed.ncbi.nlm.nih.gov/33845483/), who profiled the effect of SARS-COV2 infection on A549 cells.
Here we used the 24hr profile, as there were relatively more changes found.
Probable contaminants were removed, as well as viral proteins. This left 5825 proteins, including 165 downregulated and 44 upregulated.
The data were obtained from Supplementary Table 5 and saved to the /data folder.
x <- read.table("../data/27651444_smoking_current_vs_never_smoking_16938071371108067.tsv",header=TRUE,sep="\t")
Get background gene list.
anno = getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
bg <- anno$UCSC_RefGene_Name
bg <- unique(unlist(strsplit(bg,";")))
length(bg)
## [1] 21231
Get up and down-methylated list.
defup <- subset(x,beta>0)$gene
defup <- defup[defup!="-"]
defup <- unique(unlist(strsplit(defup,";")))
length(defup)
## [1] 2137
defdn <- subset(x,beta<0)$gene
defdn <- defdn[defdn!="-"]
defdn <- unique(unlist(strsplit(defdn,";")))
length(defdn)
## [1] 2786
Gene sets. Get the gene sets loaded in R. These are KEGG gene sets
# from https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp 16th June 2023
genesets2 <- read.gmt("../ref/c2.cp.kegg.v2023.1.Hs.symbols.gmt")
gsets <- gmtPathways("../ref/c2.cp.kegg.v2023.1.Hs.symbols.gmt")
message(paste("number of genes described in the annotation set:",length(unique(genesets2$gene))))
## number of genes described in the annotation set: 5244
In the next sections I will run enrichment analysis with over-representation analysis (ORA) test and compare it to functional class scoring. I will also investigate some strange behviour of the ORA tool clusterprofiler.
I’ve compiled a reporting checklist:
Reporting criteria | Method/resource used |
---|---|
Origin of gene sets | KEGG (2023-06-16) |
Tool used | ClusterProfiler (check version at foot of report) |
Statistical test used | hypergeometric test |
P-value correction for multiple comparisons | FDR method |
Background list | Genes with >=10 reads per sample on average across all samples |
Gene Selection Criteria | DESeq2 FDR<0.05 |
ORA directionality | Separate tests for up- and down-regulation |
Data availability | Stukalov et al 2021: PMID 33845483 |
Other parameters | Min gene set size of 10 |
Here I provide a custom background gene list.
Enrichment firstly with upregulated genes
n_pw=10
ora_up <- as.data.frame(enricher(gene = defup ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=5)
ora_ups <- rownames(ora_up)
gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))
ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL
head(ora_up) %>%
kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
KEGG_HEMATOPOIETIC_CELL_LINEAGE | 22/556 | 82/4904 | 0.0000742 | 0.0129138 | 0.0117185 | 22 | 2.366380 |
KEGG_FOCAL_ADHESION | 38/556 | 195/4904 | 0.0004755 | 0.0413714 | 0.0375421 | 38 | 1.718797 |
topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))
The reported background size does not match the dataset.
Now repeat with the downregulated genes
ora_dn <- as.data.frame(enricher(gene = defdn ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=5)
ora_dns <- rownames(ora_dn)
gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))
ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL
head(ora_dn,n_pw) %>%
kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
KEGG_RIBOSOME | 31/751 | 87/4904 | 0.0000020 | 0.0003566 | 0.0002965 | 31 | 2.326767 |
KEGG_PATHOGENIC_ESCHERICHIA_COLI_INFECTION | 18/751 | 53/4904 | 0.0005646 | 0.0330185 | 0.0274496 | 18 | 2.217722 |
KEGG_CHRONIC_MYELOID_LEUKEMIA | 22/751 | 73/4904 | 0.0009380 | 0.0330185 | 0.0274496 | 22 | 1.967933 |
KEGG_REGULATION_OF_ACTIN_CYTOSKELETON | 49/751 | 208/4904 | 0.0009285 | 0.0330185 | 0.0274496 | 49 | 1.538308 |
KEGG_PATHWAYS_IN_CANCER | 71/751 | 320/4904 | 0.0004751 | 0.0330185 | 0.0274496 | 71 | 1.448835 |
KEGG_MAPK_SIGNALING_PATHWAY | 58/751 | 262/4904 | 0.0016565 | 0.0485894 | 0.0403943 | 58 | 1.445564 |
topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)
Make a barplot
par(mar=c(5,20,5,1))
cols <- c(rep("blue",n_pw),rep("red",n_pw))
top <- c(topdn2,topup2)
if ( length(top) > 1 ) {
barplot(c(top),
horiz=TRUE,las=1,cex.names=0.65,col=cols,
main="top DE KEGGs",
xlab="ES",
cex.main=0.9)
mtext("ORA test")
}
I’ve compiled a reporting checklist:
Reporting criteria | Method/resource used |
---|---|
Origin of gene sets | KEGG (2023-06-16) |
Tool used | ClusterProfiler (check version at foot of report) |
Statistical test used | hypergeometric test |
P-value correction for multiple comparisons | FDR method |
Background list | Genes with >=10 reads per sample on average across all samples |
Gene Selection Criteria | DESeq2 FDR<0.05 |
ORA directionality | Separate tests for up- and down-regulation |
Data availability | Data availability |
Other parameters | Min gene set size of 10 + BUG FIX |
Here I provide a background gene list for cluterprofiler to use, like a user should.
I have also done some modification to the KEGG gene sets, to en
Now filter the gene names into three lists, up-regulated, down-regulated and background. Background is simply all genes that were detected.
Adding all detected genes to the background appears to improve results!
bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)
Enrichment firstly with upregulated genes
n_pw=10
orafix_up <- as.data.frame(enricher(gene = defup ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
orafix_up$geneID <- NULL
orafix_up <- subset(orafix_up,p.adjust<0.05 & Count >=5)
orafix_ups <- rownames(orafix_up)
gr <- as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",2))
orafix_up$es <- gr/br
orafix_up <- orafix_up[order(-orafix_up$es),]
orafix_up$Description=NULL
head(orafix_up) %>%
kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION | 12/2137 | 44/21231 | 0.0010285 | 0.0255655 | 0.0219619 | 12 | 2.709533 |
KEGG_HEMATOPOIETIC_CELL_LINEAGE | 22/2137 | 82/21231 | 0.0000135 | 0.0023442 | 0.0020138 | 22 | 2.665476 |
KEGG_AXON_GUIDANCE | 26/2137 | 127/21231 | 0.0003362 | 0.0146231 | 0.0125619 | 26 | 2.033928 |
KEGG_FOCAL_ADHESION | 38/2137 | 195/21231 | 0.0000517 | 0.0045008 | 0.0038664 | 38 | 1.936043 |
KEGG_REGULATION_OF_ACTIN_CYTOSKELETON | 39/2137 | 208/21231 | 0.0000996 | 0.0057756 | 0.0049615 | 39 | 1.862804 |
KEGG_CHEMOKINE_SIGNALING_PATHWAY | 32/2137 | 185/21231 | 0.0016460 | 0.0358010 | 0.0307546 | 32 | 1.718479 |
topup2 <- rev(head(orafix_up$es,10))
names(topup2) <- rev(head(orafix_up$ID,10))
Now with the downregulated genes
orafix_dn <- as.data.frame(enricher(gene = defdn ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
orafix_dn$geneID <- NULL
orafix_dn <- subset(orafix_dn,p.adjust<0.05 & Count >=5)
orafix_dns <- rownames(orafix_dn)
gr <- as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",2))
orafix_dn$es <- gr/br
orafix_dn <- orafix_dn[order(-orafix_dn$es),]
orafix_dn$Description=NULL
head(orafix_dn,n_pw) %>%
kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
KEGG_RIBOSOME | 31/2786 | 87/21231 | 0.0000001 | 0.0000137 | 0.0000099 | 31 | 2.715387 |
KEGG_TYPE_I_DIABETES_MELLITUS | 14/2786 | 41/21231 | 0.0004743 | 0.0092754 | 0.0067125 | 14 | 2.602157 |
KEGG_PATHOGENIC_ESCHERICHIA_COLI_INFECTION | 18/2786 | 53/21231 | 0.0000838 | 0.0029482 | 0.0021336 | 18 | 2.588129 |
KEGG_GRAFT_VERSUS_HOST_DISEASE | 12/2786 | 37/21231 | 0.0019706 | 0.0165151 | 0.0119517 | 12 | 2.471547 |
KEGG_ALLOGRAFT_REJECTION | 11/2786 | 35/21231 | 0.0039553 | 0.0248619 | 0.0179922 | 11 | 2.395047 |
KEGG_CHRONIC_MYELOID_LEUKEMIA | 22/2786 | 73/21231 | 0.0001115 | 0.0032705 | 0.0023668 | 22 | 2.296620 |
KEGG_PANCREATIC_CANCER | 20/2786 | 69/21231 | 0.0003989 | 0.0087753 | 0.0063506 | 20 | 2.208870 |
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY | 19/2786 | 66/21231 | 0.0006059 | 0.0104657 | 0.0075738 | 19 | 2.193810 |
KEGG_TYPE_II_DIABETES_MELLITUS | 13/2786 | 46/21231 | 0.0050390 | 0.0305816 | 0.0221314 | 13 | 2.153649 |
KEGG_ADHERENS_JUNCTION | 19/2786 | 68/21231 | 0.0009063 | 0.0106334 | 0.0076952 | 19 | 2.129286 |
topdn2 <- head(orafix_dn$es,n_pw)
names(topdn2) <- head(orafix_dn$ID,n_pw)
Make a barplot
par(mar=c(5,20,5,1))
cols <- c(rep("blue",n_pw),rep("red",n_pw))
top <- c(topdn2,topup2)
if ( length(top) > 1 ) {
barplot(c(top),
horiz=TRUE,las=1,cex.names=0.65,col=cols,
main="top DE KEGGs",
xlab="ES",
cex.main=0.9)
mtext("ORA fix")
}
v1 <- list("ORA up"=ora_ups, "ORA fix up"=orafix_ups)
v2 <- list("ORA dn"=ora_dns, "ORA fix dn"=orafix_dns)
v3 <- list("ORA"=union(ora_ups,ora_dns),
"ORA fix"=union(orafix_ups,orafix_dns))
par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated KEGG pathways")
plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated KEGG pathways")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated KEGG pathways")
svg("prot_kegg_euler.svg")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated KEGG pathways")
dev.off()
## png
## 2
Jaccard index comparing ORA and ORA fix.
jaccard <- function(a, b) {
intersection = length(intersect(a, b))
union = length(a) + length(b) - intersection
return (intersection/union)
}
ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)
jaccard(ora,orafix)
## [1] 0.195122
List gene sets identified only in the “fixed” analysis.
ora <- union(ora_ups,ora_dns)
orafix <- union(orafix_ups,orafix_dns)
setdiff(orafix,ora)
## [1] "KEGG_AXON_GUIDANCE"
## [2] "KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION"
## [3] "KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION"
## [4] "KEGG_CHEMOKINE_SIGNALING_PATHWAY"
## [5] "KEGG_ENDOCYTOSIS"
## [6] "KEGG_PANCREATIC_CANCER"
## [7] "KEGG_TYPE_I_DIABETES_MELLITUS"
## [8] "KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY"
## [9] "KEGG_NEUROTROPHIN_SIGNALING_PATHWAY"
## [10] "KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION"
## [11] "KEGG_ADHERENS_JUNCTION"
## [12] "KEGG_VIRAL_MYOCARDITIS"
## [13] "KEGG_INSULIN_SIGNALING_PATHWAY"
## [14] "KEGG_MELANOMA"
## [15] "KEGG_RENAL_CELL_CARCINOMA"
## [16] "KEGG_FC_GAMMA_R_MEDIATED_PHAGOCYTOSIS"
## [17] "KEGG_GRAFT_VERSUS_HOST_DISEASE"
## [18] "KEGG_T_CELL_RECEPTOR_SIGNALING_PATHWAY"
## [19] "KEGG_PYRIMIDINE_METABOLISM"
## [20] "KEGG_JAK_STAT_SIGNALING_PATHWAY"
## [21] "KEGG_ERBB_SIGNALING_PATHWAY"
## [22] "KEGG_NON_SMALL_CELL_LUNG_CANCER"
## [23] "KEGG_PHOSPHATIDYLINOSITOL_SIGNALING_SYSTEM"
## [24] "KEGG_ALLOGRAFT_REJECTION"
## [25] "KEGG_TYPE_II_DIABETES_MELLITUS"
## [26] "KEGG_COLORECTAL_CANCER"
## [27] "KEGG_TGF_BETA_SIGNALING_PATHWAY"
## [28] "KEGG_TIGHT_JUNCTION"
## [29] "KEGG_INOSITOL_PHOSPHATE_METABOLISM"
I’ve compiled a reporting checklist:
Reporting criteria | Method/resource used |
---|---|
Origin of gene sets | Reactome (2023-03-06) |
Tool used | ClusterProfiler (check version at foot of report) |
Statistical test used | hypergeometric test |
P-value correction for multiple comparisons | FDR method |
Background list | Genes with >=10 reads per sample on average across all samples |
Gene Selection Criteria | DESeq2 FDR<0.05 |
ORA directionality | Separate tests for up- and down-regulation |
Data availability | Data availability |
Other parameters | Min gene set size of 10 |
Here I provide a custom background gene list.
Get the gene sets loaded in R. These are Reactome gene sets
genesets2 <- read.gmt("../ref/ReactomePathways_2023-03-06.gmt")
gsets <- gmtPathways("../ref/ReactomePathways_2023-03-06.gmt")
message(paste("number of genes described in the annotation set:",length(unique(genesets2$gene))))
## number of genes described in the annotation set: 11457
Enrichment firstly with upregulated genes
n_pw=10
ora_up <- as.data.frame(enricher(gene = defup ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=5)
ora_ups <- rownames(ora_up)
gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))
ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL
head(ora_up) %>%
kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
NOTCH3 Intracellular Domain Regulates Transcription | 10/1129 | 25/10174 | 0.0001856 | 0.0238237 | 0.0220051 | 10 | 3.604606 |
FCGR3A-mediated IL10 synthesis | 13/1129 | 38/10174 | 0.0001364 | 0.0196945 | 0.0181911 | 13 | 3.082887 |
Vasopressin regulates renal water homeostasis via Aquaporins | 14/1129 | 43/10174 | 0.0001395 | 0.0196945 | 0.0181911 | 14 | 2.933981 |
Signaling by NOTCH3 | 15/1129 | 48/10174 | 0.0001371 | 0.0196945 | 0.0181911 | 15 | 2.816098 |
RHOB GTPase cycle | 17/1129 | 63/10174 | 0.0003666 | 0.0398149 | 0.0367755 | 17 | 2.431678 |
G alpha (12/13) signalling events | 19/1129 | 73/10174 | 0.0002815 | 0.0331221 | 0.0305936 | 19 | 2.345463 |
topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))
Now repeat with the downregulated genes
ora_dn <- as.data.frame(enricher(gene = defdn ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=5)
ora_dns <- rownames(ora_dn)
gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))
ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL
head(ora_dn,n_pw) %>%
kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
Sema3A PAK dependent Axon repulsion | 9/1484 | 16/10174 | 0.0001263 | 0.0052774 | 0.0047241 | 9 | 3.856385 |
FGFR2 alternative splicing | 11/1484 | 25/10174 | 0.0003788 | 0.0126604 | 0.0113330 | 11 | 3.016550 |
Eukaryotic Translation Elongation | 35/1484 | 94/10174 | 0.0000000 | 0.0000329 | 0.0000295 | 35 | 2.552690 |
Peptide chain elongation | 32/1484 | 90/10174 | 0.0000006 | 0.0001049 | 0.0000939 | 32 | 2.437616 |
Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC) | 34/1484 | 96/10174 | 0.0000003 | 0.0000694 | 0.0000621 | 34 | 2.428094 |
RUNX1 interacts with co-factors whose precise effect on RUNX1 targets is not known | 13/1484 | 37/10174 | 0.0014930 | 0.0423680 | 0.0379260 | 13 | 2.408793 |
SARS-CoV-1 modulates host translation machinery | 13/1484 | 37/10174 | 0.0014930 | 0.0423680 | 0.0379260 | 13 | 2.408793 |
Eukaryotic Translation Termination | 33/1484 | 94/10174 | 0.0000005 | 0.0001049 | 0.0000939 | 33 | 2.406822 |
Viral mRNA Translation | 31/1484 | 90/10174 | 0.0000018 | 0.0001971 | 0.0001764 | 31 | 2.361441 |
Nonsense Mediated Decay (NMD) enhanced by the Exon Junction Complex (EJC) | 39/1484 | 114/10174 | 0.0000001 | 0.0000411 | 0.0000368 | 39 | 2.345404 |
topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)
Make a barplot
par(mar=c(5,20,5,1))
cols <- c(rep("blue",n_pw),rep("red",n_pw))
top <- c(topdn2,topup2)
if ( length(top) > 1 ) {
barplot(c(top),
horiz=TRUE,las=1,cex.names=0.65,col=cols,
main="top DE Reactomes",
xlab="ES",
cex.main=0.9)
mtext("ORA test")
}
I’ve compiled a reporting checklist:
Reporting criteria | Method/resource used |
---|---|
Origin of gene sets | Reactome (2023-03-06) |
Tool used | ClusterProfiler (check version at foot of report) |
Statistical test used | hypergeometric test |
P-value correction for multiple comparisons | FDR method |
Background list | Genes with >=10 reads per sample on average across all samples |
Gene Selection Criteria | DESeq2 FDR<0.05 |
ORA directionality | Separate tests for up- and down-regulation |
Data availability | Data availability |
Other parameters | Min gene set size of 10 + BUG FIX |
Adding all detected genes to the background appears to improve results!
bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)
Enrichment firstly with upregulated genes
orafix_up <- as.data.frame(enricher(gene = defup ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
orafix_up$geneID <- NULL
orafix_up <- subset(orafix_up,p.adjust<0.05 & Count >=5)
orafix_ups <- rownames(orafix_up)
gr <- as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",2))
orafix_up$es <- gr/br
orafix_up <- orafix_up[order(-orafix_up$es),]
orafix_up$Description=NULL
head(orafix_up) %>%
kbl(row.names = FALSE, caption="Top upregulated pathways in LPS treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
FCGR activation | 6/2137 | 12/21231 | 0.0005580 | 0.0256074 | 0.0228126 | 6 | 4.967478 |
NOTCH3 Intracellular Domain Regulates Transcription | 10/2137 | 25/21231 | 0.0000824 | 0.0089505 | 0.0079736 | 10 | 3.973982 |
FCGR3A-mediated IL10 synthesis | 13/2137 | 38/21231 | 0.0000506 | 0.0064933 | 0.0057846 | 13 | 3.398801 |
Antigen activates B Cell Receptor (BCR) leading to generation of second messengers | 10/2137 | 30/21231 | 0.0004734 | 0.0256074 | 0.0228126 | 10 | 3.311652 |
Transcriptional regulation of granulopoiesis | 10/2137 | 30/21231 | 0.0004734 | 0.0256074 | 0.0228126 | 10 | 3.311652 |
Vasopressin regulates renal water homeostasis via Aquaporins | 14/2137 | 43/21231 | 0.0000491 | 0.0064933 | 0.0057846 | 14 | 3.234637 |
topup2 <- rev(head(orafix_up$es,10))
names(topup2) <- rev(head(orafix_up$ID,10))
Now with the downregulated genes
n_pw=10
orafix_dn <- as.data.frame(enricher(gene = defdn ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
orafix_dn$geneID <- NULL
orafix_dn <- subset(orafix_dn,p.adjust<0.05 & Count >=5)
orafix_dns <- rownames(orafix_dn)
gr <- as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",2))
orafix_dn$es <- gr/br
orafix_dn <- orafix_dn[order(-orafix_dn$es),]
orafix_dn$Description=NULL
head(orafix_dn,n_pw) %>%
kbl(row.names = FALSE, caption="Top downregulated pathways in LPS treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
Sema3A PAK dependent Axon repulsion | 9/2786 | 16/21231 | 0.0000545 | 0.0017439 | 0.0014072 | 9 | 4.286589 |
Signaling by Leptin | 6/2786 | 11/21231 | 0.0012988 | 0.0244175 | 0.0197042 | 6 | 4.156693 |
CD28 dependent Vav1 pathway | 6/2786 | 12/21231 | 0.0023099 | 0.0376382 | 0.0303729 | 6 | 3.810302 |
FGFR2 alternative splicing | 11/2786 | 25/21231 | 0.0001474 | 0.0043483 | 0.0035089 | 11 | 3.353065 |
Ephrin signaling | 8/2786 | 19/21231 | 0.0017106 | 0.0302683 | 0.0244257 | 8 | 3.208675 |
Growth hormone receptor signaling | 9/2786 | 24/21231 | 0.0023232 | 0.0376382 | 0.0303729 | 9 | 2.857726 |
TNFR1-induced proapoptotic signaling | 9/2786 | 24/21231 | 0.0023232 | 0.0376382 | 0.0303729 | 9 | 2.857726 |
Eukaryotic Translation Elongation | 35/2786 | 94/21231 | 0.0000000 | 0.0000011 | 0.0000009 | 35 | 2.837459 |
TNFs bind their physiological receptors | 10/2786 | 27/21231 | 0.0014960 | 0.0267850 | 0.0216147 | 10 | 2.822446 |
Signaling by Erythropoietin | 9/2786 | 25/21231 | 0.0032121 | 0.0460101 | 0.0371288 | 9 | 2.743417 |
topdn2 <- head(orafix_dn$es,n_pw)
names(topdn2) <- head(orafix_dn$ID,n_pw)
Make a barplot
par(mar=c(5,20,5,1))
cols <- c(rep("blue",n_pw),rep("red",n_pw))
top <- c(topdn2,topup2)
if ( length(top) > 1 ) {
barplot(c(top),
horiz=TRUE,las=1,cex.names=0.65,col=cols,
main="top DE Reactomes",
xlab="ES",
cex.main=0.9)
mtext("ORA fix")
}
v1 <- list("ORA up"=ora_ups, "ORA fix up"=orafix_ups)
v2 <- list("ORA dn"=ora_dns, "ORA fix dn"=orafix_dns)
v3 <- list("ORA"=union(ora_ups,ora_dns),
"ORA fix"=union(orafix_ups,orafix_dns))
par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated Reactomes")
plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated Reactomes")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated Reactomes")
svg("prot_reactome_euler.svg")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated Reactomes")
dev.off()
## png
## 2
Jaccard index comparing ORA and ORA fix.
jaccard <- function(a, b) {
intersection = length(intersect(a, b))
union = length(a) + length(b) - intersection
return (intersection/union)
}
ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)
jaccard(ora,orafix)
## [1] 0.4038462
List gene sets identified only in the “fixed” analysis.
ora <- union(ora_ups,ora_dns)
orafix <- union(orafix_ups,orafix_dns)
setdiff(orafix,ora)
## [1] "Platelet activation, signaling and aggregation"
## [2] "Muscle contraction"
## [3] "Sensory processing of sound"
## [4] "Sensory processing of sound by inner hair cells of the cochlea"
## [5] "Cell death signalling via NRAGE, NRIF and NADE"
## [6] "NCAM signaling for neurite out-growth"
## [7] "Aquaporin-mediated transport"
## [8] "NRAGE signals death through JNK"
## [9] "Antigen activates B Cell Receptor (BCR) leading to generation of second messengers"
## [10] "Transcriptional regulation of granulopoiesis"
## [11] "NCAM1 interactions"
## [12] "Anti-inflammatory response favouring Leishmania parasite infection"
## [13] "Leishmania parasite growth and survival"
## [14] "FCGR activation"
## [15] "RHOC GTPase cycle"
## [16] "Extracellular matrix organization"
## [17] "O-glycosylation of TSR domain-containing proteins"
## [18] "GPCR ligand binding"
## [19] "RHOD GTPase cycle"
## [20] "G alpha (q) signalling events"
## [21] "Glucagon signaling in metabolic regulation"
## [22] "Integration of energy metabolism"
## [23] "Cardiac conduction"
## [24] "Fatty acyl-CoA biosynthesis"
## [25] "Adrenaline,noradrenaline inhibits insulin secretion"
## [26] "RHOF GTPase cycle"
## [27] "EPH-ephrin mediated repulsion of cells"
## [28] "p75 NTR receptor-mediated signalling"
## [29] "Regulation of insulin secretion"
## [30] "Metabolism of proteins"
## [31] "Innate Immune System"
## [32] "Translation"
## [33] "SUMO E3 ligases SUMOylate target proteins"
## [34] "Metabolism of RNA"
## [35] "Signaling by NOTCH"
## [36] "TNFR2 non-canonical NF-kB pathway"
## [37] "TCF dependent signaling in response to WNT"
## [38] "Signaling by FGFR2"
## [39] "Metabolism"
## [40] "Adaptive Immune System"
## [41] "SARS-CoV-1-host interactions"
## [42] "Signaling by NTRK1 (TRKA)"
## [43] "Signaling by ALK fusions and activated point mutants"
## [44] "Signaling by ALK in cancer"
## [45] "Signaling by TGFB family members"
## [46] "SUMOylation"
## [47] "Gene expression (Transcription)"
## [48] "Intracellular signaling by second messengers"
## [49] "Activation of anterior HOX genes in hindbrain development during early embryogenesis"
## [50] "Activation of HOX genes during differentiation"
## [51] "RHOV GTPase cycle"
## [52] "Signaling by FGFR"
## [53] "Transcriptional Regulation by TP53"
## [54] "RNA Polymerase II Transcription"
## [55] "Signaling by Leptin"
## [56] "Signaling by WNT"
## [57] "Netrin-1 signaling"
## [58] "Generic Transcription Pathway"
## [59] "TNFs bind their physiological receptors"
## [60] "Ephrin signaling"
## [61] "Notch-HLH transcription pathway"
## [62] "The role of Nef in HIV-1 replication and disease pathogenesis"
## [63] "TNF signaling"
## [64] "Metabolism of nucleotides"
## [65] "CD28 dependent Vav1 pathway"
## [66] "Growth hormone receptor signaling"
## [67] "TNFR1-induced proapoptotic signaling"
## [68] "Transcriptional regulation by RUNX3"
## [69] "CD28 co-stimulation"
## [70] "PIP3 activates AKT signaling"
## [71] "Interferon alpha/beta signaling"
## [72] "Post-translational protein modification"
## [73] "RHOU GTPase cycle"
## [74] "Nuclear Receptor transcription pathway"
## [75] "GPVI-mediated activation cascade"
## [76] "Ribosomal scanning and start codon recognition"
## [77] "Translation initiation complex formation"
## [78] "Signaling by Erythropoietin"
For reproducibility
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
## [2] minfi_1.46.0
## [3] bumphunter_1.42.0
## [4] locfit_1.5-9.8
## [5] iterators_1.0.14
## [6] foreach_1.5.2
## [7] Biostrings_2.68.1
## [8] XVector_0.40.0
## [9] gplots_3.1.3
## [10] eulerr_7.0.0
## [11] fgsea_1.26.0
## [12] clusterProfiler_4.8.3
## [13] kableExtra_1.3.4
## [14] DESeq2_1.40.2
## [15] SummarizedExperiment_1.30.2
## [16] Biobase_2.60.0
## [17] MatrixGenerics_1.12.3
## [18] matrixStats_1.0.0
## [19] GenomicRanges_1.52.0
## [20] GenomeInfoDb_1.36.2
## [21] IRanges_2.34.1
## [22] S4Vectors_0.38.1
## [23] BiocGenerics_0.46.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.3.1 BiocIO_1.10.0
## [3] bitops_1.0-7 ggplotify_0.1.2
## [5] filelock_1.0.2 tibble_3.2.1
## [7] polyclip_1.10-4 preprocessCore_1.62.1
## [9] XML_3.99-0.14 lifecycle_1.0.3
## [11] base64_2.0.1 lattice_0.21-8
## [13] MASS_7.3-60 scrime_1.3.5
## [15] magrittr_2.0.3 sass_0.4.7
## [17] limma_3.56.2 rmarkdown_2.24
## [19] jquerylib_0.1.4 yaml_2.3.7
## [21] askpass_1.2.0 doRNG_1.8.6
## [23] cowplot_1.1.1 DBI_1.1.3
## [25] RColorBrewer_1.1-3 abind_1.4-5
## [27] zlibbioc_1.46.0 quadprog_1.5-8
## [29] rvest_1.0.3 purrr_1.0.2
## [31] ggraph_2.1.0 RCurl_1.98-1.12
## [33] yulab.utils_0.0.9 tweenr_2.0.2
## [35] rappdirs_0.3.3 GenomeInfoDbData_1.2.10
## [37] enrichplot_1.20.1 ggrepel_0.9.3
## [39] tidytree_0.4.5 genefilter_1.82.1
## [41] annotate_1.78.0 svglite_2.1.1
## [43] DelayedMatrixStats_1.22.6 codetools_0.2-19
## [45] DelayedArray_0.26.7 DOSE_3.26.1
## [47] xml2_1.3.5 ggforce_0.4.1
## [49] tidyselect_1.2.0 aplot_0.2.0
## [51] farver_2.1.1 beanplot_1.3.1
## [53] viridis_0.6.4 BiocFileCache_2.8.0
## [55] illuminaio_0.42.0 webshot_0.5.5
## [57] GenomicAlignments_1.36.0 jsonlite_1.8.7
## [59] multtest_2.56.0 tidygraph_1.2.3
## [61] survival_3.5-7 systemfonts_1.0.4
## [63] polylabelr_0.2.0 tools_4.3.1
## [65] progress_1.2.2 treeio_1.24.3
## [67] Rcpp_1.0.11 glue_1.6.2
## [69] gridExtra_2.3 xfun_0.40
## [71] qvalue_2.32.0 dplyr_1.1.3
## [73] HDF5Array_1.28.1 withr_2.5.0
## [75] fastmap_1.1.1 rhdf5filters_1.12.1
## [77] fansi_1.0.4 openssl_2.1.0
## [79] caTools_1.18.2 digest_0.6.33
## [81] R6_2.5.1 gridGraphics_0.5-1
## [83] colorspace_2.1-0 GO.db_3.17.0
## [85] gtools_3.9.4 biomaRt_2.56.1
## [87] RSQLite_2.3.1 utf8_1.2.3
## [89] tidyr_1.3.0 generics_0.1.3
## [91] data.table_1.14.8 rtracklayer_1.60.1
## [93] prettyunits_1.1.1 graphlayouts_1.0.0
## [95] httr_1.4.7 S4Arrays_1.0.6
## [97] scatterpie_0.2.1 pkgconfig_2.0.3
## [99] gtable_0.3.4 blob_1.2.4
## [101] siggenes_1.74.0 shadowtext_0.1.2
## [103] htmltools_0.5.6 scales_1.2.1
## [105] png_0.1-8 ggfun_0.1.2
## [107] knitr_1.43 rstudioapi_0.15.0
## [109] tzdb_0.4.0 reshape2_1.4.4
## [111] rjson_0.2.21 nlme_3.1-163
## [113] curl_5.0.2 cachem_1.0.8
## [115] rhdf5_2.44.0 stringr_1.5.0
## [117] KernSmooth_2.23-22 HDO.db_0.99.1
## [119] AnnotationDbi_1.62.2 restfulr_0.0.15
## [121] GEOquery_2.68.0 reshape_0.8.9
## [123] pillar_1.9.0 grid_4.3.1
## [125] vctrs_0.6.3 dbplyr_2.3.3
## [127] xtable_1.8-4 evaluate_0.21
## [129] readr_2.1.4 GenomicFeatures_1.52.2
## [131] cli_3.6.1 compiler_4.3.1
## [133] Rsamtools_2.16.0 rlang_1.1.1
## [135] crayon_1.5.2 rngtools_1.5.2
## [137] nor1mix_1.3-0 mclust_6.0.0
## [139] plyr_1.8.8 stringi_1.7.12
## [141] viridisLite_0.4.2 BiocParallel_1.34.2
## [143] munsell_0.5.0 lazyeval_0.2.2
## [145] GOSemSim_2.26.1 Matrix_1.6-1
## [147] hms_1.1.3 patchwork_1.1.3
## [149] sparseMatrixStats_1.12.2 bit64_4.0.5
## [151] Rhdf5lib_1.22.0 ggplot2_3.4.3
## [153] KEGGREST_1.40.0 highr_0.10
## [155] igraph_1.5.1 memoise_2.0.1
## [157] bslib_0.5.1 ggtree_3.8.2
## [159] fastmatch_1.1-4 bit_4.0.5
## [161] downloader_0.4 ape_5.7-1
## [163] gson_0.1.0