Source: https://github.com/markziemann/enrichment_recipe
This guide is a Rmarkdown script that conducts differential expression and enrichment analysis, which are very popular workflows for transcriptome data.
The goal of this work is to understand how ClusterProfiler manages the background list, as we have observed some weird behaviour. In particular we see that when we provide a custom background, those genes do not appear in the final analysis unless they are also included in the gene list.
In the code chunk below called libs
, you can add and remove required R library dependancies. Check that the libraries listed here match the Dockerfile, otherwise you might get errors.
suppressPackageStartupMessages({
library("getDEE2")
library("DESeq2")
library("kableExtra")
library("clusterProfiler")
library("fgsea")
library("eulerr")
library("gplots")
})
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'
For this guide I will be using bulk RNA-seq data from a previous study, which is deposited at NCBI GEO and SRA under accession numbers: GSE55123 and SRP038101 (Lund et al, 2014). The experiment is designed to investigate the effect of Azacitidine treatment on AML3 cells.
The raw data have been processed by the DEE2 project, and the summary gene expression counts are available at the dee2.io website, and programmatically with the getDEE2 bioconductor package (Ziemann et al, 2019).
Alternatively, you could fetch data from another resource like NCBI GEO, Zenodo or from the host storage drive.
mdat <- getDEE2Metadata("hsapiens")
# get sample sheet
ss <- subset(mdat,SRP_accession=="SRP038101")
# fetch the whole set of RNA-seq data
x <- getDEE2("hsapiens",ss$SRR_accession , metadata=mdat, legacy=TRUE)
## For more information about DEE2 QC metrics, visit
## https://github.com/markziemann/dee2/blob/master/qc/qc_metrics.md
mx <- x$GeneCounts
rownames(mx) <- paste(rownames(mx),x$GeneInfo$GeneSymbol)
dim(mx)
## [1] 58302 6
# aza no filtering
ss$trt <- grepl("Treated",ss$Experiment_title)
ss %>%
kbl(caption="sample sheet for Aza treatment in AML3 cells") %>%
kable_paper("hover", full_width = F)
SRR_accession | QC_summary | SRX_accession | SRS_accession | SRP_accession | Experiment_title | GEO_series | trt | |
---|---|---|---|---|---|---|---|---|
156267 | SRR1171523 | PASS | SRX472607 | SRS559064 | SRP038101 | GSM1329859: Untreated.1; Homo sapiens; RNA-Seq | FALSE | |
156268 | SRR1171524 | WARN(3,4) | SRX472608 | SRS559066 | SRP038101 | GSM1329860: Untreated.2; Homo sapiens; RNA-Seq | FALSE | |
156269 | SRR1171525 | WARN(3,4) | SRX472609 | SRS559065 | SRP038101 | GSM1329861: Untreated.3; Homo sapiens; RNA-Seq | FALSE | |
156270 | SRR1171526 | WARN(3,4) | SRX472610 | SRS559068 | SRP038101 | GSM1329862: Treated.1; Homo sapiens; RNA-Seq | TRUE | |
156271 | SRR1171527 | WARN(3,4) | SRX472611 | SRS559067 | SRP038101 | GSM1329863: Treated.2; Homo sapiens; RNA-Seq | TRUE | |
156272 | SRR1171528 | WARN(3,4) | SRX472612 | SRS559069 | SRP038101 | GSM1329864: Treated.3; Homo sapiens; RNA-Seq | TRUE |
QC is important, even if you are using public transcriptome data. For RNA-seq it is a good idea to show the number of reads for each sample.
par(mar=c(5,7,5,1))
barplot(rev(colSums(mx)),horiz=TRUE,las=1,main="number of reads per sample in SRP038101")
Now make a MDS plot.
mds <- cmdscale(dist(t(mx)))
# expand plot area
XMIN=min(mds[,1])*1.3
XMAX=max(mds[,1])*1.3
YMIN=min(mds[,2])*1.3
YMAX=max(mds[,2])*1.3
cols <- as.character(grepl("Treated",ss$Experiment_title))
cols <- gsub("FALSE","lightblue",cols)
cols <- gsub("TRUE","pink",cols)
plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
xlim=c(XMIN,XMAX),ylim=c(YMIN,YMAX),
pch=19,cex=2,col=cols, main="MDS plot")
text(cmdscale(dist(t(mx))), labels=colnames(mx) )
I will be using DESeq2 for DE analysis, the count matrix is prefiltered using a detection threshold of 10 reads per sample across all samples. All genes that meet the detection threshold will comprise the background list. The first 6 rows of the count matrix are shows.
mxf <- mx[which(rowMeans(mx)>=10),]
dim(mxf)
## [1] 13168 6
head(mxf,6) %>%
kbl(caption="Count matrix format") %>%
kable_paper("hover", full_width = F)
SRR1171523 | SRR1171524 | SRR1171525 | SRR1171526 | SRR1171527 | SRR1171528 | |
---|---|---|---|---|---|---|
ENSG00000225630 MTND2P28 | 494 | 396 | 340 | 333 | 415 | 418 |
ENSG00000237973 MTCO1P12 | 52 | 39 | 40 | 30 | 37 | 29 |
ENSG00000248527 MTATP6P1 | 853 | 544 | 537 | 582 | 702 | 716 |
ENSG00000228327 AL669831.1 | 17 | 13 | 21 | 21 | 22 | 12 |
ENSG00000228794 LINC01128 | 42 | 27 | 30 | 32 | 40 | 23 |
ENSG00000230699 AL645608.3 | 20 | 11 | 13 | 10 | 15 | 22 |
dds <- DESeqDataSetFromMatrix(countData = mxf , colData = ss, design = ~ trt )
res <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
z <- results(res)
vsd <- vst(dds, blind=FALSE)
zz <-cbind(as.data.frame(z),mxf)
def <-as.data.frame(zz[order(zz$pvalue),])
head(def,10) %>%
kbl(caption="Top DE genes for Aza treatment") %>%
kable_paper("hover", full_width = F)
baseMean | log2FoldChange | lfcSE | stat | pvalue | padj | SRR1171523 | SRR1171524 | SRR1171525 | SRR1171526 | SRR1171527 | SRR1171528 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSG00000165949 IFI27 | 1960.1970 | -3.384492 | 0.0938869 | -36.04861 | 0 | 0 | 4689 | 3583 | 2758 | 309 | 384 | 334 |
ENSG00000090382 LYZ | 7596.0299 | -1.650342 | 0.0561143 | -29.41036 | 0 | 0 | 14212 | 10237 | 10789 | 3476 | 3764 | 3738 |
ENSG00000115461 IGFBP5 | 531.2217 | -5.071157 | 0.1795239 | -28.24781 | 0 | 0 | 1320 | 823 | 1025 | 23 | 26 | 43 |
ENSG00000157601 MX1 | 827.1511 | -2.877795 | 0.1047823 | -27.46450 | 0 | 0 | 1732 | 1501 | 1206 | 184 | 223 | 186 |
ENSG00000111331 OAS3 | 2127.2010 | -2.661214 | 0.0972124 | -27.37525 | 0 | 0 | 4204 | 3977 | 2972 | 562 | 614 | 560 |
ENSG00000070915 SLC12A3 | 424.5509 | -3.374852 | 0.1298671 | -25.98697 | 0 | 0 | 1012 | 721 | 653 | 63 | 85 | 76 |
ENSG00000234745 HLA-B | 3197.0159 | -1.431566 | 0.0604169 | -23.69479 | 0 | 0 | 6085 | 4256 | 4023 | 1590 | 1872 | 1719 |
ENSG00000137965 IFI44 | 409.0957 | -2.978581 | 0.1319352 | -22.57608 | 0 | 0 | 829 | 740 | 635 | 76 | 111 | 89 |
ENSG00000204525 HLA-C | 1631.6421 | -1.461550 | 0.0660214 | -22.13750 | 0 | 0 | 3112 | 2150 | 2106 | 791 | 923 | 891 |
ENSG00000110042 DTX4 | 524.1318 | -2.470219 | 0.1173182 | -21.05572 | 0 | 0 | 1166 | 883 | 688 | 166 | 168 | 145 |
Make a smear plot.
sigf <- subset(def,padj<=0.05)
DET=nrow(mxf)
NSIG=nrow(sigf)
NUP=nrow(subset(sigf,log2FoldChange>0))
NDN=nrow(subset(sigf,log2FoldChange<0))
HEADER=paste(DET,"detected genes;",NSIG,"w/FDR<0.05;",NUP,"up;",NDN,"down")
plot(log10(def$baseMean) ,def$log2FoldChange,
cex=0.6,pch=19,col="darkgray",
main="5-azacitidine treatment in AML3 cells",
xlab="log10(basemean)",ylab="log2(fold change)")
points(log10(sigf$baseMean) ,sigf$log2FoldChange,
cex=0.6,pch=19,col="red")
mtext(HEADER)
In the next sections I will run enrichment analysis with over-representation analysis (ORA) test and compare it to functional class scoring. I will also investigate some strange behviour of the ORA tool clusterprofiler.
I’ve compiled a reporting checklist:
Reporting criteria | Method/resource used |
---|---|
Origin of gene sets | KEGG (2023-06-16) |
Tool used | ClusterProfiler (check version at foot of report) |
Statistical test used | hypergeometric test |
P-value correction for multiple comparisons | FDR method |
Background list | Genes with >=10 reads per sample on average across all samples |
Gene Selection Criteria | DESeq2 FDR<0.05 |
ORA directionality | Separate tests for up- and down-regulation |
Data availability | via DEE2 at accession SRP038101 (human) |
Other parameters | Min gene set size of 10 |
Here I provide a custom background gene list.
Get the gene sets loaded in R. These are KEGG gene sets
# from https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp 16th June 2023
genesets2 <- read.gmt("c2.cp.kegg.v2023.1.Hs.symbols.gmt")
gsets <- gmtPathways("c2.cp.kegg.v2023.1.Hs.symbols.gmt")
message(paste("number of genes described in the annotation set:",length(unique(genesets2$gene))))
## number of genes described in the annotation set: 5244
Now filter the gene names into three lists, up-regulated, down-regulated and background. Background is simply all genes that were detected.
defup <- rownames(subset(def,padj<=0.05 & log2FoldChange>0))
defup <- unique(sapply(strsplit(defup," "),"[[",2))
defdn <- rownames(subset(def,padj<=0.05 & log2FoldChange<0))
defdn <- unique(sapply(strsplit(defdn," "),"[[",2))
bg <- rownames(def)
bg <- unique(sapply(strsplit(bg," "),"[[",2))
message(paste("number of genes in background:",length(bg)))
## number of genes in background: 13164
Enrichment firstly with upregulated genes
ora_up <- as.data.frame(enricher(gene = defup ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=10)
ora_ups <- rownames(ora_up)
gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))
ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL
head(ora_up) %>%
kbl(row.names = FALSE, caption="Top upregulated pathways in Aza treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
KEGG_MISMATCH_REPAIR | 10/406 | 22/3134 | 0.0001811 | 0.0044696 | 0.0042686 | 10 | 3.508733 |
KEGG_DNA_REPLICATION | 15/406 | 35/3134 | 0.0000104 | 0.0005222 | 0.0004987 | 15 | 3.308234 |
KEGG_CELL_CYCLE | 44/406 | 115/3134 | 0.0000000 | 0.0000000 | 0.0000000 | 44 | 2.953438 |
KEGG_NUCLEOTIDE_EXCISION_REPAIR | 16/406 | 42/3134 | 0.0000314 | 0.0009481 | 0.0009054 | 16 | 2.940652 |
KEGG_BASE_EXCISION_REPAIR | 12/406 | 33/3134 | 0.0005182 | 0.0086936 | 0.0083027 | 12 | 2.806986 |
KEGG_PYRIMIDINE_METABOLISM | 26/406 | 86/3134 | 0.0000158 | 0.0005970 | 0.0005702 | 26 | 2.333715 |
topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))
The reported background size does not match the dataset.
Now repeat with the downregulated genes
n_pw=10
ora_dn <- as.data.frame(enricher(gene = defdn ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=10)
ora_dns <- rownames(ora_dn)
gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))
ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL
head(ora_dn,n_pw) %>%
kbl(row.names = FALSE, caption="Top downregulated pathways in Aza treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
KEGG_CELL_ADHESION_MOLECULES_CAMS | 30/608 | 59/3134 | 0.0000000 | 0.0000037 | 0.0000028 | 30 | 2.620986 |
KEGG_HEMATOPOIETIC_CELL_LINEAGE | 19/608 | 39/3134 | 0.0000311 | 0.0010268 | 0.0007992 | 19 | 2.511218 |
KEGG_COMPLEMENT_AND_COAGULATION_CASCADES | 10/608 | 21/3134 | 0.0031211 | 0.0245228 | 0.0190863 | 10 | 2.454574 |
KEGG_LEISHMANIA_INFECTION | 23/608 | 49/3134 | 0.0000100 | 0.0005478 | 0.0004264 | 23 | 2.419509 |
KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION | 47/608 | 105/3134 | 0.0000000 | 0.0000002 | 0.0000002 | 47 | 2.307300 |
KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION | 17/608 | 39/3134 | 0.0004433 | 0.0056262 | 0.0043789 | 17 | 2.246879 |
KEGG_DILATED_CARDIOMYOPATHY | 20/608 | 46/3134 | 0.0001460 | 0.0030068 | 0.0023402 | 20 | 2.241133 |
KEGG_ARRHYTHMOGENIC_RIGHT_VENTRICULAR_CARDIOMYOPATHY_ARVC | 16/608 | 37/3134 | 0.0007260 | 0.0073552 | 0.0057246 | 16 | 2.229019 |
KEGG_VIRAL_MYOCARDITIS | 16/608 | 37/3134 | 0.0007260 | 0.0073552 | 0.0057246 | 16 | 2.229019 |
KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION | 27/608 | 64/3134 | 0.0000199 | 0.0008226 | 0.0006403 | 27 | 2.174599 |
topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)
Make a barplot
par(mar=c(5,20,5,1))
cols <- c(rep("blue",n_pw),rep("red",n_pw))
barplot(c(topdn2,topup2),
horiz=TRUE,las=1,cex.names=0.65,col=cols,
main="top DE KEGGs",
xlab="ES",
cex.main=0.9)
mtext("ORA test")
I’ve compiled a reporting checklist:
Reporting criteria | Method/resource used |
---|---|
Origin of gene sets | KEGG (2023-06-16) |
Tool used | ClusterProfiler (check version at foot of report) |
Statistical test used | hypergeometric test |
P-value correction for multiple comparisons | FDR method |
Background list | Genes with >=10 reads per sample on average across all samples |
Gene Selection Criteria | DESeq2 FDR<0.05 |
ORA directionality | Separate tests for up- and down-regulation |
Data availability | via DEE2 at accession SRP038101 (human) |
Other parameters | Min gene set size of 10 |
Here I provide a background gene list for cluterprofiler to use, like a user should.
I have also done some modification to the KEGG gene sets, to en
Now filter the gene names into three lists, up-regulated, down-regulated and background. Background is simply all genes that were detected.
defup <- rownames(subset(def,padj<=0.05 & log2FoldChange>0))
defup <- unique(sapply(strsplit(defup," "),"[[",2))
defdn <- rownames(subset(def,padj<=0.05 & log2FoldChange<0))
defdn <- unique(sapply(strsplit(defdn," "),"[[",2))
bg <- rownames(def)
bg <- unique(sapply(strsplit(bg," "),"[[",2))
message(paste("number of genes in background:",length(bg)))
## number of genes in background: 13164
Adding all detected genes to the background appears to improve results!
bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)
Enrichment firstly with upregulated genes
orafix_up <- as.data.frame(enricher(gene = defup ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
orafix_up$geneID <- NULL
orafix_up <- subset(orafix_up,p.adjust<0.05 & Count >=10)
orafix_ups <- rownames(orafix_up)
gr <- as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_up$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_up$BgRatio,"/"),"[[",2))
orafix_up$es <- gr/br
orafix_up <- orafix_up[order(-orafix_up$es),]
orafix_up$Description=NULL
head(orafix_up) %>%
kbl(row.names = FALSE, caption="Top upregulated pathways in Aza treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
KEGG_MISMATCH_REPAIR | 10/1672 | 22/13164 | 0.0001611 | 0.0039275 | 0.0037509 | 10 | 3.578730 |
KEGG_DNA_REPLICATION | 15/1672 | 35/13164 | 0.0000091 | 0.0004580 | 0.0004374 | 15 | 3.374231 |
KEGG_CELL_CYCLE | 44/1672 | 115/13164 | 0.0000000 | 0.0000000 | 0.0000000 | 44 | 3.012357 |
KEGG_NUCLEOTIDE_EXCISION_REPAIR | 16/1672 | 42/13164 | 0.0000275 | 0.0008309 | 0.0007935 | 16 | 2.999316 |
KEGG_BASE_EXCISION_REPAIR | 12/1672 | 33/13164 | 0.0004583 | 0.0076884 | 0.0073427 | 12 | 2.862984 |
KEGG_PYRIMIDINE_METABOLISM | 26/1672 | 86/13164 | 0.0000142 | 0.0005343 | 0.0005103 | 26 | 2.380272 |
topup2 <- rev(head(orafix_up$es,10))
names(topup2) <- rev(head(orafix_up$ID,10))
Now with the downregulated genes
n_pw=10
orafix_dn <- as.data.frame(enricher(gene = defdn ,
universe = bg, maxGSSize = 5000, TERM2GENE = genesets2,
pAdjustMethod="fdr", pvalueCutoff = 1, qvalueCutoff = 1 ))
orafix_dn$geneID <- NULL
orafix_dn <- subset(orafix_dn,p.adjust<0.05 & Count >=10)
orafix_dns <- rownames(orafix_dn)
gr <- as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_dn$GeneRatio,"/"),"[[",2))
br <- as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",1)) /
as.numeric(sapply(strsplit(orafix_dn$BgRatio,"/"),"[[",2))
orafix_dn$es <- gr/br
orafix_dn <- orafix_dn[order(-orafix_dn$es),]
orafix_dn$Description=NULL
head(orafix_dn,n_pw) %>%
kbl(row.names = FALSE, caption="Top downregulated pathways in Aza treatment") %>%
kable_paper("hover", full_width = F)
ID | GeneRatio | BgRatio | pvalue | p.adjust | qvalue | Count | es |
---|---|---|---|---|---|---|---|
KEGG_CELL_ADHESION_MOLECULES_CAMS | 30/1926 | 59/13164 | 0.0000000 | 0.0000000 | 0.0000000 | 30 | 3.475368 |
KEGG_HEMATOPOIETIC_CELL_LINEAGE | 19/1926 | 39/13164 | 0.0000005 | 0.0000108 | 0.0000069 | 19 | 3.329819 |
KEGG_COMPLEMENT_AND_COAGULATION_CASCADES | 10/1926 | 21/13164 | 0.0003293 | 0.0021731 | 0.0013864 | 10 | 3.254710 |
KEGG_LEISHMANIA_INFECTION | 23/1926 | 49/13164 | 0.0000001 | 0.0000034 | 0.0000022 | 23 | 3.208214 |
KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION | 47/1926 | 105/13164 | 0.0000000 | 0.0000000 | 0.0000000 | 47 | 3.059427 |
KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION | 17/1926 | 39/13164 | 0.0000123 | 0.0001559 | 0.0000995 | 17 | 2.979311 |
KEGG_DILATED_CARDIOMYOPATHY | 20/1926 | 46/13164 | 0.0000022 | 0.0000306 | 0.0000195 | 20 | 2.971692 |
KEGG_ARRHYTHMOGENIC_RIGHT_VENTRICULAR_CARDIOMYOPATHY_ARVC | 16/1926 | 37/13164 | 0.0000250 | 0.0002241 | 0.0001430 | 16 | 2.955628 |
KEGG_VIRAL_MYOCARDITIS | 16/1926 | 37/13164 | 0.0000250 | 0.0002241 | 0.0001430 | 16 | 2.955628 |
KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION | 27/1926 | 64/13164 | 0.0000001 | 0.0000034 | 0.0000022 | 27 | 2.883470 |
topdn2 <- head(orafix_dn$es,n_pw)
names(topdn2) <- head(orafix_dn$ID,n_pw)
Make a barplot
par(mar=c(5,20,5,1))
cols <- c(rep("blue",n_pw),rep("red",n_pw))
barplot(c(topdn2,topup2),
horiz=TRUE,las=1,cex.names=0.65,col=cols,
main="top DE KEGGs",
xlab="ES",
cex.main=0.9)
mtext("ORA fix test")
Reporting criteria | Method/resource used |
---|---|
Origin of gene sets | KEGG (2023-06-16) |
Tool used | FGSEA (check version at foot of report) |
Statistical test used | Kolmogorov-Smirnov test |
P-value correction for multiple comparisons | FDR method |
Background list | Genes with >=10 reads per sample on average across all samples |
Gene Selection Criteria | DESeq2 FDR<0.05 |
ORA directionality | Separate tests for up- and down-regulation |
Data availability | via DEE2 at accession SRP038101 (human) |
Other parameters | Min gene set size of 10 |
def2 <- def
def2$genename <- sapply(strsplit(rownames(def2)," "),"[[",2)
def2 <- def2[,c("stat","genename")]
def2 <- aggregate(. ~ genename, def2, sum)
stat <- def2$stat
names(stat) <- def2$genename
fgseaRes <- fgsea(pathways=gsets, stats=stat, minSize=10, nPermSimple = 10000)
fgseaRes <- fgseaRes[order(fgseaRes$pval),]
fgsea_up <- subset(fgseaRes,padj<0.05 & ES>0)
fgsea_ups <- fgsea_up$pathway
fgsea_dn <- subset(fgseaRes,padj<0.05 & ES<0)
fgsea_dns <- fgsea_dn$pathway
fgsea_up <- data.frame(fgsea_up[order(-fgsea_up$ES),])
fgsea_dn <- data.frame(fgsea_dn[order(fgsea_dn$ES),])
head(fgsea_up,n_pw) %>%
kbl(row.names = FALSE, caption="FGSEA:Top upregulated pathways in Aza treatment") %>%
kable_paper("hover", full_width = F)
pathway | pval | padj | log2err | ES | NES | size | leadingEdge |
---|---|---|---|---|---|---|---|
KEGG_MISMATCH_REPAIR | 0.0001846 | 0.0012071 | 0.5188481 | 0.6649184 | 2.048921 | 22 | RFC3 , PCNA , MSH2 , MLH3 , RFC1 , EXO1 , PMS2 , RFC5 , POLD3, RFC2 , RPA1 , LIG1 , MSH3 , MSH6 , RPA3 |
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM | 0.0025371 | 0.0123232 | 0.4317077 | 0.6299449 | 1.868184 | 19 | ASS1 , CAD , GPT2 , GOT2 , PPAT , GFPT1 , ALDH5A1, GLUL , ADSL , GLS |
KEGG_DNA_REPLICATION | 0.0000125 | 0.0001641 | 0.5933255 | 0.6209629 | 2.145116 | 35 | RFC3 , MCM6 , POLA1 , POLE3 , POLE , PCNA , MCM4 , POLE2 , RFC1 , DNA2 , RFC5 , FEN1 , POLD3 , RNASEH1 , RFC2 , RPA1 , LIG1 , RPA3 , PRIM2 , MCM7 , PRIM1 , POLD2 , POLA2 , RNASEH2C, MCM2 |
KEGG_NUCLEOTIDE_EXCISION_REPAIR | 0.0001962 | 0.0012351 | 0.5188481 | 0.5541739 | 2.000193 | 42 | RFC3 , GTF2H3, POLE3 , CUL4A , POLE , PCNA , CCNH , GTF2H1, ERCC2 , POLE2 , RFC1 , DDB1 , RAD23B, RFC5 , POLD3 , RFC2 , RPA1 , LIG1 , ERCC3 , GTF2H5, RPA3 , RAD23A, ERCC8 |
KEGG_SPLICEOSOME | 0.0000000 | 0.0000000 | 0.8140358 | 0.5164633 | 2.290255 | 123 | DDX46 , HNRNPA3 , SF3A3 , SRSF2 , TCERG1 , NCBP1 , CDC5L , EIF4A3 , SRSF3 , SRSF1 , SRSF7 , HNRNPA1 , DHX15 , DDX39B , PPIH , SNRNP200 , PRPF40A , RBMX , EFTUD2 , DDX5 , SRSF10 , PRPF8 , PRPF38B , SF3A1 , SNRPD3 , U2SURP , RBM25 , AQR , USP39 , SNRPF , MAGOHB , ZMAT2 , PPIL1 , SNRPA , SF3B4 , SF3B6 , NCBP2 , SMNDC1 , PRPF19 , ALYREF , SF3B3 , SNRNP27 , TRA2B , SNRPD1 , SNRPB2 , SRSF4 , TXNL4A , DDX23 , SNRPG , HNRNPA1L2, THOC2 , PRPF31 , ACIN1 , SNRPE , PRPF3 , LSM5 , RBM17 , HSPA8 , SF3B1 , HNRNPM , WBP11 , LSM3 , MAGOH |
KEGG_BASE_EXCISION_REPAIR | 0.0060938 | 0.0239623 | 0.3263516 | 0.4987333 | 1.698255 | 33 | NEIL3, LIG3 , POLE3, POLE , PCNA , POLE2, HMGB1, NEIL2, FEN1 , POLD3, UNG , PARP1, LIG1 , NTHL1 |
KEGG_CELL_CYCLE | 0.0000001 | 0.0000027 | 0.7049757 | 0.4885467 | 2.140806 | 115 | CCNA2 , PRKDC , TFDP1 , CDC25A, BUB1B , MAD2L1, MCM6 , CDC6 , CDKN2C, ANAPC1, SKP2 , ORC6 , MAD1L1, MYC , CDK1 , CDC20 , RBL1 , CCNB1 , CDK6 , PCNA , CDC27 , CDC45 , CCNH , MCM4 , SMC3 , SMC1A , CCNE1 , YWHAH , TP53 , YWHAG , CCNB2 , BUB3 , ORC1 , TFDP2 , PLK1 , ESPL1 , CDK4 , TTK , STAG1 , YWHAQ , ANAPC5, CDK2 , PKMYT1, BUB1 , DBF4 , ORC2 , CHEK1 , E2F3 , WEE1 , CCNE2 , ATR , ANAPC4, E2F1 |
KEGG_PYRIMIDINE_METABOLISM | 0.0029878 | 0.0133663 | 0.4317077 | 0.3814180 | 1.591658 | 86 | RRM1 , RRM2 , CAD , POLA1 , TXNRD1, POLE3 , POLR1B, POLE , POLR2B, CTPS1 , NME4 , POLR3G, POLR1E, POLE2 , POLR1A, TYMS , POLR3A, DCTD , DHODH , UMPS , DUT , ENTPD5, POLD3 , POLR3B, POLR3F, POLR2C, TK1 , POLR2D, DTYMK , POLR3H |
KEGG_UBIQUITIN_MEDIATED_PROTEOLYSIS | 0.0067194 | 0.0253843 | 0.4070179 | 0.3271205 | 1.448925 | 122 | ANAPC1, SKP2 , BRCA1 , CDC20 , SAE1 , CUL4A , BIRC6 , CDC27 , HUWE1 , UBE2O , UBA2 , DDB1 , HERC1 , SMURF2, UBE2L3, FBXW8 , HERC2 , CUL5 , UBE4B , FBXW7 , UBE2G2, ANAPC5, VHL , PIAS1 , UBE3A , ITCH , TRIP12, UBE3C , ELOC , PRPF19, UBE2G1, CUL2 , UBA6 , NEDD4L, UBE2N , ANAPC4, UBR5 , UBE2K , UBE2E2, UBE2A , WWP2 , DET1 , CDC23 , PPIL2 , PIAS2 |
head(fgsea_dn,n_pw) %>%
kbl(row.names = FALSE, caption="FGSEA:Top downregulated pathways in Aza treatment") %>%
kable_paper("hover", full_width = F)
pathway | pval | padj | log2err | ES | NES | size | leadingEdge |
---|---|---|---|---|---|---|---|
KEGG_AUTOIMMUNE_THYROID_DISEASE | 0.0000028 | 0.0000530 | 0.6272567 | -0.9114495 | -2.038934 | 10 | HLA-B, HLA-C, CD86 , HLA-A, HLA-E, HLA-F, HLA-G |
KEGG_ALLOGRAFT_REJECTION | 0.0000106 | 0.0001639 | 0.5933255 | -0.8877863 | -2.032313 | 11 | HLA-B, HLA-C, CD86 , HLA-A, HLA-E, HLA-F, HLA-G, TNF |
KEGG_GRAFT_VERSUS_HOST_DISEASE | 0.0000304 | 0.0003040 | 0.5756103 | -0.8727516 | -1.997895 | 11 | HLA-B, HLA-C, CD86 , HLA-A, HLA-E, HLA-F, HLA-G, TNF |
KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION | 0.0001452 | 0.0010288 | 0.5188481 | -0.7948144 | -1.920980 | 14 | CD86 , TNFSF13B , ITGB7 , CXCR4 , TNFRSF13C, TNFSF13 |
KEGG_TYPE_I_DIABETES_MELLITUS | 0.0004050 | 0.0021517 | 0.4984931 | -0.7763294 | -1.876304 | 14 | HLA-B, HLA-C, CD86 , HLA-A, HLA-E, HLA-F, HLA-G, TNF , ICA1 |
KEGG_COMPLEMENT_AND_COAGULATION_CASCADES | 0.0000427 | 0.0003817 | 0.5573322 | -0.7613143 | -2.024488 | 21 | C3 , THBD , CFH , PLAU , PLAUR , SERPINA1, CD59 , F2R , A2M , F3 , F8 , F12 , SERPING1 |
KEGG_CELL_ADHESION_MOLECULES_CAMS | 0.0000000 | 0.0000000 | 0.8266573 | -0.7375765 | -2.413057 | 59 | HLA-B , HLA-C , CD86 , SIGLEC1, HLA-A , VCAN , ITGB7 , ITGAL , HLA-E , F11R , SDC3 , ALCAM , CDH2 , NCAM2 , CD226 , HLA-F , HLA-G , ICAM3 , ITGB2 , SDC2 , NECTIN2, PECAM1 , NEGR1 , CD276 , SDC1 , NCAM1 , MPZL1 , CLDN7 , SELPLG , NECTIN1 |
KEGG_STEROID_BIOSYNTHESIS | 0.0042630 | 0.0181177 | 0.2853134 | -0.7234786 | -1.783469 | 15 | TM7SF2 , EBP , SC5D , FDFT1 , HSD17B7, CYP51A1, SOAT1 , MSMO1 , LIPA , LSS , SQLE |
KEGG_VIRAL_MYOCARDITIS | 0.0000021 | 0.0000452 | 0.6272567 | -0.7142771 | -2.142700 | 37 | HLA-B, HLA-C, CD86 , HLA-A, ITGAL, HLA-E, CCND1, ACTG1, HLA-F, HLA-G, SGCB , ITGB2, FYN , BID , DMD |
KEGG_OTHER_GLYCAN_DEGRADATION | 0.0127005 | 0.0447909 | 0.1654064 | -0.7062814 | -1.679731 | 13 | NEU1 , FUCA1 , HEXB , MANBA , AGA , GLB1 , MAN2B2 |
fgsea_up <- head(fgsea_up,n_pw)
fgsea_up_vec <- fgsea_up$ES
names(fgsea_up_vec) <- fgsea_up$pathway
fgsea_up_vec <- rev(fgsea_up_vec)
fgsea_dn <- head(fgsea_dn,n_pw)
fgsea_dn_vec <- fgsea_dn$ES * -1
names(fgsea_dn_vec) <- fgsea_dn$pathway
Make a barplot
par(mar=c(5,20,5,1))
cols <- c(rep("blue",n_pw),rep("red",n_pw))
barplot(c(fgsea_dn_vec,fgsea_up_vec),
horiz=TRUE,las=1,cex.names=0.65,col=cols,
main="top DE KEGGs",
xlab="ES",
cex.main=0.9)
mtext("FCS test")
v1 <- list("ORA up"=ora_ups, "ORA fix up"=orafix_ups, "FCS up"=fgsea_ups)
v2 <- list("ORA dn"=ora_dns, "ORA fix dn"=orafix_dns, "FCS dn"=fgsea_dns)
v3 <- list("ORA"=union(ora_ups,ora_dns),
"ORA fix"=union(orafix_ups,orafix_dns),
"FCS"=union(fgsea_ups,fgsea_dns))
par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2),main="upregulated genes")
plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2),main="downregulated genes")
plot(euler(v3),quantities = list(cex = 2), labels = list(cex = 2),main="up- and down-regulated genes")
Jaccard index comparing ORA and FCS.
jaccard <- function(a, b) {
intersection = length(intersect(a, b))
union = length(a) + length(b) - intersection
return (intersection/union)
}
ora <- c(ora_ups,ora_dns)
fcs <- c(fgsea_ups,fgsea_dns)
jaccard(ora,fcs)
## [1] 0.5576923
Jaccard index comparing ORA and ORA fix.
ora <- c(ora_ups,ora_dns)
orafix <- c(orafix_ups,orafix_dns)
jaccard(ora,orafix)
## [1] 0.6037736
Jaccard index comparing ORA fix and FCS.
orafix <- c(orafix_ups,orafix_dns)
fcs <- c(fgsea_ups,fgsea_dns)
jaccard(orafix,fcs)
## [1] 0.6190476
In this case the top upregulated and downregulated pathways
upname <- head(fgsea_up,1)$pathway
plotEnrichment(gsets[[upname]], stat, gseaParam = 1, ticksSize = 0.2)
dnname <- head(fgsea_dn,1)$pathway
plotEnrichment(gsets[[dnname]], stat, gseaParam = 1, ticksSize = 0.2)
Now a heatmap of these gene sets.
# reads per million normalisation
rpm <- apply(mxf,2, function(x) { x/sum(x) * 1000000} )
colnames(rpm) <- c("UNTR1","UNTR2","UNTR3","TRT1","TRT2","TRT3")
gnames_up <- gsets[[which(names(gsets) == upname)]]
gnames_dn <- gsets[[which(names(gsets) == dnname)]]
gene_ids <- rownames(mxf)
gene_names <- sapply(strsplit(gene_ids," "),"[[",2)
rpm_up <- rpm[which(gene_names %in% gnames_up),]
rownames(rpm_up) <- sapply(strsplit(rownames(rpm_up)," "),"[[",2)
rpm_dn <- rpm[which(gene_names %in% gnames_dn),]
rownames(rpm_dn) <- sapply(strsplit(rownames(rpm_dn)," "),"[[",2)
colsidecols <- c("blue","blue","blue","red","red","red")
heatmap.2(rpm_up,scale="row",trace="none",margins=c(10,15),main=upname,ColSideColors=colsidecols)
heatmap.2(rpm_dn,scale="row",trace="none",margins=c(10,15),main=dnname,ColSideColors=colsidecols)
For reproducibility
sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gplots_3.1.3 eulerr_7.0.0
## [3] fgsea_1.22.0 clusterProfiler_4.4.4
## [5] kableExtra_1.3.4 DESeq2_1.36.0
## [7] SummarizedExperiment_1.26.1 Biobase_2.56.0
## [9] MatrixGenerics_1.8.1 matrixStats_0.63.0
## [11] GenomicRanges_1.48.0 GenomeInfoDb_1.32.2
## [13] IRanges_2.30.0 S4Vectors_0.34.0
## [15] BiocGenerics_0.42.0 getDEE2_1.6.0
##
## loaded via a namespace (and not attached):
## [1] RColorBrewer_1.1-3 rstudioapi_0.14 jsonlite_1.8.4
## [4] magrittr_2.0.3 farver_2.1.1 rmarkdown_2.21
## [7] zlibbioc_1.42.0 vctrs_0.6.2 memoise_2.0.1
## [10] RCurl_1.98-1.12 ggtree_3.4.1 webshot_0.5.4
## [13] htmltools_0.5.5 gridGraphics_0.5-1 sass_0.4.5
## [16] KernSmooth_2.23-20 bslib_0.4.2 plyr_1.8.8
## [19] cachem_1.0.7 igraph_1.4.2 lifecycle_1.0.3
## [22] pkgconfig_2.0.3 Matrix_1.5-4 R6_2.5.1
## [25] fastmap_1.1.1 GenomeInfoDbData_1.2.8 digest_0.6.31
## [28] aplot_0.1.10 enrichplot_1.16.1 colorspace_2.1-0
## [31] patchwork_1.1.2 AnnotationDbi_1.58.0 geneplotter_1.74.0
## [34] RSQLite_2.3.1 labeling_0.4.2 fansi_1.0.4
## [37] httr_1.4.5 polyclip_1.10-4 compiler_4.3.0
## [40] bit64_4.0.5 withr_2.5.0 downloader_0.4
## [43] BiocParallel_1.30.3 viridis_0.6.2 DBI_1.1.3
## [46] highr_0.10 ggforce_0.4.1 MASS_7.3-59
## [49] DelayedArray_0.22.0 caTools_1.18.2 gtools_3.9.4
## [52] tools_4.3.0 DO.db_2.9 scatterpie_0.1.9
## [55] ape_5.7-1 glue_1.6.2 nlme_3.1-162
## [58] GOSemSim_2.22.0 polylabelr_0.2.0 shadowtext_0.1.2
## [61] grid_4.3.0 reshape2_1.4.4 generics_0.1.3
## [64] gtable_0.3.3 tidyr_1.3.0 data.table_1.14.8
## [67] tidygraph_1.2.3 xml2_1.3.4 utf8_1.2.3
## [70] XVector_0.36.0 ggrepel_0.9.3 pillar_1.9.0
## [73] stringr_1.5.0 yulab.utils_0.0.6 genefilter_1.78.0
## [76] splines_4.3.0 dplyr_1.1.2 tweenr_2.0.2
## [79] treeio_1.20.1 lattice_0.21-8 survival_3.5-5
## [82] bit_4.0.5 annotate_1.74.0 tidyselect_1.2.0
## [85] GO.db_3.15.0 locfit_1.5-9.7 Biostrings_2.64.0
## [88] knitr_1.42 gridExtra_2.3 svglite_2.1.1
## [91] xfun_0.39 graphlayouts_0.8.4 stringi_1.7.12
## [94] lazyeval_0.2.2 ggfun_0.0.9 yaml_2.3.7
## [97] evaluate_0.20 codetools_0.2-19 ggraph_2.1.0
## [100] tibble_3.2.1 qvalue_2.28.0 ggplotify_0.1.0
## [103] cli_3.6.1 xtable_1.8-4 systemfonts_1.0.4
## [106] munsell_0.5.0 jquerylib_0.1.4 Rcpp_1.0.10
## [109] png_0.1-8 XML_3.99-0.14 parallel_4.3.0
## [112] ggplot2_3.4.2 blob_1.2.4 DOSE_3.22.0
## [115] htm2txt_2.2.2 bitops_1.0-7 tidytree_0.4.2
## [118] viridisLite_0.4.1 scales_1.2.1 purrr_1.0.1
## [121] crayon_1.5.2 rlang_1.1.1 fastmatch_1.1-3
## [124] KEGGREST_1.36.3 rvest_1.0.3