Source: https://github.com/markziemann/enrichment_recipe

Introduction

This guide is a Rmarkdown script that conducts differential expression and enrichment analysis, which are very popular workflows for transcriptome data.

This is meant to be a boilerplate template, which you can remix and modify to suit your analytical needs.

In the code chunk below called libs, you can add and remove required R library dependancies. Check that the libraries listed here match the Dockerfile, otherwise you might get errors.

suppressPackageStartupMessages({
  library("getDEE2")
  library("DESeq2")
  library("kableExtra")
  library("clusterProfiler")
  library("fgsea")
  library("eulerr")
  library("gplots")
})
## Warning: replacing previous import 'utils::findMatches' by
## 'S4Vectors::findMatches' when loading 'AnnotationDbi'

For this guide I will be using bulk RNA-seq data from a previous study, which is deposited at NCBI GEO and SRA under accession numbers: GSE55123 and SRP038101 (Lund et al, 2014). The experiment is designed to investigate the effect of Azacitidine treatment on AML3 cells.

The raw data have been processed by the DEE2 project, and the summary gene expression counts are available at the dee2.io website, and programmatically with the getDEE2 bioconductor package (Ziemann et al, 2019).

The gene counts have also been deposited to the /data folder in the example.Rdata file in case the DEE2 resource becomes unavailable. To import it, use the command: load("data/example.Rdata")

Alternatively, you could fetch data from another resource like NCBI GEO, Zenodo or from the host storage drive.

mdat <- getDEE2Metadata("hsapiens")

# get sample sheet
ss <- subset(mdat,SRP_accession=="SRP038101")

# fetch the whole set of RNA-seq data
x <- getDEE2("hsapiens",ss$SRR_accession , metadata=mdat, legacy=TRUE)
## For more information about DEE2 QC metrics, visit
##     https://github.com/markziemann/dee2/blob/master/qc/qc_metrics.md
mx <- x$GeneCounts
rownames(mx) <- paste(rownames(mx),x$GeneInfo$GeneSymbol)
dim(mx)
## [1] 58302     6
# aza no filtering
ss$trt <- grepl("Treated",ss$Experiment_title)

ss %>%
  kbl(caption="sample sheet for Aza treatment in AML3 cells") %>%
  kable_paper("hover", full_width = F)
sample sheet for Aza treatment in AML3 cells
SRR_accession QC_summary SRX_accession SRS_accession SRP_accession Experiment_title GEO_series trt
156267 SRR1171523 PASS SRX472607 SRS559064 SRP038101 GSM1329859: Untreated.1; Homo sapiens; RNA-Seq FALSE
156268 SRR1171524 WARN(3,4) SRX472608 SRS559066 SRP038101 GSM1329860: Untreated.2; Homo sapiens; RNA-Seq FALSE
156269 SRR1171525 WARN(3,4) SRX472609 SRS559065 SRP038101 GSM1329861: Untreated.3; Homo sapiens; RNA-Seq FALSE
156270 SRR1171526 WARN(3,4) SRX472610 SRS559068 SRP038101 GSM1329862: Treated.1; Homo sapiens; RNA-Seq TRUE
156271 SRR1171527 WARN(3,4) SRX472611 SRS559067 SRP038101 GSM1329863: Treated.2; Homo sapiens; RNA-Seq TRUE
156272 SRR1171528 WARN(3,4) SRX472612 SRS559069 SRP038101 GSM1329864: Treated.3; Homo sapiens; RNA-Seq TRUE

Data quality control

QC is important, even if you are using public transcriptome data. For RNA-seq it is a good idea to show the number of reads for each sample.

par(mar=c(5,7,5,1))
barplot(rev(colSums(mx)),horiz=TRUE,las=1,main="number of reads per sample in SRP038101")

Now make a MDS plot.

mds <- cmdscale(dist(t(mx)))

# expand plot area
XMIN=min(mds[,1])*1.3
XMAX=max(mds[,1])*1.3
YMIN=min(mds[,2])*1.3
YMAX=max(mds[,2])*1.3

cols <- as.character(grepl("Treated",ss$Experiment_title))
cols <- gsub("FALSE","lightblue",cols)
cols <- gsub("TRUE","pink",cols)
plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
  xlim=c(XMIN,XMAX),ylim=c(YMIN,YMAX),
  pch=19,cex=2,col=cols, main="MDS plot")
text(cmdscale(dist(t(mx))), labels=colnames(mx) )

Differential expression analysis

I will be using DESeq2 for DE analysis, the count matrix is prefiltered using a detection threshold of 10 reads per sample across all samples. All genes that meet the detection threshold will comprise the background list. The first 6 rows of the count matrix are shows.

mxf <- mx[which(rowMeans(mx)>=10),]
dim(mxf)
## [1] 13168     6
head(mxf,6) %>%
  kbl(caption="Count matrix format") %>%
  kable_paper("hover", full_width = F)
Count matrix format
SRR1171523 SRR1171524 SRR1171525 SRR1171526 SRR1171527 SRR1171528
ENSG00000225630 MTND2P28 494 396 340 333 415 418
ENSG00000237973 MTCO1P12 52 39 40 30 37 29
ENSG00000248527 MTATP6P1 853 544 537 582 702 716
ENSG00000228327 AL669831.1 17 13 21 21 22 12
ENSG00000228794 LINC01128 42 27 30 32 40 23
ENSG00000230699 AL645608.3 20 11 13 10 15 22
dds <- DESeqDataSetFromMatrix(countData = mxf , colData = ss, design = ~ trt )
res <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
z <- results(res)
vsd <- vst(dds, blind=FALSE)
zz <-cbind(as.data.frame(z),mxf)
def <-as.data.frame(zz[order(zz$pvalue),])

head(def,10) %>%
  kbl(caption="Top DE genes for Aza treatment") %>%
  kable_paper("hover", full_width = F)
Top DE genes for Aza treatment
baseMean log2FoldChange lfcSE stat pvalue padj SRR1171523 SRR1171524 SRR1171525 SRR1171526 SRR1171527 SRR1171528
ENSG00000165949 IFI27 1960.1970 -3.384492 0.0938869 -36.04861 0 0 4689 3583 2758 309 384 334
ENSG00000090382 LYZ 7596.0299 -1.650342 0.0561143 -29.41036 0 0 14212 10237 10789 3476 3764 3738
ENSG00000115461 IGFBP5 531.2217 -5.071157 0.1795239 -28.24781 0 0 1320 823 1025 23 26 43
ENSG00000157601 MX1 827.1511 -2.877795 0.1047823 -27.46450 0 0 1732 1501 1206 184 223 186
ENSG00000111331 OAS3 2127.2010 -2.661214 0.0972124 -27.37525 0 0 4204 3977 2972 562 614 560
ENSG00000070915 SLC12A3 424.5509 -3.374852 0.1298671 -25.98697 0 0 1012 721 653 63 85 76
ENSG00000234745 HLA-B 3197.0159 -1.431566 0.0604169 -23.69479 0 0 6085 4256 4023 1590 1872 1719
ENSG00000137965 IFI44 409.0957 -2.978581 0.1319352 -22.57608 0 0 829 740 635 76 111 89
ENSG00000204525 HLA-C 1631.6421 -1.461550 0.0660214 -22.13750 0 0 3112 2150 2106 791 923 891
ENSG00000110042 DTX4 524.1318 -2.470219 0.1173182 -21.05572 0 0 1166 883 688 166 168 145

Make a smear plot.

sigf <- subset(def,padj<=0.05)

DET=nrow(mxf)
NSIG=nrow(sigf)
NUP=nrow(subset(sigf,log2FoldChange>0))
NDN=nrow(subset(sigf,log2FoldChange<0))

HEADER=paste(DET,"detected genes;",NSIG,"w/FDR<0.05;",NUP,"up;",NDN,"down")

plot(log10(def$baseMean) ,def$log2FoldChange,
  cex=0.6,pch=19,col="darkgray",
  main="5-azacitidine treatment in AML3 cells",
  xlab="log10(basemean)",ylab="log2(fold change)")

points(log10(sigf$baseMean) ,sigf$log2FoldChange,
  cex=0.6,pch=19,col="red")

mtext(HEADER)

In the next sections I will run enrichment analysis with over-representation test and compare it to functional class scoring.

Enrichment using over-representation test

I’ve compiled a reporting checklist:

Reporting criteria Method/resource used
Origin of gene sets Reactome (2023-03-06)
Tool used ClusterProfiler (check version at foot of report)
Statistical test used hypergeometric test
P-value correction for multiple comparisons FDR method
Background list Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria DESeq2 FDR<0.05
ORA directionality Separate tests for up- and down-regulation
Data availability via DEE2 at accession SRP038101 (human)
Other parameters Min gene set size of 10

Get the gene sets loaded in R. They are located in the /ref folder of the repo.

# from https://reactome.org/download/current/ReactomePathways.gmt.zip
genesets2 <- read.gmt("ref/ReactomePathways_2023-03-06.gmt")
gsets <- gmtPathways("ref/ReactomePathways_2023-03-06.gmt")

Now filter the gene names into three lists, up-regulated, down-regulated and background. Background is simply all genes that were detected.

defup <- rownames(subset(def,padj<=0.05 & log2FoldChange>0))
defup <- unique(sapply(strsplit(defup," "),"[[",2))

defdn <- rownames(subset(def,padj<=0.05 & log2FoldChange<0))
defdn <- unique(sapply(strsplit(defdn," "),"[[",2))

bg <- rownames(def)
bg <- unique(sapply(strsplit(bg," "),"[[",2))

message("number of genes in each group")
## number of genes in each group
lapply(list("background"=bg,"up-regulated"=defup,"down-regulated"=defdn),length)
## $background
## [1] 13164
## 
## $`up-regulated`
## [1] 1672
## 
## $`down-regulated`
## [1] 1926

One of the quirky things I learned about clusterprofiler is that not all genes specified by the user as background are included in the enrichmentcalculation. It appears as though the genes not included in the annotation set are discarded. I believe this is a subtle but significant problem. To fix it is relatively easy from the user’s point of view. Just create a new gene set that consists of all detected genes and

Adding all detected genes to the background appears to improve results!

bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)

Enrichment firstly with upregulated genes. Note that the BgRatio contains the correct number of genes in the background now.

# show 10 pathways only
n_pw=10

ora_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 50000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=10)
ora_ups <- rownames(ora_up)

gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))

ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL

head(ora_up,n_pw) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)
Top upregulated pathways in Aza treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
Interactions of Rev with host cellular proteins 24/1672 36/13164 0 0 0 24 5.248804
Nuclear import of Rev protein 22/1672 33/13164 0 0 0 22 5.248804
Postmitotic nuclear pore complex (NPC) reformation 18/1672 27/13164 0 0 0 18 5.248804
Rev-mediated nuclear export of HIV RNA 22/1672 34/13164 0 0 0 22 5.094427
Export of Viral Ribonucleoproteins from Nucleus 20/1672 31/13164 0 0 0 20 5.079488
NEP/NS2 Interacts with the Cellular Export Machinery 20/1672 31/13164 0 0 0 20 5.079488
Transport of Ribonucleoproteins into the Host Nucleus 20/1672 31/13164 0 0 0 20 5.079488
Nuclear Pore Complex (NPC) Disassembly 22/1672 35/13164 0 0 0 22 4.948872
Defective TPR may confer susceptibility towards thyroid papillary carcinoma (TPC) 18/1672 29/13164 0 0 0 18 4.886817
Regulation of Glucokinase by Glucokinase Regulatory Protein 18/1672 29/13164 0 0 0 18 4.886817
topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))

Now with the downregulated genes

ora_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 50000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=10)
ora_dns <- rownames(ora_dn)

gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))

ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL

head(ora_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)
Top downregulated pathways in Aza treatment
ID GeneRatio BgRatio pvalue p.adjust qvalue Count es
Interleukin-20 family signaling 10/1926 12/13164 0.0000002 0.0000108 0.0000089 10 5.695742
Interferon alpha/beta signaling 41/1926 55/13164 0.0000000 0.0000000 0.0000000 41 5.095101
Interferon gamma signaling 40/1926 63/13164 0.0000000 0.0000000 0.0000000 40 4.339613
Interleukin-10 signaling 17/1926 27/13164 0.0000000 0.0000009 0.0000007 17 4.303450
Other interleukin signaling 10/1926 18/13164 0.0000622 0.0021881 0.0018112 10 3.797162
Growth hormone receptor signaling 10/1926 19/13164 0.0001141 0.0036972 0.0030603 10 3.597311
FGFR1 mutant receptor activation 10/1926 20/13164 0.0001984 0.0055504 0.0045943 10 3.417445
RHO GTPases Activate NADPH Oxidases 10/1926 20/13164 0.0001984 0.0055504 0.0045943 10 3.417445
Smooth Muscle Contraction 13/1926 26/13164 0.0000216 0.0008075 0.0006684 13 3.417445
Interleukin-4 and Interleukin-13 signaling 34/1926 70/13164 0.0000000 0.0000000 0.0000000 34 3.319804
topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

barplot(c(topdn2,topup2),
  horiz=TRUE,las=1,cex.names=0.65,col=cols,
  main="top DE Reactomes",
  xlab="ES",
  cex.main=0.9)

mtext("ORA test")

Enrichment analysis with functional class scoring

Reporting criteria Method/resource used
Origin of gene sets Reactome (2023-03-06)
Tool used FGSEA (check version at foot of report)
Statistical test used Kolmogorov-Smirnov test
P-value correction for multiple comparisons FDR method
Background list Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria DESeq2 FDR<0.05
ORA directionality Separate tests for up- and down-regulation
Data availability via DEE2 at accession SRP038101 (human)
Other parameters Min gene set size of 10
def2 <- def
def2$genename <- sapply(strsplit(rownames(def2)," "),"[[",2)
def2 <- def2[,c("stat","genename")]
def2 <- aggregate(. ~ genename, def2, sum)
stat <- def2$stat
names(stat) <- def2$genename
fgseaRes <- fgsea(pathways=gsets, stats=stat, minSize=10, nPermSimple = 10000)
fgseaRes <- fgseaRes[order(fgseaRes$pval),]
fgsea_up <- subset(fgseaRes,padj<0.05 & ES>0)
fgsea_ups <- fgsea_up$pathway
fgsea_dn <- subset(fgseaRes,padj<0.05 & ES<0)
fgsea_dns <- fgsea_dn$pathway

fgsea_up <-  data.frame(fgsea_up[order(-fgsea_up$ES),])

fgsea_dn <-  data.frame(fgsea_dn[order(fgsea_dn$ES),])

head(fgsea_up,n_pw) %>%
  kbl(row.names = FALSE, caption="FGSEA:Top upregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)
FGSEA:Top upregulated pathways in Aza treatment
pathway pval padj log2err ES NES size leadingEdge
Condensation of Prometaphase Chromosomes 0.0000299 0.0002850 0.5756103 0.8415269 2.126444 11 SMC4 , NCAPG , CSNK2A2, CDK1 , CCNB1 , SMC2 , NCAPH , CCNB2 , NCAPD2
Postmitotic nuclear pore complex (NPC) reformation 0.0000000 0.0000000 0.8012156 0.7968889 2.585608 27 KPNB1 , NDC1 , AHCTF1 , RCC1 , NUP155 , TNPO1 , NUP160 , POM121 , NUP205 , NUP54 , NUP93 , NUP188 , NUP107 , NUP98 , RAN , NUP85 , SEH1L , NUP43 , SUMO1 , NUP133 , NUP62 , RANGAP1
Interactions of Rev with host cellular proteins 0.0000000 0.0000000 0.9101197 0.7869546 2.731408 36 KPNB1 , NDC1 , NUP153 , NPM1 , RCC1 , NUP155 , TPR , RANBP1 , RANBP2 , NUP160 , POM121 , NUP50 , XPO1 , NUP205 , NUP54 , NUP93 , NUP188 , NUP214 , NUP107 , NUP98 , RAN , NUP85 , SEH1L , NUP43 , NUP133 , POM121C, NUP62 , RANGAP1, NUP88 , RAE1
Nuclear import of Rev protein 0.0000000 0.0000000 0.8390889 0.7824653 2.667908 33 KPNB1 , NDC1 , NUP153 , NPM1 , RCC1 , NUP155 , TPR , RANBP2 , NUP160 , POM121 , NUP50 , NUP205 , NUP54 , NUP93 , NUP188 , NUP214 , NUP107 , NUP98 , RAN , NUP85 , SEH1L , NUP43 , NUP133 , POM121C, NUP62 , NUP88 , RAE1
Rev-mediated nuclear export of HIV RNA 0.0000000 0.0000000 0.8753251 0.7821972 2.685095 34 NDC1 , NUP153 , RCC1 , NUP155 , TPR , RANBP1 , RANBP2 , NUP160 , POM121 , NUP50 , XPO1 , NUP205 , NUP54 , NUP93 , NUP188 , NUP214 , NUP107 , NUP98 , RAN , NUP85 , SEH1L , NUP43 , NUP133 , POM121C, NUP62 , RANGAP1, NUP88 , RAE1
SLBP Dependent Processing of Replication-Dependent Histone Pre-mRNAs 0.0005261 0.0034662 0.4772708 0.7809511 1.973376 11 NCBP1 , LSM11 , SNRPD3, SNRPF , ZNF473, NCBP2 , SLBP , SNRPG , SNRPE
SLBP independent Processing of Histone Pre-mRNAs 0.0007848 0.0048608 0.4772708 0.7801965 1.910005 10 NCBP1 , LSM11 , SNRPD3, SNRPF , ZNF473, NCBP2 , SNRPG , SNRPE
Transport of Ribonucleoproteins into the Host Nucleus 0.0000000 0.0000000 0.8012156 0.7776911 2.616341 31 KPNB1 , NDC1 , NUP153 , NUP155 , TPR , RANBP2 , NUP160 , POM121 , NUP50 , NUP205 , NUP54 , NUP93 , NUP188 , KPNA1 , NUP214 , NUP107 , NUP98 , NUP85 , SEH1L , NUP43 , NUP133 , POM121C, NUP62 , NUP88 , RAE1
Impaired BRCA2 binding to PALB2 0.0000000 0.0000008 0.7195128 0.7767850 2.449986 24 XRCC2 , BLM , BRIP1 , BRCA1 , RAD51AP1, RAD51D , WRN , BRCA2 , EXO1 , PALB2 , DNA2 , BARD1 , RAD51 , RBBP8 , MRE11 , RAD51C
Export of Viral Ribonucleoproteins from Nucleus 0.0000000 0.0000000 0.8012156 0.7760569 2.610843 31 NDC1 , NUP153 , NUP155 , TPR , RANBP2 , NUP160 , POM121 , NUP50 , XPO1 , NUP205 , NUP54 , NUP93 , NUP188 , NUP214 , NUP107 , NUP98 , RAN , NUP85 , SEH1L , NUP43 , NUP133 , POM121C, NUP62 , NUP88 , RAE1
head(fgsea_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="FGSEA:Top downregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)
FGSEA:Top downregulated pathways in Aza treatment
pathway pval padj log2err ES NES size leadingEdge
Endosomal/Vacuolar pathway 0.0000016 0.0000200 0.6435518 -0.8981315 -2.107505 12 HLA-B, HLA-C, B2M , HLA-A, HLA-E, CTSS , HLA-H, HLA-F, HLA-G
Interferon alpha/beta signaling 0.0000000 0.0000000 1.1512205 -0.8564838 -2.765692 55 IFI27 , MX1 , OAS3 , HLA-B , HLA-C , OAS2 , OAS1 , IFIT1 , IFITM3, IFIT3 , STAT1 , HLA-A , ISG15 , IRF7 , STAT2 , IFI6 , IFIT5 , IFIT2 , IFI35 , BST2 , IFITM1, IRF5 , RSAD2 , GBP2 , XAF1 , USP18 , HLA-E , ADAR , MX2 , HLA-H , OASL , PSMB8 , SAMHD1, HLA-F , HLA-G , TYK2 , PTPN6 , ISG20
TNFs bind their physiological receptors 0.0023131 0.0121912 0.4317077 -0.8166001 -1.839888 10 TNFSF13B, CD70 , TNFRSF1B, EDA2R , TNFSF13 , TNFSF9
Interleukin-20 family signaling 0.0003977 0.0026994 0.4984931 -0.8141392 -1.910413 12 STAT1 , STAT2 , STAT5A, JAK3 , JAK2 , TYK2 , IL10RB, STAT3 , STAT5B, STAT4
Regulation of IFNA/IFNB signaling 0.0029060 0.0147850 0.3548195 -0.8037272 -1.810884 10 STAT1, STAT2, USP18, TYK2 , PTPN6
Maturation of nucleoprotein_9683610 0.0017802 0.0096755 0.4550599 -0.7952852 -1.828141 11 PARP9 , PARP14, PARP10
Interferon gamma signaling 0.0000000 0.0000000 1.0276699 -0.7934868 -2.625995 63 OAS3 , HLA-B , HLA-C , OAS2 , OAS1 , B2M , STAT1 , HLA-A , IRF7 , TRIM22, IRF5 , GBP2 , SMAD7 , TRIM25, HLA-E , TRIM14, HLA-H , OASL , TRIM5 , GBP3 , PTAFR , TRIM8 , SP100 , HLA-F , TRIM21, PML , JAK2 , HLA-G , IFNGR2, GBP5 , PTPN6 , IFNGR1, MT2A , NCAM1 , GBP4 , PRKCD , TRIM26
MyD88 deficiency (TLR2/4) 0.0012073 0.0069045 0.4550599 -0.7914249 -1.896202 13 MYD88 , S100A9, S100A8, CD36 , TLR6 , CD14 , TLR2
RUNX3 regulates p14-ARF 0.0085470 0.0328838 0.2045441 -0.7641348 -1.721678 10 CDKN2A, CCND1 , RUNX3 , RUNX1
Transcriptional regulation of pluripotent stem cells 0.0038970 0.0184421 0.3037794 -0.7636281 -1.791887 12 EPAS1, EOMES, PBX1 , NR5A1, STAT3, KLF4 , HHEX
fgsea_up <- head(fgsea_up,n_pw)
fgsea_up_vec <- fgsea_up$ES
names(fgsea_up_vec) <- fgsea_up$pathway
fgsea_up_vec <- rev(fgsea_up_vec)

fgsea_dn <- head(fgsea_dn,n_pw)
fgsea_dn_vec <- fgsea_dn$ES * -1
names(fgsea_dn_vec) <- fgsea_dn$pathway

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

barplot(c(fgsea_dn_vec,fgsea_up_vec),
  horiz=TRUE,las=1,cex.names=0.65,col=cols,
  main="top DE Reactomes",
  xlab="ES",
  cex.main=0.9)

mtext("FCS test")

Compare pathway results

v1 <- list("ORA up"=ora_ups, "FCS up"=fgsea_ups)
v2 <- list("ORA dn"=ora_dns, "FCS dn"=fgsea_dns)
par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2))

plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2))

Jaccard index

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

ora <- c(ora_ups,ora_dns)
fcs <- c(fgsea_ups,fgsea_dns)

jaccard(ora,fcs)
## [1] 0.5592841

Drill in to a specific pathway

In this case the top upregulated and downregulated pathways

upname <- head(fgsea_up,1)$pathway
plotEnrichment(gsets[[upname]], stat, gseaParam = 1, ticksSize = 0.2)

dnname <- head(fgsea_dn,1)$pathway
plotEnrichment(gsets[[dnname]], stat, gseaParam = 1, ticksSize = 0.2)

Now a heatmap of these gene sets.

# reads per million normalisation
rpm <- apply(mxf,2, function(x) { x/sum(x) * 1000000} )
colnames(rpm) <- c("UNTR1","UNTR2","UNTR3","TRT1","TRT2","TRT3")
gnames_up <- gsets[[which(names(gsets) == upname)]]
gnames_dn <- gsets[[which(names(gsets) == dnname)]]
gene_ids <- rownames(mxf)
gene_names <- sapply(strsplit(gene_ids," "),"[[",2)
rpm_up <- rpm[which(gene_names %in% gnames_up),]
rownames(rpm_up) <- sapply(strsplit(rownames(rpm_up)," "),"[[",2)
rpm_dn <- rpm[which(gene_names %in% gnames_dn),]
rownames(rpm_dn) <- sapply(strsplit(rownames(rpm_dn)," "),"[[",2)
colsidecols <- c("blue","blue","blue","red","red","red")
heatmap.2(rpm_up,scale="row",trace="none",margins=c(10,15),main=upname,ColSideColors=colsidecols)

heatmap.2(rpm_dn,scale="row",trace="none",margins=c(10,15),main=dnname,ColSideColors=colsidecols)

Session information

For reproducibility


Click HERE to show session info

sessionInfo()
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gplots_3.1.3                eulerr_7.0.0               
##  [3] fgsea_1.22.0                clusterProfiler_4.4.4      
##  [5] kableExtra_1.3.4            DESeq2_1.36.0              
##  [7] SummarizedExperiment_1.26.1 Biobase_2.56.0             
##  [9] MatrixGenerics_1.8.1        matrixStats_0.63.0         
## [11] GenomicRanges_1.48.0        GenomeInfoDb_1.32.2        
## [13] IRanges_2.30.0              S4Vectors_0.34.0           
## [15] BiocGenerics_0.42.0         getDEE2_1.6.0              
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3     rstudioapi_0.14        jsonlite_1.8.4        
##   [4] magrittr_2.0.3         farver_2.1.1           rmarkdown_2.21        
##   [7] zlibbioc_1.42.0        vctrs_0.6.2            memoise_2.0.1         
##  [10] RCurl_1.98-1.12        ggtree_3.4.1           webshot_0.5.4         
##  [13] htmltools_0.5.5        gridGraphics_0.5-1     sass_0.4.5            
##  [16] KernSmooth_2.23-20     bslib_0.4.2            plyr_1.8.8            
##  [19] cachem_1.0.7           igraph_1.4.2           lifecycle_1.0.3       
##  [22] pkgconfig_2.0.3        Matrix_1.5-4           R6_2.5.1              
##  [25] fastmap_1.1.1          GenomeInfoDbData_1.2.8 digest_0.6.31         
##  [28] aplot_0.1.10           enrichplot_1.16.1      colorspace_2.1-0      
##  [31] patchwork_1.1.2        AnnotationDbi_1.58.0   geneplotter_1.74.0    
##  [34] RSQLite_2.3.1          labeling_0.4.2         fansi_1.0.4           
##  [37] httr_1.4.5             polyclip_1.10-4        compiler_4.3.0        
##  [40] bit64_4.0.5            withr_2.5.0            downloader_0.4        
##  [43] BiocParallel_1.30.3    viridis_0.6.2          DBI_1.1.3             
##  [46] highr_0.10             ggforce_0.4.1          MASS_7.3-59           
##  [49] DelayedArray_0.22.0    caTools_1.18.2         gtools_3.9.4          
##  [52] tools_4.3.0            DO.db_2.9              scatterpie_0.1.9      
##  [55] ape_5.7-1              glue_1.6.2             nlme_3.1-162          
##  [58] GOSemSim_2.22.0        polylabelr_0.2.0       shadowtext_0.1.2      
##  [61] grid_4.3.0             reshape2_1.4.4         generics_0.1.3        
##  [64] gtable_0.3.3           tidyr_1.3.0            data.table_1.14.8     
##  [67] tidygraph_1.2.3        xml2_1.3.4             utf8_1.2.3            
##  [70] XVector_0.36.0         ggrepel_0.9.3          pillar_1.9.0          
##  [73] stringr_1.5.0          yulab.utils_0.0.6      genefilter_1.78.0     
##  [76] splines_4.3.0          dplyr_1.1.2            tweenr_2.0.2          
##  [79] treeio_1.20.1          lattice_0.21-8         survival_3.5-5        
##  [82] bit_4.0.5              annotate_1.74.0        tidyselect_1.2.0      
##  [85] GO.db_3.15.0           locfit_1.5-9.7         Biostrings_2.64.0     
##  [88] knitr_1.42             gridExtra_2.3          svglite_2.1.1         
##  [91] xfun_0.39              graphlayouts_0.8.4     stringi_1.7.12        
##  [94] lazyeval_0.2.2         ggfun_0.0.9            yaml_2.3.7            
##  [97] evaluate_0.20          codetools_0.2-19       ggraph_2.1.0          
## [100] tibble_3.2.1           qvalue_2.28.0          ggplotify_0.1.0       
## [103] cli_3.6.1              xtable_1.8-4           systemfonts_1.0.4     
## [106] munsell_0.5.0          jquerylib_0.1.4        Rcpp_1.0.10           
## [109] png_0.1-8              XML_3.99-0.14          parallel_4.3.0        
## [112] ggplot2_3.4.2          blob_1.2.4             DOSE_3.22.0           
## [115] htm2txt_2.2.2          bitops_1.0-7           tidytree_0.4.2        
## [118] viridisLite_0.4.1      scales_1.2.1           purrr_1.0.1           
## [121] crayon_1.5.2           rlang_1.1.1            fastmatch_1.1-3       
## [124] KEGGREST_1.36.3        rvest_1.0.3