A second example of extremely reproducible enrichment analysis

Source: https://github.com/markziemann/enrichment_recipe

Introduction

This guide is a Rmarkdown script that conducts differential expression and enrichment analysis, which are very popular workflows for transcriptome data.

This is meant to be a boilerplate template, which you can remix and modify to suit your analytical needs.

In the code chunk below called libs, you can add and remove required R library dependancies. Check that the libraries listed here match the Dockerfile, otherwise you might get errors.

suppressPackageStartupMessages({
  library("getDEE2")
  library("DESeq2")
  library("kableExtra")
  library("clusterProfiler")
  library("fgsea")
  library("eulerr")
  library("gplots")
})

For this guide I will be using bulk RNA-seq data from a previous study, which is deposited at NCBI GEO with accession GSE93236.

The experiment is designed to investigate the effect of knowckdown of the histone methyltransferase set7.

We will be fetching the gene expression count matrix directly from NCBI GEO for downstream analysis. Link: https://ftp.ncbi.nlm.nih.gov/geo/series/GSE93nnn/GSE93236/suppl/GSE93236_Keating_Set7_KD_counts.tsv.gz

download.file("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE93nnn/GSE93236/suppl/GSE93236_Keating_Set7_KD_counts.tsv.gz",
  destfile="GSE93236_Keating_Set7_KD_counts.tsv.gz")

x <- read.table("GSE93236_Keating_Set7_KD_counts.tsv.gz")

dim(x)

## [1] 18545     6

# there is a row we need to remove
mx <- x[rownames(x)!="NOTINBED",]

head(mx)

##                                 FC045_3_NTC1 FC045_4_Set7KD1 FC045_5_NTC2
## ENSG00000165195.9_PIGA                   303             291          280
## ENSG00000249719.1_RP3-486L4.3             54              51           48
## ENSG00000146966.8_DENND2A               1171             390         1153
## ENSG00000129103.13_SUMF2                1782            1476         1673
## ENSG00000256304.1_RP11-512M8.3            31              35           28
## ENSG00000203945.3_RP11-274K13.2           15              11           13
##                                 FC045_6_Set7KD2 FC045_7_NTC3 FC045_8_Set7KD3
## ENSG00000165195.9_PIGA                      294          262             257
## ENSG00000249719.1_RP3-486L4.3                50           49              60
## ENSG00000146966.8_DENND2A                   493         1019             309
## ENSG00000129103.13_SUMF2                   1473         1700            1366
## ENSG00000256304.1_RP11-512M8.3               43           26              47
## ENSG00000203945.3_RP11-274K13.2              10           11              11

dim(mx)

## [1] 18544     6

The row names contains an underscore that we need to remove.

rownames(mx) <- sub("_"," ",rownames(mx))

head(mx)

##                                 FC045_3_NTC1 FC045_4_Set7KD1 FC045_5_NTC2
## ENSG00000165195.9 PIGA                   303             291          280
## ENSG00000249719.1 RP3-486L4.3             54              51           48
## ENSG00000146966.8 DENND2A               1171             390         1153
## ENSG00000129103.13 SUMF2                1782            1476         1673
## ENSG00000256304.1 RP11-512M8.3            31              35           28
## ENSG00000203945.3 RP11-274K13.2           15              11           13
##                                 FC045_6_Set7KD2 FC045_7_NTC3 FC045_8_Set7KD3
## ENSG00000165195.9 PIGA                      294          262             257
## ENSG00000249719.1 RP3-486L4.3                50           49              60
## ENSG00000146966.8 DENND2A                   493         1019             309
## ENSG00000129103.13 SUMF2                   1473         1700            1366
## ENSG00000256304.1 RP11-512M8.3               43           26              47
## ENSG00000203945.3 RP11-274K13.2              10           11              11

Curate sample sheet from the colnames.

ss <- as.data.frame(colnames(x))
ss$trt <- as.numeric(grepl("Set7KD",colnames(x)))

ss %>%
  kbl(caption="sample sheet for Set7KD experiment") %>%
  kable_paper("hover", full_width = F)

sample sheet for Set7KD experiment
colnames(x)	trt
FC045_3_NTC1	0
FC045_4_Set7KD1	1
FC045_5_NTC2	0
FC045_6_Set7KD2	1
FC045_7_NTC3	0
FC045_8_Set7KD3	1

Data quality control

QC is important, even if you are using public transcriptome data. For RNA-seq it is a good idea to show the number of reads for each sample.

par(mar=c(5,7,5,1))
barplot(rev(colSums(mx)),horiz=TRUE,las=1,main="number of reads per sample in SRP038101")

Now make a MDS plot.

mds <- cmdscale(dist(t(mx)))

# expand plot area
XMIN=min(mds[,1])*1.3
XMAX=max(mds[,1])*1.3
YMIN=min(mds[,2])*1.3
YMAX=max(mds[,2])*1.3

cols <- as.character(ss$trt)
cols <- gsub("0","lightblue",cols)
cols <- gsub("1","pink",cols)
plot(mds, xlab="Coordinate 1", ylab="Coordinate 2",
  xlim=c(XMIN,XMAX),ylim=c(YMIN,YMAX),
  pch=19,cex=2,col=cols, main="MDS plot")
text(cmdscale(dist(t(mx))), labels=colnames(mx) )

Differential expression analysis

I will be using DESeq2 for DE analysis, the count matrix is prefiltered using a detection threshold of 10 reads per sample across all samples. All genes that meet the detection threshold will comprise the background list. The first 6 rows of the count matrix are shows.

mxf <- mx[which(rowMeans(mx)>=10),]
dim(mxf)

## [1] 18544     6

head(mxf,6) %>%
  kbl(caption="Count matrix format") %>%
  kable_paper("hover", full_width = F)

Count matrix format
	FC045_3_NTC1	FC045_4_Set7KD1	FC045_5_NTC2	FC045_6_Set7KD2	FC045_7_NTC3	FC045_8_Set7KD3
ENSG00000165195.9 PIGA	303	291	280	294	262	257
ENSG00000249719.1 RP3-486L4.3	54	51	48	50	49	60
ENSG00000146966.8 DENND2A	1171	390	1153	493	1019	309
ENSG00000129103.13 SUMF2	1782	1476	1673	1473	1700	1366
ENSG00000256304.1 RP11-512M8.3	31	35	28	43	26	47
ENSG00000203945.3 RP11-274K13.2	15	11	13	10	11	11

dds <- DESeqDataSetFromMatrix(countData = mxf , colData = ss, design = ~ trt )

##   the design formula contains one or more numeric variables with integer values,
##   specifying a model with increasing fold change for higher values.
##   did you mean for this to be a factor? if so, first convert
##   this variable to a factor using the factor() function

res <- DESeq(dds)

## estimating size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

z <- results(res)
vsd <- vst(dds, blind=FALSE)
zz <-cbind(as.data.frame(z),mxf)
def <-as.data.frame(zz[order(zz$pvalue),])

head(def,10) %>%
  kbl(caption="Top DE genes for Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top DE genes for Aza treatment
	baseMean	log2FoldChange	lfcSE	stat	FC045_3_NTC1	FC045_4_Set7KD1	FC045_5_NTC2	FC045_6_Set7KD2	FC045_7_NTC3	FC045_8_Set7KD3
ENSG00000164692.13 COL1A2	7325.1780	2.3007932	0.0872906	26.35785	2653	13366	2392	10048	2540	12544
ENSG00000168542.8 COL3A1	814.3688	2.7408245	0.1068516	25.65076	208	1554	228	1233	214	1395
ENSG00000172531.10 PPP1CA	1571.0716	-1.7187849	0.0764171	-22.49216	2598	706	2353	748	2444	705
ENSG00000106484.8 MEST	2657.6245	1.1789946	0.0690145	17.08328	1664	3796	1696	3399	1634	3683
ENSG00000130508.6 PXDN	8180.4490	-0.9726259	0.0619784	-15.69298	11527	5249	10764	5616	10961	5419
ENSG00000125868.10 DSTN	6067.2939	1.0357844	0.0664676	15.58330	4020	8146	3866	7524	4321	8393
ENSG00000169439.6 SDC2	3002.0236	1.3181753	0.0865541	15.22950	1892	4504	1802	3728	1577	4416
ENSG00000242265.1 PEG10	3171.5304	1.0736020	0.0708803	15.14670	2023	4234	1976	4070	2270	4380
ENSG00000166734.13 CASC4	1107.0779	-1.4078612	0.0930505	-15.13007	1620	535	1685	671	1626	580
ENSG00000091986.10 CCDC80	6264.0879	1.3185353	0.0878996	15.00047	3625	9486	3152	8289	4233	8613

Make a smear plot.

sigf <- subset(def,padj<=0.05)

DET=nrow(mxf)
NSIG=nrow(sigf)
NUP=nrow(subset(sigf,log2FoldChange>0))
NDN=nrow(subset(sigf,log2FoldChange<0))

HEADER=paste(DET,"detected genes;",NSIG,"w/FDR<0.05;",NUP,"up;",NDN,"down")

plot(log10(def$baseMean) ,def$log2FoldChange,
  cex=0.6,pch=19,col="darkgray",
  main="Set7KD in endothelial cells",
  xlab="log10(basemean)",ylab="log2(fold change)")

points(log10(sigf$baseMean) ,sigf$log2FoldChange,
  cex=0.6,pch=19,col="red")

mtext(HEADER)

In the next sections I will run enrichment analysis with over-representation test and compare it to functional class scoring.

Enrichment using over-representation test

I’ve compiled a reporting checklist:

Reporting criteria	Method/resource used
Origin of gene sets	Reactome (2023-03-06)
Tool used	ClusterProfiler (check version at foot of report)
Statistical test used	hypergeometric test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	GEO under accession GSE93236
Other parameters	Min gene set size of 10

Get the gene sets loaded in R. They are located in the /ref folder of the repo.

# from https://reactome.org/download/current/ReactomePathways.gmt.zip
genesets2 <- read.gmt("ref/ReactomePathways_2023-03-06.gmt")
gsets <- gmtPathways("ref/ReactomePathways_2023-03-06.gmt")

Now filter the gene names into three lists, up-regulated, down-regulated and background. Background is simply all genes that were detected.

defup <- rownames(subset(def,padj<=0.05 & log2FoldChange>0))
defup <- unique(sapply(strsplit(defup," "),"[[",2))

defdn <- rownames(subset(def,padj<=0.05 & log2FoldChange<0))
defdn <- unique(sapply(strsplit(defdn," "),"[[",2))

bg <- rownames(def)
bg <- unique(sapply(strsplit(bg," "),"[[",2))

message("number of genes in each group")

## number of genes in each group

lapply(list("background"=bg,"up-regulated"=defup,"down-regulated"=defdn),length)

## $background
## [1] 18504
## 
## $`up-regulated`
## [1] 2700
## 
## $`down-regulated`
## [1] 2682

One of the quirky things I learned about clusterprofiler is that not all genes specified by the user as background are included in the enrichmentcalculation. It appears as though the genes not included in the annotation set are discarded. I believe this is a subtle but significant problem. To fix it is relatively easy from the user’s point of view. Just create a new gene set that consists of all detected genes and

Adding all detected genes to the background appears to improve results!

bgdf <- data.frame("background",bg)
colnames(bgdf) <- c("term","gene")
genesets2 <- rbind(genesets2,bgdf)

Enrichment firstly with upregulated genes. Note that the BgRatio contains the correct number of genes in the background now.

# show 10 pathways only
n_pw=10

ora_up <- as.data.frame(enricher(gene = defup ,
  universe = bg,  maxGSSize = 50000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_up$geneID <- NULL
ora_up <- subset(ora_up,p.adjust<0.05 & Count >=10)
ora_ups <- rownames(ora_up)

gr <- as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_up$BgRatio,"/"),"[[",2))

ora_up$es <- gr/br
ora_up <- ora_up[order(-ora_up$es),]
ora_up$Description=NULL

head(ora_up,n_pw) %>%
  kbl(row.names = FALSE, caption="Top upregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top upregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
Cell-extracellular matrix interactions	12/2700	18/18504	7.00e-07	0.0000126	0.0000072	12	4.568889
Initiation of Nuclear Envelope (NE) Reformation	11/2700	19/18504	1.52e-05	0.0001297	0.0000741	11	3.967719
Formation of tubulin folding intermediates by CCT/TriC	12/2700	21/18504	7.40e-06	0.0000716	0.0000409	12	3.916190
RHOBTB1 GTPase cycle	12/2700	23/18504	2.55e-05	0.0001958	0.0001118	12	3.575652
RHOBTB2 GTPase cycle	12/2700	23/18504	2.55e-05	0.0001958	0.0001118	12	3.575652
Postmitotic nuclear pore complex (NPC) reformation	13/2700	25/18504	1.22e-05	0.0001093	0.0000624	13	3.563733
Syndecan interactions	14/2700	27/18504	5.90e-06	0.0000639	0.0000365	14	3.553580
Transport of Ribonucleoproteins into the Host Nucleus	15/2700	29/18504	2.80e-06	0.0000346	0.0000198	15	3.544828
Transport of the SLBP Dependant Mature mRNA	17/2700	33/18504	7.00e-07	0.0000119	0.0000068	17	3.530505
EPHB-mediated forward signaling	19/2700	37/18504	2.00e-07	0.0000035	0.0000020	19	3.519279

topup2 <- rev(head(ora_up$es,10))
names(topup2) <- rev(head(ora_up$ID,10))

Now with the downregulated genes

ora_dn <- as.data.frame(enricher(gene = defdn ,
  universe = bg,  maxGSSize = 50000, TERM2GENE = genesets2,
  pAdjustMethod="fdr",  pvalueCutoff = 1, qvalueCutoff = 1  ))

ora_dn$geneID <- NULL
ora_dn <- subset(ora_dn,p.adjust<0.05 & Count >=10)
ora_dns <- rownames(ora_dn)

gr <- as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$GeneRatio,"/"),"[[",2))

br <- as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",1)) /
  as.numeric(sapply(strsplit(ora_dn$BgRatio,"/"),"[[",2))

ora_dn$es <- gr/br
ora_dn <- ora_dn[order(-ora_dn$es),]
ora_dn$Description=NULL

head(ora_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="Top downregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

Top downregulated pathways in Aza treatment
ID	GeneRatio	BgRatio	pvalue	p.adjust	qvalue	Count	es
Interferon alpha/beta signaling	29/2682	55/18504	0.0e+00	0.0e+00	0.00e+00	29	3.637828
Peptide chain elongation	40/2682	87/18504	0.0e+00	0.0e+00	0.00e+00	40	3.172105
Eukaryotic Translation Elongation	42/2682	92/18504	0.0e+00	0.0e+00	0.00e+00	42	3.149694
Response of EIF2AK4 (GCN2) to amino acid deficiency	44/2682	98/18504	0.0e+00	0.0e+00	0.00e+00	44	3.097658
Selenocysteine synthesis	40/2682	90/18504	0.0e+00	0.0e+00	0.00e+00	40	3.066368
Viral mRNA Translation	38/2682	87/18504	0.0e+00	0.0e+00	0.00e+00	38	3.013500
FOXO-mediated transcription	22/2682	52/18504	1.1e-06	4.1e-05	3.56e-05	22	2.918947
Eukaryotic Translation Termination	38/2682	91/18504	0.0e+00	0.0e+00	0.00e+00	38	2.881038
Selenoamino acid metabolism	43/2682	103/18504	0.0e+00	0.0e+00	0.00e+00	43	2.880302
Formation of a pool of free 40S subunits	41/2682	99/18504	0.0e+00	0.0e+00	0.00e+00	41	2.857298

topdn2 <- head(ora_dn$es,n_pw)
names(topdn2) <- head(ora_dn$ID,n_pw)

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

barplot(c(topdn2,topup2),
  horiz=TRUE,las=1,cex.names=0.65,col=cols,
  main="top DE Reactomes",
  xlab="ES",
  cex.main=0.9)

mtext("ORA test")

Enrichment analysis with functional class scoring

Reporting criteria	Method/resource used
Origin of gene sets	Reactome (2023-03-06)
Tool used	FGSEA (check version at foot of report)
Statistical test used	Kolmogorov-Smirnov test
P-value correction for multiple comparisons	FDR method
Background list	Genes with >=10 reads per sample on average across all samples
Gene Selection Criteria	DESeq2 FDR<0.05
ORA directionality	Separate tests for up- and down-regulation
Data availability	via DEE2 at accession SRP038101 (human)
Other parameters	Min gene set size of 10

def2 <- def
def2$genename <- sapply(strsplit(rownames(def2)," "),"[[",2)
def2 <- def2[,c("stat","genename")]
def2 <- aggregate(. ~ genename, def2, sum)
stat <- def2$stat
names(stat) <- def2$genename
fgseaRes <- fgsea(pathways=gsets, stats=stat, minSize=10, nPermSimple = 10000)

## Warning in preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, : There are ties in the preranked stats (0.06% of the list).
## The order of those tied genes will be arbitrary, which may produce unexpected results.

fgseaRes <- fgseaRes[order(fgseaRes$pval),]

fgsea_up <- subset(fgseaRes,padj<0.05 & ES>0)
fgsea_ups <- fgsea_up$pathway
fgsea_dn <- subset(fgseaRes,padj<0.05 & ES<0)
fgsea_dns <- fgsea_dn$pathway

fgsea_up <-  data.frame(fgsea_up[order(-fgsea_up$ES),])

fgsea_dn <-  data.frame(fgsea_dn[order(fgsea_dn$ES),])

head(fgsea_up,n_pw) %>%
  kbl(row.names = FALSE, caption="FGSEA:Top upregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

FGSEA:Top upregulated pathways in Aza treatment
pathway	pval	padj	log2err	ES	NES	size	leadingEdge
Folding of actin by CCT/TriC	0.0000049	0.0001013	0.6105269	0.8964312	2.140545	10	CCT5 , TCP1 , CCT6A, ACTB , CCT3 , CCT2 , CCT7 , CCT4
GP1b-IX-V activation signalling	0.0001711	0.0015309	0.5188481	0.8341859	1.991913	10	COL1A2, PIK3R1, FLNA , COL1A1, RAF1
Platelet Adhesion to exposed collagen	0.0011459	0.0078804	0.4550599	0.7729294	1.891480	11	COL1A2, ITGA1 , COL1A1, ITGB1 , ITGA2
Cell-extracellular matrix interactions	0.0000247	0.0003444	0.5756103	0.7577980	2.118309	18	RSU1 , FERMT2, ACTN1 , FLNA , FBLIM1, VASP , ACTB , PARVA , ITGB1 , LIMS1 , ILK , FLNC
Formation of tubulin folding intermediates by CCT/TriC	0.0000062	0.0001184	0.6105269	0.7545947	2.200061	21	TUBB3 , CCT5 , TCP1 , TUBB2A, CCT6A , CCT3 , CCT2 , TUBA1C, CCT7 , CCT4 , TUBA1B
MASTL Facilitates Mitotic Progression	0.0080462	0.0408024	0.2320568	0.7262829	1.734256	10	PPP2CA , PPP2R1B, PPP2R1A, PPP2CB , ENSA
Prefoldin mediated transfer of substrate to CCT/TriC	0.0000056	0.0001107	0.6105269	0.7165120	2.199181	26	TUBB3 , CCT5 , TCP1 , TUBB2A, CCT6A , ACTB , CCT3 , CCT2 , TUBA1C, CCT7 , CCT4 , VBP1
Cooperation of Prefoldin and TriC/CCT in actin and tubulin folding	0.0000018	0.0000510	0.6435518	0.7130937	2.233069	28	TUBB3 , CCT5 , TCP1 , TUBB2A, CCT6A , ACTB , CCT3 , CCT2 , TUBA1C, CCT7 , CCT4 , TUBA1B, VBP1
CTNNB1 S33 mutants aren’t phosphorylated	0.0018180	0.0113301	0.4550599	0.7026577	1.841133	14	PPP2CA , PPP2R1B, CTNNB1 , PPP2R5A, APC , PPP2R1A, PPP2CB
CTNNB1 S37 mutants aren’t phosphorylated	0.0018180	0.0113301	0.4550599	0.7026577	1.841133	14	PPP2CA , PPP2R1B, CTNNB1 , PPP2R5A, APC , PPP2R1A, PPP2CB

head(fgsea_dn,n_pw) %>%
  kbl(row.names = FALSE, caption="FGSEA:Top downregulated pathways in Aza treatment") %>%
  kable_paper("hover", full_width = F)

FGSEA:Top downregulated pathways in Aza treatment
pathway	pval	padj	log2err	ES	NES	size	leadingEdge
Endosomal/Vacuolar pathway	0.0000648	0.0007066	0.5384341	-0.8564798	-2.028123	10	HLA-E, CTSS , HLA-A, HLA-C, HLA-H, B2M , HLA-F, HLA-B, HLA-G
CASP8 activity is inhibited	0.0046557	0.0257910	0.2972329	-0.7534651	-1.784187	10	TNFRSF10B, TNFRSF10A, FAS , CFLAR , TRADD , TRAF2 , TNFSF10
Dimerization of procaspase-8	0.0046557	0.0257910	0.2972329	-0.7534651	-1.784187	10	TNFRSF10B, TNFRSF10A, FAS , CFLAR , TRADD , TRAF2 , TNFSF10
Regulation by c-FLIP	0.0046557	0.0257910	0.2972329	-0.7534651	-1.784187	10	TNFRSF10B, TNFRSF10A, FAS , CFLAR , TRADD , TRAF2 , TNFSF10
ABC transporters in lipid homeostasis	0.0087549	0.0427471	0.2157829	-0.6668947	-1.733942	14	ABCA2 , ABCG1 , ABCA3 , ABCA7 , ABCA6 , ABCA9 , ABCA5 , ABCA12
Regulation of RUNX1 Expression and Activity	0.0102121	0.0478077	0.2005132	-0.6411896	-1.693097	15	TNRC6B, PTPN11, CCND2 , CBFB , PML
Eukaryotic Translation Elongation	0.0000000	0.0000000	0.8753251	-0.6273228	-2.468077	92	RPL3 , EEF1A2 , RPS14 , RPL13A , RPL19 , RPS11 , RPL13 , RPS13 , EEF2 , RPS16 , RPL28 , RPS20 , RPS9 , RPL4 , RPS8 , RPL10A , RPL22L1, RPL15 , RPS19 , RPS3 , RPL8 , RPL37 , RPL26 , RPL12 , RPS6 , RPL23A , RPL29 , RPL37A , RPL22 , RPL18 , RPS21 , RPL31 , RPL18A , EEF1G , RPLP2 , RPL27A , RPS4X , RPL23 , RPL5 , RPS18 , RPL35A , EEF1A1 , RPL34 , RPL7A , RPS15 , RPS29 , EEF1D , RPLP1 , RPL30 , RPL14 , RPL36
Peptide chain elongation	0.0000000	0.0000000	0.8634154	-0.6251293	-2.436687	87	RPL3 , RPS14 , RPL13A , RPL19 , RPS11 , RPL13 , RPS13 , EEF2 , RPS16 , RPL28 , RPS20 , RPS9 , RPL4 , RPS8 , RPL10A , RPL22L1, RPL15 , RPS19 , RPS3 , RPL8 , RPL37 , RPL26 , RPL12 , RPS6 , RPL23A , RPL29 , RPL37A , RPL22 , RPL18 , RPS21 , RPL31 , RPL18A , RPLP2 , RPL27A , RPS4X , RPL23 , RPL5 , RPS18 , RPL35A , EEF1A1 , RPL34 , RPL7A , RPS15 , RPS29 , RPLP1 , RPL30 , RPL14 , RPL36 , RPL10 , RPL32 , RPS15A , RPL35 , RPS27 , RPLP0
Selenocysteine synthesis	0.0000000	0.0000000	0.8513391	-0.6159046	-2.414271	90	RPL3 , RPS14 , RPL13A , RPL19 , RPS11 , RPL13 , RPS13 , RPS16 , RPL28 , RPS20 , RPS9 , RPL4 , RPS8 , RPL10A , RPL22L1 , RPL15 , SEPSECS , RPS19 , RPS3 , RPL8 , RPL37 , RPL26 , RPL12 , RPS6 , RPL23A , RPL29 , RPL37A , RPL22 , RPL18 , RPS21 , RPL31 , RPL18A , RPLP2 , RPL27A , RPS4X , RPL23 , RPL5 , RPS18 , SECISBP2, RPL35A , RPL34 , RPL7A , RPS15 , RPS29 , RPLP1 , RPL30 , RPL14 , RPL36 , RPL10 , RPL32 , RPS15A , RPL35 , RPS27 , RPLP0 , EEFSEC , RPL11 , RPL38 , RPS28 , RPS4Y1 , RPL36AL
Interferon alpha/beta signaling	0.0000007	0.0000262	0.6594444	-0.6136285	-2.205558	55	HLA-E , STAT2 , PTPN11, IFITM3, BST2 , ISG20 , HLA-A , IRF1 , HLA-C , IRF3 , IFITM1, HLA-H , IFITM2, IRF7 , HLA-F , MX2 , USP18 , SOCS3 , IRF9 , ISG15 , IFI27 , HLA-B , EGR1 , IFIT1 , OAS3 , HLA-G , RNASEL, MX1 , IFNAR2

fgsea_up <- head(fgsea_up,n_pw)
fgsea_up_vec <- fgsea_up$ES
names(fgsea_up_vec) <- fgsea_up$pathway
fgsea_up_vec <- rev(fgsea_up_vec)

fgsea_dn <- head(fgsea_dn,n_pw)
fgsea_dn_vec <- fgsea_dn$ES * -1
names(fgsea_dn_vec) <- fgsea_dn$pathway

Make a barplot

par(mar=c(5,20,5,1))

cols <- c(rep("blue",n_pw),rep("red",n_pw))

barplot(c(fgsea_dn_vec,fgsea_up_vec),
  horiz=TRUE,las=1,cex.names=0.65,col=cols,
  main="top DE Reactomes",
  xlab="ES",
  cex.main=0.9)

mtext("FCS test")

Compare pathway results

v1 <- list("ORA up"=ora_ups, "FCS up"=fgsea_ups)
v2 <- list("ORA dn"=ora_dns, "FCS dn"=fgsea_dns)
par(mar=c(10,10,10,10))
par(mfrow=c(2,1))
plot(euler(v1),quantities = list(cex = 2), labels = list(cex = 2))

plot(euler(v2),quantities = list(cex = 2), labels = list(cex = 2))

Jaccard index

jaccard <- function(a, b) {
    intersection = length(intersect(a, b))
    union = length(a) + length(b) - intersection
    return (intersection/union)
}

ora <- c(ora_ups,ora_dns)
fcs <- c(fgsea_ups,fgsea_dns)

jaccard(ora,fcs)

## [1] 0.5233463

Drill in to a specific pathway

In this case the top upregulated and downregulated pathways

upname <- head(fgsea_up,1)$pathway
plotEnrichment(gsets[[upname]], stat, gseaParam = 1, ticksSize = 0.2)

dnname <- head(fgsea_dn,1)$pathway
plotEnrichment(gsets[[dnname]], stat, gseaParam = 1, ticksSize = 0.2)

Now a heatmap of these gene sets.

# reads per million normalisation
rpm <- apply(mxf,2, function(x) { x/sum(x) * 1000000} )
colnames(rpm) <- c("UNTR1","UNTR2","UNTR3","TRT1","TRT2","TRT3")
gnames_up <- gsets[[which(names(gsets) == upname)]]
gnames_dn <- gsets[[which(names(gsets) == dnname)]]
gene_ids <- rownames(mxf)
gene_names <- sapply(strsplit(gene_ids," "),"[[",2)
rpm_up <- rpm[which(gene_names %in% gnames_up),]
rownames(rpm_up) <- sapply(strsplit(rownames(rpm_up)," "),"[[",2)
rpm_dn <- rpm[which(gene_names %in% gnames_dn),]
rownames(rpm_dn) <- sapply(strsplit(rownames(rpm_dn)," "),"[[",2)
colsidecols <- c("blue","blue","blue","red","red","red")
heatmap.2(rpm_up,scale="row",trace="none",margins=c(10,15),main=upname,ColSideColors=colsidecols)

heatmap.2(rpm_dn,scale="row",trace="none",margins=c(10,15),main=dnname,ColSideColors=colsidecols)

Session information

For reproducibility

Click HERE to show session info

sessionInfo()

## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gplots_3.1.3                eulerr_7.0.0               
##  [3] fgsea_1.22.0                clusterProfiler_4.4.4      
##  [5] kableExtra_1.3.4            DESeq2_1.36.0              
##  [7] SummarizedExperiment_1.26.1 Biobase_2.56.0             
##  [9] MatrixGenerics_1.8.1        matrixStats_0.63.0         
## [11] GenomicRanges_1.48.0        GenomeInfoDb_1.32.2        
## [13] IRanges_2.30.0              S4Vectors_0.34.0           
## [15] BiocGenerics_0.42.0         getDEE2_1.6.0              
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3     jsonlite_1.8.4         rstudioapi_0.14       
##   [4] magrittr_2.0.3         farver_2.1.1           rmarkdown_2.21        
##   [7] zlibbioc_1.42.0        vctrs_0.6.2            memoise_2.0.1         
##  [10] RCurl_1.98-1.12        ggtree_3.4.1           webshot_0.5.4         
##  [13] htmltools_0.5.5        gridGraphics_0.5-1     sass_0.4.5            
##  [16] bslib_0.4.2            KernSmooth_2.23-20     plyr_1.8.8            
##  [19] cachem_1.0.7           igraph_1.4.2           lifecycle_1.0.3       
##  [22] pkgconfig_2.0.3        Matrix_1.5-4           R6_2.5.1              
##  [25] fastmap_1.1.1          GenomeInfoDbData_1.2.8 digest_0.6.31         
##  [28] aplot_0.1.10           enrichplot_1.16.1      colorspace_2.1-0      
##  [31] patchwork_1.1.2        AnnotationDbi_1.58.0   geneplotter_1.74.0    
##  [34] RSQLite_2.3.1          labeling_0.4.2         fansi_1.0.4           
##  [37] httr_1.4.5             polyclip_1.10-4        compiler_4.3.0        
##  [40] bit64_4.0.5            withr_2.5.0            downloader_0.4        
##  [43] BiocParallel_1.30.3    viridis_0.6.2          DBI_1.1.3             
##  [46] highr_0.10             ggforce_0.4.1          MASS_7.3-59           
##  [49] DelayedArray_0.22.0    caTools_1.18.2         gtools_3.9.4          
##  [52] tools_4.3.0            DO.db_2.9              scatterpie_0.1.9      
##  [55] ape_5.7-1              glue_1.6.2             nlme_3.1-162          
##  [58] GOSemSim_2.22.0        polylabelr_0.2.0       shadowtext_0.1.2      
##  [61] grid_4.3.0             reshape2_1.4.4         generics_0.1.3        
##  [64] gtable_0.3.3           tidyr_1.3.0            data.table_1.14.8     
##  [67] tidygraph_1.2.3        xml2_1.3.4             utf8_1.2.3            
##  [70] XVector_0.36.0         ggrepel_0.9.3          pillar_1.9.0          
##  [73] stringr_1.5.0          yulab.utils_0.0.6      genefilter_1.78.0     
##  [76] splines_4.3.0          dplyr_1.1.2            tweenr_2.0.2          
##  [79] treeio_1.20.1          lattice_0.21-8         survival_3.5-5        
##  [82] bit_4.0.5              annotate_1.74.0        tidyselect_1.2.0      
##  [85] GO.db_3.15.0           locfit_1.5-9.7         Biostrings_2.64.0     
##  [88] knitr_1.42             gridExtra_2.3          svglite_2.1.1         
##  [91] xfun_0.39              graphlayouts_0.8.4     stringi_1.7.12        
##  [94] yaml_2.3.7             lazyeval_0.2.2         ggfun_0.0.9           
##  [97] evaluate_0.20          codetools_0.2-19       ggraph_2.1.0          
## [100] tibble_3.2.1           qvalue_2.28.0          ggplotify_0.1.0       
## [103] cli_3.6.1              xtable_1.8-4           systemfonts_1.0.4     
## [106] jquerylib_0.1.4        munsell_0.5.0          Rcpp_1.0.10           
## [109] png_0.1-8              XML_3.99-0.14          parallel_4.3.0        
## [112] ggplot2_3.4.2          blob_1.2.4             DOSE_3.22.0           
## [115] htm2txt_2.2.2          bitops_1.0-7           tidytree_0.4.2        
## [118] viridisLite_0.4.1      scales_1.2.1           purrr_1.0.1           
## [121] crayon_1.5.2           rlang_1.1.1            fastmatch_1.1-3       
## [124] KEGGREST_1.36.3        rvest_1.0.3