---
title: "Example gene set analysis"
author: "Mark Ziemann & Kaumadi Wijesooriya"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
code_folding: hide
fig_width: 7
fig_height: 5
theme: cosmo
---
Source: https://github.com/markziemann/SurveyEnrichmentMethods
## Intro
Here we are performing an analysis of some gene expression data to understand whether direction agnostic (DA) or direction informed (DI)
enrichment analysis is recommended.
This will be demonstrated by using two approaches: mitch and clusterprofiler.
Mitch is a FCS method while clusterprofiler is an ORA method.
The dataset being used is SRP128998 and we are comparing the cells grown in normal glucose condition (control) to the high glucose condition (case).
Data are obtained from http://dee2.io/
```{r,begin}
suppressPackageStartupMessages({
library("getDEE2")
library("DESeq2")
library("clusterProfiler")
library("mitch")
library("kableExtra")
library("eulerr")
})
dir.create("images")
```
## Get expression data and make an MDS plot
I'm using some RNA-seq data looking at the effect of hyperglycemia on hepatocytes.
```{r,getdata}
name="SRP128998"
mdat<-getDEE2Metadata("hsapiens")
samplesheet <- mdat[grep("SRP128998",mdat$SRP_accession),]
samplesheet<-samplesheet[order(samplesheet$SRR_accession),]
samplesheet$trt<-as.factor(c(1,1,1,1,1,1,0,0,0,0,0,0))
samplesheet$VPA<-as.factor(c(0,0,0,1,1,1,0,0,0,1,1,1))
s1 <- subset(samplesheet,VPA==0)
s1 %>% kbl(caption = "sample sheet") %>% kable_paper("hover", full_width = F)
w<-getDEE2("hsapiens",samplesheet$SRR_accession,metadata=mdat,legacy = TRUE)
x<-Tx2Gene(w)
x<-x$Tx2Gene
# save the genetable for later
gt<-w$GeneInfo[,1,drop=FALSE]
gt$accession<-rownames(gt)
# counts
x1<-x[,which(colnames(x) %in% s1$SRR_accession)]
```
Here show the number of genes in the annotation set, and those detected above the detection threshold.
```{r,filter}
# filter out lowly expressed genes
x1<-x1[which(rowSums(x1)/ncol(x1)>=(10)),]
nrow(x)
nrow(x1)
```
Now multidimensional scaling (MDS) plot to show the correlation between the datasets.
If the control and case datasets are clustered separately, then it is likely that there will be many differentially expressed genes with FDR<0.05.
```{r,mds}
plot(cmdscale(dist(t(x1))), xlab="Coordinate 1", ylab="Coordinate 2", pch=19, col=s1$trt, main="MDS")
```
## Differential expression
Now run DESeq2 for control vs case.
```{r,deseq2}
y <- DESeqDataSetFromMatrix(countData = round(x1), colData = s1, design = ~ trt)
y <- DESeq(y)
de <- results(y)
de<-as.data.frame(de[order(de$pvalue),])
rownames(de)<-sapply(strsplit(rownames(de),"\\."),"[[",1)
head(de) %>% kbl() %>% kable_paper("hover", full_width = F)
```
Now let's have a look at some of the charts showing differential expression.
In particular, an MA plot and volcano plot.
```{r,deplots,fig.width=7,fig.height=7}
maplot <- function(de,contrast_name) {
sig <-subset(de, padj < 0.05 )
up <-rownames(subset(de, padj < 0.05 & log2FoldChange > 0))
dn <-rownames(subset(de, padj < 0.05 & log2FoldChange < 0))
GENESUP <- length(up)
GENESDN <- length(dn)
DET=nrow(de)
SUBHEADER = paste(GENESUP, "up, ", GENESDN, "down", DET, "detected")
ns <-subset(de, padj > 0.05 )
plot(log2(de$baseMean),de$log2FoldChange,
xlab="log2 basemean", ylab="log2 foldchange",
pch=19, cex=0.5, col="dark gray",
main=contrast_name, cex.main=0.7)
points(log2(sig$baseMean),sig$log2FoldChange,
pch=19, cex=0.5, col="red")
mtext(SUBHEADER,cex = 0.7)
}
make_volcano <- function(de,name) {
sig <- subset(de,padj<0.05)
N_SIG=nrow(sig)
N_UP=nrow(subset(sig,log2FoldChange>0))
N_DN=nrow(subset(sig,log2FoldChange<0))
DET=nrow(de)
HEADER=paste(N_SIG,"@5%FDR,", N_UP, "up", N_DN, "dn", DET, "detected")
plot(de$log2FoldChange,-log10(de$padj),cex=0.5,pch=19,col="darkgray",
main=name, xlab="log2 FC", ylab="-log10 pval", xlim=c(-6,6))
mtext(HEADER)
grid()
points(sig$log2FoldChange,-log10(sig$padj),cex=0.5,pch=19,col="red")
}
maplot(de,name)
make_volcano(de,name)
```
## Gene sets from Reactome
In order to perform gene set analysis, we need some gene sets.
```{r,reactome}
if (! file.exists("ReactomePathways.gmt")) {
download.file("https://reactome.org/download/current/ReactomePathways.gmt.zip",
destfile="ReactomePathways.gmt.zip")
unzip("ReactomePathways.gmt.zip")
}
genesets<-gmt_import("ReactomePathways.gmt")
```
## FCS with Mitch
Mitch uses rank-ANOVA statistics for enrichment detection.
Here I'm using the standard approach
```{r,mitch1}
m <- mitch_import(de,DEtype = "DEseq2", geneTable = gt)
msep <- mitch_calc(m,genesets = genesets)
ms_up <- subset(msep$enrichment_result,p.adjustANOVA<0.05 & s.dist > 0)[,1]
ms_dn <- subset(msep$enrichment_result,p.adjustANOVA<0.05 & s.dist < 0)[,1]
message(paste("Number of up-regulated pathways:",length(ms_up) ))
message(paste("Number of down-regulated pathways:",length(ms_dn) ))
head(msep$enrichment_result,10) #%>% kbl() %>% kable_paper("hover", full_width = F)
#mitch_report(msep,outfile="mitch_separate.html",overwrite=TRUE)
```
Here I'm using the combined approach.
```{r,mitch2}
mcom <- mitch_calc(abs(m),genesets = genesets)
mc_up <- subset(mcom$enrichment_result,p.adjustANOVA<0.05 & s.dist > 0)[,1]
mc_dn <- subset(mcom$enrichment_result,p.adjustANOVA<0.05 & s.dist < 0)[,1]
message(paste("Number of up-regulated pathways:",length(mc_up) ))
message(paste("Number of down-regulated pathways:",length(mc_dn) ))
head(mcom$enrichment_result,10) #%>% kbl() %>% kable_paper("hover", full_width = F)
#mitch_report(mcom,outfile="mitch_combined.html",overwrite=TRUE)
```
Let's look at the significant ones based on the combined analysis.
There weren't many gene sets classed as significant.
Let's see how many have a direction agnostic enrichment score which is larger in magnitude than the direction informed enrichment score.
There are only 11 such sets which would benefit from such combined analysis.
```{r,mitchres1}
mres$enrichment_result$padj.abs <- p.adjust(mres$enrichment_result$p.abs)
mres2 <- subset(mres$enrichment_result, padj.abs < 0.05)
subset(mres2, s.abs > abs(s.x))
dim(subset(mres2, s.abs > abs(s.x)))
```
Euler diagram of the significant pathways found with each approach.
```{r,mitch_euler1}
l0 <- list("sep up"=ms_up,"sep dn"=ms_dn,"comb up"=mc_up,"comb dn"=mc_dn)
par(cex.main=0.5)
plot(euler(l0),quantities = TRUE, edges = "gray", main="FCS: combined vs separated")
length(ms_up)
length(ms_dn)
length(ms_up)+length(ms_dn)
length(mc_up)
length(mc_dn)
length(mc_up)+length(mc_dn)
( length(ms_up)+length(ms_dn) ) / ( length(mc_up)+length(mc_dn) )
```
List gene sets which are specific to each approach.
```{r,mitch_sets}
ms <- c(ms_up,ms_dn)
# in sep but not comb
setdiff(ms,mc_up)
# in comb but not sep
setdiff(mc_up,ms)
# intersection
intersect(mc_up,ms)
```
Here I'm doing both (separated and combined) at once.
```{r,mitch3}
m$abs <- abs(m$x)
mres <- mitch_calc(m,genesets = genesets)
head(mres$enrichment_result,10) #%>% kbl() %>% kable_paper("hover", full_width = F)
#mitch_report(mres,outfile="mitch_multi.html",overwrite=TRUE)
mm <- subset(mres$enrichment_result, p.adjustMANOVA<0.05)[,1]
```
Euler diagram to show the overlaps of significant results.
```{r,mitch_euler2}
l1 <- list("sep up"=ms_up,"sep dn"=ms_dn,"comb up"=mc_up,"comb dn"=mc_dn,"multi"=mm)
plot(euler(l1),quantities = TRUE, edges = "gray", main="FCS: combined vs separated vs multivariate")
```
## ORA with clusterprofiler
Clusterprofiler uses a hypergeometric test.
Firstly I will conduct the analysis separately for up and down regulated genes and with the correct backgound (as intended by the developers).
```{r,cp1}
genesets2 <- read.gmt("ReactomePathways.gmt")
de_up <- rownames(subset(de, padj<0.05 & log2FoldChange > 0))
de_up <- unique(gt[which(rownames(gt) %in% de_up),1])
de_dn <- rownames(subset(de, padj<0.05 & log2FoldChange < 0))
de_dn <- unique(gt[which(rownames(gt) %in% de_dn),1])
de_bg <- rownames(de)
de_bg <- unique(gt[which(rownames(gt) %in% de_bg),1])
o_up <- as.data.frame(enricher(gene = de_up, universe = de_bg, maxGSSize = 5000, TERM2GENE = genesets2, pAdjustMethod="fdr"))
o_up <- rownames(subset(o_up, p.adjust < 0.05))
o_dn <- as.data.frame(enricher(gene = de_dn, universe = de_bg, maxGSSize = 5000, TERM2GENE = genesets2, pAdjustMethod="fdr"))
o_dn <- rownames(subset(o_dn, p.adjust < 0.05))
o_com <- as.data.frame(enricher(gene = union(de_up,de_dn), universe = de_bg, maxGSSize = 5000, TERM2GENE = genesets2, pAdjustMethod="fdr"))
o_com <- rownames(subset(o_com, p.adjust < 0.05))
length(o_up)
length(o_dn)
length(o_up) + length(o_dn)
length(o_com)
( length(o_up) + length(o_dn) ) / length(o_com)
```
Euler diagram of the significant pathways found with each approach.
```{r,cp_euler1}
l2 <- list("sep up"=o_up,"sep dn"=o_dn,"comb"=o_com)
plot(euler(l2),quantities = TRUE, edges = "gray", main="ORA: combined vs separated")
```
List gene sets which are specific to each approach.
```{r,ora_sets}
o_sep <- c(o_up,o_dn)
# in sep but not comb
setdiff(o_sep,o_com)
# in comb but not sep
setdiff(o_com,o_sep)
# intersection
intersect(mc_up,ms)
```
## Euler diagrams comparing FCS and ORA methods
```{r,euler_comparison}
par(cex.main=0.5)
par(mar=c(2,2,2,2))
l3 <- list("ORA up"=o_up,"ORA dn"=o_dn,"ORA comb"=o_com,
"FCS up"=ms_up,"FCS dn"=ms_dn,"FCS comb"=mc_up)
plot(euler(l3),quantities = TRUE, edges = "gray", main="FCS compared to ORA")
```
## Conclusion
For mitch, it would appear that performing direction informed (DI) analysis clearly yields more differentially regulated pathways (413) as compared to the
direction agnostic (DA) method (55).
That being said, there were 18 pathways identified only in the DA method that appeared to be related to the physiology of the model.
These gene sets are likely to contain a mix of genes affected by the stimulus in different ways - for example a mix of up and downregulated genes.
Are these really real? Not sure.
This pattern was consistent with ORA, where 80 sets were identified with separate analysis and only 23 with the combined analysis.
When comparing ORA to FCS, we found that FCS identified many more sets than ORA.
In fact all gene sets that were identified by ORA were also identified by FCS, except for 3 that were specific to the ORA up set.
Let's look at those now.
```{r,conc}
myfcs <- c(ms_up, mc_up)
setdiff(o_up,myfcs)
```
## Session information
```{r,session}
sessionInfo()
```