Here we are looking at putting together the easiest possible workflow for the package.
Source code: https://github.com/markziemann/gmea/blob/main/example_workflow/example_workflow.Rmd
Load packages.
Important: ensure that the mitch version used is 1.15.1 or higher.
suppressPackageStartupMessages({
library("limma")
library("eulerr")
library("AnnotationHub")
library("IlluminaHumanMethylation450kanno.ilmn12.hg19")
library("IlluminaHumanMethylationEPICanno.ilm10b4.hg19")
library("HGNChelper")
library("tictoc")
library("mitch")
library("kableExtra")
library("beeswarm")
library("missMethyl")
library("gridExtra")
library("png")
})
if( packageVersion("mitch") < "1.15.1") {
warning("This workflow requires mitch version 1.15.1 or higher")
}
The limma1 data table is read in. This dataset is available from GEO (GSE49667)2. The limma data was obtained using a previously prepared workflow.3
dm <- read.csv("DMPs.csv.gz",header=TRUE)
head(dm)
## Name logFC AveExpr t P.Value adj.P.Val B
## 1 cg07499259 3.654104 2.46652171 18.73082 7.258963e-08 0.005062817 7.454265
## 2 cg26992245 4.450696 -0.09180715 18.32680 8.603867e-08 0.005062817 7.360267
## 3 cg09747445 -3.337299 -0.25201484 -18.24369 8.913981e-08 0.005062817 7.340431
## 4 cg18808929 -2.990263 0.77522878 -17.90079 1.033329e-07 0.005062817 7.256713
## 5 cg25015733 -3.054336 0.83280190 -17.32537 1.332317e-07 0.005062817 7.109143
## 6 cg21179654 2.859016 1.32460816 17.27702 1.361568e-07 0.005062817 7.096322
Note that this object has probes with their own column, not simply as row names.
Gene ontologies were downloaded in GMT format from MSigDB on 15th Jan 20244,5. The GMT file is read into R using the mitch function gmt_import()
.
gene_sets <- gmt_import("c5.go.v2023.2.Hs.symbols.gmt")
One of the critical parts of this workflow is the establishment of probe-gene relationships. This controls how the probe data is aggregated to make the gene level scores.
In this demonstration we are using all probes on the HM450K chip. Alternative approaches could use only promoters CpGs for example.
As these annotations are several years old, many of the annotated gene names are no longer current. To remedy this, the gene names are screened with the HGNChelper package and any defunct symbols get updated to the newer gene name, so they will be recognised properly in the gene sets.
The process for making a gene table for EPIC data is included down below.
tic()
anno <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt1 <- stack(gp2)
colnames(gt1) <- c("gene","probe")
gt1$probe <- as.character(gt1$probe)
toc() #6.0s
## 15.969 sec elapsed
dim(gt1)
## [1] 407090 2
str(gt1)
## 'data.frame': 407090 obs. of 2 variables:
## $ gene : chr "TSPY4" "FAM197Y2" "TTTY14" "TMSB4Y" ...
## $ probe: chr "cg00050873" "cg00050873" "cg00212031" "cg00214611" ...
length(unique(gt1$gene))
## [1] 21231
tic()
#new.hgnc.table <- getCurrentHumanMap()
new.hgnc.table <- readRDS("../figs/new.hgnc.table.rds")
fix <- checkGeneSymbols(gt1$gene,map=new.hgnc.table)
## Warning in checkGeneSymbols(gt1$gene, map = new.hgnc.table): Human gene symbols
## should be all upper-case except for the 'orf' in open reading frames. The case
## of some letters was corrected.
## Warning in checkGeneSymbols(gt1$gene, map = new.hgnc.table): x contains
## non-approved gene symbols
fix2 <- fix[which(fix$x != fix$Suggested.Symbol),]
length(unique(fix2$x))
## [1] 2788
gt1$gene <- fix$Suggested.Symbol
toc() # 36 sec (2788 gene names updated)
## 141.394 sec elapsed
head(gt1)
## gene probe
## 1 TSPY4 cg00050873
## 2 FAM197Y2 cg00050873
## 3 TTTY14 cg00212031
## 4 TMSB4Y cg00214611
## 5 TBL1Y cg01707559
## 6 TMSB4Y cg02004872
The first part is to import the data into mitch.
This only works properly when mitch’s mapGeneIds function uses mean
to aggregate (line 69 of mitch.R).
tic()
m <- mitch_import(x=dm,DEtype="limma",geneTable=gt1,geneIDcol="Name")
## The input is a single dataframe; one contrast only. Converting
## it to a list for you.
## Note: Mean no. genes in input = 439918
## Note: no. genes in output = 20092
## Warning in mitch_import(x = dm, DEtype = "limma", geneTable = gt1, geneIDcol = "Name"): Warning: less than half of the input genes are also in the
## output
toc() #2.5 sec
## 5.143 sec elapsed
head(m) %>%
kbl(caption = "Example of differential gene methylation scores used for pathway analysis") %>%
kable_paper("hover", full_width = F)
x | |
---|---|
A1BG | 0.7093783 |
A1BG-AS1 | 0.5498603 |
A1CF | 1.0133746 |
A2M | 0.2846473 |
A2ML1 | 0.1595816 |
A4GALT | 0.0441000 |
Now run the enrichment analysis.
Note that the results are not sorted by p-value, rather S.distance, an enrichment score. I think this works better for interpretation.
tic()
mres <- mitch_calc(x=m,genesets=gene_sets,minsetsize=5, priority="effect")
## Note: Enrichments with large effect sizes may not be
## statistically significant.
toc() #7 secs
## 251.525 sec elapsed
mtable <- mres$enrichment_result
up <- subset(mtable,s.dist>0 & p.adjustANOVA<0.05)
dn <- subset(mtable,s.dist<0 & p.adjustANOVA<0.05)
nrow(up)
## [1] 8
nrow(dn)
## [1] 10
head(up,10) %>%
kbl(caption = "Top significant pathways with higher methylation") %>%
kable_paper("hover", full_width = F)
set | setSize | pANOVA | s.dist | p.adjustANOVA | |
---|---|---|---|---|---|
10188 | GOMF_STRUCTURAL_CONSTITUENT_OF_SKIN_EPIDERMIS | 35 | 1.53e-05 | 0.4223663 | 0.0155125 |
2100 | GOBP_KERATINIZATION | 78 | 4.00e-07 | 0.3329977 | 0.0019071 |
2016 | GOBP_INTERMEDIATE_FILAMENT_ORGANIZATION | 73 | 1.20e-06 | 0.3286556 | 0.0041505 |
7143 | GOBP_STRIATED_MUSCLE_ADAPTATION | 55 | 3.76e-05 | 0.3212865 | 0.0222716 |
8073 | GOCC_KERATIN_FILAMENT | 85 | 3.30e-06 | 0.2919890 | 0.0052689 |
2014 | GOBP_INTERMEDIATE_FILAMENT_BASED_PROCESS | 95 | 2.13e-05 | 0.2523268 | 0.0159130 |
1964 | GOBP_INORGANIC_ION_IMPORT_ACROSS_PLASMA_MEMBRANE | 136 | 2.14e-05 | 0.2110562 | 0.0159130 |
10191 | GOMF_STRUCTURAL_MOLECULE_ACTIVITY | 769 | 1.76e-05 | 0.0911287 | 0.0155125 |
head(dn,10) %>%
kbl(caption = "Top significant pathways with lower methylation") %>%
kable_paper("hover", full_width = F)
set | setSize | pANOVA | s.dist | p.adjustANOVA | |
---|---|---|---|---|---|
9210 | GOMF_GLUTATHIONE_PEROXIDASE_ACTIVITY | 22 | 3.73e-05 | -0.5077864 | 0.0222716 |
6144 | GOBP_REGULATION_OF_PLATELET_AGGREGATION | 27 | 3.86e-05 | -0.4575555 | 0.0222716 |
7750 | GOCC_CENTROSOME | 630 | 1.79e-05 | -0.1002339 | 0.0155125 |
8135 | GOCC_MICROTUBULE_ORGANIZING_CENTER | 829 | 2.30e-06 | -0.0967837 | 0.0052689 |
8236 | GOCC_NUCLEAR_PROTEIN_CONTAINING_COMPLEX | 1184 | 0.00e+00 | -0.0944773 | 0.0004838 |
4779 | GOBP_POSITIVE_REGULATION_OF_TRANSCRIPTION_BY_RNA_POLYMERASE_II | 1226 | 3.50e-06 | -0.0789663 | 0.0052689 |
1182 | GOBP_DNA_METABOLIC_PROCESS | 1016 | 2.82e-05 | -0.0778273 | 0.0195531 |
8132 | GOCC_MICROTUBULE_CYTOSKELETON | 1339 | 3.50e-06 | -0.0757001 | 0.0052689 |
743 | GOBP_CELL_CYCLE_PROCESS | 1287 | 1.43e-05 | -0.0721735 | 0.0155125 |
736 | GOBP_CELL_CYCLE | 1791 | 4.20e-06 | -0.0657249 | 0.0055179 |
Now make a barplot of these top findings.
top <- rbind(up[1:10,],dn[1:10,])
top <- top[order(top$s.dist),]
barnames <- gsub("_"," ",top$set)
cols <- as.character(((sign(top$s.dist)+1)/2)+1)
cols <- gsub("1","blue",gsub("2","red",cols))
par(mar = c(5.1, 29.1, 4.1, 2.1))
barplot(abs(top$s.dist),horiz=TRUE,las=1,names.arg=barnames, cex.names=0.8,
cex.axis=0.8,col=cols, xlab="Enrichment score",main="Gene Ontologies")
mtext("whole gene")
grid()
par( mar = c(5.1, 4.1, 4.1, 2.1) )
There are some interesting results there that give hints to the differences between naive T and Treg cells.
Make a html report and some charts.
mitch_report(res=mres,outfile="example_mitchreport.html",overwrite=TRUE)
## Dataset saved as " /tmp/Rtmp85T27l/example_mitchreport.rds ".
##
##
## processing file: mitch.Rmd
## 1/34
## 2/34 [checklibraries]
## 3/34
## 4/34 [peek]
## 5/34
## 6/34 [metrics]
## 7/34
## 8/34 [scatterplot]
## 9/34
## 10/34 [contourplot]
## 11/34
## 12/34 [input_geneset_metrics1]
## 13/34
## 14/34 [input_geneset_metrics2]
## 15/34
## 16/34 [input_geneset_metrics3]
## 17/34
## 18/34 [echart1d]
## 19/34 [echart2d]
## 20/34
## 21/34 [heatmap]
## 22/34
## 23/34 [effectsize]
## 24/34
## 25/34 [results_table]
## 26/34
## 27/34 [results_table_complete]
## 28/34
## 29/34 [detailed_geneset_reports1d]
## 30/34
## 31/34 [detailed_geneset_reports2d]
## 32/34
## 33/34 [session_info]
## 34/34
## output file: /mnt/data/mdz/projects/gmea/example_workflow/mitch.knit.md
## /home/mdz/anaconda3/bin/pandoc +RTS -K512m -RTS /mnt/data/mdz/projects/gmea/example_workflow/mitch.knit.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output /tmp/Rtmp85T27l/mitch_report.html --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/pagebreak.lua --lua-filter /usr/local/lib/R/site-library/rmarkdown/rmarkdown/lua/latex-div.lua --self-contained --variable bs3=TRUE --section-divs --template /usr/local/lib/R/site-library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable theme=bootstrap --mathjax --variable 'mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --include-in-header /tmp/Rtmp85T27l/rmarkdown-str97bf277041e14.html
##
## Output created: /tmp/Rtmp85T27l/mitch_report.html
## [1] TRUE
mitch_plots(res=mres,outfile="example_mitchcharts.pdf")
## png
## 2
In case you are working with EPIC array data, the probe-gene table can be make with this snippet of code.
tic()
anno <- getAnnotation(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt2 <- stack(gp2)
colnames(gt2) <- c("gene","probe")
gt2$probe <- as.character(gt2$probe)
toc() #9.0s
## 22.065 sec elapsed
dim(gt2)
## [1] 684970 2
str(gt2)
## 'data.frame': 684970 obs. of 2 variables:
## $ gene : chr "YTHDF1" "EIF2S3" "PKN3" "CCDC57" ...
## $ probe: chr "cg18478105" "cg09835024" "cg14361672" "cg01763666" ...
length(unique(gt2$gene))
## [1] 27364
head(gt2)
## gene probe
## 1 YTHDF1 cg18478105
## 2 EIF2S3 cg09835024
## 3 PKN3 cg14361672
## 4 CCDC57 cg01763666
## 5 INF2 cg12950382
## 6 CDC16 cg02115394
tic()
new.hgnc.table <- readRDS("../figs/new.hgnc.table.rds")
fix <- checkGeneSymbols(gt2$gene,map=new.hgnc.table)
## Warning in checkGeneSymbols(gt2$gene, map = new.hgnc.table): Human gene symbols
## should be all upper-case except for the 'orf' in open reading frames. The case
## of some letters was corrected.
## Warning in checkGeneSymbols(gt2$gene, map = new.hgnc.table): x contains
## non-approved gene symbols
fix2 <- fix[which(fix$x != fix$Suggested.Symbol),]
length(unique(fix2$x))
## [1] 3253
gt2$gene <- fix$Suggested.Symbol
toc()
## 183.881 sec elapsed
head(gt2)
## gene probe
## 1 YTHDF1 cg18478105
## 2 EIF2S3 cg09835024
## 3 PKN3 cg14361672
## 4 CCDC57 cg01763666
## 5 INF2 cg12950382
## 6 CDC16 cg02115394
tic()
ah <- AnnotationHub()
## snapshotDate(): 2024-04-30
EPICv2manifest <- ah[["AH116484"]]
## loading from cache
myann <- data.frame(EPICv2manifest[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","UCSC_CpG_Islands_Name","Relation_to_UCSC_CpG_Island","posrep_RepNum")])
myann <- subset(myann,posrep_RepNum==1)
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt3 <- stack(gp2)
colnames(gt3) <- c("gene","probe")
gt3$probe <- as.character(gt3$probe)
dim(gt3)
## [1] 360697 2
str(gt3)
## 'data.frame': 360697 obs. of 2 variables:
## $ gene : chr "RBL2" "ANKFN1" "VDAC3" "ACTN1" ...
## $ probe: chr "cg00000029_TC21" "cg00000221_BC21" "cg00000236_TC21" "cg00000289_TC21" ...
toc()
## 58.14 sec elapsed
tic()
new.hgnc.table <- readRDS("../figs/new.hgnc.table.rds")
fix <- checkGeneSymbols(gt3$gene,map=new.hgnc.table)
## Warning in checkGeneSymbols(gt3$gene, map = new.hgnc.table): Human gene symbols
## should be all upper-case except for the 'orf' in open reading frames. The case
## of some letters was corrected.
## Warning in checkGeneSymbols(gt3$gene, map = new.hgnc.table): x contains
## non-approved gene symbols
fix2 <- fix[which(fix$x != fix$Suggested.Symbol),]
length(unique(fix2$x))
## [1] 237
gt3$gene <- fix$Suggested.Symbol
toc()
## 9.701 sec elapsed
head(gt3)
## gene probe
## 1 RBL2 cg00000029_TC21
## 2 ANKFN1 cg00000221_BC21
## 3 VDAC3 cg00000236_TC21
## 4 ACTN1 cg00000289_TC21
## 5 ATP2A1 cg00000292_BC21
## 6 SFRP1 cg00000321_BC21
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
## [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] pkgload_1.3.4
## [2] GGally_2.2.1
## [3] ggplot2_3.5.1
## [4] reshape2_1.4.4
## [5] gplots_3.1.3.1
## [6] gtools_3.9.5
## [7] tibble_3.2.1
## [8] dplyr_1.1.4
## [9] echarts4r_0.4.5
## [10] png_0.1-8
## [11] gridExtra_2.3
## [12] missMethyl_1.38.0
## [13] beeswarm_0.4.0
## [14] kableExtra_1.4.0
## [15] mitch_1.16.0
## [16] tictoc_1.2.1
## [17] HGNChelper_0.8.14
## [18] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
## [19] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1
## [20] minfi_1.50.0
## [21] bumphunter_1.46.0
## [22] locfit_1.5-9.10
## [23] iterators_1.0.14
## [24] foreach_1.5.2
## [25] Biostrings_2.72.1
## [26] XVector_0.44.0
## [27] SummarizedExperiment_1.34.0
## [28] Biobase_2.64.0
## [29] MatrixGenerics_1.16.0
## [30] matrixStats_1.3.0
## [31] GenomicRanges_1.56.1
## [32] GenomeInfoDb_1.40.1
## [33] IRanges_2.38.0
## [34] S4Vectors_0.42.0
## [35] eulerr_7.0.2
## [36] limma_3.60.3
## [37] AnnotationHub_3.12.0
## [38] BiocFileCache_2.12.0
## [39] dbplyr_2.5.0
## [40] BiocGenerics_0.50.0
##
## loaded via a namespace (and not attached):
## [1] splines_4.4.1 later_1.3.2
## [3] BiocIO_1.14.0 bitops_1.0-7
## [5] filelock_1.0.3 preprocessCore_1.66.0
## [7] XML_3.99-0.17 lifecycle_1.0.4
## [9] lattice_0.22-6 MASS_7.3-61
## [11] base64_2.0.1 scrime_1.3.5
## [13] magrittr_2.0.3 sass_0.4.9
## [15] rmarkdown_2.27 jquerylib_0.1.4
## [17] yaml_2.3.8 httpuv_1.6.15
## [19] doRNG_1.8.6 askpass_1.2.0
## [21] DBI_1.2.3 RColorBrewer_1.1-3
## [23] abind_1.4-5 zlibbioc_1.50.0
## [25] quadprog_1.5-8 purrr_1.0.2
## [27] RCurl_1.98-1.14 rappdirs_0.3.3
## [29] GenomeInfoDbData_1.2.12 genefilter_1.86.0
## [31] annotate_1.82.0 svglite_2.1.3
## [33] DelayedMatrixStats_1.26.0 codetools_0.2-20
## [35] DelayedArray_0.30.1 xml2_1.3.6
## [37] tidyselect_1.2.1 UCSC.utils_1.0.0
## [39] beanplot_1.3.1 illuminaio_0.46.0
## [41] GenomicAlignments_1.40.0 jsonlite_1.8.8
## [43] multtest_2.60.0 survival_3.7-0
## [45] systemfonts_1.1.0 tools_4.4.1
## [47] Rcpp_1.0.12 glue_1.7.0
## [49] SparseArray_1.4.8 xfun_0.45
## [51] HDF5Array_1.32.0 withr_3.0.0
## [53] BiocManager_1.30.23 fastmap_1.2.0
## [55] rhdf5filters_1.16.0 fansi_1.0.6
## [57] openssl_2.2.0 caTools_1.18.2
## [59] digest_0.6.36 R6_2.5.1
## [61] mime_0.12 colorspace_2.1-0
## [63] RSQLite_2.3.7 utf8_1.2.4
## [65] tidyr_1.3.1 generics_0.1.3
## [67] data.table_1.15.4 rtracklayer_1.64.0
## [69] httr_1.4.7 htmlwidgets_1.6.4
## [71] S4Arrays_1.4.1 ggstats_0.6.0
## [73] pkgconfig_2.0.3 gtable_0.3.5
## [75] blob_1.2.4 siggenes_1.78.0
## [77] htmltools_0.5.8.1 scales_1.3.0
## [79] rstudioapi_0.16.0 knitr_1.47
## [81] tzdb_0.4.0 rjson_0.2.21
## [83] nlme_3.1-165 curl_5.2.1
## [85] org.Hs.eg.db_3.19.1 cachem_1.1.0
## [87] rhdf5_2.48.0 stringr_1.5.1
## [89] KernSmooth_2.23-24 BiocVersion_3.19.1
## [91] AnnotationDbi_1.66.0 restfulr_0.0.15
## [93] GEOquery_2.72.0 pillar_1.9.0
## [95] grid_4.4.1 reshape_0.8.9
## [97] vctrs_0.6.5 promises_1.3.0
## [99] xtable_1.8-4 evaluate_0.24.0
## [101] readr_2.1.5 GenomicFeatures_1.56.0
## [103] cli_3.6.3 compiler_4.4.1
## [105] Rsamtools_2.20.0 rlang_1.1.4
## [107] crayon_1.5.3 rngtools_1.5.2
## [109] nor1mix_1.3-3 mclust_6.1.1
## [111] plyr_1.8.9 stringi_1.8.4
## [113] viridisLite_0.4.2 BiocParallel_1.38.0
## [115] munsell_0.5.1 Matrix_1.7-0
## [117] hms_1.1.3 sparseMatrixStats_1.16.0
## [119] bit64_4.0.5 Rhdf5lib_1.26.0
## [121] KEGGREST_1.44.1 statmod_1.5.0
## [123] shiny_1.8.1.1 highr_0.11
## [125] memoise_2.0.1 bslib_0.7.0
## [127] bit_4.0.5 splitstackshape_1.4.8