Applying mitch to pathway analysis of Infinium Methylation array data

Mark Ziemann

2024-02-19

Background

Applying enrichment analysis to methylation array data is difficult due to the presence of a variable number of probes per gene and the fact that a probe could belong to overlapping genes. There are existing over-representation based approaches to this, however they appear to lack sensitivity. To address this, we have developed a simple approach to aggregating differential methylation data to make it suitable for downstream use by mitch. The process begins with the differential probe methylation results from limma. Here, we summarise the limma t-statistics by gene using the arithmetic mean. The resulting gene level differential methylation scores then undergo mitch as usual.

Requirements

In addition to mitch v1.15.0 of higher, you will need an annotation set for the array you are using. These are conveniently provided as Bioconductor packages for HM450K and EPIC arrays.

HM450k: IlluminaHumanMethylation450kanno.ilmn12.hg19

EPIC: IlluminaHumanMethylationEPICanno.ilm10b4.hg19

One issue with these is that the annotations are quite old, which means that some of the gene names have changed (~12%), so it is a good idea to screen the list of gene names and update obsolete names using the HGNChelper package.

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("mitch")
library("mitch")
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library("HGNChelper")
library("IlluminaHumanMethylation450kanno.ilmn12.hg19")
## Loading required package: minfi
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
## Loading required package: Biostrings
## Loading required package: XVector
## 
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
## 
##     strsplit
## Loading required package: bumphunter
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: locfit
## locfit 1.5-9.8    2023-06-11
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
library("IlluminaHumanMethylationEPICanno.ilm10b4.hg19")
## 
## Attaching package: 'IlluminaHumanMethylationEPICanno.ilm10b4.hg19'
## The following objects are masked from 'package:IlluminaHumanMethylation450kanno.ilmn12.hg19':
## 
##     Islands.UCSC, Locations, Manifest, Other, SNPs.132CommonSingle,
##     SNPs.135CommonSingle, SNPs.137CommonSingle, SNPs.138CommonSingle,
##     SNPs.141CommonSingle, SNPs.142CommonSingle, SNPs.144CommonSingle,
##     SNPs.146CommonSingle, SNPs.147CommonSingle, SNPs.Illumina

Gene sets

In this cut down example we will be using a sample of 200 Reactome gene sets:

data(genesetsExample)
head(genesetsExample,3)
## $`2-LTR circle formation`
##  [1] "Reactome Pathway" "BANF1"            "HMGA1"            "LIG4"            
##  [5] "PSIP1"            "XRCC4"            "XRCC5"            "XRCC6"           
##  [9] "gag"              "gag-pol"          "rev"              "vif"             
## [13] "vpr"              "vpu"             
## 
## $`5-Phosphoribose 1-diphosphate biosynthesis`
## [1] "Reactome Pathway" "PRPS1"            "PRPS1L1"          "PRPS2"           
## 
## $`A tetrasaccharide linker sequence is required for GAG synthesis`
##  [1] "Reactome Pathway" "AGRN"             "B3GALT6"          "B3GAT1"          
##  [5] "B3GAT2"           "B3GAT3"           "B4GALT7"          "BCAN"            
##  [9] "BGN"              "CSPG4"            "CSPG5"            "DCN"             
## [13] "GPC1"             "GPC2"             "GPC3"             "GPC4"            
## [17] "GPC5"             "GPC6"             "HSPG2"            "NCAN"            
## [21] "SDC1"             "SDC2"             "SDC3"             "SDC4"            
## [25] "VCAN"             "XYLT1"            "XYLT2"

Probe-gene relationship

In order to get mitch working, we need a 2 column table that maps probes to genes. The workflow shown here is for the HM450k array, and an EPIC example is show at the end of the report.

anno <- getAnnotation(IlluminaHumanMethylation450kanno.ilmn12.hg19)
myann <- data.frame(anno[,c("UCSC_RefGene_Name","UCSC_RefGene_Group","Islands_Name","Relation_to_Island")])
gp <- myann[,"UCSC_RefGene_Name",drop=FALSE]
gp2 <- strsplit(gp$UCSC_RefGene_Name,";")
names(gp2) <- rownames(gp)
gp2 <- lapply(gp2,unique)
gt1 <- stack(gp2)
colnames(gt1) <- c("gene","probe")
gt1$probe <- as.character(gt1$probe)
dim(gt1)
## [1] 407090      2
str(gt1)
## 'data.frame':    407090 obs. of  2 variables:
##  $ gene : chr  "TSPY4" "FAM197Y2" "TTTY14" "TMSB4Y" ...
##  $ probe: chr  "cg00050873" "cg00050873" "cg00212031" "cg00214611" ...
head(gt1)
##       gene      probe
## 1    TSPY4 cg00050873
## 2 FAM197Y2 cg00050873
## 3   TTTY14 cg00212031
## 4   TMSB4Y cg00214611
## 5    TBL1Y cg01707559
## 6   TMSB4Y cg02004872
length(unique(gt1$gene))
## [1] 21231

Update deprecated names

Update old gene symbols using HGNChelper (13% of 21k names).

new.hgnc.table <- getCurrentHumanMap()
## Fetching gene symbols from ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt
fix <- checkGeneSymbols(gt1$gene,map=new.hgnc.table)
## Warning in checkGeneSymbols(gt1$gene, map = new.hgnc.table): Human gene symbols
## should be all upper-case except for the 'orf' in open reading frames. The case
## of some letters was corrected.
## Warning in checkGeneSymbols(gt1$gene, map = new.hgnc.table): x contains
## non-approved gene symbols
fix2 <- fix[which(fix$x != fix$Suggested.Symbol),]
length(unique(fix2$x))
## [1] 2788
gt1$gene <- fix$Suggested.Symbol

Importing profiling data

Here we will read in a table of differential probe methylation data generated by limma. We will use the t-statistics for downstream analysis.

x <- read.table("https://ziemann-lab.net/public/gmea/dma3a.tsv",header=TRUE,row.names=1)
head(x)
##                 logFC   AveExpr         t      P.Value adj.P.Val        B
## cg04905210 -0.2926788 -2.812451 -5.380966 2.372336e-07 0.1002013 5.906039
## cg09338148 -0.2938597 -1.155475 -5.114443 8.261969e-07 0.1744820 4.880538
## cg04247967 -0.2070451  3.180593 -4.986000 1.484616e-06 0.2090211 4.399278
## cg06458106 -0.1906572  1.675741 -4.793141 3.511038e-06 0.2413324 3.693145
## cg26425904 -0.2540917 -4.019515 -4.761482 4.034832e-06 0.2413324 3.579161
## cg19590707 -0.2471850 -1.237702 -4.716409 4.912691e-06 0.2413324 3.417844

Now that the profiling data is loaded, we need to import with mitch package, which establishes the probe-gene relationships and aggregates the data to gene level scores. As you can see, the input was a table of 422k probes and the output is 19,380 gene scores. Many probes not annotated to genes are discarded.

y<-mitch_import(x,DEtype="limma",geneTable=gt1)
## The input is a single dataframe; one contrast only. Converting
##         it to a list for you.
## Note: Mean no. genes in input = 422374
## Note: no. genes in output = 19380
## Warning in mitch_import(x, DEtype = "limma", geneTable = gt1): Warning: less than half of the input genes are also in the
##         output
head(y)
##                   x
## A1BG     -0.3505339
## A1BG-AS1 -0.1904379
## A1CF     -0.8443613
## A2M      -0.6794687
## A2ML1     0.2157592
## A4GALT   -0.4001430
dim(y)
## [1] 19380     1

Calculating enrichment

The mitch_calc function performs an enrichment test. If you imported multiple data tables in the previous step, mitch will conduct a multivariate enrichment test. The results can be prioritised by significance or effect size. My recommendation is to discard results with FDR>0.05 then prioritise by effect size, which for us is the mitch enrichment score called S distance. In this example I also set the minimum gene set size to 5.

res<-mitch_calc(y,genesetsExample,priority="effect",cores=2,minsetsize=5)
## Note: Enrichments with large effect sizes may not be
##             statistically significant.
head(res$enrichment_result,10)

Downstream presentation

For presentation of the results you could consider making a volcano plot and/or making a barplot of S.dist values for selected gene sets that meet the FDR cutoff.

You can also use the built in functions to make a set of charts and html report.

mitch_plots(res,outfile="methcharts.pdf")
mitch_report(res,"methreport.html")
## Dataset saved as " /tmp/Rtmpgt5nh4/methreport.rds ".
## 
## 
## processing file: mitch.Rmd
## output file: /mnt/md0/mitch/repo/mitch/vignettes/mitch.knit.md
## 
## Output created: /tmp/Rtmpgt5nh4/mitch_report.html

Session Info

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] kableExtra_1.3.4                                   
##  [2] pkgload_1.3.3                                      
##  [3] GGally_2.2.0                                       
##  [4] ggplot2_3.4.4                                      
##  [5] reshape2_1.4.4                                     
##  [6] beeswarm_0.4.0                                     
##  [7] gplots_3.1.3                                       
##  [8] gtools_3.9.5                                       
##  [9] tibble_3.2.1                                       
## [10] dplyr_1.1.4                                        
## [11] echarts4r_0.4.5                                    
## [12] IlluminaHumanMethylationEPICanno.ilm10b4.hg19_0.6.0
## [13] IlluminaHumanMethylation450kanno.ilmn12.hg19_0.6.1 
## [14] minfi_1.48.0                                       
## [15] bumphunter_1.44.0                                  
## [16] locfit_1.5-9.8                                     
## [17] iterators_1.0.14                                   
## [18] foreach_1.5.2                                      
## [19] Biostrings_2.70.1                                  
## [20] XVector_0.42.0                                     
## [21] SummarizedExperiment_1.32.0                        
## [22] Biobase_2.62.0                                     
## [23] MatrixGenerics_1.14.0                              
## [24] matrixStats_1.2.0                                  
## [25] GenomicRanges_1.54.1                               
## [26] GenomeInfoDb_1.38.5                                
## [27] IRanges_2.36.0                                     
## [28] S4Vectors_0.40.2                                   
## [29] BiocGenerics_0.48.1                                
## [30] HGNChelper_0.8.1                                   
## [31] mitch_1.15.0                                       
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.2             later_1.3.2              
##   [3] BiocIO_1.12.0             bitops_1.0-7             
##   [5] filelock_1.0.3            preprocessCore_1.64.0    
##   [7] XML_3.99-0.16             lifecycle_1.0.4          
##   [9] lattice_0.22-5            MASS_7.3-60              
##  [11] base64_2.0.1              scrime_1.3.5             
##  [13] magrittr_2.0.3            limma_3.58.1             
##  [15] sass_0.4.8                rmarkdown_2.25           
##  [17] jquerylib_0.1.4           yaml_2.3.8               
##  [19] httpuv_1.6.13             doRNG_1.8.6              
##  [21] askpass_1.2.0             DBI_1.2.0                
##  [23] RColorBrewer_1.1-3        abind_1.4-5              
##  [25] zlibbioc_1.48.0           quadprog_1.5-8           
##  [27] rvest_1.0.3               purrr_1.0.2              
##  [29] RCurl_1.98-1.14           rappdirs_0.3.3           
##  [31] GenomeInfoDbData_1.2.11   genefilter_1.84.0        
##  [33] annotate_1.80.0           svglite_2.1.3            
##  [35] DelayedMatrixStats_1.24.0 codetools_0.2-19         
##  [37] DelayedArray_0.28.0       xml2_1.3.6               
##  [39] tidyselect_1.2.0          beanplot_1.3.1           
##  [41] BiocFileCache_2.10.1      illuminaio_0.44.0        
##  [43] webshot_0.5.5             GenomicAlignments_1.38.1 
##  [45] jsonlite_1.8.8            multtest_2.58.0          
##  [47] ellipsis_0.3.2            survival_3.5-7           
##  [49] systemfonts_1.0.5         tools_4.3.2              
##  [51] progress_1.2.3            Rcpp_1.0.12              
##  [53] glue_1.7.0                gridExtra_2.3            
##  [55] SparseArray_1.2.3         xfun_0.41                
##  [57] HDF5Array_1.30.0          withr_2.5.2              
##  [59] fastmap_1.1.1             rhdf5filters_1.14.1      
##  [61] fansi_1.0.6               openssl_2.1.1            
##  [63] caTools_1.18.2            digest_0.6.33            
##  [65] R6_2.5.1                  mime_0.12                
##  [67] colorspace_2.1-0          biomaRt_2.58.0           
##  [69] RSQLite_2.3.4             utf8_1.2.4               
##  [71] tidyr_1.3.0               generics_0.1.3           
##  [73] data.table_1.14.10        rtracklayer_1.62.0       
##  [75] prettyunits_1.2.0         httr_1.4.7               
##  [77] htmlwidgets_1.6.4         S4Arrays_1.2.0           
##  [79] ggstats_0.5.1             pkgconfig_2.0.3          
##  [81] gtable_0.3.4              blob_1.2.4               
##  [83] siggenes_1.76.0           htmltools_0.5.7          
##  [85] scales_1.3.0              png_0.1-8                
##  [87] knitr_1.45                rstudioapi_0.15.0        
##  [89] tzdb_0.4.0                rjson_0.2.21             
##  [91] nlme_3.1-164              curl_5.2.0               
##  [93] cachem_1.0.8              rhdf5_2.46.1             
##  [95] stringr_1.5.1             KernSmooth_2.23-22       
##  [97] AnnotationDbi_1.64.1      restfulr_0.0.15          
##  [99] GEOquery_2.70.0           pillar_1.9.0             
## [101] grid_4.3.2                reshape_0.8.9            
## [103] vctrs_0.6.5               promises_1.2.1           
## [105] dbplyr_2.4.0              xtable_1.8-4             
## [107] evaluate_0.23             readr_2.1.5              
## [109] GenomicFeatures_1.54.1    cli_3.6.2                
## [111] compiler_4.3.2            Rsamtools_2.18.0         
## [113] rlang_1.1.3               crayon_1.5.2             
## [115] rngtools_1.5.2            nor1mix_1.3-2            
## [117] mclust_6.0.1              plyr_1.8.9               
## [119] stringi_1.8.3             viridisLite_0.4.2        
## [121] BiocParallel_1.36.0       munsell_0.5.0            
## [123] Matrix_1.6-4              hms_1.1.3                
## [125] sparseMatrixStats_1.14.0  bit64_4.0.5              
## [127] Rhdf5lib_1.24.1           KEGGREST_1.42.0          
## [129] statmod_1.5.0             shiny_1.8.0              
## [131] highr_0.10                memoise_2.0.1            
## [133] bslib_0.6.1               bit_4.0.5