Introduction

Here we will look at the contents of MSigDB, and compare the coverage of KEGG and other databases.

library("mitch")

msigdb <- gmt_import("msigdb.v2023.2.Hs.symbols.gmt")

There are two types of KEGG pathways in the MSigDB dataset, KEGG Legacy and KEGG Medicus. Let’s take a look at these side by side with Reactome and GO Biological Process.

kegg_all <- msigdb[grep("KEGG_",names(msigdb))]
kegg_med <- msigdb[grep("KEGG_MED",names(msigdb))]
kegg_leg <- setdiff(kegg_all,kegg_med)

message("number of pathways in KEGG Legacy:")
## number of pathways in KEGG Legacy:
length(kegg_leg)
## [1] 186
message("number of pathways in KEGG Medicus:")
## number of pathways in KEGG Medicus:
length(kegg_med)
## [1] 619
reactome <- msigdb[grep("REACTOME_",names(msigdb))]
message("number of pathways in Reactome:")
## number of pathways in Reactome:
length(reactome)
## [1] 1692
gobp <- msigdb[grep("GOBP_",names(msigdb))]
message("number of pathways in GOBP:")
## number of pathways in GOBP:
length(gobp)
## [1] 7647

So KEGG legacy has 186 pathways and the most up to date KEGG Medicus has 619, but this is dwarfed by Reactome with 1692 pathways and GO BP with 7649. Granted there is some redundancy in Reactome and GO BP, but the difference is stark.

Let’s dig a bit deeper, and find out the coverage, that is, what proportion of human genes are mentioned in these three databases.

message("Total number of unique gene symbols in MSigDB 2023.2:")
## Total number of unique gene symbols in MSigDB 2023.2:
length(unique(unlist(msigdb)))
## [1] 42416
message("Total number of unique gene symbols in KEGG Legacy:")
## Total number of unique gene symbols in KEGG Legacy:
length(unique(unlist(kegg_leg)))
## [1] 5245
message("Proportion of genes covered with KEGG Legacy:")
## Proportion of genes covered with KEGG Legacy:
paste(signif(length(unique(unlist(kegg_leg))) / length(unique(unlist(msigdb))) * 100, 3),"%")
## [1] "12.4 %"
message("Total number of unique gene symbols in KEGG Medicus:")
## Total number of unique gene symbols in KEGG Medicus:
length(unique(unlist(kegg_med)))
## [1] 2727
message("Proportion of genes covered with KEGG Medicus:")
## Proportion of genes covered with KEGG Medicus:
paste(signif(length(unique(unlist(kegg_med))) / length(unique(unlist(msigdb))) * 100, 3), "%")
## [1] "6.43 %"
message("Total number of unique gene symbols in Reactome:")
## Total number of unique gene symbols in Reactome:
length(unique(unlist(reactome)))
## [1] 11155
message("Proportion of genes covered with Reactome:")
## Proportion of genes covered with Reactome:
paste(signif(length(unique(unlist(reactome))) / length(unique(unlist(msigdb))) * 100, 3), "%")
## [1] "26.3 %"
message("Total number of unique gene symbols in GO BP:")
## Total number of unique gene symbols in GO BP:
length(unique(unlist(gobp)))
## [1] 17959
message("Proportion of genes covered with GO BP:")
## Proportion of genes covered with GO BP:
paste(signif(length(unique(unlist(gobp))) / length(unique(unlist(msigdb))) * 100, 3), "%")
## [1] "42.3 %"

A very interesting result. Firstly, GO BP and Reactome dominate coverage, with 42.3% and 26.3% respectively which is waaaay better than KEGG legacy with 12.4%. What surprised me was the paucity of KEGG Medicus data, which covers only 6.4% of genes, which is very disappointing.

Next, I wanted to look at the total number of annotations that are included in these databases.

message("Total number of functional annotations in MsigDB:")
## Total number of functional annotations in MsigDB:
length(unlist(msigdb))
## [1] 4033408
message("Total number of functional annotations in KEGG Legacy:")
## Total number of functional annotations in KEGG Legacy:
length(unlist(kegg_leg))
## [1] 12797
message("Total number of functional annotations in KEGG Medicus:")
## Total number of functional annotations in KEGG Medicus:
length(unlist(kegg_med))
## [1] 9651
message("Total number of functional annotations in Reactome:")
## Total number of functional annotations in Reactome:
length(unlist(reactome))
## [1] 95226
message("Total number of functional annotations in GO BP:")
## Total number of functional annotations in GO BP:
length(unlist(gobp))
## [1] 627225

In conclusion, KEGG coverage is very poor as compared to the alternatives. Reactome is a better source for canonical pathways but Gene Ontology Biological Process appears to be the most comprehensive.

Session information

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Australia/Melbourne
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mitch_1.14.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.4       beeswarm_0.4.0     xfun_0.41          bslib_0.6.1       
##  [5] ggplot2_3.4.4      htmlwidgets_1.6.4  caTools_1.18.2     GGally_2.2.0      
##  [9] vctrs_0.6.5        tools_4.3.2        bitops_1.0-7       generics_0.1.3    
## [13] parallel_4.3.2     tibble_3.2.1       fansi_1.0.6        pkgconfig_2.0.3   
## [17] KernSmooth_2.23-22 RColorBrewer_1.1-3 webshot_0.5.5      lifecycle_1.0.4   
## [21] compiler_4.3.2     stringr_1.5.1      gplots_3.1.3       munsell_0.5.0     
## [25] httpuv_1.6.13      htmltools_0.5.7    sass_0.4.8         yaml_2.3.8        
## [29] later_1.3.2        pillar_1.9.0       jquerylib_0.1.4    tidyr_1.3.0       
## [33] MASS_7.3-60        ellipsis_0.3.2     cachem_1.0.8       mime_0.12         
## [37] ggstats_0.5.1      gtools_3.9.5       tidyselect_1.2.0   rvest_1.0.3       
## [41] digest_0.6.33      stringi_1.8.3      dplyr_1.1.4        reshape2_1.4.4    
## [45] purrr_1.0.2        fastmap_1.1.1      grid_4.3.2         colorspace_2.1-0  
## [49] cli_3.6.2          magrittr_2.0.3     utf8_1.2.4         scales_1.3.0      
## [53] promises_1.2.1     rmarkdown_2.25     httr_1.4.7         gridExtra_2.3     
## [57] kableExtra_1.3.4   shiny_1.8.0        evaluate_0.23      knitr_1.45        
## [61] viridisLite_0.4.2  rlang_1.1.2        Rcpp_1.0.11        xtable_1.8-4      
## [65] glue_1.6.2         echarts4r_0.4.5    xml2_1.3.6         svglite_2.1.3     
## [69] rstudioapi_0.15.0  jsonlite_1.8.8     R6_2.5.1           plyr_1.8.9        
## [73] systemfonts_1.0.5