Abstract

   The value of metabolomics in translational research is undeniable, and
   metabolomics data are increasingly generated in large cohorts. The
   functional interpretation of disease-associated metabolites though is
   difficult, and the biological mechanisms that underlie cell type or
   disease-specific metabolomics profiles are oftentimes unknown. To help
   fully exploit metabolomics data and to aid in its interpretation,
   analysis of metabolomics data with other complementary omics data,
   including transcriptomics, is helpful. To facilitate such analyses at a
   pathway level, we have developed RaMP (Relational database of
   Metabolomics Pathways), which combines biological pathways from the
   Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, WikiPathways,
   and the Human Metabolome DataBase (HMDB). To the best of our knowledge,
   an off-the-shelf, public database that maps genes and metabolites to
   biochemical/disease pathways and can readily be integrated into other
   existing software is currently lacking. For consistent and
   comprehensive analysis, RaMP enables batch and complex queries (e.g.,
   list all metabolites involved in glycolysis and lung cancer), can
   readily be integrated into pathway analysis tools, and supports pathway
   overrepresentation analysis given a list of genes and/or metabolites of
   interest. For usability, we have developed a RaMP R package
   ([34]https://github.com/Mathelab/RaMP-DB), including a user-friendly
   RShiny web application, that supports basic simple and batch queries,
   pathway overrepresentation analysis given a list of genes or
   metabolites of interest, and network visualization of gene-metabolite
   relationships. The package also includes the raw database file (mysql
   dump), thereby providing a stand-alone downloadable framework for
   public use and integration with other tools. In addition, the Python
   code needed to recreate the database on another system is also publicly
   available ([35]https://github.com/Mathelab/RaMP-BackEnd). Updates for
   databases in RaMP will be checked multiple times a year and RaMP will
   be updated accordingly.

   Keywords: pathway analysis, metabolomics, transcriptomics, pathway
   database

1. Introduction

   Metabolomics is undeniably powerful for uncovering disease biomarkers
   [[36]1,[37]2,[38]3]. Beyond biomarker discovery though, metabolomics
   data can provide information on biological mechanisms that are
   disrupted in diseases. From an analysis point of view, identifying
   these biological roles is very challenging, and typically requires
   integration of additional molecular information, such as other omics
   and biological pathway annotations [[39]4]. Nonetheless, analysis of
   metabolomics data with other omics data, such as transcriptomics, has
   uncovered relevant gene-metabolite associations and disease-relevant
   metabolic functions and pathways [[40]5,[41]6,[42]7,[43]8,[44]9].
   Finding genes associated with metabolite levels, or whose products
   catalyze reactions involving disease-related metabolites, or their
   associated pathways, can generate hypotheses on how these metabolic
   phenotypes are regulated. In turn, these hypotheses could elucidate
   functional mechanisms that could be targeted to generate a desired
   metabolomics phenotype. Understanding the regulation of metabolic
   phenotypes will expand knowledge of disease biology, and could
   contribute to finding successful interventions, including accurate
   predictions of diagnosis, prognosis, and treatment outcomes.

   While numerous methods and approaches that integrate gene expression
   and metabolomics data have been reported [[45]10,[46]11,[47]12], public
   and web-accessible software packages that integrate these data are
   generally sparse. Furthermore, these tools are often tailored to
   specific analysis types, such as pathway visualization, pathway
   enrichment and overrepresentation analysis, network analysis or
   reaction-level/metabolic flux analysis. Of these, MetaboAnalyst 3.0
   [[48]13], IMPaLA [[49]14], XCMS [[50]15,[51]16], and Metabox [[52]17]
   integrate metabolomics and gene expression for pathway enrichment
   and/or network analysis ([53]Table 1). In addition, Pathway Commons
   [[54]18] integrates many sources of pathway annotations and includes
   functionalities for pathway analysis on genes ([55]Table 1). While
   Metabox, XCMS, and MetaboAnalyst primarily use KEGG annotations, the
   other tools combine multiple databases. Combining multiple databases is
   advantageous as it broadens the scope of genes and metabolites that
   have pathway annotations. However, these combined databases are not
   readily accessible, making it difficult, if not impossible, to query
   and to integrate with improved analysis tools. Furthermore, statistics
   used in these software assume that pathways are independent of each
   other. This assumption is false since the hierarchical nature of many
   databases (e.g., KEGG, Reactome) yield pathways that overlap each other
   in terms of the genes and metabolites contained therein. Also, there
   are content overlaps between pathways that are drawn from various
   database sources.

Table 1.

   Tools that support over-representation and pathway enrichment analysis
   of genes and/or metabolites. These tools include a user-friendly web
   interface. ORA–Overrepresentation analysis.
   Tools Databases Included Access and Availability Batch Queries Pathway
   Analysis Network Visualization/Analysis Pathway Clustering Output
   RaMP
   [56]https://github.com/Mathelab/RaMP-DB/ KEGG, Reactome, WikiPathways,
   HMDB/SMPDB - R package
   - MySQL Dump
   - Python code to build MySQL Dump Yes ORA Yes Yes - Interactive tables
   of query results
   - Interactive tables of pathway analysis results
   - Clustering of enriched pathways by pathway similarity
   IMPaLA
   [57]https://impala.molgen.mpg.de KEGG, Reactome, BioCyC, PID, BioCarta,
   NetPath, INOH, EHMN, PharmGKB, WikiPathways, SMPDB Web services
   programming interface No - ORA
   - Wilcoxon enrichment analysis Yes No - Interactive tables of pathway
   analysis results with clickable links
   MetaboAnalyst
   [58]https://www.metaboanalyst.ca KEGG, HMDB, SMPDB R package No - ORA
   - Metabolite set enrichment analysis
   - Integrated topology and enrichment analysis (metabolites only)
   - Integrated gene and metabolite pathway analysis Yes No - Interactive
   tables and plots of pathway analysis results with clickable links
   - Interactive pathway viewer
   Metabox
   [59]https://kwanjeeraw.github.io/metabox KEGG, PubChem, UniProt,
   ENSEMBL, miRTarBase, BioGRID, Pathway Commons R package No - ORA
   - Set enrichment analysis Yes No - Interactive tables of pathway
   analysis results
   - Interactive visualization of networks (with table of nodes/edges)
   with clickable links
   XCMS
   [60]https://xcmsonline.scripps.edu/ METLIN, KEGG, HMDB, Lipid Maps,
   NIST, MassBank - R package
   - Web interface No Predictive Pathway Analysis No No - Interactive
   tables of pathway results with clickable links
   - Interactive pathway cloud plot for visualization
   Pathway Commons
   [61]http://www.pathwaycommons.org Reactome, NCI PID, PhosphoSitePlus,
   HumanCyc, HPRD, PANTHER, DIP, BioGRID, intAct, BIND, CORUM, MSigDB,
   miRTarBase, DrugBank, Recon X, CTG, KEGG, SMPD, INOH, NetPath,
   WikiPathways, ChEBI, SwissProt, UniChem - R package
   - Web services programming interface No Gene set enrichment analysis
   Yes No - Interactive pathway visualization
   [62]Open in a new tab

   To address these limitations, we developed RaMP (Relational database of
   Metabolomics Pathways), a publicly available, comprehensive database of
   gene and metabolite pathways. RaMP is carefully designed to enable
   complex searches across genes and metabolites (e.g., find genes
   involved in regulating key metabolites), and across distinct types of
   annotations, such as biofluid location, disease, and biological
   pathways (e.g., find metabolites detected in urine and involved in
   cancer). This design also allows analysis of pathway content overlap
   for development of improved pathway enrichment statistics. RaMP is
   publicly available at [63]https://github.com/mathelab/RaMP-DB/ and can
   be used in two different ways: (1) it can be downloaded as a mysql dump
   ([64]https://github.com/mathelab/RaMP-DB/inst/extdata/), for
   integration into any other tool; (2) it can be accessed via a
   user-friendly R Shiny web interface that supports basic queries,
   enrichment analysis given a list of genes and metabolites, and network
   visualization of gene-metabolite relationships. Overall, RaMP provides
   an up-to-date, comprehensive gene and metabolite pathway annotations
   that can be used as a stand-alone resource or can readily be
   incorporated into other tools. It is our hope that this resource will
   improve biological interpretation of metabolomics phenotypes, will
   guide data-driven hypothesis generation on the modulation of these
   phenotypes, and will thus advance scientific knowledge of metabolic
   phenotypes.

2. Results

2.1. RaMP Design

   A multi-database integration approach has been successfully applied for
   gene/metabolite enrichment analysis [[65]14,[66]19,[67]20,[68]21], yet
   their underlying databases are not downloadable, do not allow complex
   or batch queries, or do not account for pathway redundancy in their
   statistical enrichment metrics. To facilitate development of improved
   pathway analysis methods and tools, RaMP is publicly available and
   incorporates the following publicly available databases: KEGG
   [[69]22,[70]23,[71]24], Reactome [[72]25,[73]26], HMDB
   [[74]27,[75]28,[76]29], and WikiPathways [[77]30,[78]31,[79]32]. The
   KEGG database was chosen because it is one of the most widely used and
   complete pathway databases. The KEGG “Human maps”, that represent
   manually curated human diseases and molecular interactions from various
   organisms (experimental evidence in specific organisms are generalized
   to others), are incorporated into RaMP. HMDB is the largest collection
   of annotations for small molecules found in humans, and is thus the
   more complete resource for metabolite annotations. HMDB provides links
   to SMPDB [[80]33,[81]34] and KEGG pathway databases. Only the SMPDB
   pathways from HMDB are incorporated into RaMP, since KEGG pathways are
   integrated directly through the KEGG REST API. HMDB information about
   diseases, biospecimen location, and synonyms is also input into RaMP.
   We further included information about genes and metabolite pairs that
   are involved in the same reaction (e.g., “enzymes” section in HMDB
   entries).

   Reactome pathways were included because they are derived from published
   experimental evidence and are curated by expert molecular biologists.
   Reactome also contains relevant disease pathways. The hierarchy in
   Reactome is such that the lowest level pathways represent single
   reactions, which is important for retrieving the gene(s) that catalyze
   reactions involving metabolites of interest. Finally, we incorporated
   WikiPathways because it is one of the largest human pathway collections
   to date and has recently undergone considerable growth in metabolic
   pathway annotations [[82]31,[83]32]. Importantly, WikiPathways updates
   its content, both through individual users and groups from the general
   scientific community through the Wiki. WikiPathways are curated for
   quality and only those pathways that pass the curators’ quality metric
   are included into RaMP.

   Because the intent of RaMP is to retrieve biological pathways that
   relate genes and metabolites, the logical relationship between genes,
   metabolites, and associated pathways can be identified upfront and
   naturally yields a relational structure. RaMP is thus written in MySQL.
   The Python code used to pull in the data from each individual database
   is publicly available at [84]https://github.com/Mathelab/RaMP-BackEnd.
   Importantly, the design of the database ([85]Figure 1) is centered on
   the analytes (genes or metabolites), not on the pathways. The main
   reason for this design is to readily retrieve genes and metabolites
   that belong to the same pathway or reactions. This design also
   facilitates complex queries across multiple annotations (genes,
   metabolites, pathways). Equally important, an internal RAMP ID is
   attributed to each gene, metabolite, and pathway (see Methods). One
   issue with metabolite and gene names is that there are many synonyms
   for individual analyte names. Creating unique IDs based on synonyms is
   not possible, because there are synonym names that are commonly used
   for many different metabolites and genes. For example, the synonym
   “triglyceride” is used for all the triglycerides in HMDB, of which
   there are 13,919. When populating the RaMP database, a unique RaMP ID
   is attributed to database compound IDs that are linked to each other.
   To help ensure that RaMP IDs map to a unique metabolite (e.g., there
   are no multiple RaMP IDs for the same metabolite), we check whether a
   database compound ID is already attributed to a RaMP ID for every new
   database compound ID that is processed. For example, glucose has one
   unique RaMP ID, but is found in multiple databases and is thus linked
   to multiple database IDs: ChEBI ID 4167, PubChem Compound ID 3333, KEGG
   ID [86]C00031, and HMDB ID HMDB0000122. A similar procedure is applied
   for internal pathway RaMP IDs. The list of IDs and other information
   (e.g., synonyms) retrieved from each database is listed in [87]Table
   S1. See Methods for information regarding the mapping of IDs from
   different databases.

Figure 1.

   [88]Figure 1
   [89]Open in a new tab

   Schema of the database, depicting the tables included in the database
   and how they are related.

2.2. RaMP Content

   The number of genes, metabolites, and pathways in each database are
   shown in [90]Table 2. In total, RaMP integrates 51,526 pathways (from
   KEGG, Reactome, SMPDB, and WikiPathways), 23,077 genes, and 113,725
   metabolites. Furthermore, 157 ontologies from HMDB have been
   incorporated, including biofluid type (e.g., blood, urine, etc.),
   cellular location (e.g., nucleus, mitochondria, etc.), origins (e.g.,
   drug, food, microbial, etc.), and tissue location (e.g., teeth, lung,
   etc.). Gene and metabolite pairs that are involved in the same
   reactions are retrieved from the HMDB database.

Table 2.

   Databases incorporated into RaMP, including the number of metabolites,
   genes, and pathways.
   Database # Metabolites # Genes # Pathways Access
   Human Metabolome Database 111,005 5,645 48,623 *
   [91]http://www.hmdb.ca/
   KEGG 3653 7298 323 [92]http://www.genome.jp/kegg/pathway.html
   Reactome 1771 11,035 2169 [93]https://reactome.org/
   WikiPathways 1421 7727 411 [94]https://www.wikipathways.org/
   [95]Open in a new tab

   * Pathways imported from the HMDB database include SMPDB and KEGG
   pathways.

   Importantly, integration of the four databases into RaMP widens the
   coverage and variety of metabolites and genes that have pathway
   annotations. [96]Figure 2a,b depict the number of overlapping
   metabolites and genes, respectively, among the four databases
   integrated into RaMP. Only a small fraction, 0.05% of metabolites and
   13.2% of genes, overlap between all four databases. This relatively low
   overlap is not surprising given the fact that the four databases were
   constructed using varying input resources and for different purposes,
   as described above. Nonetheless, the low overlap exemplifies the
   strength in integrating annotation databases to increase the number of
   metabolites and genes of interest that map to pathways. In fact, each
   database has a high percentage of analytes that are unique to that
   database: 42% metabolites and 8.9% genes in KEGG, 36.7% metabolites and
   35% genes in Reactome, 26.4% metabolites and 32.6% genes in
   WikiPathways, and 97.9% and 20.7% genes in HMDB. It is important to
   note that HMDB contains many metabolites that do not map to pathways
   (of the 111,105 metabolites incorporated into our RaMP database, 48,623
   or 43.8%, are mapped to a KEGG or SMPDB pathway).

Figure 2.

   [97]Figure 2
   [98]Open in a new tab

   Overlap of (a) metabolites and (b) genes within each database
   integrated into RaMP.

   When assessing the number of pathways each metabolite is involved in, a
   few hundred metabolites are involved in many pathways ([99]Figure S1).
   For example, 5′ (Tetrahydrogentriphosphate) Adenosine,
   Adenosindiphosphorsaeure, and dihydrogenoxide are involved in over 600
   pathways in the Reactome database. This promiscuity may render
   interpretation of pathway analysis more complicated because many more
   hits could be returned if a promiscuous metabolite is involved, yet it
   is unlikely that all these pathways are involved simultaneously.
   Flagging these metabolites when performing pathway enrichment analysis
   could be beneficial, unless the specific context of the system under
   study is well defined (e.g., specific cells, cellular localization,
   disease, etc.).

2.3. Pathway Redundancy and Clustering of Enriched Pathways

   Integration of databases enables redundancy analysis, where the goal is
   to evaluate how much overlap in genes or metabolites exists between
   pathways that are present in different databases. [100]Figure 3 depicts
   the metabolite percent overlap (Number of metabolites in common/union
   of all metabolites in two pathways being compared, see Methods) for all
   pairwise comparisons of pathways from KEGG, Reactome, and WikiPathways
   incorporated into RaMP. Pathways within Reactome and KEGG show the
   largest number of overlapping pathways. For Reactome, these overlaps
   are likely to reflect the hierarchical structure of pathways. As an
   example, the “Formation of COPII vesicle” pathway in Reactome is a
   subpathway of “MHC class II antigen presentation”, which is a
   subpathway of the “Adaptive Immune System” pathway. In contrast, the
   overlap in gene content between pathways is much less compared to that
   of the overlap in metabolite content (data not shown).

Figure 3.

   [101]Figure 3
   [102]Open in a new tab

   Percentage of metabolite overlap in each pathway from all databases
   that are integrated in RaMP. (WP–WikiPathways).

   Content overlaps of pathways within or between databases can make
   interpretation of pathway enrichment analyses confusing. To address
   this, we have implemented a clustering approach, based on a heuristic
   fuzzy multiple-linkage partitioning algorithm [[103]35], to group
   findings by functional homology (see Methods for further details). To
   demonstrate this utility, we have analyzed a list of altered
   metabolites and genes between breast tumor tissue and adjacent
   non-tumor tissue from a previously published study [[104]5] (see
   Methods, [105]Figure 4). When performing pathway overrepresentation
   analysis, the RaMP package outputs enriched pathways that can be sorted
   by p-value or database source (e.g., all significant pathways from KEGG
   are grouped, then pathways from Reactome, etc.). Next, we clustered
   these pathways and identified high levels of overlap between
   significant pathways. This clustering thus allows the user to quickly
   sort through redundant results and identify functionally relevant
   pathways. In the altered breast cancer metabolite data set, our
   clustering algorithm identified a relevant cluster of pathways involved
   in nucleic acid metabolism ([106]Figure 4a). It is well documented that
   various cancer types induce shifts in de novo nucleotide synthesis,
   catabolism, and nucleoside salvage [[107]36]. When both genes and
   metabolites were input into our algorithm, clusters of glucose
   metabolism and transcriptional pathways were significant ([108]Figure
   4b,c). These enriched clusters are concordant with previous work
   reporting that cancer cells undergo higher rates of aerobic glycolysis
   (“Warburg effect”) [[109]37] and alterations of the transcriptional
   machinery with TP53 being among the most mutated in cancers [[110]38].
   As the pathways identified in one cluster contain >50% overlap in their
   metabolite/gene composition, it is clear that enrichment of these
   pathways is driven by their common metabolites. This pathway clustering
   thus offers a flexible way to improve interpretability of results by
   identifying groups of pathways with many genes and metabolites in
   common, allowing users to quickly and efficiently identify functional
   groups of interest.

Figure 4.

   [111]Figure 4
   [112]Open in a new tab

   Output from pathway overrepresentation analysis using the RaMP R
   package web application. Significant pathways are derived from a list
   of metabolites and genes that are altered in breast tumor tissue
   relative to adjacent tumor tissue in a publicly available breast cancer
   dataset (see Methods). (a) Nucleic acid metabolism cluster of
   statistically significant pathways resulting from analysis using
   metabolites as input. (b) Glucose metabolism and (c) transcriptional
   regulation pathway clusters resulting from analysis using metabolites
   and genes as input.

2.4. RaMP Access and User Interface

   Access to the code used to build the RaMP MySQL database, the RaMP
   database itself (mysql dump), and the associated R package are publicly
   accessible on our GitHub site [113]https://github.com/mathelab/RaMP-DB.
   Instructions for creating the MySQL database locally and running the R
   package are detailed on the front page of the GitHub site. For users
   that want to perform basic queries and pathway enrichment analysis
   without programming overhead, we have developed an R package that
   includes an R Shiny web interface (see [114]Supplementary Material for
   installation instructions). The package can be readily installed using
   the devtools R package with the command
   install_github(“mathelab/RaMP-DB”).

   Once installed, the application runs by simply typing “runRaMPapp
   (password = ”mysqlpassword”)” in the R console. The interface supports
   4 basic types of queries ([115]Table 3) that can be run in batch: (1)
   Given a list of pathway(s), retrieve all analytes involved; (2) Given a
   list of analyte(s), retrieve the pathways that each analyte(s) is
   involved in; (3) Given a list of analytes, return the analytes that are
   involved at a reaction level (e.g., return metabolites catalyzed by
   user-input genes, based on HMDB database); (4) Given a list of
   ontologies or metabolites, retrieve the corresponding metabolites or
   ontologies, respectively. In addition to queries, the web application
   supports pathway overrepresentation analysis on genes, metabolites, or
   genes and metabolites combined, and results can be grouped by database
   type or clustered by pathway overlap, as described above. This pathway
   analysis is embedded in the second query (retrieve pathways from a
   user-input list of analytes). Furthermore, the web application provides
   network visualization of gene-metabolite relationships that are
   retrieved from a user-input list of genes or metabolites (query 3,
   [116]Figure S2). The [117]Supplementary Materials provides details on
   how to utilize the web app, and includes snapshots of each query.

Table 3.

   Types of queries that are supported by the web interface.
   Query Input Tabular Output Analysis/Visualization
   Retrieve analytes for a given pathway Pathway name(s) or pathway id(s)
   Analytes that are within input pathway
   Retrieve pathway(s) for one or more analytes Analyte name(s) or id(s)
   Pathways that contain input analytes Pathway enrichment analysis and
   clustering of enriched pathways
   Retrieve analytes that are in the same reaction Analyte name(s)
   Analytes catalyzing or catalyzed by input analytes Network
   visualization of gene-metabolite relationships
   Retrieve ontologies from given metabolites Metabolite name(s) or id(s),
   or ontology name List of ontologies or metabolites that pertain to
   input
   [118]Open in a new tab

3. Discussion

   One of the first steps in statistical analysis of metabolomics data is
   to identify metabolites that are altered between disease states or
   conditions under study. This step however is oftentimes insufficient to
   fully leverage the data and understand the underlying biological
   mechanisms at play. To provide such further insights, one can combine
   metabolomics data with other data, such as gene expression and pathway
   annotations. To facilitate such integration at a pathway level, we have
   developed the relational database RaMP, which incorporates gene and
   metabolite pathway annotations from four large, and commonly leveraged
   databases: HMDB, KEGG, Reactome, and WikiPathways. RaMP was designed to
   allow complex and batch queries, to facilitate integration with other
   tools, and to provide improved pathway overrepresentation
   functionality. The relational structure supports complex and batch
   queries, and the publicly available MySQL dump
   ([119]https://github.com/mathelab/RaMP-DB/inst/extdata/) enables
   advanced users to easily set up the database locally. We have improved
   interpretation of pathway enrichment analysis by calculating pathway
   overrepresentation using 3 databases (KEGG, Reactome, WikiPathways) in
   RaMP, and by providing different groupings of enriched pathways (by
   database origin or pathway overlap). Furthermore, all the underlying
   Python code used to create the RaMP MySQL file is publicly available
   ([120]https://github.com/Mathelab/RaMP-BackEnd), thereby ensuring full
   transparency of the database construction, and complying to
   reproducibility best practices. Lastly, we have wrapped RaMP into an R
   package that contains a user-friendly web interface for performing
   several queries and pathway overrepresentation analysis. The R package
   is publicly available on GitHub at
   [121]https://github.com/mathelab/RaMP-DB/, where detailed installation
   instructions are provided.

   As with any research endeavor, RaMP has limitations. One current issue
   is the integrity of mapping metabolite names to an appropriate compound
   ID. Mapping can be hampered because there are synonyms that are
   generalized compound names and thus map to a large number of
   metabolites. One extreme example is “triglyceride”, which maps to
   13,719 different compound IDs. Further, there are synonyms that have
   different IDs even though they correspond to different levels of
   structure resolution, which is highly dependent on the platform. For
   example, some platforms can distinguish isomeric structures
   (2,3-Dimethylphenol vs 2,5-Dimethylphenol) while others cannot. One
   existing solution to this problem is the Metabolomics Workbench Refmet
   resource [[122]19] that provides a translation service that retrieves a
   common, “lowest denominator” name for each compound, thereby
   facilitating harmonization of names across platforms. This type of
   harmonization could be integrated into RaMP for improved metabolite
   mapping when the metabolites under study are present in Refmet.
   Ultimately though, it is important for the users to check that the
   mapping of IDs is correct.

   In addition, the background number of metabolites used to calculate
   pathway enrichment is based on the number of metabolites represented in
   each pathway database (e.g., 4134 metabolites mappable to KEGG
   pathways). The default number of genes used for background is set to
   20,000. In the future, users will have the option to provide a list of
   genes or metabolites assayed to build a custom contingency table for
   the test. This capability is particularly relevant for analysis of
   metabolites, where the number of metabolites measured in a given
   experiment is variable. Because RaMP is continuously being developed,
   we anticipate expansion of the RaMP functionalities to increase utility
   and usability. In addition to the aforementioned pathway enrichment
   changes, we also plan to develop more query capabilities. Furthermore,
   while overrepresentation analysis can be useful for uncovering
   disrupted biological pathways, we recognize the existence of improved,
   second and third generation methods that take into account topology
   [[123]39,[124]40,[125]41], and pathway dependency and crosstalk
   [[126]42]. With the accessibility and organization of RaMP, it is our
   hope that incorporation of up-to-date and comprehensive annotation of
   genes and metabolites into improved pathway analysis methods will be
   facilitated. Future developments of RaMP will include expansion of RaMP
   pathway analysis approaches and functionalities to increase utility and
   usability.

   While RaMP is currently focused on human pathways, we plan to expand
   the database to other organisms. In particular, with the increasing
   appreciation of the impact of microbial metabolites on human
   metabolism, microbial pathway databases could be integrated into RaMP
   to further expand its utility for integrative pathway analysis. With
   this in mind, it is important to note that the content of RaMP revolves
   around analytes (genes and metabolites) and how they are related
   (pathway involvement, reaction-level relationships). Therefore, when
   information from source databases (HMDB, KEGG, Reactome, WikiPathways)
   is included, only information that pertains to downstream pathway
   enrichment analysis is retained. With this mindset, we hope to retain
   the simplicity of our database design ([127]Figure 1).

   In conclusion, RaMP is a standalone database and application, usable
   through a web interface that was developed to facilitate gene and
   metabolite pathway analysis. RaMP can be used independently as a MySQL
   database that can be readily integrated with other tools, or can be
   accessed through our R package and web interface. RaMP is thus a first
   step toward a comprehensive integration of genes and metabolites at a
   pathway level, and it is our hope that our transparent approach, with
   all code publicly available, will generate further developments and
   improvements toward more complete interpretation of metabolomics data.

4. Materials and Methods

4.1. Parsing Raw Database Files

   All metabolite and pathway data were downloaded from HMDB, KEGG,
   Reactome, and WikiPathways using Python scripts, including Python
   library urllib, based on HTTP protocol. All the code is available at
   [128]https://github.com/Mathelab/RaMP-BackEnd. Because the format of
   the data varies by database, individual classes and parsing procedures
   were created for each database

   The HMDB data, in Extensible Markup Language (XML) format, was parsed
   using the Python built-in parser from the ElementTree XML API. First,
   the HMDB ID is retrieved through the “metabolite” tag of the XML file.
   Next, for each “metabolite” tag, information for other tags are
   retrieved, including gene names and IDs, pathway names, and other
   ontologies (biofluid location, cellular location, origin, and tissue
   location). While parsing, dictionaries are created where the keys are
   HMDB IDs and the associated values are all available attributes (e.g.,
   synonyms, genes involved in metabolite reactions, pathways, etc.)
   pertaining to that metabolite.

   The KEGG data was retrieved through the REST API as “txt” files, and
   each file type was parsed in the following order: pathways,
   metabolites, metabolite synonyms, genes, and gene synonyms. To use the
   REST API, the complete list of human pathway IDs
   ([129]http://rest.kegg.jp/list/pathway/hsa) was used to retrieve
   information on the pathways and associated genes and metabolites. For
   example, information on the first pathway in the complete list of human
   pathway IDs, “hsa00010”, is accessible through the link
   [130]http://rest.kegg.jp/get/hsa00010. Parsing compound and gene IDs
   from this pathway entry allows us to retrieve further information on
   the compounds and genes related to that pathway (e.g., metabolite
   [131]http://rest.kegg.jp/get/C00022 and gene
   [132]http://rest.kegg.jp/get/hsa:3101).

   For WikiPathways, the data are stored in a GenMAPP Pathway Markup
   Language (GPML) format, which is a custom XML format compatible with
   pathway analysis tools such as Cytoscape, GeneMAPP and PathVisio. This
   file format retains all of characteristic of XML, so we apply the same
   procedure used for parsing the HMDB database.

   Finally, the physical entity identifier mapping files that map compound
   (ChEBI) IDs and gene (UniProt) IDs to Reactome pathways were downloaded
   from Reactome. Each file is tab-delimited and 3 columns are retrieved:
   (1) compound/gene identifiers; (2) Reactome pathway ID; (3) Reactome
   pathway name; (4) genes and species. As with the other databases, only
   human pathways were selected. The Python library “libChEBI” is used to
   retrieve the ChEBI common name from each ChEBI ID retrieved from
   Reactome. Similarly, the gene common names are retrieved through the
   UniProt REST API.

4.2. Creating Unique RaMP IDs

   Metabolite and gene names have many synonyms and sometimes, the
   synonyms can be the same for different molecules. Furthermore,
   different databases use different identifiers. To properly map
   identifiers from one database to the next, we (1) created dictionaries
   of IDs for each database source and (2) ensured that identifiers linked
   to common IDs had the same RaMP ID. In the first step, source IDs were
   used as the key in the dictionaries and the values were the other
   identifiers present in the source database (see [133]Supplementary
   Table S1). In the second step, the dictionaries are parsed and a RaMP
   ID is created for each new ID that is encountered. A two-column table
   that relates RaMP IDs with source IDs (one RaMP ID to many source IDs)
   is created. For each new key (source ID) in the dictionaries, the
   associated values and the value of the key itself are searched against
   the RaMP ID/source ID table. If there is a match, then all values for
   that key (including the key itself) are assigned to the matching RaMP
   ID. If there is no match, then a new RaMP ID is created and all values
   are assigned to the new RaMP ID. An analogous approach is used for
   pathways and ontologies. Of note, it is possible that ID mappings from
   different databases for the same metabolite or gene do not have any
   overlap. For such cases, these ID mappings would have different RAMP
   IDs.

   RaMP IDs have a prefix, followed by a unique number. The prefix
   “RAMP_C” is used for compounds, “RAMP_G” for genes, “RAMP_P” for
   pathways, and “RAMP_OL” for ontologies. Prefixes are then concatenated
   to a number (from “000000001” to “999999999”). While RaMP IDs are
   created to map metabolites and genes appropriately across the different
   databases, these IDs are internal and are not returned to the user
   through the R package.

4.3. R Package

   The R package for RaMP is available online via GitHub
   ([134]https://github.com/mathelab/RaMP-DB/). Instructions are provided
   on how to set up MySQL and the RaMP database on this GitHub site. The
   RaMP R package can be installed via the install_github() command from
   the devtools package and requires R (≥3.2.0). Questions and concerns
   can be raised as issues on the GitHub site. Further documentation is
   provided in the [135]Supplementary Material on how to run the
   application.

4.4. Pathway Overrepresentation Analysis

   RaMP supports pathway overrepresentation analysis of user-supplied
   lists of metabolites and/or genes. Fisher’s exact tests are performed
   to calculate pathway overrepresentation p-values for metabolites (P[m])
   and genes (P[m]), independently. Of note, if pathways contain only
   genes or only metabolites, then P[m] or P[g], respectively, cannot be
   computed. A combined p-value (P[comb]) is then calculated for pathways
   that are annotated with both genes and metabolites, using Fisher’s
   method [[136]43]. Specifically, p-values are combined using Fisher’s
   combined probability test, where the test statistic, T[comb] is
   calculated as:
   [MATH: <mrow><mrow><msub><mi
   mathvariant="normal">T</mi><mrow><mi>comb</mi></mrow></msub><mo>=</mo><
   mo>−</mo><mn>2</mn><mo>×</mo><mi>ln</mi><mrow><mo>(</mo><mrow><msub><mi
   mathvariant="normal">P</mi><mi
   mathvariant="normal">m</mi></msub></mrow><mo>)</mo></mrow><mo>+</mo><mi
   >ln</mi><mo stretchy="false">(</mo><msub><mi
   mathvariant="normal">P</mi><mi mathvariant="normal">g</mi></msub><mo
   stretchy="false">)</mo><mtext> </mtext></mrow></mrow> :MATH]
   (1)

   T[comb] follows a χ^2 distribution with 2 degrees of freedom and the
   associated p-value, P[comb], is calculated using the R function
   pchisq() and 2 degrees of freedom. When P[m] is missing, P[comb] =
   P[g]. Conversely, when P[g] is missing, P[comb] = P[m]. Resulting
   P[comb] p-values are adjusted for multiple comparisons using the
   Benjamini and Hochberg method and the Holm method to control the false
   discovery rate. Similar to other approaches [[137]13], the default
   total number of metabolites to be used as background is set to the
   number of metabolites mappable to pathways in each database (3603 for
   KEGG, 1771 for Reactome, and 1421 for WikiPathways). In the future, we
   will support a user-input list of metabolites to be used as background.
   For genes, the total number of genes used as background is 20,000.
   Pathways derived from KEGG, Reactome, and WikiPathways are used for
   pathway enrichment analysis, and pathways with <10 or >1000 analytes
   are removed since those are either too narrow or too broad for
   meaningful interpretation.

4.5. Clustering of Pathway Enrichment Analysis Results

   By default, pathway enrichment analysis results are returned for each
   database (KEGG, Reactome, WikiPathways), ordered by the database the
   enriched pathway was found in. To improve interpretability of pathway
   analysis results, enriched pathways are placed in groups according to
   the proportion of analytes they share in common, allowing the user to
   more efficiently navigate through redundant pathways. To accomplish
   this, we implemented an agglomerative clustering algorithm based on the
   heuristic fuzzy multiple-linkage partitioning algorithm, which is used
   by the DAVID gene functional annotation tool [[138]35]. The algorithm
   is comprised of the following four basic steps:
    1. Calculating analyte overlap: The degree of analyte overlap was
       calculated for all possible pairs of pathways. Gene overlap and
       metabolite overlaps were calculated separately. Given two pathways,
       m and n, the overlap score O[mn] represents the Jaccard index,
       which is calculated as:

   [MATH: <mrow><mrow><msub><mi
   mathvariant="normal">O</mi><mrow><mi>mn</mi></mrow></msub><mo>=</mo><mf
   rac><mrow><msub><mi
   mathvariant="normal">I</mi><mrow><mi>mn</mi></mrow></msub></mrow><mrow>
   <msub><mi mathvariant="normal">L</mi><mi
   mathvariant="normal">m</mi></msub><mo>+</mo><msub><mi
   mathvariant="normal">L</mi><mi
   mathvariant="normal">n</mi></msub><mo>−</mo><msub><mrow><mtext> </mtext
   ><mi
   mathvariant="normal">I</mi></mrow><mrow><mi>mn</mi></mrow></msub></mrow
   ></mfrac></mrow></mrow> :MATH]
       (2)
       where I[mn] is the number of analytes (genes or metabolites)
       present in both pathways, and L[m] and L[n] are the number of total
       analytes in pathways m and n, respectively. When no analytes are in
       common between two pathways, O[mn] = 0. Conversely, O[mn] = 1 if
       all analytes overlap between two pathways.
    2. Identifying seeds: The overlap scores O[mn] are used to identify
       cluster seeds. Pathways with a high degree of overlap with multiple
       other pathways (e.g., ≥30% overlap with at least 2 other pathways)
       are considered “seeds”. Thresholds for percent overlap and number
       of pathways to overlap with can be defined by the user.
    3. Initial pathway clustering: Once seeds are identified, pathways are
       clustered to the seeds based on the overlap scores. Pathways that
       have overlap scores with seed pathways greater than or equal to a
       user-defined threshold (e.g., 30%) are clustered with the
       corresponding seed pathway. Of note, this approach allows for a
       single pathway to belong to multiple clusters, as long as it is
       sufficiently similar to the seed pathway of those clusters.
    4. Calculate cluster overlap: Overlap scores between clusters are
       calculated with the same formula as Equation (2), with the
       following definitions for I and L: I[mn] is now the number of
       pathways in common (based on their names) between clusters m and n,
       and L[m] and L[n] are now the number of pathways in clusters m and
       n, respectively. All pairwise cluster similarities (e.g., cluster
       overlap scores) are ranked, and the cluster pair with the highest
       overlap score is merged into a single cluster, provided that their
       overlap score is greater than a user-defined merge threshold (e.g.,
       30%).
    5. Repeat cluster overlaps: Step 4 is repeated until there are no
       cluster overlap scores above the merge threshold.

   With this clustering approach, large and complex lists of enriched
   pathways are grouped into clusters of highly similar pathways. This
   feature is important as it allows users to more easily interpret
   functional implications of pathway enrichment results.

4.6. Pathway Analysis in Breast Cancer Dataset

   Metabolite data was obtained for a previously published breast cancer
   study comparing tumor and adjacent non-tumor breast tissue [[139]5].
   Metabolites with more than 80% imputed values were filtered out. A
   t-test was performed on tumor and non-tumor samples and the resulting
   p-values were adjusted using the False Discovery Rate (FDR) method.
   Metabolites, mappable to KEGG or HMDB IDs, that had a fold-change
   greater than +/− 1.5 with an FDR adjusted p-value <0.05 were then input
   into the RaMP web application using the “Return pathway from given
   analytes” tab and the “Input Multiple Metabolites (batch query)”
   subtab. Overrepresentation analysis was performed on the list of
   metabolites and pathways were retained if their Holm-adjusted p-values
   were <0.01. Clustering of these pathways was performed using the
   following parameters: overlap threshold for medoid establishment = 0.2,
   number of similar neighbors = 2, overlap threshold for cluster merge =
   0.75. Overrepresentation analysis was repeated with a list of
   metabolites and genes as input (Holm-adjusted p-values <0.01).
   Parameters for clustering these pathways were: overlap threshold for
   medoid establishment = 0.2, number of similar neighbors = 2, overlap
   threshold for cluster merge = 0.5.

Acknowledgments