Abstract

   Pathway enrichment analysis represents a key technique for analyzing
   high-throughput omic data, and it can help to link individual genes or
   proteins found to be differentially expressed under specific conditions
   to well-understood biological pathways. We present here a computational
   tool, SEAS, for pathway enrichment analysis over a given set of genes
   in a specified organism against the pathways (or subsystems) in the
   SEED database, a popular pathway database for bacteria. SEAS maps a
   given set of genes of a bacterium to pathway genes covered by SEED
   through gene ID and/or orthology mapping, and then calculates the
   statistical significance of the enrichment of each relevant SEED
   pathway by the mapped genes. Our evaluation of SEAS indicates that the
   program provides highly reliable pathway mapping results and identifies
   more organism-specific pathways than similar existing programs. SEAS is
   publicly released under the GPL license agreement and freely available
   at [27]http://csbl.bmb.uga.edu/~xizeng/research/seas/.

Introduction

   High-throughput omic techniques are being increasingly more widely used
   by large research centers as well as by individual labs because of the
   rapidly decreasing costs and the increasing quality of the data
   generated. The rapid accumulation of the omic data has provided
   unprecedented new opportunities for biologists to study substantially
   more complex problems at a systems level [28][1], [29][2] than just a
   few years ago. As a key technique in linking individual genes/proteins
   to biological processes, pathway enrichment analysis is being widely
   used to study pathway-level activities based on the activities of
   individual genes/proteins observed using omic techniques [30][3],
   [31][4]. A number of computational tools have been developed to provide
   pathway enrichment analyses against different pathway databases. As of
   now, the majority of the existing tools have been designed for pathway
   analyses for human or eukaryotes in general, including ArrayXPath
   [32][5], GenMAPP [33][6], DAVID [34][7], PathwayExplorer [35][8],
   PathExpress [36][9] and Pathway Miner [37][10]. Among all these
   analysis tools, gene mapping from a specified organism to the pathway
   genes covered by the underlying (pathway) database is typically done
   through gene ID [38][5], [39][6], [40][7] or orthology mapping
   [41][11], [42][12]. A pathway is considered as enriched by a set of
   genes if they overlap the pathway at a substantially higher percentage
   of the pathway genes than expected by chance. Statistical enrichment
   analysis methods fall into three classes according to enrichment
   algorithms [43][13]: (i) singular enrichment analysis (SEA), which
   calculates an enrichment P-value on each pathway and lists the enriched
   pathways in a linear table based on the hyper-geometric distribution
   assumption [44][14] or using Fisher exact test [45][15], [46][16] among
   a few other methods [47][17] [48][18]; (ii) gene set enrichment
   analysis [49][19], which considers an entire gene set (without
   pre-selection) encoded in a genome and associated experimental values
   (for instance expression fold change); and (iii) modular enrichment
   analysis [50][20], which uses the key idea of SEA but considers
   pathway-pathway or gene-gene relations in its enrichment P-value
   calculation. In this paper, we will use the SEA method because of its
   simplicity and popularity, and may consider the other two classes of
   enrichment analysis methods in our future work.

   Currently there are a few popular pathway databases in the public
   domain, without a particular one being the predominant one [51][21], as
   they each have their own strengths and limitations, making each of them
   suitable for different application scenarios. For example, the KEGG
   Pathway database [52][22] has a collection of generic pathways mostly
   derived based on known biochemical reactions rather than how individual
   organisms execute the reactions. Hence these generic pathways could be
   considered as a superset of the corresponding pathways specific to
   individual organisms, i.e., not every reaction in a KEGG pathway is
   encoded in every organism [53][23]. So mapping these generic pathways
   to specific organisms generally requires manual examination to ensure
   the mapping quality. The SEED Subsystem database is another pathway
   resource; each subsystem (pathway) for a specific organism in SEED is
   constructed by a group of domain experts [54][24], making its pathway
   genes more organism-specific and generally more reliable than KEGG
   pathways. Its limitation is that its coverage might not be as high as
   KEGG pathways. For example, the KEGG pathways cover 2,983 E. coli genes
   while SEED covers only 2,181 while exceptions exist. For instance, KEGG
   covers 2,296 B. subtilis genes while SEED covers 2,303.

   We have previously developed a software tool KOBAS [55][11] for
   enrichment analyses of KEGG pathways, which has been widely used since
   its publication [56][25]. Here we present a new tool for enrichment
   analyses against SEED subsystems, called SEAS (SEED-based Enrichment
   Analysis System). SEAS provides three ways for gene mapping to
   subsystems through gene ID, orthology or homology mapping based on the
   availability of the relevant information, and identifies the
   statistically enriched pathways in SEED. We have extensively tested the
   performance of SEAS by re-annotating known pathways of E. coli and B.
   subtilis in SEED, and found that the mapped pathways are highly
   reliable, achieving 79% precision and 95% coverage for E. coli and 66%
   precision and 74% coverage for B. subtilis. Our additional evaluation
   results on microarray data and newly sequenced genome suggest that SEAS
   can identify more organism-specific pathways than KEGG-based pathway
   annotation. To the best of our knowledge, SEAS is the first software
   for SEED pathway enrichment analysis.

Results and Discussion

   The workflow of SEAS consists of two main steps as shown in [57]Figure
   1: (a) it first maps the query genes to SEED subsystems based on
   sequence similarity search or ID mapping; and (b) it then compares the
   ratio of the query genes out of all the genes in each mapped subsystem
   versus the ratio of the query genes out of the whole gene set of the
   query genome or some other background ratio prepared by the user, and
   identifies significantly enriched subsystems.

Figure 1. A schematic representation of the SEAS workflow.

   [58]Figure 1
   [59]Open in a new tab

   Each rectangle represents a program, each cylinder represents a
   database, and the others are flat text files for input, output or
   intermediate results.

Gene mapping to pathways by multiple strategies

   Mapping the query genes to pathways involves searching the
   well-annotated gene database in SEED that currently has 1,414
   organisms. We have implemented three strategies in SEAS, one of which
   will be used depending on the availability of the relevant information.
   When the query genes are already in SEED, we will use the original
   (pathway) annotation in SEED directly if the SEED ID is available for
   the query or through ID mapping using the NCBI GI number as the
   universal ID. When the genes are not covered but have available genome
   in SEED, we will use the mapping results between the query genes and
   the pathway genes in SEED given by the official RAST server using
   Bi-Directional Best Hit (BDBH) [60][26], or use the mapping results by
   our own P-MAP program [61][27] when operons for the query genome are
   available. P-MAP uses both high sequence similarity and operon
   information for orthologous gene mapping, and hence tend to make the
   mapping results more accurate than BDBH when it is applicable. When
   neither of these two methods provides useful mapping results, which
   could be true for partially sequenced genomes and meta-genomes, we use
   NCBI BLAST (blastp for DNA, blastx for protein) (see Material and
   Methods on E-value cutoff), to compare the query genes/proteins against
   one or more reference genomes in SEED specified by the user, in which
   we select the top hit with known annotation in SEED. The SEAS program
   provides the option for the user to choose one of the options to do
   gene mapping.

   The first two strategies have been well evaluated in the original
   papers on SEED [62][24], RAST [63][26] and P-MAP [64][27] so we focus
   on the assessment of the third strategy. Specifically, we will
   re-annotate the pathways of E. coli and B. subtilis (already in SEED)
   based on SEED pathways encoded by other genomes (as references). The