Abstract Pathway enrichment analysis represents a key technique for analyzing high-throughput omic data, and it can help to link individual genes or proteins found to be differentially expressed under specific conditions to well-understood biological pathways. We present here a computational tool, SEAS, for pathway enrichment analysis over a given set of genes in a specified organism against the pathways (or subsystems) in the SEED database, a popular pathway database for bacteria. SEAS maps a given set of genes of a bacterium to pathway genes covered by SEED through gene ID and/or orthology mapping, and then calculates the statistical significance of the enrichment of each relevant SEED pathway by the mapped genes. Our evaluation of SEAS indicates that the program provides highly reliable pathway mapping results and identifies more organism-specific pathways than similar existing programs. SEAS is publicly released under the GPL license agreement and freely available at [27]http://csbl.bmb.uga.edu/~xizeng/research/seas/. Introduction High-throughput omic techniques are being increasingly more widely used by large research centers as well as by individual labs because of the rapidly decreasing costs and the increasing quality of the data generated. The rapid accumulation of the omic data has provided unprecedented new opportunities for biologists to study substantially more complex problems at a systems level [28][1], [29][2] than just a few years ago. As a key technique in linking individual genes/proteins to biological processes, pathway enrichment analysis is being widely used to study pathway-level activities based on the activities of individual genes/proteins observed using omic techniques [30][3], [31][4]. A number of computational tools have been developed to provide pathway enrichment analyses against different pathway databases. As of now, the majority of the existing tools have been designed for pathway analyses for human or eukaryotes in general, including ArrayXPath [32][5], GenMAPP [33][6], DAVID [34][7], PathwayExplorer [35][8], PathExpress [36][9] and Pathway Miner [37][10]. Among all these analysis tools, gene mapping from a specified organism to the pathway genes covered by the underlying (pathway) database is typically done through gene ID [38][5], [39][6], [40][7] or orthology mapping [41][11], [42][12]. A pathway is considered as enriched by a set of genes if they overlap the pathway at a substantially higher percentage of the pathway genes than expected by chance. Statistical enrichment analysis methods fall into three classes according to enrichment algorithms [43][13]: (i) singular enrichment analysis (SEA), which calculates an enrichment P-value on each pathway and lists the enriched pathways in a linear table based on the hyper-geometric distribution assumption [44][14] or using Fisher exact test [45][15], [46][16] among a few other methods [47][17] [48][18]; (ii) gene set enrichment analysis [49][19], which considers an entire gene set (without pre-selection) encoded in a genome and associated experimental values (for instance expression fold change); and (iii) modular enrichment analysis [50][20], which uses the key idea of SEA but considers pathway-pathway or gene-gene relations in its enrichment P-value calculation. In this paper, we will use the SEA method because of its simplicity and popularity, and may consider the other two classes of enrichment analysis methods in our future work. Currently there are a few popular pathway databases in the public domain, without a particular one being the predominant one [51][21], as they each have their own strengths and limitations, making each of them suitable for different application scenarios. For example, the KEGG Pathway database [52][22] has a collection of generic pathways mostly derived based on known biochemical reactions rather than how individual organisms execute the reactions. Hence these generic pathways could be considered as a superset of the corresponding pathways specific to individual organisms, i.e., not every reaction in a KEGG pathway is encoded in every organism [53][23]. So mapping these generic pathways to specific organisms generally requires manual examination to ensure the mapping quality. The SEED Subsystem database is another pathway resource; each subsystem (pathway) for a specific organism in SEED is constructed by a group of domain experts [54][24], making its pathway genes more organism-specific and generally more reliable than KEGG pathways. Its limitation is that its coverage might not be as high as KEGG pathways. For example, the KEGG pathways cover 2,983 E. coli genes while SEED covers only 2,181 while exceptions exist. For instance, KEGG covers 2,296 B. subtilis genes while SEED covers 2,303. We have previously developed a software tool KOBAS [55][11] for enrichment analyses of KEGG pathways, which has been widely used since its publication [56][25]. Here we present a new tool for enrichment analyses against SEED subsystems, called SEAS (SEED-based Enrichment Analysis System). SEAS provides three ways for gene mapping to subsystems through gene ID, orthology or homology mapping based on the availability of the relevant information, and identifies the statistically enriched pathways in SEED. We have extensively tested the performance of SEAS by re-annotating known pathways of E. coli and B. subtilis in SEED, and found that the mapped pathways are highly reliable, achieving 79% precision and 95% coverage for E. coli and 66% precision and 74% coverage for B. subtilis. Our additional evaluation results on microarray data and newly sequenced genome suggest that SEAS can identify more organism-specific pathways than KEGG-based pathway annotation. To the best of our knowledge, SEAS is the first software for SEED pathway enrichment analysis. Results and Discussion The workflow of SEAS consists of two main steps as shown in [57]Figure 1: (a) it first maps the query genes to SEED subsystems based on sequence similarity search or ID mapping; and (b) it then compares the ratio of the query genes out of all the genes in each mapped subsystem versus the ratio of the query genes out of the whole gene set of the query genome or some other background ratio prepared by the user, and identifies significantly enriched subsystems. Figure 1. A schematic representation of the SEAS workflow. [58]Figure 1 [59]Open in a new tab Each rectangle represents a program, each cylinder represents a database, and the others are flat text files for input, output or intermediate results. Gene mapping to pathways by multiple strategies Mapping the query genes to pathways involves searching the well-annotated gene database in SEED that currently has 1,414 organisms. We have implemented three strategies in SEAS, one of which will be used depending on the availability of the relevant information. When the query genes are already in SEED, we will use the original (pathway) annotation in SEED directly if the SEED ID is available for the query or through ID mapping using the NCBI GI number as the universal ID. When the genes are not covered but have available genome in SEED, we will use the mapping results between the query genes and the pathway genes in SEED given by the official RAST server using Bi-Directional Best Hit (BDBH) [60][26], or use the mapping results by our own P-MAP program [61][27] when operons for the query genome are available. P-MAP uses both high sequence similarity and operon information for orthologous gene mapping, and hence tend to make the mapping results more accurate than BDBH when it is applicable. When neither of these two methods provides useful mapping results, which could be true for partially sequenced genomes and meta-genomes, we use NCBI BLAST (blastp for DNA, blastx for protein) (see Material and Methods on E-value cutoff), to compare the query genes/proteins against one or more reference genomes in SEED specified by the user, in which we select the top hit with known annotation in SEED. The SEAS program provides the option for the user to choose one of the options to do gene mapping. The first two strategies have been well evaluated in the original papers on SEED [62][24], RAST [63][26] and P-MAP [64][27] so we focus on the assessment of the third strategy. Specifically, we will re-annotate the pathways of E. coli and B. subtilis (already in SEED) based on SEED pathways encoded by other genomes (as references). The