Abstract

   Pathway enrichment analysis provides a knowledge-driven approach to
   interpret differentially expressed genes associated with disease
   status. Many tools have been developed to analyze a single study.
   However, when multiple studies of different conditions are jointly
   analyzed, novel integrative tools are needed. In addition, pathway
   redundancy introduced by combining multiple public pathway databases
   hinders interpretation and knowledge discovery. We present a
   meta-analytic integration tool, Comparative Pathway Integrator (CPI),
   to address these issues using adaptively weighted Fisher’s method to
   discover consensual and differential enrichment patterns, a tight
   clustering algorithm to reduce pathway redundancy, and a text mining
   algorithm to assist interpretation of the pathway clusters. We applied
   CPI to jointly analyze six psychiatric disorder transcriptomic studies
   to demonstrate its effectiveness, and found functions confirmed by
   previous biological studies as well as novel enrichment patterns. CPI’s
   R package is accessible online on Github metaOmics/MetaPath.

   Keywords: pathway, meta-analysis, text mining

1. Introduction

   In a typical transcriptomic study, a set of candidate genes associated
   with diseases or other outcomes are first identified through
   differential expression analysis. Then, to gain more insight into the
   underlying biological mechanism, pathway analysis (also known as gene
   set analysis) is usually applied to pursue functional annotation of the
   candidate biomarker list. The goal behind pathway analysis is to
   determine whether the detected biomarkers are enriched in pre-defined
   biological functional domains. These functional domains might come from
   one of the publicly available databases such as GO [[38]1], Reactome
   [[39]2] and KEGG [[40]3], or one of the integrated pathway collection
   such as MSigDB [[41]4] and Pathway Commons [[42]5]. Three main
   categories of pathway analysis methods have been developed in the past
   decade. The first category of methods called “over-representation
   analysis" considers biomarkers under a certain cutoff of differential
   express (DE) evidence and statistically evaluates the fraction of DE
   genes in a particular pathway found among the background genes. Without
   a hard threshold, the second category “functional class scoring" takes
   the DE evidence scores of all genes in a pathway into account and
   aggregates them into a single pathway-specific statistics. The third
   category “pathway topology" further incorporates the information of
   gene-gene interaction and their cellular location in addition to the
   pathway database. Details of general pathway enrichment analysis review
   can be found in [[43]6].

   Many transcriptomic datasets have been generated with the rapid
   advances of high-throughput experimental technologies in the past
   decade. Meta-analysis, a set of statistical methods for combining
   multiple studies of a related hypothesis, has thus become popular
   [[44]7]. Some methods have been developed for the pathway
   meta-analysis. Shen and Tseng [[45]8] developed two approaches of
   meta-analysis for pathway enrichment by combining DE evidence at the
   gene level (MAPE_G) or at the pathway level (MAPE_P). Nguyen et al.
   [[46]9] proposed a robust bi-level pathway meta-analysis by adding an
   intra-experiment level analysis and another data-driven meta-analysis
   approach, DANUBE [[47]10], using unbiased empirical distribution.
   However, in many real applications, when multiple datasets for a common
   biological hypothesis are available but possibly performed under
   different conditions (e.g., different tissues, different cell
   composition or different experimental platforms), it becomes necessary
   to detect both pathways enriched consistently in all studies
   (consensually enriched pathways) and pathways enriched in partial
   studies (differentially enriched pathways). One naïve way is to
   identify the enriched pathways in each study individually and manually
   check whether a certain pathway is enriched in one or multiple studies
   (e.g., using Venn diagram for enriched pathways in each study under a
   certain FDR threshold). This approach is sensitive to the choice of FDR
   threshold and is ad hoc in drawing a final conclusion. To avoid an
   arbitrary significance threshold, Plaisieret et al. [[48]11] and Cahill
   et al. [[49]12] have proposed rank-rank hypergeometric overlap (RRHO)
   plot to visualize contrasting enrichment significance under all
   continuous significance level of two studies. This approach is,
   however, limited to comparing two studies. To the best of our
   knowledge, there is currently no available statistical tool that can
   achieve the goal of integrating pathway enrichment of multiple studies
   in an automated and systematic manner for characterizing consensually
   and differentially enriched pathways.

   A second issue emerges with pathway enrichment analysis is the pathway
   redundancy across pathway databases. Researchers often have difficult
   time to infer and interpret the underlying biological mechanism without
   presumed bias due to the large number of pathways identified. This kind
   of redundancy frequently occurs in a regular pathway enrichment
   analysis since different pathway databases contain similar annotated
   pathways with highly overlapped genes. The DAVID Bioinformatics
   Resources [[50]13] partially resolved this issue by clustering pathways
   based on a kappa statistic representing the pathway similarity.
   However, the users still had to manually inspect each pathway in a
   cluster. Due to the long and vague descriptions in many pathways, users
   can still struggle to reach a solid conclusion from the results.

   In light of the aforementioned drawbacks in existing tools, we propose
   a meta-analytic integrative framework to combine multiple
   transcriptomic studies for identifying consensual and differential
   pathway enrichment, wrapped in a tool named Comparative Pathway
   Integrator (CPI). CPI incorporates 25 pathway databases, including GO
   [[51]1], Reactome [[52]2], KEGG [[53]3] and MSigDB [[54]4] or
   user-defined gene set lists, as reference of pathway analysis. In order
   to identify both consensual and differential enriched pathways across
   studies, we applied the adaptively weighted Fisher’s method [[55]14],
   which was originally developed to combine p-values from multiple omics
   studies for detecting homogeneous and heterogeneous differentially
   expressed genes. Next, cluster analysis based on pathway similarity
   (defined by gene overlap) is applied to remove the level of pathway
   redundancy. But unlike DAVID, we adopt a tight clustering algorithm
   similar to [[56]15] to allow scattered pathways without being clustered
   and derive tight pathway clusters. Subsequently, we developed a text
   mining algorithm to automate the annotation of the tight pathway
   clusters by extracting keywords from pathway descriptions, which also
   offers more statistically valid summarization compared to leaving user
   to manually explore pathways in a cluster. Lastly, CPI provide users
   with both spreadsheets and graphical outputs for intuitive
   visualization, statistically solid presentation and insightful
   interpretation. CPI has a standalone R package as well as being
   disseminated into MetaOmics, an analysis pipeline and browser-based
   software suite for transcriptomic meta-analysis [[57]16].

2. Materials and Methods

2.1. Workflow of Comparative Pathway Integrator (CPI)

   CPI is a comprehensive tool incorporating several widely accepted
   mature methods as well as multiple novel algorithms/approaches. It is
   mainly composed of three steps ([58]Figure 1). The first step
   ([59]Section 2.2) performs meta-analytic pathway analysis, which
   integrates pathway enrichment analysis and meta-analysis. This step
   partially resembles the previous in-house work of R package MetaPath
   [[60]8] but with advanced features. While MetaPath focuses on detecting
   consensually enriched pathways, CPI will detect both consensual and
   differentially enriched pathways, providing valuable information on how
   the patterns of pathway enrichment differ across studies. The second
   step ([61]Section 2.3) implements pathway clustering. This step aims to
   reduce redundancy of the pathway information from commonly hundreds of
   enriched pathways to only handful (usually 5–10) pathway clusters. The
   results are more succinct and interpretable. The third step
   ([62]Section 2.4) includes text mining based on pathway names and
   descriptions to find keywords characterizing the intrinsic biological
   functions of each pathway cluster. A permutation-based statistical test
   is performed to assess if a specific biological noun phrase appears
   significantly more than by chance. Without this step, it would be
   difficult to avoid subjective biases from users and to objectively
   identify the representative biological mechanisms for each pathway
   cluster since clustering of the pathways does not fundamentally reduce
   the total number of pathways under investigation. Finally, we generate
   graphical and spreadsheet outputs of pathway p-value matrices and
   pathway clustering details including gene composition and functional
   keywords.

Figure 1.

   [63]Figure 1
   [64]Open in a new tab

   Workflow of Comparative Pathway Integrator (CPI).

2.2. Meta-Analytic Pathway Analysis

   Compared to differential expression analysis, pathway enrichment
   analysis provides more biological insight in a more systematic and
   comprehensive manner. In CPI, we allow multiple methods of over
   representation analysis since recent comparative studies
   [[65]17,[66]18] have shown little additional advantages using
   sophisticated functional class scoring or pathway topology methods. But
   if such advanced pathway enrichment analysis is preferred, users can
   externally implement the pathway analysis and CPI can accept lists of
   significant pathways and the corresponding p-values as alternative
   input (Input 2 in [67]Figure 1). Given pathway enrichment results, we
   perform adaptively-weighted Fisher’s (AW-Fisher) method [[68]14,[69]19]
   for meta-analysis, to identify pathways significant in one or more
   studies/conditions. AW-Fisher not only increases statistical power, but
   also provides a 0/1 binary weight for each study, indicating whether a
   study contributes to the meta-analytic significance. Given a
   user-specified q-value cutoff, we obtain a list of significant
   pathways, with 0/1 binary weights indicating whether a pathway is
   significantly enriched across most or all studies/conditions (i.e.,
   consensually enriched pathways) or only in partial studies (i.e.,
   differentially enriched pathways). For example, in [70]Section 3, the
   “GO:MF kinase activity” pathway has raw enrichment p-values
   [MATH:
   <mrow><mrow><mo>(</mo><mn>0.26922</mn><mo>,</mo><mn>0.17773</mn><mo>,</
   mo><mn>0.06485</mn><mo>,</mo><mn>2.04</mn><mo>×</mo><msup><mn>10</mn><m
   row><mo>−</mo><mn>5</mn></mrow></msup><mo>,</mo><mn>0.00449</mn><mo>,</
   mo><mn>0.018922</mn><mo>)</mo></mrow></mrow> :MATH]
   for the six studies. AW-Fisher meta-analysis generates combined p-value
   =
   [MATH:
   <mrow><mrow><mn>5.52</mn><mo>×</mo><msup><mn>10</mn><mrow><mo>−</mo><mn
   >6</mn></mrow></msup></mrow></mrow> :MATH]
   (q-value =
   [MATH: <mrow><mrow><mn>0.00014</mn></mrow></mrow> :MATH]
   ) with adaptive weights = (0,0,1,1,1,1), showing enrichment in the last
   four bipolar studies or major depressive disorder studies but not in
   the first two schizophrenia studies.

2.3. Pathway Clustering for Reducing Redundancy and Enhancing Interpretation

   Because of the nature of pathway definitions (e.g., hierarchy structure
   or overlapping functions), many genes are shared among different
   pathways. Similar pathways can also repeat in different pathway
   databases with slightly different gene composition, annotation or
   description. Such redundancies often stumble interpretation of pathway
   analysis results. In CPI, we perform pathway clustering to reduce the
   redundancy among detected pathways. The similarity between different
   pathways is calculated based on kappa statistics [[71]20], which
   depends on how many genes are mutually identical or exclusive among
   those pathways. The kappa statistics represents the dissimilarity
   between two pathways based on the genes composing each pathway. Based
   on the dissimilarity matrix of all pathway pairs, consensus clustering
   [[72]21] is used to estimate the number of clusters. Following the
   original consensus clustering method, an elbow plot and consensus CDF
   plot are generated to assist users to decide the number of clusters.

   We next assign detected pathways into clusters. For most clustering
   algorithms including the aforementioned consensus clustering, all
   pathways are forced into clusters although it is well-known that
   leaving scattered subjects out of clusters often generate tighter
   clusters and improve the clustering performance in such
   high-dimensional data [[73]15,[74]22,[75]23]. In CPI, we allow
   scattered pathways to form singletons, when its gene composition is
   largely different from representative pathway clusters, to avoid adding
   outliers to the pathway clusters. To improve the tightness of the
   clusters, we further calculated for each pathway the silhouette width
   [[76]24], a measure of how tightly each pathway is grouped in its
   cluster, and removed the scattered pathways with low silhouette width
   iteratively until all pathways’ silhouette widths are above a certain
   cutoff. The removing cutoff for silhouette width is estimated
   empirically based on its distribution as in our multi-disease
   application in [77]Section 3 (we choose 0.1 in this paper). For the
   identified singleton pathways, we collected them to form a scattered
   pathway set instead of filtering them out. In general applications, we
   recommend users to investigate the identified tight pathway clusters
   first with the subsequent text mining tool introduced in the next
   subsection since these pathway clusters are better annotated in the
   pathway databases and are likely better studied in the current
   biological knowledge domain. We, however, do not discard pathways in
   the scattered set but recommend them in the secondary investigation.

2.4. Text Mining for Automated Annotation and Knowledge Retrieval of Pathway
Clusters

2.4.1. Motivation and Problem Setting

   Although the pathway cluster analysis in [78]Section 2.3 can greatly
   reduce redundancy structure of the detected pathways, users still have
   to manually scan through all pathways in a cluster to grasp its major
   content and biologically driving mechanism, which can be
   labor-intensive and subject to the user’s biased presumption.
   Therefore, we need a more rigorous and statistically meaningful summary
   of the pathway cluster to guide an unbiased interpretation. The above
   goal is expressed here as the text mining for key noun phrases of each
   pathway cluster: which noun phrase appears more frequently in a certain
   pathway cluster than by chance in statistical sense? We will therefore
   treat these noun phrases as the potentially representative entities
   (mechanism) for the pathway cluster. The entity is counted based on the
   number of pathways containing it, rather than the frequency of it
   appearing in all pathway descriptions in a cluster. For instance, in a
   certain cluster, if "T cell" occurs six times in 3 pathway
   descriptions: three times in pathway #1, twice in pathway #2 and once
   in pathway #3, T cell" is counted 3 occurrences even though it appears
   6 times in total.

2.4.2. Pathway-Phrase Matrix

   For each pathway description, we firstly extracted unique noun phrases
   from it. This step was done using the spacy_extract _nounphrases
   function from R package spacyr [[79]25] which is an R wrapper around
   the Python spaCy package [[80]26]. spaCy is an industrial strength
   text-mining package employing a large library database as well as some
   machine learning algorithms to detect information from texts. The stop
   words in English, such as “the”, “a”, “that”, which are common and
   carry no important information, are removed from those noun phrases by
   using the English stop words database from R package tm [[81]27]. After
   removing all stop words, the last word of each noun phrases, i.e., the
   central noun of a noun phrase, is lemmatized (converting plural form to
   singular form) by the lemmatize_words function in textstem [[82]28].
   The top 5000 common English words [[83]29] were then filtered out from
   the result noun phrases of length one. A text mining process of an
   example sentence is shown in [84]Figure 2.

Figure 2.

   [85]Figure 2
   [86]Open in a new tab

   Workflow of noun phrase extraction.

   In total, we provide 25 pathway databases (GO, KEGG, BioCarta,
   Reactome, Phenocarta, etc.) with 26,801 pathways in CPI for users to
   select in the analysis. The above preprocessing and filtering steps
   were repeated for each pathway to generate standard noun phrases for
   all pathways. Based on the results, we constructed a binary matrix
   where each row being a noun phrase and each column being a pathway with
   element
   [MATH:
   <mrow><mrow><msub><mi>w</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo
   >=</mo><mn>1</mn></mrow></mrow> :MATH]
   indicating the pathway description j contains the noun phrase i and 0
   otherwise.

   Once the matrix was constructed, R package wordnet [[87]30] was used to
   identify synonyms from row names of the matrix (noun phrases). When a
   pair of synonyms are identified, the phrase with lower occurrence in
   all pathways are combined with the phrase with higher occurrence. Then
   the row of less occurred phrase was deleted. Since in later text mining
   of pathway clusters, a phrase needs to at least occur in two pathways
   to be considered, all rows of phrases which occurred only once in the
   26,801 pathways were deleted. As a result, a matrix of 36,037 rows and
   26,801 columns was constructed.

   For later penalized permutation test, the above text mining matrix
   construction procedure was also applied to pathway names of 26,801
   pathways, producing a similar matrix of 36,037 rows and 26,801 columns
   with
   [MATH:
   <mrow><mrow><msub><mi>v</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo
   >=</mo><mn>1</mn></mrow></mrow> :MATH]
   indicating the pathway name j contains the noun phrase i and 0
   otherwise. For a given pathway (e.g., GO:0030964), the pathway name
   (“NADH dehydrogenase complex”) is usually more concise while the
   pathway description (“An integral membrane complex that possesses NADH
   oxidoreductase activity. The complex is one of the components of the
   electron transport chain. It catalyzes the transfer of a pair of
   electrons from NADH to a quinone”) gives a more detailed illustration
   of the pathway.

2.4.3. Test Statistics for Noun Phrase Enrichment Analysis

   A simple strategy to test for the significance of a phrase frequently
   appearing in a cluster is by simple counting and conducting Fisher
   exact test. Yet we found this method to be less powerful and less
   biologically justifiable from real data analysis, because the phrases
   in the term name or a shorter description of a pathway are deemed to be
   more representative than those in a full or longer description. In
   other words, phrases appearing in a pathway with shorter description
   should statistically contribute more weights than in a pathway with
   lengthy description because the latter is more likely to happen by
   chance. Therefore, we down-weighted the phrase count with long
   description by assigning a score between 0 and 1 to each pathway j to
   indicate whether it contains phrase i:
   [MATH:
   <mrow><mrow><msub><mi>x</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo
   >=</mo><mfenced separators="" open="{" close=""><mtable
   columnalign="left"><mtr><mtd><mn>1</mn></mtd><mtd><msub><mi>v</mi><mrow
   ><mi>i</mi><mi>j</mi></mrow></msub><mo>=</mo><mn>1</mn><mtext> </mtext>
   <mo
   stretchy="false">(</mo><mi>phrase</mi><mtext> </mtext><mi>i</mi><mtext>
    </mtext><mi>appeared</mi><mtext> </mtext><mi>in</mi></mtd></mtr><mtr><
   mtd></mtd><mtd><mi>the</mi><mi>j</mi><mtext>-th</mtext><mo> </mo><mi>pa
   thway</mi><mtext> </mtext><mi>name</mi><mo
   stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd
   columnalign="left"><mrow><mi>e</mi><mi>x</mi><mi>p</mi><mo
   stretchy="false">(</mo><mo>−</mo><mi>α</mi><mo>·</mo><mo>|</mo><msub><m
   i>w</mi><mi>j</mi></msub><mo>|</mo><mo>)</mo></mrow></mtd><mtd
   columnalign="left"><msub><mi>v</mi><mrow><mi>i</mi><mi>j</mi></mrow></m
   sub><mo>=</mo><mn>0</mn><mtext> </mtext><mi>and</mi><mtext> </mtext><ms
   ub><mi>w</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=</mo><mn>1</m
   n><mtext> </mtext><mo
   stretchy="false">(</mo><mi>phrase</mi></mtd></mtr><mtr><mtd></mtd><mtd>
   <mi>i</mi><mtext> </mtext><mi>appeared</mi><mtext> </mtext><mi>only</mi
   ><mtext> </mtext><mi>in</mi><mtext> </mtext><mi>the</mi><mtext> </mtext
   ><mi>j</mi><mtext>-th</mtext></mtd></mtr><mtr><mtd></mtd><mtd><mi>pathw
   ay</mi><mtext> </mtext><mi>description</mi><mo
   stretchy="false">)</mo><mo>,</mo></mtd></mtr><mtr><mtd
   columnalign="left"><mn>0</mn></mtd><mtd
   columnalign="left"><mrow><msub><mi>v</mi><mrow><mi>i</mi><mi>j</mi></mr
   ow></msub><mtext> </mtext><mo>=</mo><mtext> </mtext><msub><mi>w</mi><mr
   ow><mi>i</mi><mi>j</mi></mrow></msub><mtext> </mtext><mo>=</mo><mtext>
   </mtext><mn>0</mn></mrow></mtd></mtr></mtable></mfenced></mrow></mrow>
   :MATH]

   where
   [MATH:
   <mrow><mrow><mrow><mo>|</mo></mrow><msub><mi>w</mi><mi>j</mi></msub><mr
   ow><mo>|</mo><mo>=</mo></mrow><msub><mo>∑</mo><mi>i</mi></msub><msub><m
   i>w</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow></mrow> :MATH]
   is the number of unique noun phrases in the description of pathway j
   and
   [MATH: <mrow><mi>α</mi></mrow> :MATH]
   is a parameter controlling the degree of penalty. The greater
   [MATH: <mrow><mi>α</mi></mrow> :MATH]
   is, the greater the penalty is on longer description. When
   [MATH: <mrow><mi>α</mi></mrow> :MATH]
   equals to 0, there is no penalty and our test simplifies to Fisher’s
   exact test with equal weight to pathway names and pathway descriptions.
   Based on evaluation in real data,
   [MATH: <mrow><mrow><mi>α</mi><mo>=</mo><mn>0.05</mn></mrow></mrow>
   :MATH]
   is used in our package. Next, we define cluster score
   [MATH:
   <mrow><mrow><msub><mi>T</mi><mi>i</mi></msub><mrow><mo>(</mo><mi>C</mi>
   <mo>)</mo></mrow></mrow></mrow> :MATH]
   to be the sum of scores of pathways in the cluster, i.e., for phrase i
   in a pathway cluster C, we define the test statistics:
   [MATH:
   <mrow><mrow><msub><mi>T</mi><mi>i</mi></msub><mrow><mo>(</mo><mi>C</mi>
   <mo>)</mo></mrow><mo>=</mo><munder><mo>∑</mo><mrow><mi>j</mi><mo>∈</mo>
   <mi>C</mi></mrow></munder><msub><mi>x</mi><mrow><mi>i</mi><mi>j</mi></m
   row></msub></mrow></mrow> :MATH]

2.4.4. Permutation Test

   To test for the null hypothesis that a phrase is not enriched in a
   certain cluster, we adopt a permutation analysis. For each phrase i in
   the b-th permutation, pathways are randomly sampled to form subset
   [MATH: <mrow><msub><mi>S</mi><mi>b</mi></msub></mrow> :MATH]
   with the same cluster size as C. Test statistics
   [MATH:
   <mrow><mrow><msub><mi>T</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>
   S</mi><mi>b</mi></msub><mo>)</mo></mrow></mrow></mrow> :MATH]
   is recomputed at the end of each permutation. The operation is then
   repeated for a large number of times (say,
   [MATH: <mrow><mrow><mi>B</mi><mo>=</mo><mn>10,000</mn></mrow></mrow>
   :MATH]
   times). Finally, all
   [MATH:
   <mrow><mrow><msub><mi>T</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>
   S</mi><mi>b</mi></msub><mo>)</mo></mrow></mrow></mrow> :MATH]
   ’s form a null distribution and are compared to the observed statistics
   [MATH:
   <mrow><mrow><msub><mi>T</mi><mi>i</mi></msub><mrow><mo>(</mo><mi>C</mi>
   <mo>)</mo></mrow></mrow></mrow> :MATH]
   . And the p-value could be calculated by
   [MATH:
   <mrow><mrow><mi>p</mi><mrow><mo>(</mo><msub><mi>T</mi><mi>i</mi></msub>
   <mrow><mo>(</mo><mi>C</mi><mo>)</mo></mrow><mo>)</mo></mrow><mo>=</mo><
   mfrac><mrow><msubsup><mo>∑</mo><mrow><mi>b</mi><mo>=</mo><mn>1</mn></mr
   ow><mi>B</mi></msubsup><mi>I</mi><mrow><mo>(</mo><msub><mi>T</mi><mi>i<
   /mi></msub><mrow><mo>(</mo><mi>C</mi><mo>)</mo></mrow><mo>≥</mo><msub><
   mi>T</mi><mi>i</mi></msub><mrow><mo>(</mo><msub><mi>S</mi><mi>b</mi></m
   sub><mo>)</mo></mrow><mo>)</mo></mrow></mrow><mi>B</mi></mfrac></mrow><
   /mrow> :MATH]
   , indicating how extremely frequent phrase i is seen in cluster C.
   Multiple comparison is then corrected by Benjamini–Hochberg procedure
   [[88]31] to control false discovery rate (FDR).

2.4.5. Graphical and Spreadsheet Output

   In the final step, CPI outputs visualization tools, including (1)
   heatmap of kappa statistics matrix for pair-wise pathways (see
   [89]Figure 3) (2) heatmap of pathway enrichment p-value matrix
   (pathways sorted by clusters on the rows and studies on the columns)
   (see [90]Figure 4) (3) multi-dimensional scaling (MDS) plot of pathways
   and cluster assignment distributed by kappa statistics (see
   [91]Supplementary Figure S3c) and (4) dendrograms of hierachical
   clustering (distance measured by pathway enrichment p-values) of
   studies in each cluster (see [92]Figure 5). CPI also provides
   diagnostic tools such as CDF plot and scree plot to determine the
   number of clusters in consensus clustering.

Figure 3.

   [93]Figure 3
   [94]Open in a new tab

   Heatmap of kappa statistics of pair-wise pathways in all clusters.

Figure 4.

   [95]Figure 4
   [96]Open in a new tab

   Heatmap of log10-scale pathway enrichment p-values of pathways
   annotated by eight pathway clusters (I-VIII) and a scattered pathway
   set (black).

Figure 5.

   [97]Figure 5
   [98]Open in a new tab

   Hierarchical clustering of psychiatric studies in each cluster with
   distance defined by the log10-scale pathway enrichment p-values.

2.4.6. Datasets and Databases

   We provide 22 Homo sapiens pathway databases in CPI, including 14
   pathway databases from MsigDB (containing GO, KEGG, Reactome, BioCarta,
   and others), 2 databases from Connectivity Map, transcription factor
   target database JASPAR, Protein-Protein interaction database and 3
   microRNA target databases as options for enrichment analysis. In
   addition, GO and KEGG for Mus musculus and Saccharomyces cerevisiae,
   and JASPAR database for Mus musculus are also provided (see
   [99]Supplementary Table S2). Users may choose to apply their own
   pathway databases with the extra computing cost of re-calculating the
   pathway-phrase matrix in [100]Section 2.4.2.

3. Results

3.1. Application to Transcriptomic Data of Multiple Psychiatric Disorders

   To demonstrate its utility, we applied CPI to an integrative analysis
   of a postmortem microarray dataset from three psychiatric disorders
   [[101]32]. Briefly laser-microdissection was used to isolate pools of
   100 pyramidal neurons from layers 3 (L3) and 5 (L5) from dorsolateral
   prefrontal cortex (DLPFC). Samples were collected from schizophrenia
   (SCZ), bipolar disorder (BP), and major depressive disorder (MDD)
   subjects and unaffected comparison subjects matched for age and sex.
   Samples consisted of both cell types from all subjects except for the
   L5 SCZ sample where two subjects were removed due to quality control
   issues. Following identification of differentially-expressed genes in
   each of the 6 diagnostic categories (cells from two layers and three
   different diagnoses) four default pathway databases in CPI (Gene
   Ontology, KEGG, Reactome and BioCarta) were used for this application.

   Low expression and non-informative genes were first filtered out by
   quantile filtering and then differential expression (DE) analysis was
   conducted by limma embedded in the MAPE2.0 function from our package
   which allows both raw transcriptomic data input as well as DE p-value
   matrix provided by users. Pathway enrichment analysis was performed in
   each study using the top 400 DE genes and the results were meta
   analyzed by adaptively weighted Fisher’s method in CPI to obtain the
   final integrative p-values and q-values in each pathway. We filtered
   out pathways containing less than 15 genes or more than 500 genes in
   the pathway databases. Of the 1901 pathways analyzed, 96 pathways had
   meta-analyzed q-values smaller than 0.0005 and were entered for pathway
   cluster analysis. The number of pathway clusters were selected to be 8
   which was justified by the elbow plot and consensus CDF plot (see
   [102]Supplementary Figure S2a,b) and 18 pathways were left out as
   scattered pathways.

   [103]Figure 3 displays heatmap of kappa statistics of pair-wise
   pathways, sorted by the 8 identified pathway clusters and a scattered
   pathway set (black color). [104]Figure 4 shows heatmap of
   log10-transformed pathway enrichment p-values with pathways on the rows
   and six studies on the columns. Dendrograms of hierachical clustering
   of studies in each cluster are shown in [105]Figure 5. [106]Table 1
   contains 10 functionally annotated phrases identified from the
   penalized text mining algorithm for each pathway cluster. We note that,
   by our algorithm, heatmap pattern of pathways in the same cluster in
   [107]Figure 4 may not visualize similarly since the pathway clusters
   are obtained by kappa statistics, representing similarity of gene
   content of any pair of pathways, rather than pathway enrichment
   p-values. But in general, we do observe clear pattern in almost all 8
   clusters. For example, cluster VI, VII and VIII contain highly enriched
   pathways in SCZ-L3 and SCZ-L5, marginal enrichment in BP-L3 and BP-L5
   but almost no significance in MDD-L3 and MDD-L5. Based on text mining
   results, clusters VI and VIII contain pathways related to
   mitochondrion, ATP synthesis, NAD, etc. Our results also suggest these
   alternations across DLPFC layer 3 and layer 5 are mainly related to ATP
   production rather than other aspects of mitochondrial function. Cluster
   VII with keywords degradation, multiubiquitination, ubiquitin 26s
   proteasome system, etc. is significantly altered in schizophrenia DLPFC
   layers and to a lesser extent in bipolar disorder. Similar results have
   been reported in a blood-based microarray investigation of both
   schizophrenia and bipolar disorder [[108]33].

Table 1.

   Ten significant keywords with q-value < 0.05 in each pathway cluster.
   Cluster Keywords
   I insulin, NGF, focal adhesion, BDNF, neurotrophins, Trk tyrosine
   kinase receptor, insulin receptor substrate, insulin receptor tyrosine
   kinase, Ras MAPK pathway, FAK
   II neuron
   III transcription, nucleoplasm, chromosome, nuclear content, nucleolus,
   RNA
   IV metabolism, mRNA, ribosome, replication, chemical reaction, cRNA,
   vRNA, viral protein, NUMB, nucleus
   V cell death, apoptotic process, activation, endogenous cellular
   process, programmed cell death, apoptosis
   VI mitochondrion, organelle, mitochondrial envelope, organelle
   envelope, lipid bilayer, inner elumen facing lipid bilayer,
   semiautonomous self replicating organelle, tissue respiration,
   virtually eukaryotic cell, cytoplasm
   VII degradation, APC/C, apoptosis, CDC20, CDH1, mitotic protein, MHC,
   multiubiquitination, ubiquitin 26s proteasome system, exogenous antigen
   VIII respiratory electron transport, ATP synthesis, inner mitochondrial
   membrane, chemiosmotic gradient, brown fat, rotenone, FAD,
   mitochondrial matrix, body temperature, NAD
   [109]Open in a new tab

   Abbreviation. NGF: Nerve growth factor, BDNF: Brain-derived
   neurotrophic factor, MAPK: Mitogen-activated protein kinase, FAK: Focal
   adhesion kinase, RNA: Ribonucleic acid, APC/C: Anaphase-promoting
   complex, MHC: Major histocompatibility complex, ATP: Adenosine
   triphosphate, FAD: Flavin adenine dinucleotide, NAD: Nicotinamide
   adenine dinucleotide, NUMB, CDC20, CDH1: Gene names.

   The result of cluster VI, VII and VIII is consistent to several
   biological findings in the literature. Firstly, our results are highly
   consistent with the original publication in [[110]32]. Secondly, the
   paper [[111]34] analyzed a mostly non-overlapping schizophrenia cohort
   and also showed that the differential expression genes at the layer 3
   and/or layer 5 pyramidal cells in the DLPFC of schizophrenia subjects
   are mainly related to mitochondrial (MT) and ubiquitin-proteasome
   system (UPS) functions. The findings were followed up with qPCR
   validation in selected target genes. This is again consistent to our
   findings in cluster VI, VII and VIII. Finally, it has been shown that
   the synaptic area is particularly sensitive to MT and UPS deficits due
   to the high demand for ATP and for UPS activity at pre- and post-
   synaptic terminals [[112]35], which is consistent with our results
   suggesting ATP production as the main aspect of the mitochondrial
   dysfunction in schizophrenia diseases.

   Cluster I and cluster IV had a different pattern of pathway
   alterations. Cluster IV with keywords metabolism, mRNA, ribosome, viral
   protein, NUMB, etc. is significantly altered only in schizophrenia
   DLPFC layers, which indicates protein synthesis dysfunction. However,
   similar altered expression of gene sets related to protein synthesis
   has been found in postmortem hippocampus and orbitofrontal cortex of
   patients with major depression, bipolar disorder, and schizophrenia
   consensually [[113]36]. This implies different degrees of protein
   synthesis pathway alterations in different brain tissues. Cluster I
   with keywords insulin, NGF, BDNF, neurotrophins, Trk tyrosine kinase
   receptor, etc. shows an enrichment pattern mainly in MDD and BP-L5 with
   little enrichment in SCZ of BP-L3. This suggests some similarity
   between the expression of these gene in layer 5 between BP and MDD
   subjects.

   Pathways in cluster III are enriched in layer 3 of all three diseases
   (SCZ-L3, BP-L3 and MDD-L3) but to a lesser extent in layer 5 (SCZ-L5,
   BPL5 and MDD-L5). This cluster is annotated with keywords such as
   transcription, nucleoplasm, chromosome, nuclear content, nucleolus,
   RNA, indicating alterations in general aspects of nuclear function.
   Finally, cluster II and V with moderate enrichment in all six studies
   contain general neural disease related pathways with keywords such as
   neuron, cell death and apoptotic process and likely represent processes
   involved in neuronal survival.

   Interestingly hierarchical clustering of the six categories of samples
   (cells from two layers and three different diagnoses) for each of the
   eight identified pathway clusters shows a similar pattern.
   Specifically, each of the 8 pathway clusters are most tightly clustered
   by diagnosis across layers suggesting similar alterations in layers 3
   and 5 within a disease. Furthermore, across diagnoses BP and SCZ
   cluster more strongly with one another than with MDD. This is
   consistent with transcriptome [[114]37,[115]38,[116]39] and genomic
   [[117]40] findings suggesting similarities between SCZ and BP.

3.2. Justification to Penalize Pathway Description by Length in Text Mining

   In pathway databases, some pathways come with only pathway names
   (usually less than 15 words) and some contains both pathway names and
   pathway descriptions (can be up to 1500 words). In Fisher’s exact test
   by simple count, occurrences of a noun phrase are treated the same when
   appearing in these two extreme cases. In this case, signals of
   important mechanism terms in pathway names can be masked due to its
   frequent occurrence in pathway descriptions, while some non-informative
   terms can be falsely detected from long pathway descriptions. The
   penalization by pathway description length in [118]Section 2.4.3 helps
   improve the sensitivity and reduce false positives. For example, in the
   real application in [119]Section 3, the noun phrase apoptosis in
   cluster 7 was ranked low in the Fisher’s exact test (r = 37, p = 0.051)
   while prioritized in permutation analysis of the penalized test
   statistics (r = 4, p <
   [MATH:
   <mrow><msup><mn>10</mn><mrow><mo>−</mo><mn>4</mn></mrow></msup></mrow>
   :MATH]
   ) (see [120]Supplementary Table S1). Some meaningless words such as et
   al in cluster 7 was ranked high in Fisher’s exact test (r=11, p =
   [MATH:
   <mrow><mrow><mn>5.08</mn><mo>×</mo><msup><mn>10</mn><mrow><mo>−</mo><mn
   >7</mn></mrow></msup></mrow></mrow> :MATH]
   ), but ranked low by penalized statistics (r = 27 and p = 0.005). In
   our experience,
   [MATH: <mrow><mrow><mi>B</mi><mo>=</mo><mn>10,000</mn></mrow></mrow>
   :MATH]
   permutations are sufficient to obtain accurate p-value assessment while
   under a reasonable computing time. The entire permutation analysis for
   the example in [121]Section 3 required only 1.5 min under parallel
   computing using ten cores.

4. Discussion

   CPI has three advantages compared to existing methods. Firstly, CPI
   explores consensual and differential pathway enrichment pattern
   simultaneously when combining multiple related studies. To our
   knowledge, CPI is the first method for this purpose. Secondly, CPI
   clusters pathways by gene composition similarity (i.e., kappa
   statistics) to reduce pathway redundancy. Finally, CPI uses a
   statistically evaluated text mining method to annotate mechanisms of
   each pathway cluster automatically without subjective human
   interpretation. In addition, the proposed penalized text mining
   algorithm by permutation test was shown to outperform conventional
   Fisher’s exact test in text mining. We applied the tool to six
   transcriptomic datasets spanning on three psychiatric disorders (SCZ,
   BP and MDD) and two layers in DLPFC (L3 and L5). The result identified
   multiple pathway clusters with enrichment patterns consistent with
   previous findings, such as mitochondrial ATP dysfunction in
   schizophrenia DLPFC layers, as well as other new findings.

   The current CPI package has several limitations. Firstly, for single
   study pathway analysis, our tool currently only provides Fisher’s exact
   test and Kolmogorov–Smirnov test. Users, however, can externally apply
   advanced methods such as GSEA [[122]41] or others and take the result
   as alternative input. Secondly, our text mining algorithm relies on the
   descriptions provided by pathway databases. For pathway databases
   without detailed descriptions (e.g., in some KEGG pathways), text
   mining algorithm cannot annotate them well. Thirdly, computation time
   is not ignorable, especially in the text mining step. To incorporate
   more studies or more pathway databases, scalable computing algorithms
   will be needed.

5. Conclusions

   In this article, we developed an integrative framework for combining
   and comparing pathway analyses from multiple transcriptomic studies,
   namely Comparative Pathway Integrator (CPI). CPI performs meta-analytic
   pathway analysis, reduces pathway redundancy to condense knowledge
   discovered from the results and conducts text mining to provide
   statistically solid inference on interpreting results. CPI has three
   major steps. In the first step, users can input either gene-based
   differential expression p-value matrix or pathway-based enrichment
   p-value matrix for each study to start with. If p-values of genes are
   entered, pathway enrichment analysis is applied first within each
   study. Enriched pathways are passed down to meta-analysis where
   AW-Fisher is applied to discover consensually and differentially
   enriched pathways. In the second step, significant pathways from
   AW-Fisher meta-analysis are clustered using consensus clustering, with
   consensus CDF plot and elbow plot to assist users to choose the number
   of clusters [[123]21]. Silhouette information is used to achieve
   cluster tightness by removing scattered pathways without being
   clustered. In the third step, a penalized text mining algorithm is used
   to annotate each pathway cluster for an unbiased knowledge learning
   from the experimental data and pathway database. With the penalized
   matrix provided in package, CPI requires 2.8 min under parallel
   computing using ten cores to integrate six studies and 9888 genes in
   the psychiatric example with 1901 pathways in default pathway databases
   and B = 10,000 in permutation analysis. An R package is available at
   Github metaOmics/MetaPath.

   In summary, CPI is a meta-analytic tool for integrating multiple
   related transcriptomic studies in pathway enrichment analysis. As more
   and more transcriptomic datasets accumulate in the public domain, the
   need of such integrative analysis will become more and more prevalent.
   CPI can fill the gap and provide biological insight in such comparative
   and integrative tasks. In addition to transcriptomic studies, the
   framework is readily extensible to integrate pathway analysis of
   multiple related proteomic studies or multiple metabolomics studies.

Supplementary Materials

   The following are available online at
   [124]https://www.mdpi.com/2073-4425/11/6/696/s1, Figure S1: p-values
   and q-values from individual pathway analysis and meta-analysis and
   AW-fisher weight for each pathway in each study, Figure S2: CDF plot of
   consensus clustering and delta plot of consensus clustering, Figure S3:
   Heatmap for Kappa statistic of pair-wise pathways and heatmap of
   log10-scale pathway enrichment p-values of pathways and MDS plot of
   pathways and cluster assignment distributed by kappa statistics and
   Hierarchical clustering of psychiatric studies in each cluster, Table
   S1: Fisher-exact test 2 × 2 table for “apoptosis” in cluster 7, Table
   S2: 22 Homo sapiens pathway database including 14 pathway database from
   MsigDB, 2 databases from Connectivity map, transcription factor target
   database JASPAR, Protein-Protein interaction database, 3 microRNA
   databases and 1 phenotype database as options for enrichment analysis.
   [125]Click here for additional data file.^ (2.1MB, zip)

Author Contributions

   Conceptualization, C.-W.L., T.M. and G.C.T.; Methodology, X.Z., W.Z.,
   C.-W.L., Z.F., T.M. and G.C.T.; Software, X.Z., W.Z. and C.-W.L.;
   Validation, W.Z., D.A.L., J.F.E. and G.C.T.; Data Curation, X.Z.,
   D.A.L. and J.F.E.; Writing—Original Draft Preparation, X.Z., C.-W.L.,
   Z.F., T.M. and G.C.T.; Writing—Review & Editing, W.Z., D.A.L., J.F.E.
   and G.C.T.; Supervision, G.C.T.; Funding Acquisition, D.A.L. and G.C.T.
   All authors have read and agreed to the published version of the
   manuscript.

Funding

   This work was funded by the National Institute of Health R01CA190766
   and R21LM012752 to X.Z., W.Z., C.-W.L., Z.F., T.M. and G.C.T.

Conflicts of Interest

   The authors declare no conflict of interest.

References