Abstract

Background

   In silico functional genomics have become a driving force in the way we
   interpret and use gene expression data, enabling researchers to
   understand which biological pathways are likely to be affected by the
   treatments or conditions being studied. There are many approaches to
   functional genomics, but a number of popular methods determine if a set
   of modified genes has a higher than expected overlap with genes known
   to function as part of a pathway (functional enrichment testing).
   Recently, researchers have started to apply such analyses in a new way:
   to ask if the data they are collecting show similar disruptions to
   biological functions compared to reference data. Examples include
   studying whether similar pathways are perturbed in smokers vs. users of
   e-cigarettes, or whether a new mouse model of schizophrenia is
   justified, based on its similarity in cytokine expression to a
   previously published model. However, there is a dearth of robust
   statistical methods for testing hypotheses related to these questions
   and most researchers resort to ad hoc approaches. The goal of this work
   is to develop a statistical approach to identifying gene pathways that
   are equivalently (or inversely) changed across two experimental
   conditions.

Results

   We developed Equivalent Change Enrichment Analysis (ECEA). This is a
   new type of gene enrichment analysis based on a statistic that we call
   the equivalent change index (ECI). An ECI of 1 represents a gene that
   was over or under-expressed (compared to control) to the same degree
   across two experiments. Using this statistic, we present an approach to
   identifying pathways that are changed in similar or opposing ways
   across experiments. We compare our approach to current methods on
   simulated data and show that ECEA is able to recover pathways
   exhibiting such changes even when they exhibit complex patterns of
   regulation, which other approaches are unable to do. On biological
   data, our approach recovered pathways that appear directly connected to
   the condition being studied.

Conclusions

   ECEA provides a new way to perform gene enrichment analysis that allows
   researchers to compare their data to existing datasets and determine if
   a treatment will cause similar or opposing genomic perturbations.

   Keywords: Functional genomics, Enrichment, Equivalent change

Background

   In silico functional genomics have become a standard approach in
   enabling researchers to use transcriptomics to understand biological
   pathways or molecular functions affected by the treatments or
   conditions they are studying. For example, a published protocol for the
   DAVID Bioinformatics Resources has over 15,000 citations [[27]1]. A
   paper discussing another popular approach, called Gene Set Enrichment
   Analysis (GSEA), has been cited over 13,000 times [[28]2]. Most methods
   determine if a set of modified genes has a higher than expected overlap
   with genes known to function as part of a pathway (functional
   enrichment testing) [[29]1–[30]3]. Perhaps the simplest way of doing
   such an analysis is to perform overrepresentation analysis (ORA) and
   test if a set of genes (perhaps those that were statistically
   significantly differentially expressed between the comparator groups)
   overlaps a list of genes in a biological pathway more than what would
   be expected by chance [[31]1, [32]4, [33]5]. Several important
   annotations of biological processes or pathways have been developed to
   facilitate such analyses. One of the best known is the Gene Ontology
   [[34]6] (GO), although The Kyoto Encyclopedia of Genes and Genomes
   (KEGG) [[35]7] and Reactome [[36]8] have additional annotations for the
   relationships between genes in a pathway. The statistical significance
   of overrepresentation of a significant set of genes in the genes of a
   pathway can be tested simply using the hypergeometric test. Other
   methods have been developed to improve the power and reliability of
   these tests, including GSEA [[37]2]. GSEA identifies sets of genes that
   group together near the top or bottom of a list of genes ranked by
   degree of differential expression (typically log[2]-fold change) more
   than one would expect by chance. There is no requirement that
   individual genes be statistically significant by whatever metric is
   used.

   Now, some researchers are asking a different version of this question:
   i.e., they want to know if the data they are collecting show similar
   functional disruptions as compared to some reference data. For example,
   Shen, et al. studied whether similar pathways were perturbed in smokers
   vs. users of e-cigarettes [[38]9]. Gil-Pisa, et al. justified the use
   of their mouse model of schizophrenia based on its similarity in
   cytokine expression to a previously published model [[39]10].
   Martins-de-Souza, et al. showed that responders showed the same
   pathways were affected, but in opposite directions, in poor vs. good
   responders to anti-psychotics [[40]11]. Clearly, this idea has many
   potential applications. Unfortunately, we currently lack statistically
   sound approaches for most such analyses.

   One possible approach is perform enrichment analysis separately for
   genes that are up and down regulated in each treatment and then find
   the intersection of pathways that move in the same or opposite
   directions [[41]12, [42]13]. GSEA can test for pathways that are
   significantly up or down regulated. However, typically, researchers
   using this or similar approaches do not attempt to determine the
   probability of this occurring by chance, bringing the interpretability
   and reproducibility of such results into question. Also, these types of
   approaches do not indicate the degree to which pathways are changed in
   similar or opposing ways. We are not aware of any methods that can
   specifically address these questions. Furthermore, a substantial
   limitation of similar approaches is the underlying assumption that
   biological pathways depend on the co-expression of genes in them. Some
   pathways likely function in this manner, and one may be able to detect
   sub-pathways with equivalent or inverse changes, but the results will
   likely be biased to simple pathways.

   For the case of drug-repurposing, Connectivity Mapping was introduced
   to address the need for detecting when genes are disrupted in similar
   ways. This approach assesses the correlation in ranked lists of genes,
   with the intent of identifying gene profiles for drugs that are
   correlated, or anti-correlated to a researcher’s own gene signature
   [[43]14], but this approach is not designed to identify specific
   biological functions that are similar across experiments. The same is
   true of other methods, such as openSesame [[44]15] or the extreme
   cosine method (XCos) [[45]16], which were developed later. Even for
   drug repurposing, this may be an important point. That is, it may be
   important to be similar only in terms of certain pathways but not
   others. Therefore, a method that can systematically identify pathways
   with similar (or inverted) perturbations could be of great use. In this
   work, we propose a novel functional genomics approach called Equivalent
   Change Enrichment Analysis (ECEA) that seeks to accomplish this goal.
   We further introduce a novel metric called the Equivalent Change Index
   (ECI), which plays a key role in our proposed methodology. There are
   many potential applications of the proposed methodology, including the
   ability to focus on genes that may be more directly relevant to the
   experimental question, drug screening for treatments that have similar
   effects, demonstrating the viability of a new mouse model, and many
   other potential uses.

Results

   We present two key developments for identifying biological pathways
   that exhibit equivalent or inverse changes across experiments and/or
   treatments: i) the Equivalent Change Index (ECI), and ii) Equivalent
   Change Enrichment Analysis (ECEA). The ECI is a measure that is
   calculated at the level of individual genes from gene expression assays
   that ranges from [− 1,1]. A value of 1 indicates that a gene was
   changed to the same degree by both treatments (e.g. a 2-fold change by
   each treatment compared to its respective control). A value of − 1
   indicates a gene was change in completely the opposite way (e.g.
   downregulated 2-fold by one treatment and upregulated 2-fold by the
   other). ECEA is a functional genomics approach that identifies pathways
   with a non-random distribution of equivalently or inversely changed
   genes.

   We evaluated our approach on three datasets: one simulated, one
   biological data set with expected inverse changes, and a second
   biological data set with expected equivalent changes.

   We benchmarked ECEA to the current approach that involves simply
   performing pathway enrichment analysis on the datasets from each
   treatment separately and then intersecting the results. We used both
   Gene Set Enrichment Analysis (GSEA) and over-representation analysis
   (ORA) for comparison. With GSEA, it is possible to get some idea of
   equivalent or inverse changes as well, because it tests for enrichment
   of up or downregulated genes.

Simulations

   The results of the simulations are shown for equivalent change in Fig.
   [46]1, and for inverse change in Fig. [47]2, using a probability of
   equivalent change or inverse change of 0.5 (i.e., each gene in a
   pathway chosen to have equivalent differential expression would have a
   probability of 0.5 of that equivalent change). Results at other
   probability levels are shown in the supplemental material. We did 100
   simulations for each set of parameters. Each simulation involved 1
   equivalently changed pathway, 1 inversely changed pathway, 1 pathway
   with differentially expressed genes but no relationship between
   experiments, and 7 pathways not affected by these simulated treatments.
   In each case, there were 5 samples for each treatment and 5 controls
   (N = 20). The results show the proportion of times the equivalent or
   inversely changed pathway was detected. The inversely changed pathway
   was more difficult to detect by GSEA, because creating inverse changes
   will also tend to increase the number of genes that are not regulated
   in the same direction, which is a limitation of GSEA. For most levels
   of probability of differential expression and across levels of
   symmetry, ECEA outperformed GSEA or ORA. For equivalently changed
   pathways, GSEA outperformed ECEA only for low levels of probability of
   differential expression (PDE) and when the symmetry was extreme. In
   Fig. [48]3, the false positive rate (FPR) for detecting the pathway
   with differential expression but without enforced equivalent or inverse
   change is shown. For all levels of probability of differential
   expression ORA had the lowest FPR, but it is also very low for ECEA.
   When the symmetry is .1 or .9 the FPR is nearly linear for FGSEA,
   meaning that the greater the probability of differential expression,
   the greater the likelihood of identifying a pathway as having
   equivalent change by chance, using this approach.

Fig. 1.

   [49]Fig. 1
   [50]Open in a new tab

   The proportion of equivalently changed pathways detected by each
   method, when the probability of equivalent change was 0.5. The x-axis
   displays the symmetry, which shows the probability of genes being
   up-regulated in the pathway as opposed to down-regulated (when they
   were differentially expressed). With a symmetry of 0.5, approximately
   half of the differentially expressed genes would be up-regulated and
   the rest would be down-regulated. The different lines show the
   sensitivity for different levels of probability of a gene being
   differentially expressed. The results are shown for (A) ECEA, (B) GSEA,
   and (C) ORA. For nearly all levels of probability of differential
   expression, ECEA was more sensitive than ORA. ECEA was also more
   sensitive than GSEA whenever the genes were not highly co-expressed
   (symmetry near .1 or .9)

Fig. 2.

   [51]Fig. 2
   [52]Open in a new tab

   The proportion of inversely changed pathways detected by each method,
   when the probability of inverse change was 0.5. The x-axis displays the
   symmetry, which shows the probability of genes being up-regulated in
   the pathway as opposed to down-regulated (when they were differentially
   expressed). With a symmetry of 0.5, approximately half of the
   differentially expressed genes would be up-regulated and the rest would
   be down-regulated. The different lines show the sensitivity for
   different levels of probability of a gene being differentially
   expressed. The results are shown for (A) ECEA, (B) GSEA, and (C) ORA.
   For nearly all levels of probability of differential expression, ECEA
   was the most sensitive. ECEA was also more sensitive than GSEA whenever
   the genes were not highly co-expressed (symmetry near .1 or .9)

Fig. 3.

   [53]Fig. 3
   [54]Open in a new tab

   The proportion of times a pathway with differentially expressed genes
   was erroneously identified as being equivalently or inversely enriched
   by method with probability of equivalent change at 0.5. The sub-figures
   show the results at (A) Symmetry = 0.1, (B) Symmetry = 0.5, and (C)
   Symmetry = 0.9

Glut4 data

   ECEA was run on the Glut4 data to determine pathways enriched for genes
   that are equivalently or inversely changed when Glut4 is knocked out or
   overexpressed in mice. The data were collected to determine the effect
   of Glut4 on insulin sensitivity in white adipose tissue. The log[2]
   fold change for each treatment vs. its respective control was
   calculated using the limma package for R [[55]17]. The log[2] fold
   change was used to calculate the ECI, which was then used to perform
   the ECEA. First, we performed this analysis on the Kyoto Encyclopedia
   of Genes and Genomes (KEGG), with a false discovery rate cut-off of
   0.25, which is the recommended threshold for GSEA [[56]2]. We found
   enrichment in equivalent or inverse change for 8 pathways using this
   approach. Of these, 5 were enriched with inversely changed genes and
   are listed in Table [57]1. It is important to note that the number of
   pathways equivalently or inversely changed do not represent a picture
   of overall equivalent or inverse change as pathways have many
   overlapping genes. For each enriched pathway, we have included the top
   5 genes in the results, which are the genes with the greatest inverse
   change for the respective pathway.

Table 1.

   ECEA identified inversely changed KEGG pathways in the Glut4 data
   Pathway FDR NES Size Top 5 Genes
   mmu00280 Valine, leucine and isoleucine degradation 1.85 × 10^− 2 −2.08
   34 Bckdha,Pccb,Acaa2,Mcee,Hmgcs1
   mmu00071 Fatty acid metabolism 2.44 × 10^− 1 −1.62 32
   Acaa2,Eci1,Echs1,Acsl4,Acsl1
   mmu00640 Propanoate metabolism 3.24 × 10^−2 −1.83 21
   Pccb,Mcee,Echs1,Aldh7a1,Aldh1a1
   Mmu04080 Neuroactive ligand-receptor interaction 2.44 × 10^−1 − 1.38
   177 Tshr,Sstr1,Pth1r,Vipr2,Fpr1
   mmu00310 Lysine degradation 2.44 × 10^−1 −1.64 25
   Echs1,Suv39h2,Aldh7a1,Aldh1a1,Ehmt2
   [58]Open in a new tab

   Next, we applied the GSEA intersection approach. This involves finding
   pathway enriched for differentially expressed genes in each experiment
   separately and then intersecting the results. Using GSEA, for the Glut4
   KO, there were 32 significantly enriched pathways. For the
   overexpressed data, there were 29. One way to search for inversely
   changed pathways would be to find those enriched in upregulated genes
   using GSEA in one dataset and enriched for downregulated genes in the
   other (for equivalent changes we would simply intersect those changed
   in the same direction). Between these two results, there were 11 shared
   pathways. Of these, 6 had an inverse relationship and 5 and an
   equivalently changed relationship. Using this approach there is no way
   to assess the statistical significance of this relationship, but the
   pathways with inverse regulation are shown in Table [59]2.

Table 2.

   GSEA identified inversely changed KEGG pathways in the Glut4 data
   Pathway
   mmu00280 Valine, leucine and isoleucine degradation
   mmu00071 Fatty acid metabolism
   mmu03320 PPAR signaling pathway
   mmu00640 Propanoate metabolism
   mmu04146 Peroxisome
   mmu04610 Complement and coagulation cascades
   [60]Open in a new tab

   Three of the pathways identified as being inversely regulated by ECEA
   were also identified by this GSEA approach. However, the total number
   of pathways available in the KEGG database is relatively limited at
   225. Therefore, we also tried these approaches using the larger
   Reactome database, which has 1647 total pathways, which might allow for
   a more granular picture. Using ECEA, we found 27 pathways with
   significant enrichment, eight of which were inversely enriched (Table
   [61]3).

Table 3.

   ECEA identified inversely changed Reactome pathways in the Glut4 data
   Pathways FDR NES Size Top 5 Genes
   GPCR ligand binding 1.72 × 10^− 1 −1.40 219 Tshr,Ccr7,Ece1,Sstr1,Pth1r
   Protein localization 2.29 × 10^−1 −1.52 68
   Dhrs4,Gstk1,Ech1,Slc25a17,Tysnd1
   SLC-mediated transmembrane transport 6.69 × 10^−2 −1.57 108
   Slc26a2,Slc31a1,Slc2a4,Slco3a1,Slc39a8
   Transport of inorganic cations/anions and amino acids/oligopeptides
   7.00 × 10^−2 −1.69 43 Slc26a2,Calm2,Slc7a8,Slc3a2,Slc4a8
   Peroxisomal protein import 1.15 × 10^−1 − 1.68 40
   Dhrs4,Gstk1,Ech1,Tysnd1,Pex5
   VEGFR2 mediated vascular permeability 1.29 × 10^− 1 −1.73 21
   Akt2,Calm2,Pak2,Ctnnb1,Nos3
   Transcriptional Regulation by E2F6 4.01 × 10^−2 −1.79 15
   Phc1,Rbbp4,Ezh2,Ehmt2,Suz12
   SHC-mediated cascade: FGFR2 2.03 × 10^−1 − 1.68 17
   Fgf4,Fgf8,Grb2,Fgf6,Fgf18
   [62]Open in a new tab

   Using the GSEA intersection approach, we found 14 pathways with inverse
   changes in regulation, two of which was also found by ECEA. These are
   shown in Table [63]4.

Table 4.

   GSEA identified inversely changed Reactome pathways in the Glut4 data
   Pathway
   Plasma lipoprotein assembly, remodeling, and clearance
   Condensation of Prophase Chromosomes
   Metabolism of vitamins and cofactors
   Protein localization
   Glucocorticoid biosynthesis
   Hemostasis
   Platelet activation, signaling and aggregation
   Peroxisomal protein import
   Platelet degranulation
   Response to elevated platelet cytosolic Ca2+
   Branched-chain amino acid catabolism
   Regulation of Tp53 Degradation
   Regulation of Tp53 Expression and Degradation
   Laminin interactions
   [64]Open in a new tab

   An additional 31 pathways were identified using GSEA with the
   intersection approach that were enriched for differential expression
   with equivalent directional change in regulation.

   It is worth noting that ECEA looks for enrichment in equivalent or
   inverse changes in the same genes across treatments, while the
   intersection approach will simply find overall changes in genes in the
   pathway that tend to be in the same direction. This might also be a
   useful thing to do, but the goals are slightly different. However, with
   the overlap approach, there is no indication as to whether the
   functional impact is likely to be similar, such as there is for ECEA
   (i.e. different parts of a complex pathway might be affected and thus
   not result in similar functional impacts).

   Figure [65]4 shows part of the VEGFR2 mediated vascular permeability
   pathway from Reactome. On the left, we can see the effect of the Glut4
   knockout compared to the controls, on the right the effect of Glut4
   overexpression. This pathway was identified by ECEA as inversely
   changed but not by GSEA. The figure illustrates a possible explanation
   why, due to the assumptions in GSEA about co-expression. There are
   clear inverse changes in Nos3, Akt2, Calm2, and Pdpk1 but some of these
   genes are upregulated and some downregulated in each treatment.

Fig. 4.

   [66]Fig. 4
   [67]Open in a new tab

   Differentially expressed genes in part of the VEGFR mediated vascular
   permeability pathway from Reactome. There are clear inverse changes
   between the two treatments pictured and, importantly, not all genes
   with inverse changes show the same direction in regulation (some are
   upregulated and others downregulated). (A) The log[2]-fold change in
   expression between Glut4 knockout and control. (B) The log[2]-fold
   change in expression between Glut4 overexpressed and control

Antidepressant data

   We analyzed the antidepressant data in much the same way as the Glut4
   data, except in these data we expected to capture equivalent changes.
   These data were collected to investigate the effect of two different
   antidepressant drugs, ketamine and imipramine, on a mouse model of
   depression. For the KEGG pathways, in this case, there were 6 pathways
   with significant enrichment for equivalent change across treatments
   (Table [68]5) and none for inverse change.

Table 5.

   ECEA identified equivalently changed KEGG pathways in the
   antidepressant data
   Pathway FDR NES Size Top 5 Genes
   mmu03010 Ribosome 1.01 × 10^−2 1.52 95 Rpl7a,Rpsa,Rpl34,mt-Rnr1,Rplp1
   mmu00230 Purine metabolism 2.12 × 10^−1 1.22 144
   Ada,Nt5e,Adcy4,Ak7,Pde1c
   mmu00500 Starch and sucrose metabolism 2.12 × 10^−1 1.47 22
   Gaa,Amy1,Gys1,Pgm1,Pgm2
   mmu00250 Alanine, aspartate and glutamate metabolism 2.12 × 10^−1 1.44
   28 Gad2,Nit2,Aldh4a1,Gad1,Ass1
   mmu04080 Neuroactive ligand-receptor interaction 5.36 × 10^−2 1.24 185
   Mc3r,Sstr5,Crhr2,Glp1r,Gabrq
   mmu00340 Histidine metabolism 1.01 × 10^−2 1.67 20
   Aldh3b1,Aldh7a1,Hdc,Aldh3a1,Ddc
   [69]Open in a new tab

   When we performed the GSEA intersection approach, we found 2 pathways
   that were significantly enriched after treatment with either drug and
   both had changes in the same direction (Table [70]6). Both pathways
   were also identified by the ECEA approach.

Table 6.

   GSEA identified equivalently changed KEGG pathways in the
   antidepressant data
   Pathway
   mmu03010 Ribosome
   mmu04080 Neuroactive ligand-receptor interaction
   [71]Open in a new tab

   Next, we applied ECEA, using these data, to the Reactome database. The
   results are shown in Table [72]7. A total of 17 pathways were found to
   be enriched for equivalent changes across the two experiments and none
   for inverse changes.

Table 7.

   ECEA identified equivalently changed Reactome pathways in the
   antidepressant data
   Pathway FDR NES Size Top 5 Genes
   Regulation of Insulin-like Growth Factor (IGF) transport and uptake by
   Insulin-like Growth Factor Binding Proteins (IGFBPs) 2.76 × 10^−2 1.38
   78 Gpc3,Trf,Scg3,Penk,F5
   Post-translational protein phosphorylation 3.02 × 10^−2 1.38 73
   Gpc3,Trf,Scg3,Penk,F5
   GPCR ligand binding 1.14 × 10^−1 1.22 200 Mc3r,Sstr5,Crhr2,Glp1r,Nts
   L13a-mediated translational silencing of Ceruloplasmin expression
   8.07 × 10^−3 1.39 100 Rpsa,Rpl34,Rplp1,Rps29,Rpl8
   Eukaryotic Translation Initiation 8.07 × 10^− 3 1.35 108
   Rpsa,Rpl34,Rplp1,Rps29,Rpl8
   Formation of a pool of free 40S subunits 8.07 × 10^−3 1.45 90
   Rpsa,Rpl34,Rplp1,Rps29,Rpl8
   GTP hydrolysis and joining of the 60S ribosomal subunit 1.48 × 10^−2
   1.38 101 Rpsa,Rpl34,Rplp1,Rps29,Rpl8
   Cap-dependent Translation Initiation 8.07 × 10^−3 1.35 108
   Rpsa,Rpl34,Rplp1,Rps29,Rpl8
   Translation 3.02 × 10^−2 1.23 211 Mrpl10,Mrps16,Rpsa,Rpl34,Rplp1
   SRP-dependent cotranslational protein targeting to membrane
   8.07 × 10^−3 1.53 81 Rpsa,Rpl34,Rplp1,Rps29,Rpl8
   Major pathway of rRNA processing in the nucleolus and cytosol
   8.07 × 10^−3 1.36 156 Exosc10,Rpsa,Rpl34,Rplp1,Rps29
   rRNA processing 8.07 × 10^−3 1.36 156 Exosc10,Rpsa,Rpl34,Rplp1,Rps29
   rRNA processing in the nucleus and cytosol 8.07 × 10^−3 1.36 156
   Exosc10,Rpsa,Rpl34,Rplp1,Rps29
   Nonsense-Mediated Decay (NMD) 8.07 × 10^−3 1.44 102
   Upf2,Rpsa,Rpl34,Rplp1,Rps29
   Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex
   (EJC) 8.07 × 10^−3 1.51 83 Rpsa,Rpl34,Rplp1,Rps29,Rpl8
   Nonsense Mediated Decay (NMD) enhanced by the Exon Junction Complex
   (EJC) 8.07 × 10^−3 1.44 102 Upf2,Rpsa,Rpl34,Rplp1,Rps29
   Sulfur amino acid metabolism 8.07 × 10^−3 1.74 19
   Slc25a10,Gm4737,Ahcy,Cdo1,Adi1
   [73]Open in a new tab

   Finally, we used the GSEA intersection approach to examine these same
   data. A total of 20 pathways were found to be enriched for
   differentially expressed genes and regulated in the same direction by
   both drugs. These are shown in Table [74]8. Eight of these pathways
   were also identified by ECEA.

Table 8.

   GSEA identified equivalently changed pathways in the antidepressant
   data
   Pathway
   Signaling by GPCR
   Class A/1 (Rhodopsin-like receptors)
   Peptide ligand-binding receptors
   GPCR downstream signaling
   G alpha (i) signalling events
   GPCR ligand binding
   L13a-mediated translational silencing of Ceruloplasmin expression
   Formation of a pool of free 40S subunits
   GTP hydrolysis and joining of the 60S ribosomal subunit
   G alpha (q) signalling events
   G alpha (s) signalling events
   SRP-dependent cotranslational protein targeting to membrane
   Nonsense-Mediated Decay (NMD)
   Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex
   (EJC)
   Nonsense Mediated Decay (NMD) enhanced by the Exon Junction Complex
   (EJC)
   Protein-protein interactions at synapses
   Dopamine Neurotransmitter Release Cycle
   Serotonin Neurotransmitter Release Cycle
   Glutamate Neurotransmitter Release Cycle
   Synaptic adhesion-like molecules
   [75]Open in a new tab

Discussion

   In this work we have presented a new approach to functional genomic
   analysis that can identify key biological pathways that are changed in
   similar (equivalent) or opposing (inverse) ways across diverse
   experiments. Due to our unique approach, data collected at different
   times, by different groups can be used, because there is no comparison
   of gene expression values, just the effect sizes, and there is no
   dependence on exact estimates of those effects. This has the potential
   to allow researchers to capitalize on publicly available data in new
   ways. However, it works just as well when two treatments are run as
   part of the same project. Equivalent change enrichment analysis (ECEA)
   allows researchers to determine the specific pathways and functions
   that are regulated in close to the same fashion by different
   treatments, or pathways for which changes can be reversed by one
   treatment compared to another.

   As we demonstrate, a similar type of analysis can be done by
   intersecting the results of two separate enrichment analyses, and this
   is undoubtedly a useful technique. However, such approaches cannot
   determine if a pathway is changed in the same way (i.e. the same parts
   of the pathway) or to the same extent. For larger pathways, the
   differences may be critical.

   On the simulation data, all approaches were able to capture a useful
   proportion of equivalently and inversely changed pathways in most
   situations. However, ECEA typically had the most consistent
   performance. Also, for the GSEA intersection approach, a number of its
   results are likely to be false positives, particularly if there are
   many genes that are differentially expressed. For pathways with mostly
   co-expressed genes, the GSEA intersection approach may have somewhat
   more power than ECEA, however, this approach will be unable to identify
   pathways with equivalent or inverse changes when there are genes that
   are both up and down-regulated by a treatment. It is important to note
   that this is not a limitation of GSEA, given it is its intended
   behavior. It is simply a limitation of applying GSEA in this context.

   Our approach is invariant to the direction of change in gene
   expression, because we are determining enrichment in similar or inverse
   changes across experiments. Therefore, a change can be equivalent for
   multiple genes, even if they are up or down-regulated in the same
   pathway. Furthermore, our ECEA approach outlined here can calculate the
   statistical significance of enrichment in genes that are dysregulated
   in similar or opposing ways across experiments. This should be
   particularly useful when applied to larger pathways, because enrichment
   will only be found when the same parts of the pathway are affected
   similarly (rather than a similar trend in expression on average for the
   pathway overall).

   One potential limitation of the ECEA approach is the somewhat
   restrictive assumption that equivalent change in a pathway means that
   the same genes change to the same degree by different treatments. This
   will undoubtedly be more or less useful of an assumption in different
   contexts and should be kept in mind when using our approach.

   In the Glut4 dataset, both ECEA and GSEA identified the “mmu00280
   Valine, leucine and isoleucine degradation” KEGG pathway as one with
   inverse changes. Indeed, prior research has shown that increases in
   circulating branched chain amino acids are associated with insulin
   resistance in obese patients [[76]18], which is relevant for this
   dataset investigating the effect of Glut4 on insulin sensitivity in
   white adipose tissue. ECEA also identified the “VEGFR2 mediated
   vascular permeability” Reactome pathway as having inverse changes
   across the treatments. Interestingly, this pathway is specifically
   related to the experimental question, as research has shown that VEGFR2
   can modulate insulin sensitivity in white adipose tissue [[77]19].
   However, the GSEA intersection approach identified the “Response to
   elevated platelet cytosolic Ca2+” Reactome pathway as having inverse
   changes, and Ca2+ has also been linked to regulation of insulin
   signaling [[78]20]. Thus, the ECEA approach has some statistical
   advantages, but there are specific circumstances for which the GSEA
   intersection approach will work well. It is seldom the case in
   functional genomics that a single method can be claimed to have every
   advantage. Also, our ECEA approach will identify when pathways are
   change in the same or specifically opposing ways (i.e. same genes)
   while the intersection approach will more generally identify pathways
   that experience overall changes in gene expression that are similar or
   opposing. Depending on one’s needs, this is a key difference that
   should be kept in mind. Nevertheless, these results indicate our
   approach can at least identify inversely changed pathways across
   treatments that are relevant to the target disease, and importantly,
   assign a statistical significance to the results.

   In the antidepressant data, both ECEA and GSEA identified equivalent
   regulation by antidepressants of ribosome-related genes and indeed this
   association has been observed in patients with depression compared to
   healthy controls [[79]21]. This suggests that both ketamine and
   imipramine have similar influence on the regulation of genes involved
   that might serve as biomarkers of depression and suggest the utility of
   our approach in a precision medicine context. ECEA identified the
   “Regulation of Insulin-like Growth Factor (IGF) transport and uptake by
   Insulin-like Growth Factor Binding Proteins (IGFBPs)” Reactome pathway
   as equivalently changed by ketamine and imipramine and there is
   research linking this pathway with depression [[80]22]. However, the
   GSEA approach identified the “Glutamate Neurotransmitter Release Cycle”
   Reactome pathway as equivalently changed and this pathway has also been
   linked to depression [[81]23]. Thus, we can see that both approaches
   can lead to biologically meaningful, yet different, results. Although
   we highlight these specific results as examples, there are other
   pathways that seem to have a direct connection to both datasets
   identified by both methods.

   The results for the antidepressant data are particularly exciting,
   because they demonstrate an important use case for our approach. Data
   for a new drug can be collected and commonalities in functional effects
   in comparison with existing drugs can be predicted, using a model
   organism. This has clear implications for the field of drug
   repositioning. One could imagine inverse enrichment could play a
   similarly important role, by allowing the prediction of a drug
   reversing the changes in genes of pathways disrupted in a disease.
   Although the GSEA intersection approach will work, the ECEA approach
   will particularly identify pathways where the changes are equivalent or
   inverted at the gene level within the pathway, which may be more useful
   when considering targeted treatments.

Conclusions

   ECEA is not a general-purpose functional genomics approach that will
   supplant existing computational functional genomics methods. Rather, we
   have demonstrated that it is a useful new tool that can allow
   researchers to garner relevant new insights into certain kinds of data.
   It allows for statistical rigor to be brought to research questions
   that are already being investigated in other ways, and potentially
   opens new avenues of inquiry.

Methods

Equivalent change index

   Our goal is to be able to identify genomic pathways that are changed
   equivalently or inversely given two sets of experiments, each with a
   treatment and control. As a first step, we need a metric for the degree
   of equivalent change for a single gene. With such a metric, we can then
   search for pathways that have an unusual degree of equivalently or
   inversely changed genes. Therefore, we introduce the idea of an
   Equivalent Change Index (ECI).

   Let
   [MATH: <msub><mover accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>1</mn
   ></msub></mrow></msub> :MATH]
   and
   [MATH: <msub><mover accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>2</mn
   ></msub></mrow></msub> :MATH]
   be the estimated effect sizes (ES) of two experiments for gene i on
   experiment j. The ES could be a log[2] fold change, a standardized mean
   difference, or a simple mean difference. Then we define the equivalent
   change index (ECI) for gene i to be:
   [MATH: <mi
   mathvariant="italic">EC</mi><msub><mi>I</mi><mi>i</mi></msub><mo>=</mo>
   <mfrac><mrow><mo>sign</mo><mfenced close=")"
   open="("><mrow><msub><mover accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>1</mn
   ></msub></mrow></msub><mo>×</mo><mo>−</mo><msub><mover
   accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>2</mn
   ></msub></mrow></msub></mrow></mfenced><mfenced close="}"
   open="{"><mrow><mo>max</mo><mfenced close=")" open="("
   separators=","><mfenced close="|" open="|"><msub><mover
   accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>1</mn
   ></msub></mrow></msub></mfenced><mfenced close="|"
   open="|"><msub><mover accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>2</mn
   ></msub></mrow></msub></mfenced></mfenced><mo>−</mo><mfenced close=")"
   open="("><mrow><mfenced close="|" open="|"><msub><mover
   accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>1</mn
   ></msub></mrow></msub></mfenced><mo>+</mo><mfenced close="|"
   open="|"><msub><mover accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>2</mn
   ></msub></mrow></msub></mfenced></mrow></mfenced></mrow></mfenced></mro
   w><mrow><mo>max</mo><mfenced close=")" open="(" separators=","><mfenced
   close="|" open="|"><msub><mover accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>1</mn
   ></msub></mrow></msub></mfenced><mfenced close="|"
   open="|"><msub><mover accent="true"><mi>β</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>i</mi><msub><mi>j</mi><mn>2</mn
   ></msub></mrow></msub></mfenced></mfenced></mrow></mfrac> :MATH]

   This is simply the ratio of the smaller ES to the larger ES in terms of
   absolute value of both effects and with a sign reflecting whether the
   effects were in the same direction (positive) or opposite directions
   (negative). Thus, ECI[i] ∈ [−1, 1]. An ECI[i] of 1 means that the ES
   was exactly the same for gene i in both experiments. Likewise, an
   ECI[i]of −1 means the ES was exactly opposite in both experiments.
   Therefore, ECI[i] indicates either the degree of equivalence or
   inverseness for a gene in one experiment compared to a separate
   experiment, depending on its sign.

Pathway-level equivalent and inverse change

   The ECI gives us a way to find pathways enriched for genes changed in
   equivalent or inverse ways across experiments. A high ECI indicates
   equivalent change, not the directionality of the change. Therefore, if
   some genes in a pathway are up-regulated and others down-regulated by
   one treatment, but the reverse happens in a second treatment, the ECI
   will be low for all genes in the pathway. This is a critical property
   for our approach.

   As an example, we can consider an experiment to determine the genomic
   influence of Drug B. In particular, we want to know biological
   functions for which Drug B has similar effects as Drug A. We measure
   gene expression for patients on Drug B and a control. We already have
   similar data for Drug A. Therefore, we calculate the ECI for Drug B and
   A on each gene. Next, we rank all genes by their ECI. We can now
   consider a hypothetical functional pathway, with a set of genes P. The
   full list of genes in the experiment is set G. The function g(x) yields
   the index of a gene in G with rank x, with genes ranked by ECI. We
   further define S = {i| G[i] ∈ P} and R = {i| G[i] ∉ P}, where S is the
   indices of the genes in the pathway, and R is the indices of the genes
   not in the pathway. Now, we will consider a gene of rank x. Suppose:
   [MATH: <msub><mi>I</mi><mi>h</mi></msub><mo>=</mo><mfenced close=""
   open="{"><mtable><mtr><mtd><maligngroup></maligngroup><mn>1</mn><mspace
   width="0.25em"></mspace><mtext mathvariant="italic">if</mtext><mspace
   width="0.25em"></mspace><mi>g</mi><mfenced close=")"
   open="("><mi>h</mi></mfenced><mo>∈</mo><mi>S</mi></mtd></mtr><mtr><mtd>
   <maligngroup></maligngroup><mn>0</mn><mspace
   width="0.25em"></mspace><mtext mathvariant="italic">if</mtext><mspace
   width="0.25em"></mspace><mi>g</mi><mfenced close=")"
   open="("><mi>h</mi></mfenced><mo>∈</mo><mi>R</mi></mtd></mtr></mtable><
   /mfenced> :MATH]
   [MATH: <msub><mi>F</mi><mi>R</mi></msub><mfenced close=")"
   open="("><mi>x</mi></mfenced><mo>=</mo><mfrac><mn>1</mn><mrow><mo>∣</mo
   ><mi>R</mi><mo>∣</mo></mrow></mfrac><msubsup><mo>∑</mo><mrow><mi>h</mi>
   <mo>=</mo><mn>1</mn></mrow><mi>x</mi></msubsup><mn>1</mn><mo>−</mo><msu
   b><mi>I</mi><mi>h</mi></msub> :MATH]

   where ∣R∣ represents the number of genes not in the pathway under
   consideration. This gives us the proportion of genes that are not in
   the pathway and have a rank of x or less. We also have:
   [MATH: <msub><mover accent="true"><mi mathvariant="italic">ECI</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>g</mi><mfenced close=")"
   open="("><mi>h</mi></mfenced></mrow></msub><mo>=</mo><msub><mi
   mathvariant="italic">ECI</mi><mrow><mi>g</mi><mfenced close=")"
   open="("><mi>h</mi></mfenced></mrow></msub><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mo>max</mo><mfenced close=")"
   open="("
   separators=","><msub><mi>p</mi><mrow><mn>1</mn><mo>,</mo><mi>g</mi><mfe
   nced close=")"
   open="("><mi>h</mi></mfenced></mrow></msub><msub><mi>p</mi><mrow><mn>2<
   /mn><mo>,</mo><mi>g</mi><mfenced close=")"
   open="("><mi>h</mi></mfenced></mrow></msub></mfenced></mrow></mfenced>
   :MATH]

   where p[1, g(h)] and p[2, g(h)] are the p-values for the test used to
   determine the effect sizes for g(h) in experiments 1 and 2. This yields
   a weighted ECI, based on the maximum p-value of the ES from each
   dataset for a given gene. This weight is useful to separate genes with
   a high or low ECI based on our confidence in their differential
   expression results.

   Next, suppose:
   [MATH: <msub><mi>F</mi><mi>S</mi></msub><mfenced close=")"
   open="("><mi>x</mi></mfenced><mo>=</mo><mfrac><mrow><msubsup><mo>∑</mo>
   <mrow><mi>h</mi><mo>=</mo><mn>1</mn></mrow><mi>x</mi></msubsup><msub><m
   i>I</mi><mi>h</mi></msub><mspace width="0.25em"></mspace><msup><mfenced
   close="|" open="|"><msub><mover accent="true"><mi>λ</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>g</mi><mfenced close=")"
   open="("><mi>h</mi></mfenced></mrow></msub></mfenced><mi>ω</mi></msup><
   /mrow><mrow><msubsup><mo>∑</mo><mrow><mi>h</mi><mo>=</mo><mn>1</mn></mr
   ow><mrow><mfenced close="|"
   open="|"><mi>R</mi></mfenced><mo>+</mo><mo>∣</mo><mi>S</mi><mo>∣</mo></
   mrow></msubsup><msub><mi>I</mi><mi>h</mi></msub><mspace
   width="0.25em"></mspace><msup><mfenced close="|" open="|"><msub><mover
   accent="true"><mi>λ</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>g</mi><mfenced close=")"
   open="("><mi>h</mi></mfenced></mrow></msub></mfenced><mi>ω</mi></msup><
   /mrow></mfrac> :MATH]

   When ω = 0, this gives the proportion of genes that are in the pathway
   and have a rank of x or less. When ω ≠ 0, we will get a weighted ratio,
   depending on
   [MATH: <msub><mover accent="true"><mi>λ</mi><mo
   stretchy="true">^</mo></mover><mrow><mi>g</mi><mfenced close=")"
   open="("><mi>h</mi></mfenced></mrow></msub> :MATH]
   , which we will discuss more in a minute.

   Now, we can determine a quantity D:
   [MATH:
   <mi>D</mi><mo>=</mo><munder><mo>sup</mo><mi>x</mi></munder><mfenced
   close="|" open="|"><mrow><msub><mi>F</mi><mi>S</mi></msub><mfenced
   close=")"
   open="("><mi>x</mi></mfenced><mo>−</mo><msub><mi>F</mi><mi>R</mi></msub
   ><mfenced close=")" open="("><mi>x</mi></mfenced></mrow></mfenced>
   :MATH]

   When ω = 0, this D is the Kolmogorov-Smirnov statistic. Therefore, we
   could use it for a hypothesis test of whether the distribution of ranks
   is different for a particular pathway vs. all other pathways.
   Unfortunately, this is problematic. The K-S test depends on an
   assumption of independence, which gene expression data cannot claim.
   The K-S test has been shown to be sensitive to violations of this
   assumption, so it is important to consider. Therefore, we will set
   ω = 1, which will provide a weighted Kolmogorov-Smirnov statistic. Note
   that this is exactly the approach taken by GSEA. The difference is, we
   have substituted the ECI for the local statistic used by GSEA (which is
   based on effect size) [[82]2]. Thus, F[S](x) is adjusted for
   correlation between genes that is not associated with the treatments
   (this assumes that genes in a pathway would tend be correlated),
   because it is higher when the genes tend to be more equivalently
   expressed as the result of two treatments and is not a simple
   proportion. D is nevertheless still dependent on the size of the gene
   set. Therefore, we will scale D, getting
   [MATH: <mover accent="true"><mi>D</mi><mo
   stretchy="true">^</mo></mover> :MATH]
   , and perform permutation testing for enrichment in the same manner as
   GSEA, by using the fgsea package for R. We call this approach
   equivalent change enrichment analysis or ECEA.

   One convenient aspect of the statistic
   [MATH: <mover accent="true"><mi>D</mi><mo
   stretchy="true">^</mo></mover> :MATH]
   , is that it represents a directional effect size for the entire
   pathway. Therefore, it can be used to judge the overall equivalent or
   inverse change of genes in a pathway, which is a particularly useful
   result for functional genomics. Unlike in standard GSEA, ECEA makes no
   assumption about the directionality of the change for gene effects.
   That is, genes that are up-regulated across treatments receive a high
   ECI, as do genes that are down-regulated across treatments. Therefore,
   there is no implicit assumption of co-expression for a pathway, meaning
   that ECEA can be used to investigate a wider array of pathways than
   typical GSEA, although for an entirely different use case.

Data

   In order to assess our method in both controlled and realistic
   conditions, we examined the performance of our approach using both
   simulated and biological data.

Simulation

   The simulated gene expression data were created using an approach
   similar to the one created by Dembele (2013) [[83]24]. This approach
   simulates correlation structure between genes, like might occur in a
   biological pathway in real data and allows us to simulate different
   treatments that affect those simulated pathways to varying degrees. We
   modified the approach to create datasets in which genes perturbed by
   one treatment have a chance of being similarly perturbed (or inversely)
   by a second treatment.

   We simulated 72,900 gene expression data sets, with correlation
   structure (the reason for this number is explained shortly). Each
   dataset was constructed from different runs of the algorithm, using
   different parameters for the proportion of equivalent or inversely
   changed genes and then combined, each subset thus representing a
   pathway, with its own correlation structure. Thus, we are making the
   simplifying assumption that there is no correlation between pathways in
   the simulation and there are no overlapping genes between pathways. For
   each dataset, we created seven pathways with no treatment effect (to
   provide a background), one pathway with equivalent change, one with
   inverse change, and finally a pathway without equivalent or inverse
   change but that was still enriched for differentially expressed genes.
   Varying the probability of genes being differentially expressed we ran
   100 simulations at each probability level (from 0.1 to 0.9 in
   increments of 0.1). We also varied the symmetry of changes in a pathway
   (from 0.1 to 0.9 in increments of 0.1), and the probability of
   equivalent or inverse change (from 0.1 to 0.9). By symmetry, we mean
   proportion that were up vs down regulated. Thus, a symmetry of .5 means
   there was an equal probability of up vs. down regulation for
   differentially expressed genes. For each resulting dataset, we then
   calculated the number of times each equivalently or inversely changed
   pathway was detected by our method and calculated the number of true
   positives (TP), false positives (FP), true negatives (TN), and false
   negatives (FN), which were used to calculate other statistics, such as
   the sensitivity. We applied ECEA, GSEA, and ORA to these data and
   determined sensitivity and false positive rate for the recovery of
   pathways with varying levels of equivalent or inversely changed genes.

Biological data

   The first biological dataset was created by Kraus, et al. [[84]25], to
   study the effect of the Glut4 gene in adipose tissue on insulin
   sensitivity in a mouse model. These data were specifically created to
   have opposing effects and represent a dataset with expected inverse
   changes between two treatments. These samples are available in the
   National Center for Biotechnology Information’s (NCBI’s) Gene
   Expression Ominbus (GEO) [[85]26, [86]27] under accession [87]GSE35378.
   This study involved 12 mice: 3 were adipose-Glut4−/−, 3 were aP2-Cre
   transgenic mice (controls for the Glut4−/−), 3 were adipose-Glut4-Tg
   mice with Glut4 transgenically overexpressed, and 3 were FVB mice
   (controls for the adipose-Glut4-Tg mice). Gene expression was assayed
   using the Affymetrix Murine Genome U74A Version 2 Array. These were
   background subtracted and normalized using the rma function of the
   oligo package [[88]28] for the R statistical environment [[89]29].
   Differential expression was assessed using the limma package [[90]17]
   for R.

   The second biological dataset was created by Bagot, et al. [[91]30] to
   investigate the effect of two antidepressants on the transcriptome in a
   mouse model of depression. Therefore, we expect to find some equivalent
   changes in this dataset. Gene expression was assayed using RNA
   sequencing on the Illumina HiSeq 2500 platform. Two drugs were
   examined, ketamine and imipramine, and various brain regions were
   examined. We limited our analysis to mice that were susceptible to
   depression and only used samples from the prefrontal cortex (PFC), in
   order to control confounding and because PFC had the greatest number of
   these samples available. Differential expression was assessed using the
   DESeq2 [[92]31] package for R.

   For the biological data, we do not have a ground truth. Here, our focus
   will be on determining whether our approach can detect equivalent or
   inverse changes that appear to make sense given the models and
   treatments. For each dataset, we will be examining the ability of ECEA
   and GSEA to identify disease relevant pathways.

Supplementary information

   [93]12864_2020_6589_MOESM1_ESM.pdf^ (874.6KB, pdf)

   Additional file 1: Supplemental Material: contains additional figures
   showing the results of the simulation when using varying probabilities
   of equivalent or inverse change.

Acknowledgements