Abstract

   Pathway enrichment analysis is a ubiquitous computational biology
   method to interpret a list of genes (typically derived from the
   association of large-scale omics data with phenotypes of interest) in
   terms of higher-level, predefined gene sets that share biological
   function, chromosomal location, or other common features. Among many
   tools developed so far, Gene Set Enrichment Analysis (GSEA) stands out
   as one of the pioneering and most widely used methods. Although
   originally developed for microarray data, GSEA is nowadays extensively
   utilized for RNA-seq data analysis. Here, we quantitatively assessed
   the performance of a variety of GSEA modalities and provide guidance in
   the practical use of GSEA in RNA-seq experiments. We leveraged
   harmonized RNA-seq datasets available from The Cancer Genome Atlas
   (TCGA) in combination with large, curated pathway collections from the
   Molecular Signatures Database to obtain cancer-type-specific target
   pathway lists across multiple cancer types. We carried out a detailed
   analysis of GSEA performance using both gene-set and phenotype
   permutations combined with four different choices for the
   Kolmogorov-Smirnov enrichment statistic. Based on our benchmarks, we
   conclude that the classic/unweighted gene-set permutation approach
   offered comparable or better sensitivity-vs-specificity tradeoffs
   across cancer types compared with other, more complex and
   computationally intensive permutation methods. Finally, we analyzed
   other large cohorts for thyroid cancer and hepatocellular carcinoma. We
   utilized a new consensus metric, the Enrichment Evidence Score (EES),
   which showed a remarkable agreement between pathways identified in TCGA
   and those from other sources, despite differences in cancer etiology.
   This finding suggests an EES-based strategy to identify a core set of
   pathways that may be complemented by an expanded set of pathways for
   downstream exploratory analysis. This work fills the existing gap in
   current guidelines and benchmarks for the use of GSEA with RNA-seq data
   and provides a framework to enable detailed benchmarking of other
   RNA-seq-based pathway analysis tools.

1 Introduction

   Since its inception two decades ago, pathway enrichment analysis has
   grown in popularity to become an essential data analysis tool in
   omics-based studies. By casting gene-level measurements in a broader
   biological context, this approach allows researchers to interpret their
   data in terms of a great variety of gene sets that may represent
   different functions, processes, components, or associations with
   disease. The success of pathway enrichment analysis, however, led to a
   very complex state of affairs with more than 70 different methods and
   hundreds of variants published to date [[26]1–[27]4] compounded by more
   than 33,000 human gene sets available from the Molecular Signatures
   Database (MSigDB) [[28]5] alone, a scenario that has become extremely
   difficult to navigate.

   Although pathway enrichment analysis methods share a common goal, their
   approach and underlying statistical models vary substantially. Existing
   methods have been categorized in various ways; for the purpose of our
   work, we are primarily concerned with over-representation analysis
   (ORA), whose input is an unranked list of genes (selected from the full
   list of measured genes by imposing a threshold criterion), and
   functional class scoring (FCS), whose input is the full list of
   measured genes (ranked based on quantitative scores) [[29]6]. In recent
   years, several groups have discussed and implemented benchmarks to
   compare findings from different tools as well as to evaluate multiple
   factors that affect the results of enrichment analysis [[30]1,
   [31]7–[32]15]. Typically, these benchmarking efforts were aimed at
   comparing many different tools belonging to one or more broad
   categories, such as ORA and FCS. In contrast, our aim is to focus on
   the pioneering FCS-type method known as Gene Set Enrichment Analysis
   (GSEA) [[33]16, [34]17], which arguably remains the most highly cited
   and widely used pathway analysis method, thereby comprehensively and
   thoroughly exploring its performance under different modalities by
   means of high-quality, curated RNA-seq datasets and pathway
   collections. Similarly to other early-generation pathway analysis
   methods, GSEA was originally developed for use with microarray data,
   but nowadays it is extensively used for analysis of RNA-seq data.
   Different omics technologies, however, are known to introduce a variety
   of potential sources of bias that may confound functional enrichment
   analysis [[35]18, [36]19]. Thus, here we aim to assess GSEA’s
   performance specifically using the type of data on which GSEA is most
   commonly being currently utilized.

   To that end, we leveraged an approach first proposed by Tarca et al.
   [[37]7] and followed by several studies [[38]1, [39]8, [40]12, [41]14,
   [42]15], which built a set of positive-control pathways using
   annotations that matched the conditions under study. In our approach,
   we scanned 11,274 harmonized RNA-seq samples across 33 cancer types
   from The Cancer Genome Atlas (TCGA) [[43]20]. We applied filters to
   remove technical sources of variability and, in order to control for
   intertumor transcriptional heterogeneity, we focused on TCGA projects
   with available paired primary-tumor / non-tumor tissue samples. The
   final analysis involved 1,219 paired samples from 604 subjects across
   12 cancer types. To select positive-control pathways, we scanned 33,591
   gene sets from MSigDB (which includes KEGG [[44]21], Reactome [[45]22],
   Gene Ontology [[46]23], and many other public resources) and
   implemented a sequence of filters leading to a total of 253 pathways
   associated to those 12 cancer types. GSEA was run using the latest
   available version 4.3.2 (build 13, October 2022) [[47]24]. We generated
   random gene set ensembles to quantitatively assess the sensitivity and
   specificity of different GSEA modalities. Finally, we analyzed other
   large cohorts (a thyroid cancer study consisting of 780 paired
   tumor/nontumor RNA-seq tissue samples from 390 subjects [[48]25] and a
   hepatocellular carcinoma study consisting of 140 paired tumor/nontumor
   RNA-seq tissue samples from 70 subjects [[49]26]). By comparing results
   from these cohorts and their TCGA counterparts, we propose a new
   consensus metric termed Enrichment Evidence Score (EES) that allows the
   identification of a core set of pathways (those showing a high degree
   of consensus across enrichment metrics) complemented by an expanded set
   of pathways for exploratory analysis. Moreover, we show how the EES
   metric may be utilized to identify high-consensus leading-edge genes
   and integrate pathway- and gene-level information.

   In summary, our primary goal is to leverage large, harmonized RNA-seq
   data repositories and large, curated pathway repositories in order to
   quantitatively assess the performance of a variety of GSEA modalities
   and provide guidance in the practical use of GSEA for the analysis of
   RNA-seq data. Moreover, a secondary goal of this study is to set up a
   framework capable of benchmarking other RNA-seq-based pathway analysis
   tools. To enable such secondary aim, we made all source code openly and
   publicly available along with detailed step-by-step documentation
   [[50]27].

2 Methods

2.1 RNA-seq datasets

2.1.1 The Cancer Genome Atlas (TCGA)

   From NCI’s Genomic Data Commons portal [[51]28] we downloaded all of
   TCGA’s gene expression quantification files and associated metadata. In
   total, these are 11,274 harmonized RNA-seq expression data files across
   33 cancer types. In order to control for intertumor transcriptional
   heterogeneity, we only considered individuals with paired
   primary-tumor/non-tumor solid tissue samples available. Moreover, to
   remove technical sources of variability [[52]29, [53]30], we excluded
   Formalin-Fixed Paraffin-Embedded (FFPE) tissue samples and considered
   only TCGA studies with at least 10 subjects available after these
   exclusions. This process left us with 15 TCGA projects. However, 3 of
   them (KICH, KIRP and UCEC) were later excluded because we could not
   identify enough positive-control pathways based on the procedure
   explained below (see Sect 2.3.1). The final analysis involved 1,219
   paired primary-tumor/non-tumor solid tissue samples from 604 subjects
   across 12 cancer types ([54]Table 1).

Table 1. Summary of TCGA projects and associated cancer-type-specific
pathways from MSigDB.

   TCGA Project RNA-seq Data MSigDB Pathways
   ID Description Samples Subjects Preselected Target Positive-control
   BLCA Bladder urothelial carcinoma 39 19 16 15 12
   BRCA Breast invasive carcinoma 228 113 137 109 72
   COAD Colon adenocarcinoma 84 41 30 25 15
   ESCA Esophageal carcinoma 26 13 10 7 5
   HNSC Head and neck squamous cell carcinoma 86 43 9 9 8
   KIRC Kidney renal clear cell carcinoma 144 72 8 6 3
   LIHC Liver hepatocellular carcinoma 100 50 114 92 70
   LUAD Lung adenocarcinoma 122 58 39 29 16
   LUSC Lung squamous cell carcinoma 102 51 39 29 16
   PRAD Prostate adenocarcinoma 104 52 27 24 16
   STAD Stomach adenocarcinoma 66 33 12 11 8
   THCA Thyroid carcinoma 118 59 30 21 12
   [55]Open in a new tab

2.1.2 Thyroid cancer from the Chernobyl Tissue Bank (REBC-THYR)

   Following a similar procedure as described above, we used NCI’s Genomic
   Data Commons portal [[56]28] to download gene expression quantification
   files and associated metadata for REBC-THYR, a radiation-induced
   thyroid cancer study from the Chernobyl Tissue Bank [[57]25]. Based on
   the same exclusion criteria followed for TCGA data, we obtained a total
   of 780 paired primary-tumor/non-tumor solid tissue samples from 390
   subjects.

2.1.3 Hepatocellular carcinoma from Mongolia (MO-HCC)

   From NCBI’s Gene Expression Omnibus repository (accession
   [58]GSE144269) [[59]31], we downloaded gene expression quantification
   files and associated metadata from a study of hepatocellular carcinoma
   in Mongolia, which comprises a total of 140 paired
   primary-tumor/non-tumor solid tissue samples obtained from 70 subjects
   [[60]26].

2.2 Differential Gene Expression (DGE) analysis

   All of TCGA’s gene expression quantification files are openly available
   to download from NCI’s Genomic Data Commons (GDC) portal [[61]28]; we
   provide step-by-step details in our code and documentation repository
   [[62]27]. The GDC mRNA quantification analysis pipeline measures gene
   level expression with STAR as raw read counts. Subsequently, the counts
   are augmented with several transformations including Fragments per
   Kilobase of transcript per Million mapped reads (FPKM), upper quartile
   normalized FPKM (FPKM-UQ), and Transcripts per Million (TPM). These
   values are additionally annotated with the gene symbol and gene
   bio-type. These data are generated through this pipeline by first
   aligning reads to the GRCh38 reference genome and then by quantifying
   the mapped reads. To facilitate harmonization across samples, all
   RNA-Seq reads are treated as unstranded during analyses. Further
   details are provided in the GDC documentation [[63]32].

   Gene expression count matrices consist of 60,660 quantified genes, out
   of which we kept 19,962 protein-coding genes. After correcting for gene
   name aliases, we found that 19,321 of these genes were included in
   pathways from MSigDB. We then performed further filtering and DGE
   analysis separately for each TCGA project. Using TPM quantification, we
   removed low-abundance genes with TPM > 1 in less than 40% of the
   samples, transformed the count matrix using log[2](TPM + 1) and fitted
   the expression values against tissue type (primary tumor or non-tumor
   tissue) and subject (to take into account the paired sample design)
   using functions lmFit, eBayes and topTable from the limma [[64]33] R
   package v.3.56.1. [65]S1 Table shows the number of measured genes in
   each TCGA project after all filters were applied and how many of them
   were found differentially expressed in either direction, i.e.
   over-expressed in primary-tumor (fold change FC > 1) or non-tumor (FC <
   1) tissue, using Bonferroni or Benjamini-Hochberg (B-H) q—value < 0.05
   significance thresholds adjusted for multiple testing. An illustration
   of these results is shown in [66]S1 Fig with a Volcano plot of
   significant genes based on Bonferroni-adjusted q—value < 0.05 for
   TCGA-BRCA (breast cancer).

   In order to explore an alternative procedure for count normalization
   and gene filtering feeding into the DGE analysis, we re-analyzed all
   paired primary-tumor vs non-tumor cancer samples from TCGA using raw
   counts. We implemented a standard analysis pipeline using the edgeR
   package v.3.42.2 by applying filterByExpr to filter genes, followed by
   voom quantile normalization and limma for assessing differentially
   expressed genes with variance stabilization. Using standard gene
   ranking scores, defined by [67]Eq (3) below, [68]S2 Fig illustrates the
   excellent agreement between the two procedures for TCGA-BRCA samples.
   Furthermore, in order to extend the comparison to the pathway level, we
   re-ran GSEA with gene set permutations (described below, see Sect. 2.4
   for details) using the voom-normalized data and determined
   pathway-level ranking scores based on GSEA p-values and enrichment
   directions. [69]S3 Fig shows comparisons of these pathway-level ranking
   scores derived from the two alternative count normalization procedures
   for unweighted (classic) and weighted (p = 1, 1.5 and 2) enrichment
   statistics across all TCGA-BRCA target pathways (see below for further
   details). Taken together, [70]S2 and [71]S3 Figs show that differential
   expression/enrichment analyses derived from these different count
   normalization and filtering procedures lead to highly concordant
   results at both gene and pathway levels.

2.3 Gene sets

2.3.1 Cancer-type-specific pathways

   [72]Fig 1 illustrates the process to obtain cancer-type-specific
   pathways as positive controls. We downloaded 33,591 human gene sets
   from MSigDB v2023.1.Hs (released on March 2023) [[73]34]. Using
   functionality provided by the R package fgsea v.1.26.20, we ran a
   custom script to identify, for each cancer type, gene sets whose names
   matched a curated list of search terms. The preselected pathways were
   then ran through a size filter procedure (following GSEA
   recommendations, we excluded gene sets outside the 15–500 size range,
   considering only genes that passed all filters in the quantification
   step). After running GSEA in multiple modalities, we applied a
   subsequent filter to keep target pathways that were found significant
   (unadjusted p—value < 0.05) at least once after running GSEA with
   gene-set and phenotype permutations using all available enrichment
   statistic settings. [74]Table 1 shows the number of pathways selected
   at each stage. Full details are provided in [75]S2 Table.

Fig 1. Cancer-type-specific pathways.

   [76]Fig 1
   [77]Open in a new tab

   Procedure to identify lists of preselected, target, and
   positive-control pathways using TCGA-BRCA (breast cancer) as example.
   See [78]S2 Table for full details.

2.3.2 Gene set overlap

   An important challenge of pathway enrichment analysis is that of gene
   set overlap, where some genes participate in multiple gene sets
   [[79]35, [80]36]. This phenomenon occurs in the presence of
   multifunctional genes (i.e., genes that play a role in several
   biological functions or molecular processes) but is also prevalent in
   some gene set collections with redundancies or a strong hierarchical
   structure. The latter case is most notably present, by design, in the
   three lineages of Gene Ontology (GO) [[81]23]. A useful metric used to
   quantify similarities between gene sets
   [MATH: <msub><mi mathvariant="script">G</mi><mi>i</mi></msub> :MATH]
   and
   [MATH: <msub><mi mathvariant="script">G</mi><mi>j</mi></msub> :MATH]
   is the Jaccard index, given by
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msub><mi
   mathvariant="script">J</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>
   =</mo><mfrac><mrow><mrow><mo>∥</mo></mrow><msub><mi
   mathvariant="script">G</mi><mi>i</mi></msub><mo>∩</mo><msub><mi
   mathvariant="script">G</mi><mi>j</mi></msub><mrow><mo>∥</mo></mrow></mr
   ow><mrow><mrow><mo>∥</mo></mrow><msub><mi
   mathvariant="script">G</mi><mi>i</mi></msub><mo>∪</mo><msub><mi
   mathvariant="script">G</mi><mi>j</mi></msub><mrow><mo>∥</mo></mrow></mr
   ow></mfrac><mspace
   width="4pt"></mspace><mo>,</mo></mrow></mtd></mtr></mtable> :MATH]
   (1)

   which defines a symmetric matrix normalized to the 0 − 1 range.
   Alternatively, the fractional pairwise overlap matrix, given by
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msub><mi
   mathvariant="script">O</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>
   =</mo><mfrac><mrow><mrow><mo>∥</mo></mrow><msub><mi
   mathvariant="script">G</mi><mi>i</mi></msub><mo>∩</mo><msub><mi
   mathvariant="script">G</mi><mi>j</mi></msub><mrow><mo>∥</mo></mrow></mr
   ow><mrow><mrow><mo>∥</mo></mrow><msub><mi
   mathvariant="script">G</mi><mi>i</mi></msub><mrow><mo>∥</mo></mrow></mr
   ow></mfrac><mo>,</mo></mrow></mtd></mtr></mtable> :MATH]
   (2)

   defines a non-symmetric matrix normalized to the 0 − 1 range that
   better captures hierarchical relations such as those present in GO.

2.4 Gene Set Enrichment Analysis (GSEA)

   Gene Set Enrichment Analysis is a well-established tool based on a
   weighted, normalized Kolmogorov-Smirnov statistic to assess pathway
   enrichment of ranked gene lists [[82]16, [83]17]. Originally developed
   for microarray data, GSEA is a threshold-free method that analyzes all
   genes on the basis of their ranking score, without prior gene
   filtering, and is thus recommended when ranks are available for all or
   most of the genes in the genome (e.g., transcriptomics) [[84]37].

   Let r[k] be the ranking score associated with the k − th gene, g[k],
   with k = 1, …, n an index running through all measured genes. Without
   loss of generality, we assume genes to be indexed in decreasing order
   of their ranking scores. Typically, it is assumed that ranking scores
   span positive and negative values; the absolute value of a ranking
   score is a measure of effect size and/or statistical significance,
   whereas the sign indicates the direction of change. For a case-control
   differential gene expression design with fold change FC, one popular
   choice is a ranking score of the form
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msub><mi>r</mi><mi>k</mi></msub><mo>=</mo><m
   row><mtext>sign</mtext><mo>(</mo><mtext>log</mtext><mrow><mo>(</mo><msu
   b><mtext mathvariant="italic">FC</mtext><mi
   mathvariant="italic">k</mi></msub><mo>)</mo></mrow><mo>)</mo></mrow><mo
   >×</mo><mo>(</mo><mo>-</mo><mrow><msub><mtext>log</mtext><mn>10</mn></m
   sub><mrow><mo>(</mo><mi
   mathvariant="normal">p</mi><mo>-</mo><msub><mtext>value</mtext><mi
   mathvariant="italic">k</mi></msub><mo>)</mo></mrow></mrow><mo>)</mo><ms
   pace width="4pt"></mspace><mo>,</mo></mrow></mtd></mtr></mtable> :MATH]
   (3)

   where p–value[k] is an estimate of the statistical significance of the
   observed FC[k] for the k − th gene. Based on the DGE analysis outlined
   above (Sect. 2.2), we obtained standard ranking scores, defined by
   [85]Eq (3), which are provided in [86]S3 Table for all TCGA projects
   analyzed in our study.

   Let
   [MATH: <msub><mi mathvariant="script">G</mi><mi>i</mi></msub> :MATH]
   be a gene set (after removing any genes not included in the universe of
   measured genes). The step variable x[k] is defined as
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msub><mi>x</mi><mi>k</mi></msub><mo>=</mo><m
   o>{</mo><mtable><mtr><mtd
   columnalign="left"><mfrac><mrow><mrow><mo>|</mo></mrow><msub><mi>r</mi>
   <mi>k</mi></msub><msup><mrow><mo>|</mo></mrow><mi>p</mi></msup></mrow><
   mrow><msub><mo>∑</mo><mrow><mi>k</mi><mo>∈</mo><msub><mi
   mathvariant="script">G</mi><mi>i</mi></msub></mrow></msub><msup><mrow><
   mo>|</mo><msub><mi>r</mi><mi>k</mi></msub><mo>|</mo></mrow><mi>p</mi></
   msup></mrow></mfrac></mtd><mtd
   columnalign="left"><mrow><mtext>if</mtext><mspace
   width="4pt"></mspace><msub><mi>g</mi><mi>k</mi></msub><mo>∈</mo><msub><
   mi
   mathvariant="script">G</mi><mi>i</mi></msub></mrow></mtd></mtr><mtr><mt
   d
   columnalign="left"><mfrac><mrow><mo>-</mo><mn>1</mn></mrow><mrow><mrow>
   <mi>n</mi><mo>-</mo><mo>∥</mo></mrow><msub><mi
   mathvariant="script">G</mi><mi>i</mi></msub><mrow><mo>∥</mo></mrow></mr
   ow></mfrac></mtd><mtd columnalign="left"><mrow><mtext>if</mtext><mspace
   width="4pt"></mspace><msub><mi>g</mi><mi>k</mi></msub><mo>∉</mo><msub><
   mi
   mathvariant="script">G</mi><mi>i</mi></msub></mrow></mtd></mtr></mtable
   ><mspace width="4pt"></mspace><mo>,</mo></mrow></mtd></mtr></mtable>
   :MATH]
   (4)

   where p > 0 is a weight parameter introduced in GSEA’s follow-up paper
   [[87]17] to emphasize the role of genes at the top and bottom of the
   gene list (i.e. those with larger absolute values of the ranking score
   in either direction) in detriment of genes in the center of the list
   (i.e. those with lower absolute values of the ranking score in either
   direction). The denominators are chosen for normalization purposes. The
   p = 0 case is akin to the standard, unweighted formulation from GSEA’s
   original paper [[88]16], although with a different normalization.

   The cumulative enrichment of gene set
   [MATH: <msub><mi mathvariant="script">G</mi><mi>i</mi></msub> :MATH]
   is defined as the running sum of the step variable x[k] from top to
   bottom, i.e.
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msub><mi>E</mi><mi>m</mi></msub><mo>=</mo><m
   underover><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mi>m</m
   i></munderover><mspace
   width="2pt"></mspace><msub><mi>x</mi><mi>k</mi></msub><mspace
   width="4pt"></mspace><mo>,</mo><mspace
   width="4pt"></mspace><mtext>for</mtext><mspace
   width="4pt"></mspace><mi>m</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>.</mo>
   <mo>.</mo><mo>.</mo><mo>,</mo><mi>n</mi><mspace
   width="4pt"></mspace><mo>.</mo></mrow></mtd></mtr></mtable> :MATH]
   (5)

   In this way, starting at E[0] ≡ 0, GSEA progressively examines genes
   from the top to the bottom of the ranked list, increasing the
   cumulative enrichment if the gene at that position is part of the
   pathway and decreasing the running sum otherwise. The step variable is
   normalized to ensure that E[n] = 0 at the bottom of the list. Then, the
   target gene set’s enrichment score (ES) is defined as the maximum
   departure from zero (in either direction, i.e. positive or negative)
   along the cumulative enrichment profile; its sign indicates whether the
   observed enrichment is associated with the top (positive ES) or bottom
   (negative ES) of the ranked gene list. Genes in the target gene set
   that contribute to the maximum departure from zero in the cumulative
   enrichment running sum are termed leading-edge or core-enrichment
   genes. By definition of the step variable normalization, ES is
   restricted to the [−1, 1] range; the extreme ES = ±1 values may only
   occur when all genes in a gene set appear strictly at the top (ES = 1)
   or bottom (ES = −1) of the ranked gene list. Because, by virtue of Eqs
   ([89]4) and ([90]5), GSEA’s weighted Kolmogorov-Smirnov (global)
   statistic cannot be expressed in terms of a simple function of the
   ranking score (local statistic), formal analytical solutions to express
   the significance of a pathway’s enrichment score based on parametric
   modeling are not known. Instead, non-parametric methods based on
   empirical distributions are computationally generated via random
   permutations. GSEA offers two options: (i) gene-set permutations, in
   which a target gene set’s enrichment score is compared against a
   null-model distribution of enrichment scores from an ensemble of
   randomly generated gene sets of equal size, and (ii) phenotype
   permutations, in which the null-model distribution is generated by
   randomly shuffling phenotype labels among the samples while keeping the
   original target gene set.

   GSEA’s GUI applications offer a limited choice of ranking statistics
   and design matrices [[91]37], which are not appropriate in the context
   of our work (see Sect. 4 below for a detailed discussion). For gene-set
   permutations, it is possible to upload the ranked list of genes into
   GSEA’s GUI in Preranked mode; however, GSEA does not offer suitable
   functionality to perform phenotype permutations on RNA-seq data
   [[92]37]. Because our study involved both gene-set and phenotype
   permutations, we implemented GSEA in the command line via custom
   scripts. In this work, GSEA was run using gsea-cli.sh in a macOS
   environment (Monterey v.12.6.9) from the latest available GSEA version
   4.3.2 (build 13, October 2022) [[93]24]. Importantly, it must be
   pointed out that gsea-3.0.jar, utilized in protocols published by
   Reimand et al [[94]37], is affected by serious security vulnerabilities
   due to the use of the Java-based logging utility Apache Log4j in GSEA
   versions earlier than 4.2.3. Moreover, as reported by the GSEA Team,
   version 3.0 contained microarray-specific code (mostly related to
   Affymetrix) that may cause issues with RNA-seq data analysis, which was
   removed in later GSEA updates.

   Our analysis of phenotype permutations followed an approach similar to
   that outlined by Reimand et al. [[95]37]. For each TCGA project under
   study, we generated n[rdm] = 1000 phenotype randomizations and obtained
   n[rdm] randomized list of ranks, which were distributed among parallel
   command-line GSEA executions. The statistical significance for each
   pathway was then empirically estimated as
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mrow><mi
   mathvariant="normal">p</mi><mo>-</mo><mtext>value</mtext></mrow><mo>≃</
   mo><mfrac><mn>1</mn><mrow><mn>1</mn><mo>+</mo><msub><mi
   mathvariant="normal">n</mi><mtext>rdm</mtext></msub></mrow></mfrac><mo>
   (</mo><mn>1</mn><mo>+</mo><munderover><mo>∑</mo><mrow><mi
   mathvariant="normal">i</mi><mo>=</mo><mn>1</mn></mrow><msub><mi
   mathvariant="normal">n</mi><mtext>rdm</mtext></msub></munderover><mo>Θ<
   /mo><mo>(</mo><mrow><mo>|</mo></mrow><msubsup><mtext>ES</mtext><mrow><m
   text>rdm</mtext></mrow><mi
   mathvariant="normal">i</mi></msubsup><mrow><mo>|</mo><mo>-</mo><mo>|</m
   o><mtext>ES</mtext><mo>|</mo></mrow><mo>)</mo><mo>)</mo><mspace
   width="4pt"></mspace><mo>,</mo></mrow></mtd></mtr></mtable> :MATH]
   (6)

   where Θ is the Heaviside step function. This expression computes the
   fraction of randomized ranked gene lists that yield enrichment scores
   [MATH: <msubsup><mtext>ES</mtext><mrow><mtext>rdm</mtext></mrow><mi
   mathvariant="normal">i</mi></msubsup> :MATH]
   larger (in absolute value) than the enrichment score ES obtained in the
   original, non-randomized ranked gene list, correcting for random
   sampling [[96]38]. It should be noticed that, because our matrix design
   was paired, the phenotype permutation procedure randomized tumor vs
   non-tumor labels while preserving subject identifiers [[97]39].

   In this work, we introduce a new consensus metric termed Enrichment
   Evidence Score (EES). At the pathway level, EES is defined as the
   signed sum of GSEA runs in which the pathway was significantly
   enriched, summed over the four enrichment statistics (classic and
   weighted with p = 1, 1.5, and 2). For instance, EES = 3 would
   correspond to a pathway found significantly enriched in tumor samples
   (positive sign) in three out of the four GSEA runs, whereas EES = −2
   would correspond to a pathway found significantly enriched in non-tumor
   samples (negative sign) in two out of the four GSEA runs. Similarly, we
   define a gene-level EES by summing over GSEA runs in which a given gene
   was identified as leading-edge.

2.5 Over-representation analysis (ORA)

   ORA is the pioneer approach for pathway enrichment analysis, originally
   designed for microarray data [[98]40]. Despite its simplicity, it is a
   robust and well-established method that has been implemented in more
   than 40 tools [[99]41], which differ in their various components such
   as gene set database, data visualization, and user interface. Unlike
   GSEA, ORA requires a selection criterion to define a list of top
   ranking genes,
   [MATH: <mi mathvariant="script">L</mi> :MATH]
   , and treats all selected genes equally, ignoring measurable
   quantitative differences (e.g. the magnitude of gene expression
   differences). ORA tests the null hypothesis of no association between
   genes in
   [MATH: <mi mathvariant="script">L</mi> :MATH]
   and those contained in a target gene set
   [MATH: <msub><mi mathvariant="script">G</mi><mi>i</mi></msub> :MATH]
   . The probability for the observed intersection between these sets,
   [MATH:
   <mrow><msub><mi>n</mi><mi>i</mi></msub><mo>≡</mo><mrow><mo>∥</mo><mi
   mathvariant="script">L</mi><mo>∩</mo><msub><mi
   mathvariant="script">G</mi><mi>i</mi></msub><mo>∥</mo></mrow></mrow>
   :MATH]
   , is given by the hypergeometric distribution, i.e.
   [MATH: <msub><mi
   mathvariant="script">P</mi><mrow><msub><mi>n</mi><mi>i</mi></msub></mro
   w></msub><mo>=</mo><mfrac><mrow><mfenced><mrow><munder><mrow><mo>∥</mo>
   <msub><mi
   mathvariant="script">G</mi><mi>i</mi></msub><mo>∥</mo></mrow><mrow><msu
   b><mi>n</mi><mi>i</mi></msub></mrow></munder></mrow></mfenced><mo>×</mo
   ><mfenced><mrow><munder><mrow><mi>n</mi><mo>−</mo><mo>∥</mo><msub><mi
   mathvariant="script">G</mi><mi>i</mi></msub><mo>∥</mo></mrow><mrow><mo>
   ∥</mo><mi
   mathvariant="script">L</mi><mo>∥</mo><mo>−</mo><msub><mi>n</mi><mi>i</m
   i></msub></mrow></munder></mrow></mfenced></mrow><mrow><mfenced><mrow><
   munder><mi>n</mi><mrow><mo>∥</mo><mi
   mathvariant="script">L</mi><mo>∥</mo></mrow></munder></mrow></mfenced><
   /mrow></mfrac><mo>,</mo> :MATH]
   (7)

   and the cumulative probability of an intersection equal or larger than
   that observed is given by
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi
   mathvariant="script">P</mi><mo>(</mo><mo>≥</mo><msub><mi>n</mi><mi>i</m
   i></msub><mo>)</mo><mo>=</mo><mn>1</mn><mo>-</mo><munderover><mo>∑</mo>
   <mrow><mi>m</mi><mo>=</mo><mn>0</mn></mrow><mrow><msub><mi>n</mi><mi>i<
   /mi></msub><mo>-</mo><mn>1</mn></mrow></munderover><msub><mi
   mathvariant="script">P</mi><mi>m</mi></msub><mo>,</mo></mrow></mtd></mt
   r></mtable> :MATH]
   (8)

   which matches the one-sided Fisher’s exact test of the association
   between genes in
   [MATH: <mi mathvariant="script">L</mi> :MATH]
   and
   [MATH: <msub><mi mathvariant="script">G</mi><mi>i</mi></msub> :MATH]
   . This approach lies at the core of many popular tools for pathway
   analysis, both public and proprietary, such as PathwayStudio [[100]42],
   DAVID [[101]43], g:Profiler [[102]44] and Ingenuity Pathway Analysis
   [[103]45], among many others [[104]41].

   We generated custom implementations of ORA using multiple modalities.
   After generating DGE tables with tumor/non-tumor FC and p—value
   estimates (see Sect. 2.2), we proceeded according to two different
   approaches, namely (i) signed ORA and (ii) unsigned ORA. In the signed
   ORA approach, we performed two analyses, one using significant genes
   with FC > 1 and another one using significant genes with FC < 1; a
   target pathway was thus assigned two enrichment p—values, which were
   merged by choosing the most significant between the two. In the
   unsigned ORA approach, in contrast, FC was ignored and genes were
   selected solely based on their p—values. Pan et al [[105]46] pointed
   out one shortcoming of ORA, namely that pathways identified as
   statistically significant may strongly depend on the cutoffs used to
   select the input lists of differentially expressed genes. To take this
   effect into consideration, we implemented the q—value < 0.05 selection
   criterion applied to two multiple-testing correction scenarios, namely:
   (i) Bonferroni-adjusted p-values (stricter and therefore leading to
   shorter input gene lists), and (ii) Benjamini-Hochberg (also known as
   false discovery rate, FDR), as an alternative, more lenient approach
   used to select lists for ORA. In combination with the signed/unsigned
   ORA approaches mentioned above, these gene selection criteria yield a
   total of four ORA-based scenarios.

2.6 Analysis of sensitivity and specificity

   To perform an analysis of sensitivity and specificity for each TCGA
   project in this study, we generated an ensemble of 1000
   negative-control random pathways. Pathway sizes in the 15 − 500 range
   were sampled from a uniform distribution; genes were randomly chosen
   from the set of measured genes that passed the filters (see Sect. 2.2
   for data processing details). After ordering positive- and
   negative-control pathways by enrichment p—value, we used the roc
   function from R package pROC [[106]47] v.1.18.4 to generate
   one-directional receiver operating characteristic (ROC) curves to test
   the assumption that positive-control pathways are more significantly
   enriched than negative-control ones. The Area Under the Curve (AUC) was
   calculated with function auc and DeLong’s 95% confidence interval with
   function ci.auc from the same package. For visualizations, we used R
   package VennDiagram v.1.7.3, R package gplots v.3.1.3, and Cytoscape
   v.3.9.1.

3 Results

   We begin to explore our ideas using TCGA-BRCA (breast cancer), the most
   common cancer type in the U.S. and worldwide, for which we have 228
   paired primary-tumor/non-tumor RNA-seq samples available from 113
   subjects. Following the procedure outlined in Sect. 2.3.1, a sequential
   series of filters was applied on MSigDB gene sets to identify 72
   positive-control pathways ([107]Fig 1). The analysis of gene set
   overlap metrics, namely the Jaccard index ([108]Eq (1), see [109]S4
   Fig) and the fractional pairwise overlap matrix ([110]Eq (2), see
   [111]S5 Fig), revealed that positive-control pathways are mostly
   non-overlapping. We then ran GSEA under eight different modalities,
   choosing gene-set or phenotype permutations combined with four
   different choices for the enrichment statistic, either classic
   (unweighted) or weighted (p = 1, 1.5, 2), which comprise all the
   options available in the current GSEA release (see Sect. 2.4 for
   further details).

   [112]Fig 2(a) shows the overlap of positive-control pathways found
   significant among different weight parameter choices using gene-set
   permutations and p—value < 0.05 as significance criterion. We observe
   that, while half (36) of the pathways were found significant by all
   four runs and an additional set of 8 were found significant by three
   runs, 9 pathways were identified only by two methods and 19 of them
   were found by only one method. Among the latter, most of them (17) were
   identified by the classic, unweighted method. By tightening the
   significance criterion to p—value < 0.01, shown in [113]Fig 2(b), the
   percentage of pathways found significant by all methods increases only
   slightly to 56% (32 out of 57 pathways). We find that stronger
   consensus cannot be enforced by tightening the significance criterion
   even further; using p—value < 0.001, still only 56% (26 out of 46)
   pathways agree among all methods (see [114]S6 Fig). Similarly, [115]Fig
   2(c) and 2(d) show results for phenotype permutations using different
   p-value cutoffs, as indicated. It is clearly apparent that the
   phenotype permutation approach identifies fewer pathways and shows
   poorer agreement among different weight parameter choices. For stricter
   selection criteria (p—value < 0.001), none of the 72 positive control
   pathways are identified.

Fig 2. Significant TCGA-BRCA positive control pathways across different
weight parameter choices.

   [116]Fig 2
   [117]Open in a new tab

   (a) Gene-set permutation with p—value < 0.05. (b) Gene-set permutation
   with p—value < 0.01. (c) Phenotype permutation with p—value < 0.05. (d)
   Phenotype permutation with p—value < 0.01. GSEA enrichment statistics:
   classic (“cl”), weight parameter p = 1 (“p1”), weight parameter p = 1.5
   (“p1.5”), and weight parameter p = 2 (“p2”).

   To complement these findings, [118]Fig 3 shows the overlap between
   gene-set and phenotype permutation methods at significance p—value <
   0.05 for each enrichment statistic. As shown in this figure, phenotype
   permutation is significantly less sensitive than gene-set permutation,
   since the former approach identifies subsets of less than half the
   number of positive control pathways obtained by the latter.
   Furthermore, to explore the impact of tumor/non-tumor sample
   pairedness, which affects the magnitude and statistical significance of
   differential gene expression estimates (and, therefore, gene ranking
   metrics) as well as phenotype randomization procedures (Sect. 2.4), we
   re-ran our pipeline as unpaired data. As shown by [119]S7 Fig (for
   gene-set permutations) and [120]S8 Fig (for phenotype permutations),
   differences between paired and unpaired analyses appear to be minor.

Fig 3. Significant TCGA-BRCA positive control pathways using gene-set (“gs”)
and phenotype (“ph”) permutation approaches for different enrichment
statistics.

   [121]Fig 3
   [122]Open in a new tab

   The significance criterion was p—value < 0.05. (a) Classic
   (unweighted). (b) Weight parameter p = 1. (c) Weight parameter p = 1.5.
   (d) Weight parameter p = 2.

   In order to compare the sensitivity and specificity of GSEA under
   different modalities, we generated 1000 negative-control pathways by
   random selection of genes (see Sect. 2.6 for details). Ordering
   positive and negative controls by significance, [123]Fig 4 shows
   receiver operating characteristic (ROC) curves for (a) gene-set and (b)
   phenotype permutation approaches, using different enrichment
   statistics. For comparison, we also show results from (c) signed and
   (d) unsigned implementations of over-representation analysis (ORA)
   using gene lists selected by q—value < 0.05 adjusted by two different
   methods (see Sect. 2.5 for details). This analysis shows that the
   classic gene-set permutation approach of GSEA yields the best
   sensitivity vs specificity tradeoff with Area Under the Curve (AUC)
   estimated as 0.99 and DeLong’s 95% confidence interval (CI) in the 0.98
   − 1 range. The introduction of weighted enrichment statistics appears
   to be deleterious, negatively impacting the performance of gene-set
   approaches ([124]Fig 4(a)). Moreover, in agreement with previous
   results, phenotype permutation models display poorer sensitivity
   ([125]Fig 4(b)). The straightforward signed ORA ([126]Fig 4(c))
   achieves a performance similar to that of gene-set GSEA with p = 1 and
   better than that of other weighted approaches, but below gene-set
   classic GSEA. It should be noticed that our definition of
   positive-control pathways relies on a pathway significance filter,
   which removes target pathways that appeared not significant in any of
   the GSEA runs (Sect. 2.3.1 and [127]Fig 1). By removing this filtering
   step and, thereby, expanding the target list from 72 to 109 pathways,
   our main conclusions remain the same: the classic gene-set permutation
   approach of GSEA is the best-performing model with AUC = 0.87 (95% CI =
   0.83 − 0.91), as shown in [128]S9 Fig.

Fig 4. ROC curves for different GSEA and ORA approaches using 72 TCGA-BRCA
positive control pathways and 1000 randomized negative controls.

   [129]Fig 4
   [130]Open in a new tab

   (a) Gene-set permutation GSEA. (b) Phenotype permutation GSEA. (c)
   Signed ORA. (d) Unsigned ORA. GSEA approaches used different enrichment
   statistics, as indicated. ORA approaches used Bonferroni and
   Benjamini-Hochberg (B-H) adjusted q-values as different inclusion
   criteria to select differentially expressed genes, as indicated.

   [131]Fig 5 expands the analysis by showing the Area Under the Curve
   across all cancer types included in our study, where error bars
   indicate DeLong’s 95% confidence intervals, for different GSEA and ORA
   approaches using positive control pathways and randomized negative
   controls as previously described. Despite the fact that the number of
   available samples, subjects, and pathways identified as positive
   controls varies significantly across projects (recall [132]Table 1),
   the classic gene-set permutation approach of GSEA remains, quite
   remarkably, the top performing method. As shown in [133]S10 Fig, this
   finding is confirmed by expanding the analysis to include all target
   pathways (i.e. by removing the last pathway filtering step in the
   procedure described in Sect. 2.3.1 and [134]Fig 1).

Fig 5. AUC across TCGA projects for different GSEA and ORA approaches using
cancer-type-specific positive control pathways and 1000 randomized negative
controls.

   [135]Fig 5
   [136]Open in a new tab

   (a) Gene-set permutation GSEA. (b) Phenotype permutation GSEA. (c)
   Signed ORA. (d) Unsigned ORA. GSEA approaches used different enrichment
   statistics, as indicated. ORA approaches used Bonferroni and
   Benjamini-Hochberg (B-H) adjusted q-values as different inclusion
   criteria to select differentially expressed genes, as indicated.

   As discussed earlier (see Sect. 2.2), alternative methods for
   differential gene expression analysis exist, which rely on a variety of
   approaches for count normalization and statistical models for
   differential expression quantification. For the sake of comparison,
   [137]S11 Fig shows ROC curves for GSEA gene-set permutations under
   different enrichment statistics obtained from TCGA-BRCA’s raw counts
   analyzed using the edgeR-voom-limma pipeline previously described.
   Although the differences among different enrichment statistics appear
   to be less pronounced compared with [138]Fig 4(a), the unweighted
   Kolmogorov statistic approach still yields the largest AUC value.
   Similarly, [139]S12 Fig extends the comparison to all TCGA projects
   analyzed in this study.

   Finally, we explored similarities and differences between target
   pathways identified in TCGA versus other cancer-type-specific cohorts.
   Firstly, we analyzed REBC-THYR, a thyroid cancer study consisting of
   780 paired tumor/nontumor RNA-seq tissue samples from 390 subjects
   (Sect. 2.1.2). Using gene-set permutations in GSEA with threshold
   p—value < 0.05, we calculated the Enrichment Evidence Score (EES) for
   each target pathway (see Sect. 2.4 for details). [140]Fig 6(a) shows
   the contingency table of EES across target pathways for REBC-THYR vs
   TCGA-THCA, which displays a remarkable agreement between pathways
   detected in both thyroid cancer studies (see [141]S4 Table for
   details). By grouping EES intervals as: (i) [−4, −3]
   (non-tumor-enriched), (ii) [−2, 2] (undetermined), and (iii) [[142]3,
   [143]4] (tumor-enriched), the association between enriched pathways is
   further emphasized, as shown in [144]Fig 6(b), leading to Fisher’s
   exact test p—value = 2.5 × 10^−6. Secondly, we analyzed MO-HCC, a study
   on hepatocellular carcinoma (the most common form of liver cancer)
   consisting of 140 paired tumor/nontumor RNA-seq tissue samples from 70
   subjects (Sect. 2.3.1). Following a similar procedure, we obtained the
   contingency table of EES across target pathways for MO-HCC vs
   TCGA-LIHC, shown in [145]Fig 6(c) and [146]S5 Table, as well as the
   contingency table by grouped intervals, displayed in [147]Fig 6(d).
   These results appear to capture the larger heterogeneity of TCGA-LIHC
   compared with MO-HCC, which is to be expected since the Mongolia study
   is focused on a cohort of rather uniform genetic background from a
   relatively isolated region exposed to weaker external influences
   [[148]26]. Despite the spread attributable to differences in etiology,
   the association between EES groups appears highly significant, with
   Fisher’s exact test p—value = 4.4 × 10^−15.

Fig 6. Pathway-level enrichment evidence scores for thyroid cancer and
hepatocellular carcinoma cohorts.

   [149]Fig 6
   [150]Open in a new tab

   (a) Comparison between significant target pathways in REBC-THYR vs
   TCGA-THCA. (b) Contingency table of grouped EES intervals in REBC-THYR
   vs TCGA-THCA (Fisher’s exact test p—value = 2.5 × 10^−6). (c)
   Comparison between significant target pathways in MO-HCC vs TCGA-LIHC.
   (d) Contingency table of grouped EES intervals in MO-HCC vs TCGA-LIHC
   (Fisher’s exact test p—value = 4.4 × 10^−15). Distance to the diagonal
   is represented with increasingly darker shades of blue.

   A similar approach can also be implemented to investigate the role of
   leading-edge genes within the previously identified core pathways.
   Leading-edge genes are those identified by the GSEA algorithm to make
   the strongest contribution to the enrichment signal of a pathway (see
   Sect. 2.4 for details). For the sake of illustration, [151]Fig 7(a)
   shows gene-level EES scores for LUI_THYROID_CANCER_CLUSTER_1, a
   previously identified high-consensus tumor-enriched pathway in thyroid
   cancer (recall [152]S4 Table). The association between EES groups,
   shown in [153]Fig 7(b), is highly significant, with Fisher’s exact test
   p—value = 4.6 × 10^−8. Out of 52 constituent genes, we observe that 17
   of them are robustly identified as leading-edge genes in this thyroid
   cancer pathway in both cohorts (see [154]S6 Table for details).
   Interestingly, three genes (EPOR, GAS6, CDC25B) are found to be
   leading-edge only in TCGA-THCA, while other three genes (SPRY1,
   ARHGAP1, RUBCN) are identified as leading-edge only in REBC-THYR, which
   may point to mechanistic differences across etiologies. [155]Fig 7(c)
   illustrates how we can integrate high-consensus leading-edge genes
   (defined by requiring |EES| ≥ 3 in at least one of the cohorts) and
   core pathway information. The seven core thyroid cancer pathways
   reported earlier here appear depicted in connection with high-consensus
   leading-edge genes; for simplicity, only genes connected to two or more
   pathways are shown.

Fig 7. Gene-level enrichment evidence scores for thyroid cancer cohorts.

   [156]Fig 7
   [157]Open in a new tab

   (a) Comparison between EES leading-edge genes in REBC-THYR vs TCGA-THCA
   for genes in LUI_THYROID_CANCER_CLUSTER_1, previously identified as a
   high-consensus tumor-enriched pathway in thyroid cancer. (b)
   Contingency table of grouped EES intervals for the same case as in
   panel (a) (Fisher’s exact test p—value = 4.6 × 10^−8). (c) Network
   representation showing seven core thyroid cancer pathways and
   high-consensus leading-edge genes (|EES| ≥ 3 in at least one of the
   cohorts). Only genes connected to two or more pathways are shown.

4 Discussion

   Gene Set Enrichment Analysis (GSEA) pioneered the broad class of
   functional class scoring approaches to pathway analysis and remains, to
   this day, one of the leading methods of choice. For this reason, it is
   of paramount importance for users to be aware of its caveats and
   limitations. In this Section, we discuss some important aspects of the
   current implementation of GSEA.

   The GSEA algorithm provides a number of alternative statistics that can
   be used for feature ranking. For categorical phenotypes, options are
   signal-to-noise (default), t-test, ratio of classes, difference of
   classes, and log[2] of ratio of classes. For numerical phenotypes,
   options are cosine, Euclidean, Manhattan, Pearson and Spearman
   correlation ranking metrics. These ranking statistics aim to capture
   some measure of differential expression between categorical phenotypes,
   or some expression trend associated with quantitative phenotypes. Zyla
   et al [[158]14] have analyzed these and other ranking metrics, showing
   that they have a critical impact on GSEA results based on a large
   number of microarray benchmark datasets. While these metrics are also
   widely used for RNA-seq datasets, it is important to be aware that
   these ranking statistics, originally selected for their effectiveness
   when used with microarray-based expression data, have not been
   evaluated for their use with data derived from RNA-seq experiments. As
   noted by Reimand et al [[159]37], rather than using these built-in
   options, RNA-seq-derived differential gene expression values should be
   computed outside of GSEA, using methods that include variance
   stabilization (such as edgeR [[160]48], DESeq [[161]49], limma
   [[162]33], and voom [[163]50]) and then imported into GSEA software for
   the actual pathway analysis computation. The current implementation of
   GSEA, moreover, is not able to handle slightly more complex design
   matrices, even common ones such as paired data analysis. In this work,
   we illustrate an approach using paired tumor/non-tumor samples to
   control for inter-subject heterogeneity, which also makes use of
   limma’s linear modeling and empirical Bayes moderation to assess
   differential expression (see Sect. 2.2).

   GSEA has two methods for determining the statistical significance of
   pathway enrichment: gene-set permutation and phenotype permutation (see
   Sect. 2.4 for details). Gene-set permutation is recommended for studies
   with limited variability and biological replicates, while phenotype
   permutation is recommended when a larger number of replicates is
   available (as a rule of thumb, at least ten per condition) [[164]37].
   The main advantage of the latter is that it preserves the structure of
   gene sets with biologically important gene-gene correlations, in
   contrast to the gene-set permutation approach [[165]51]. Despite these
   recommendations, performing phenotype permutation with the appropriate
   methods that RNA-seq data require for differential expression modeling
   and ranking is a daunting task for the typical GSEA user, since it
   demands custom programming to compute pathway enrichment scores and
   differential expression statistics separately for thousands of
   phenotype randomizations [[166]37]. Besides the issues of custom
   programming and computational complexity, phenotype permutation jobs
   typically take several hours to complete, even in a parallelized
   multi-CPU environment. By implementing and executing the phenotype
   permutation approach, however, our work provided evidence that the
   gene-set permutation method showed superior performance to identify
   RNA-seq-based, cancer-type-specific pathways.

   GSEA offers the option of a weighted Kolmogorov-Smirnov global
   statistic to assess gene set enrichment; this global statistic is
   weighed using a power of the local statistic, p > 0 (see Sect. 2.4).
   Besides the so-called classic, or unweighted, original version of GSEA
   (akin to setting p = 0), built-in options for this weight parameter are
   p = 1 (default), p = 1.5, and p = 2. By increasing p, the suppression
   effects of genes with smaller absolute values of the local statistic
   become more pronounced, thereby emphasizing the impact of the genes at
   the top and bottom of the ranked gene list. No clear guidance is
   offered to users as to how to choose this so-called enrichment
   statistic with RNA-seq data, since the main evidence in favor of
   weighted statistics was produced when the method was first introduced
   for the analysis of microarray datasets [[167]17]. In this work, we
   explored all available options for both permutation types and concluded
   that the classic, unweighted gene-set permutation procedure offers
   comparable or better sensitivity-specificity tradeoffs across cancer
   types compared with other, more complex and computationally intensive
   permutation methods. It should be noticed that only performing
   unweighted GSEA and permuting the genes, analytical solutions for the
   expected average enrichment score are available [[168]51]; all other
   GSEA modalities require the calculation of non-parametric,
   computationally intensive enrichment score distributions.

   In fact, by running all GSEA enrichment statistics under gene-set
   permutation, which can be easily accomplished by uploading a ranked
   list of genes to GSEA’s Preranked mode, no custom programming is
   required (except for RNA-seq data processing, QC, normalization,
   filtering, and differential gene expression analysis, which may be
   accomplished with standard software outside of GSEA). As demonstrated
   by our analysis of additional cohorts from non-TCGA sources, we see
   value in a strategy that uses the proposed Enrichment Evidence Score
   (EES) as a consensus metric to identify a core set of pathways,
   complemented by an expanded set of pathways for exploratory analysis,
   which may be more specifically tailored to the study of interest.
   Similarly, EES can be used to identify high-consensus leading-edge
   genes, which allows to integrate pathway- and gene-level information.
   Hence, the standard, unidirectional sense of information flow through
   pathway analysis (from genes to pathways) acquires a feedback loop to
   enrich the input, gene-phenotype associations (e.g. differential gene
   expression) with functional, gene-pathway associations. The information
   thus flows from genes to pathways and back.

5 Conclusion

   The main goal of this work was to quantitatively assess the performance
   of a variety of GSEA modalities and provide guidance in the practical
   use of GSEA for the analysis of RNA-seq data. We leveraged harmonized
   RNA-seq datasets available from TCGA in combination with large, curated
   pathway collections from MSigDB to obtain cancer-type-specific target
   pathway lists across multiple cancer types. We performed a detailed
   analysis of GSEA performance using both gene-set and phenotype
   permutations combined with four different choices for the
   Kolmogorov-Smirnov enrichment statistic, either unweighted (also known
   as classic) or weighted (with three different options for the weight
   parameter). Based on our benchmarks, we concluded that the
   classic/unweighted gene-set permutation approach offered comparable or
   better sensitivity-specificity tradeoffs across cancer types compared
   with other, more complex and computationally intensive permutation
   methods, and also appeared superior to the performance of standard
   over-representation analysis procedures. Finally, we analyzed other
   large cohorts for thyroid cancer and hepatocellular carcinoma. We
   utilized a new consensus metric, the Enrichment Evidence Score (EES),
   which showed a remarkable agreement between pathways identified in TCGA
   and those from other sources, despite differences in cancer etiology.
   This procedure suggests an analysis strategy in which, by running
   multiple GSEA modalities in palallel, one may identify a core set of
   pathways (namely, those showing a high degree of consensus across
   enrichment metrics), which may be complemented by an expanded set of
   pathways for exploratory analysis. The EES metric can also be utilized
   to identify high-consensus leading-edge genes and integrate pathway-
   and gene-level information. In accordance to FAIR Principles for
   Research Software (FAIR4RS) [[169]52], we made all source code openly
   and publicly available along with detailed step-by-step documentation
   [[170]27], with the two-pronged aims (i) to ensure the transparency and
   reproducibility of our results, as well as (ii) to enable further
   RNA-seq-based benchmarking work by setting up a suitable and flexible
   framework. We believe that our work fills the existing gap in current
   guidelines and benchmarks for the use of GSEA with RNA-seq data and
   that, hopefully, it will stimulate further efforts to advance pathway
   analysis methods.

Supporting information

   S1 Table. Genes measured and differentially expressed in each TCGA
   project.

   Number of measured protein-coding genes after all filters were applied
   and how many of them were found differentially expressed in either
   direction, i.e. over-expressed in primary-tumor (FC > 1) or non-tumor
   (FC < 1) tissue, using Bonferroni or Benjamini-Hochberg (B-H) q—value <
   0.05 significance thresholds adjusted for multiple testing.

   (XLSX)
   [171]pone.0302696.s001.xlsx^ (11.3KB, xlsx)
   S2 Table. Cancer-type specific pathways.

   For each TCGA study, we provide name, size, collection, and GMT file
   name for all cancer-type specific preselected pathways. Additional
   columns indicate whether preselected pathways were included in the
   target and positive-control pathway lists.

   (XLSX)
   [172]pone.0302696.s002.xlsx^ (34.7KB, xlsx)
   S3 Table. Gene ranking scores for all TCGA projects analyzed in our
   study.

   (XLSX)
   [173]pone.0302696.s003.xlsx^ (3.9MB, xlsx)
   S4 Table. Enrichment evidence scores of thyroid cancer pathways in
   TCGA-THCA and REBC-THYR.

   (XLSX)
   [174]pone.0302696.s004.xlsx^ (11KB, xlsx)
   S5 Table. Enrichment evidence scores of liver cancer pathways in
   TCGA-LIHC and MO-HCC.

   (XLSX)
   [175]pone.0302696.s005.xlsx^ (12.8KB, xlsx)
   S6 Table. Enrichment evidence scores for genes in the
   LUI_THYROID_CANCER_CLUSTER_1 pathway in TCGA-THCA and REBC-THYR.

   (XLSX)
   [176]pone.0302696.s006.xlsx^ (11.5KB, xlsx)
   S1 Fig. Differentially-expressed genes in TCGA-BRCA.

   Volcano plot showing significant genes over-expressed in primary-tumor
   (red) or non-tumor (blue) tissue based on Bonferroni-adjusted q—value <
   0.05. Labels for the top ten genes on either side are also shown.

   (TIF)
   [177]pone.0302696.s007.tif^ (2.8MB, tif)
   S2 Fig. Comparison of gene ranking scores obtained from differential
   gene expression (DGE) assessed via voom quantile normalization vs
   log-transformed transcript per million (TPM) normalized counts.

   DGE was assessed from 228 paired primary-tumor vs non-tumor breast
   cancer samples from TCGA-BRCA. The linear fit is shown by a dashed red
   line. Spearman’s correlation estimate and p-value are shown in the
   legend.

   (TIF)
   [178]pone.0302696.s008.tif^ (6.9MB, tif)
   S3 Fig. Comparison of pathway ranking scores obtained from differential
   gene expression (DGE) assessed via voom quantile normalization vs
   log-transformed transcript per million (TPM) normalized counts.

   DGE was assessed from 228 paired primary-tumor vs non-tumor breast
   cancer samples from TCGA-BRCA. Target pathways were analyzed using GSEA
   with different enrichment statistics: (a) classic (unweighted); (b)
   weight parameter p = 1; (c) weight parameter p = 1.5; (d) weight
   parameter p = 2. Pathway ranking scores were defined as
   −log_10(p—value)*sign(ES), where p-values were empirically determined
   via gene-set permutations and ES represented gene set enrichment
   scores. Because we used 1000 permutations for GSEA’s null model, we
   adopted p—value = 0.001 as lower threshold, which implies that pathway
   ranking scores are constrained to the [−3, 3] range. Linear fits are
   shown by dashed red lines. Spearman’s correlation estimates and
   p-values are shown in the legends.

   (TIF)
   [179]pone.0302696.s009.tif^ (919KB, tif)
   S4 Fig. Jaccard index matrix to assess overlaps among 72 TCGA-BRCA
   positive control pathways.

   (TIF)
   [180]pone.0302696.s010.tif^ (19MB, tif)
   S5 Fig. Fractional pairwise overlap matrix to assess redundancies among
   72 TCGA-BRCA positive control pathways.

   (TIF)
   [181]pone.0302696.s011.tif^ (18.7MB, tif)
   S6 Fig. Significant TCGA-BRCA positive control pathways across
   different weight parameter choices.

   Gene-set permutation with p—value < 0.001.

   (TIF)
   [182]pone.0302696.s012.tif^ (1.3MB, tif)
   S7 Fig. Comparison of significant TCGA-BRCA target pathways from paired
   vs unpaired analyses using gene-set permutation approaches for
   different enrichment statistics.

   The significance criterion was p—value < 0.05. (a) Classic
   (unweighted). (b) Weight parameter p = 1. (c) Weight parameter p = 1.5.
   (d) Weight parameter p = 2.

   (TIF)
   [183]pone.0302696.s013.tif^ (1.5MB, tif)
   S8 Fig. Comparison of significant TCGA-BRCA target pathways from paired
   vs unpaired analyses using phenotype permutation approaches for
   different enrichment statistics.

   The significance criterion was p—value < 0.05. (a) Classic
   (unweighted). (b) Weight parameter p = 1. (c) Weight parameter p = 1.5.
   (d) Weight parameter p = 2.

   (TIF)
   [184]pone.0302696.s014.tif^ (1.2MB, tif)
   S9 Fig. ROC curves for different GSEA and ORA approaches using 109
   TCGA-BRCA target pathways and 1000 randomized negative controls.

   (a) Gene-set permutation GSEA. (b) Phenotype permutation GSEA. (c)
   Signed ORA. (d) Unsigned ORA. GSEA approaches used different enrichment
   statistics, as indicated. ORA approaches used Bonferroni and
   Benjamini-Hochberg (B-H) adjusted p-values as different inclusion
   criteria to select differentially expressed genes, as indicated.

   (TIF)
   [185]pone.0302696.s015.tif^ (2.3MB, tif)
   S10 Fig. AUC across TCGA projects for different GSEA and ORA approaches
   using cancer-type-specific target pathways and 1000 randomized negative
   controls.

   (a) Gene-set permutation GSEA. (b) Phenotype permutation GSEA. (c)
   Signed ORA. (d) Unsigned ORA. GSEA approaches used different enrichment
   statistics, as indicated. ORA approaches used Bonferroni and
   Benjamini-Hochberg (B-H) adjusted p-values as different inclusion
   criteria to select differentially expressed genes, as indicated.

   (TIF)
   [186]pone.0302696.s016.tif^ (3.4MB, tif)
   S11 Fig. ROC curves for gene-set permutation GSEA from TCGA-BRCA
   samples.

   Results obtained by an alternative gene expression analysis derived via
   the edgeR-voom-limma pipeline described in Sect. 2.2.

   (TIF)
   [187]pone.0302696.s017.tif^ (7.7MB, tif)
   S12 Fig. AUC for gene-set permutation GSEA across TCGA projects.

   Results obtained by an alternative gene expression analysis derived via
   the edgeR-voom-limma pipeline described in Sect. 2.2.

   (TIF)
   [188]pone.0302696.s018.tif^ (7.6MB, tif)

Acknowledgments