Abstract Introduction: Gene set enrichment analysis (GSEA) subsequent to differential expression analysis is a standard step in transcriptomics and proteomics data analysis. Although many tools for this step are available, the results are often difficult to reproduce because set annotations can change in the databases, that is, new features can be added or existing features can be removed. Finally, such changes in set compositions can have an impact on biological interpretation. Methods: We present bootGSEA, a novel computational pipeline, to study the robustness of GSEA. By repeating GSEA based on bootstrap samples, the variability and robustness of results can be studied. In our pipeline, not all genes or proteins are involved in the different bootstrap replicates of the analyses. Finally, we aggregate the ranks from the bootstrap replicates to obtain a score per gene set that shows whether it gains or loses evidence compared to the ranking of the standard GSEA. Rank aggregation is also used to combine GSEA results from different omics levels or from multiple independent studies at the same omics level. Results: By applying our approach to six independent cancer transcriptomics datasets, we showed that bootstrap GSEA can aid in the selection of more robust enriched gene sets. Additionally, we applied our approach to paired transcriptomics and proteomics data obtained from a mouse model of spinal muscular atrophy (SMA), a neurodegenerative and neurodevelopmental disease associated with multi-system involvement. After obtaining a robust ranking at both omics levels, both ranking lists were combined to aggregate the findings from the transcriptomics and proteomics results. Furthermore, we constructed the new R-package “bootGSEA,” which implements the proposed methods and provides graphical views of the findings. Bootstrap-based GSEA was able in the example datasets to identify gene or protein sets that were less robust when the set composition changed during bootstrap analysis. Discussion: The rank aggregation step was useful for combining bootstrap results and making them comparable to the original findings on the single-omics level or for combining findings from multiple different omics levels. Keywords: bootstrap analysis, gene set enrichment analysis, multi-omics analysis, proteomics, rank aggregation, transcriptomics 1 Introduction Set-based enrichment methods are an integral part of the analysis of high-throughput expression data, such as those originating from transcriptomics or proteomics experiments. Enrichment methods allow the identification of molecular pathways, Gene Ontology (GO) terms, and other gene sets that might play a role in the disease of interest. Most enrichment methods are subsequently conducted for differential expression analysis; that is, they rely on the ranking of genes after comparing two groups of samples. Statistical tests are used to determine whether the genes of a particular set are disproportionately highly enriched among the differentially expressed genes (DEGs) ([34]Beissbarth and Speed, 2004; [35]Subramanian et al., 2005; [36]Ackermann and Strimmer, 2009). This contrasts with self-contained gene set tests, which are based on subsets of expression data related to a particular gene set ([37]Goeman et al., 2004; [38]Hummel et al., 2008; [39]Jung et al., 2011; [40]Bayerlová et al., 2015). Set information for enrichment analyses is usually obtained from public databases, for example, on molecular pathways or GO terms. The most commonly used databases are the “Reactome pathway knowledgebase” ([41]Fabregat et al., 2016) ([42]https://reactome.org/), “Kyoto Encyclopedia of Genes and Genomes” ([43]Kanehisa and Goto, 2000) (KEGG; [44]https://www.genome.jp/kegg/), “WikiPathways” database ([45]Kelder et al., 2012) ([46]https://www.wikipathways.org), “GO” knowledgebase ([47]Consortium, 2004) ([48]http://geneontology.org/), and the “Molecular Signatures Database” ([49]Liberzon et al., 2011) (MSigDB; [50]https://www.gsea-msigdb.org/gsea/msigdb). In the GO database, a particular GO term comprises a set of genes that can be assigned to a biological process (BP), molecular function (MF), or cellular component (CC). The contents of the databases are curated either automatically by computer algorithms or manually by experts ([51]Consortium, 2004). Specifically, WikiPathways provides community-based curation by registered contributors ([52]Kelder et al., 2012). An example of a database where curation is done both ways, manually and computationally, is the MSigDB. Furthermore, pathway membership can be experimentally validated or predicted computationally. However, none of the modes of curation can prevent the uncertainties remaining regarding the membership of individual genes to a particular pathway ([53]Gillis and Pavlidis, 2013). This is important because all enrichment analyses rely on the correctness of the database information, and the results of enrichment analyses would change if features of a set are removed or added. This can especially happen when the database information is retrieved at different times. For example, the GO database contained 42,442 terms classified as valid and 5,287 classified as obsolete in January 2024. Two months before, only 4,889 terms were classified as obsolete, meaning that nearly 400 terms would have to be reconsidered when a Gene-set enrichment analysis (GSEA) is performed after January 2024. In addition, the WikiPathways database reports roughly between 100 and 700 edits per month. Furthermore, in the KEGG database, complete pathways can be merged, leading to a large number of changes. For example, the KEGG pathway map00471 has been deleted and then added to the KEGG pathway map00470 (“D-amino acid metabolism”). In this work, we present bootGSEA as a novel bootstrap approach to repeatedly sample subsets of pathways or other gene sets to study whether a result remains significant when the set composition is changed. The ranking lists of the gene sets obtained from each bootstrap replicate were aggregated using a score that can be used for a new ranking list. The analyst can then compare the original ranking with the bootstrap-based ranking list to study whether the association of a pathway or GO term with the disease gains or loses evidence. A similar approach was proposed by [54]Schmid et al. (2016), who generated a robustness score for each gene set using random subsets of gene sets. In contrast to their approach, our method results in a new ranking of gene sets that can be helpful in aggregating findings from different independent studies or different omics levels. Thus, our approach for multi-omics follows the idea of aggregating the different omics levels after performing primary analysis on the individual levels first. This way of multi-omics analysis has also been implemented in other studies. For example, [55]Wang et al. (2014) fused networks that were first derived on individual omics levels, [56]Xiong et al. (2012) integrated genetic and transcriptomics results in a joint score, and [57]Kang et al. (2022) fitted neural networks from individual omics levels and merged them into a joint model. We also demonstrated that this method is useful for aggregating the results of enrichment analysis from different omics domains in the same experiment. We applied our new bootstrap pipeline to a single-omics scenario (transcriptomics only) comprising six independent renal cancer datasets. This example was used to show how our bootstrap pipeline can help study the robustness of GSEA when comparing results from multiple independent datasets from studies on a similar research question. In addition, we analyzed our multi-omics kidney data (transcriptomics and proteomics) from our research consortium on spinal muscular atrophy (SMA). The data were obtained from a SMA mouse model to demonstrate the usefulness of our approach when comparing GSEA results between different omics levels. SMA is a monogenic disease caused by the mutation or deletion of the survival motor neuron 1 (SMN1) gene ([58]Lefebvre et al., 1995). The disease is characterized by the degeneration of motoneurons, with the subsequent atrophy of skeletal muscles to muscular atrophy since the SMN affects all tissues, which also include non-skeletal muscles. Moreover, SMA is a multi-system disorder that also affects peripheral organs, such as the kidney ([59]Allardyce et al., 2020). Three treatment methods are available, all increasing SMN expression. The SMN is expressed ubiquitously and has several important cellular functions, including snRNP assembly, R-loop resolution, and regulation of the actin cytoskeleton and translation ([60]Hensel et al., 2020). Therefore, SMA is a highly complex disease with expected dysregulations in pathways in several cell types and on several molecular levels. 2 Methods In this section, we describe the analysis pipeline, including the approach for the bootstrap step used to repeatedly analyze different random subsets of the data. Furthermore, the rank aggregation step and examples of transcriptomics and proteomics data are presented. The complete workflow is shown in [61]Figure 1. All the analyses were implemented in the R programming environment [[62]www.r-project.org, version 4.2 ([63]R Core Team, 2022)]. FIGURE 1. [64]FIGURE 1 [65]Open in a new tab Workflow of the analysis pipeline for bootstrap enrichment analysis at the single-omics level and the steps for single- and multi-omics rank aggregation. For the data obtained from each omics domain (transcriptomics and proteomics data), Gene-set enrichment analysis (GSEA) is performed after differential expression analysis using all data (original standard analysis). Next, gene set enrichment analysis is performed on subsets of data based on bootstrap samples. Finally, the different rankings of the GO terms or pathways can be integrated not only within each omics level but also across all omics levels. The robustness of gene set enrichment analysis can be studied by comparing the original results with either single- or multi-omics results. 2.1 Differential expression analysis and bootstrap method for Gene-set enrichment analysis Prior to enrichment analysis, differential expression analysis was performed on normalized expression data from different groups of interest (for example, disease vs. control) to obtain the ranks of genes and proteins. For the microarray and proteomics data, normalization was performed using the quantile method ([66]Bolstad et al., 2003; [67]Välikangas et al., 2018; [68]Zhao et al., 2020), and for the RNA-seq data, the internal normalization of the DESeq method of the R package “DESeq2” was used ([69]Love et al., 2014). Differential analysis of the proteomics data was performed using the functionality of the R package “limma” ([70]Ritchie et al., 2015), and differential analysis of the RNA-seq data was done using the R package “DESeq2” (spinal muscular atrophy data) and microarray data using “limma” ([71]Ritchie et al., 2015) (renal cell carcinoma datasets).Next, enrichment analysis based on the results of the differential expression analysis was initially performed using the complete dataset, that is, using all genes or proteins that were assigned to a particular gene set according to the database information. We denote this as the original analysis. Gene sets were defined based on pathway data from the KEGG, Reactome, and WikiPathways databases, as well as GO terms. KEGG, Reactome, and WikiPathways enrichment analyses were performed using the R package “clusterProfiler” ([72]Wu et al., 2021), and GO term enrichment analyses were performed using the R package “topGO” ([73]Alexa and Rahnenführer, 2009). The enrichment analyses in “clusterProfiler” implement the methods described by [74]Subramanian et al. (2005), which are independent of thresholds for differentially expressed features. In contrast, the “topGO” method uses a threshold but has the advantage of incorporating information about the hierarchy of GO terms. To study the variability and robustness of the outcome of the enrichment analysis, B bootstrap samples were drawn using only 95% of all the genes in each run. The genes were randomly drawn without replacement. GSEA was repeated for these randomly selected subsets of genes B times, where B is the number of times the whole set was resampled. The composition of the defined gene sets changed when bootstrapping from the gene sets was performed. Consequently, the composition of the gene sets was different for each bootstrap run. Thus, the effect of individual genes was also reflected in this approach. From this bootstrap procedure, B ranking lists of the gene sets were obtained. 2.2 Rank aggregation for single- and multi-omics analyses The resulting enrichment analysis from the B bootstrap runs with B lists of GO terms was aggregated using the R package “RobustRankAggreg” ([75]Kolde et al., 2012). The aggregation score for each pathway was obtained based on the number of occurrences and the ranks from each bootstrap run. The aggregation score was further transformed into a rank for each pathway. To study the robustness of the original findings, the rank obtained from the aggregated score can be compared with the actual analysis, that is, the analysis without a bootstrap step. For multi-omics data, original and bootstrap enrichment analyses were first performed for each omics domain, resulting in one list of aggregated ranks per domain. The aggregated scores from each omics domain were further aggregated. In one of our data examples, enrichment analysis was performed separately for the transcriptomics and proteomics data, and both ranking lists were aggregated into one final ranking list. Thus, the final multi-omics score for each pathway or GO term was obtained. 2.3 R package: bootGSEA The workflow shown in [76]Figure 1 has been compiled and implemented in the new R package “bootGSEA” available at the GitHub repository ([77]https://github.com/klausjung-hannover/bootGSEA). The input requires the results of differential expression analysis. The package currently has eight functions. The functions boot.GO and boot.pathway are used for GO and pathway enrichment analyses, respectively, of the complete data (original analysis) and of bootstrapped data samples, and aggr.boot.GO and aggr.boot.pathway are used for the rank aggregation of pathways obtained from the former functions. In the functions boot.GO and boot.pathway, the user can specify which percentage of features should be drawn during the bootstrap runs. Furthermore, to understand the robustness of pathways at a broader level, we used a multi-omics approach by aggregating ranks from individual omics levels using the aggr.multiomics function. In addition, three functions are provided to visualize these results and study the robustness of the findings. Examples of these visualizations are presented in [78]Section 3. The function compareRank was implemented to compare the original and bootstrapped results at a single-omics level, the function plotRank, for both single- and multi-omics levels, and the function histDiff to understand the rank difference between original and bootstrap analyses. 2.4 Example data 1: transcriptomics data from a renal cancer study The gene expression profiles of renal cell carcinoma (RCC) datasets ([79]GSE6344 ([80]Copland, 2008), [81]GSE14762 ([82]Dykema and Furge, 2009), [83]GSE11024 ([84]Kort, 2008), [85]GSE14994 ([86]Signoretti and Beroukhim, 2010), [87]GSE53757 ([88]John Copland et al., 2014), and [89]GSE15641 ([90]Jones et al., 2005)) were downloaded from the Gene Expression Omnibus (GEO) database using the GEOquery ([91]Davis and Meltzer, 2007) package in the R platform. Detailed information about the datasets, including platform and sample size, is given in [92]Table 1. Differential expression analysis was performed using the limma ([93]Ritchie et al., 2015) package in R. DEGs were screened based on FDR-adjusted p-values [MATH: < :MATH] 0.05 as the cut-off value for all datasets. The results from the differential analysis were further analyzed following the pipeline shown in [94]Figure 1 at the single-omics level. TABLE 1. Renal cell carcinoma (RCC) datasets with accession numbers from the GEO database, sample sizes in the normal and tumor groups, and references