Abstract

   The majority of genome-wide association studies (GWAS) loci are not
   annotated to known genes in the human genome, which renders biological
   interpretations difficult. Transcriptome-wide association studies
   (TWAS) associate complex traits with genotype-based prediction of gene
   expression deriving from expression quantitative loci(eQTL) studies,
   thus improving the interpretability of GWAS findings. However, these
   results can sometimes suffer from a high false positive rate, because
   predicted expression of different genes may be highly correlated due to
   linkage disequilibrium between eQTL. We propose a novel statistical
   method, Gene Score Regression (GSR), to detect causal gene sets for
   complex traits while accounting for gene-to-gene correlations. We
   consider non-causal genes that are highly correlated with the causal
   genes will also exhibit a high marginal association with the complex
   trait. Consequently, by regressing on the marginal associations of
   complex traits with the sum of the gene-to-gene correlations in each
   gene set, we can assess the amount of variance of the complex traits
   explained by the predicted expression of the genes in each gene set and
   identify plausible causal gene sets. GSR can operate either on GWAS
   summary statistics or observed gene expression. Therefore, it may be
   widely applied to annotate GWAS results and identify the underlying
   biological pathways. We demonstrate the high accuracy and computational
   efficiency of GSR compared to state-of-the-art methods through
   simulations and real data applications. GSR is openly available at
   [30]https://github.com/li-lab-mcgill/GSR.

Introduction

   Genome-wide association studies (GWAS) have been broadly successful in
   associating genetic variants with complex traits and estimating trait
   heritabilities in large populations [[31]1–[32]4]. Over the past
   decade, GWAS have quantified the effects of individual genetic variants
   on hundreds of polygenic phenotypes [[33]5, [34]6]. GWAS summary
   statistics have enabled various downstream analyses, including
   partitioning heritability [[35]7], inferring causal single nucleotide
   polymorphisms (SNPs) using epigenomic annotations [[36]8], and gene
   sets enrichment analysis for complex traits [[37]9]. However, it
   remains challenging to link these genetic associations with known
   biological mechanisms. One main reason is that the majority of the GWAS
   loci are not located in known genic regions of the human genome.

   Transcriptome-wide association studies (TWAS) [[38]10–[39]12] offer a
   systematic way to integrate GWAS and the reference genotype-gene
   expression datasets, such as the Genotype-Tissue Expression project
   (GTEx) [[40]13], via expression quantitative loci (eQTL). In TWAS, we
   could first quantify the impact of each genetic variant on expression
   variability in a population and obtain predicted gene expression levels
   based on new genotypes; Then, we could correlate the predicted gene
   expression with the phenotype of interest in order to identify pivotal
   genes [[41]10]. Moreover, when individual-level genotypes and gene
   expression levels are not available, we could still quantify
   gene-to-phenotype association (i.e. TWAS statistics) using only the
   marginal effect sizes of SNPs on the phenotype and on gene expression
   respectively [[42]11]. These concepts and implementations have largely
   facilitated explanation of genetic association findings at the gene or
   the pathway level.

   However, as depicted in [43]Fig 1, TWAS are often confounded by the
   gene-to-gene correlation of the genetically predicted gene expression
   due to the SNP-to-SNP correlation i.e., linkage disequilibrium (LD)
   [[44]12]. Consequently, relying on the TWAS statistics may lead to
   false positive discoveries of causal genes and pathways. One approach
   to address this problem is to fine-map causal genes by inferring the
   posterior probabilities of configurations of each gene being causal in
   a defined GWAS loci and then test gene set enrichment using the
   credible gene sets of prioritized genes [[45]14]. However, this
   approach is computationally expensive, restricted to GWAS loci, and
   sensitive to the arbitrary thresholds used for determining the credible
   gene set and the maximum number of causal genes per locus.

Fig 1. Overview of confounding effects on pathway analysis.

   [46]Fig 1
   [47]Open in a new tab

   (a) A hypothetical example illustrates the confounding issue when using
   the genetically predicted transcriptome to assess the pathway
   enrichments for a target phenotype. The causal gene set includes a
   causal gene 1, which is linked to non-causal gene 2 via their
   respective causal SNPs 1 and 3, which are in strong linkage
   disequilibrium. (b) A GWAS locus. SNP associations with the target
   phenotype are summarized. The causal SNP for the causal gene is in red.
   The SNPs that drive non-causal genes are in blue. The rest of the SNPs
   are in green. SNPs exhibit correlated signals due to the linkage
   disequilibrium (LD) as displayed by the upper triangle of the SNP-SNP
   Pearson correlation matrix; (c) A TWAS locus. The gene-to-gene
   correlation is partly induced by the SNP-to-SNP correlation and partly
   due to intrinsic co-regulatory expression program. (d) Pathway
   associations based on averaged gene associations.

   Another method called PASCAL [[48]9] projects SNP signals onto genes
   while correcting for LD, and then performs pathway enrichments as the
   aggregated transformed gene scores, which asymptotically follows a
   chi-square distribution. However, PASCAL does not leverage the eQTL
   information for each SNP thereby assuming that a priori all SNPs have
   the same effect on the gene. Stratified LD score regression (LDSC)
   offers a principle way to partition the SNP heritability into
   functional categories, defined based on tissue or cell-type specific
   epigenomic regions [[49]7] or eQTL regions of the genes exhibiting a
   strong tissue specificity [[50]15]. Although LDSC is able to obtain
   biologically meaningful tissue-specific enrichments, it operates at the
   SNP level, rendering it difficult to assess enrichment of gene sets.
   Moreover, neither PASCAL nor LDSC is able to integrate the observed
   gene expression data measured in a disease cohort (rather than the
   reference cohort) that are broadly available across diverse studies of
   diseases including cancers such as The Cancer Genome Atlas (TCGA)
   [[51]16].

   Although expression-based methods, such as gene set enrichment analysis
   (GSEA), are often adopted in combination with the observed gene
   expression and phenotypes [[52]17], they generally do not account for
   the gene-to-gene correlation. While this type of correlation is usually
   caused by shared transcriptional regulatory mechanisms across genes,
   GSEA still likely produces false positives in identifying causal
   pathways.

   In this study, we present a novel and powerful gene-based heritability
   partitioning method that jointly accounts for gene-to-gene correlation
   and integrates information captured at either the SNP-to-phenotype or
   the SNP-to-gene level. We utilize this method to identify plausible
   causal gene sets or pathways for complex traits. We showcase its high
   accuracy and computational efficiency in various simulated and real
   scenarios.

Methods

Partitioning gene-based variance of complex traits

   We assume gene expression has linearly additive effects on a continuous
   polygenic trait y:
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msub><mi>y</mi><mi>i</mi></msub><mo>=</mo><m
   under><mo>∑</mo><mi>j</mi></munder><msub><mi>A</mi><mrow><mi>i</mi><mi>
   j</mi></mrow></msub><msub><mi>α</mi><mi>j</mi></msub><mo>+</mo><msub><m
   i>ϵ</mi><mi>i</mi></msub></mrow></mtd></mtr></mtable> :MATH]
   (1)

   where A[ij] denotes the expression of the j-th gene in the i-th
   individual for i ∈ {1, …, N} individuals and j ∈ {1, …, G} genes; α[j]
   denotes the true effect size of the j-th gene on the trait and
   [MATH:
   <mrow><msub><mi>α</mi><mi>j</mi></msub><mo>∼</mo><mi>N</mi><mrow><mo>(<
   /mo><mn>0</mn><mo>,</mo><msubsup><mi>σ</mi><mi>j</mi><mn>2</mn></msubsu
   p><mo>)</mo></mrow></mrow> :MATH]
   ; ϵ[i] denotes the residual for the i-th individual in this linear
   model and
   [MATH:
   <mrow><msub><mi>ϵ</mi><mi>i</mi></msub><mo>∼</mo><mi>N</mi><mrow><mo>(<
   /mo><mn>0</mn><mo>,</mo><msubsup><mi>σ</mi><mi>ϵ</mi><mn>2</mn></msubsu
   p><mo>)</mo></mrow></mrow> :MATH]
   .

   Here we further assume that both y and A are standardized such that
   [MATH:
   <mrow><mfrac><mn>1</mn><mi>N</mi></mfrac><mrow><msub><mo>∑</mo><mi>i</m
   i></msub><msub><mi>y</mi><mi>i</mi></msub></mrow><mo>=</mo><mn>0</mn></
   mrow> :MATH]
   ,
   [MATH:
   <mrow><mfrac><mn>1</mn><mi>N</mi></mfrac><mrow><msup><mi>y</mi><mi>⊤</m
   i></msup><mi>y</mi></mrow><mo>=</mo><mn>1</mn></mrow> :MATH]
   ,
   [MATH:
   <mrow><mfrac><mn>1</mn><mi>N</mi></mfrac><mrow><msub><mo>∑</mo><mi>i</m
   i></msub><msub><mi>A</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow
   ><mo>=</mo><mn>0</mn></mrow> :MATH]
   and
   [MATH:
   <mrow><mfrac><mn>1</mn><mi>N</mi></mfrac><msubsup><mi>A</mi><mi>j</mi><
   mi>⊤</mi></msubsup><msub><mi>A</mi><mi>j</mi></msub><mo>=</mo><mn>1</mn
   ></mrow> :MATH]
   , for j ∈ {1, …, G}.

   We define the estimated marginal effect size of the j-th gene on the
   trait as
   [MATH: <msub><mover
   accent="true"><mi>α</mi><mo>^</mo></mover><mi>j</mi></msub> :MATH]
   :
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><msub><mover
   accent="true"><mi>α</mi><mo>^</mo></mover><mi>j</mi></msub></mtd><mtd><
   mrow><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><msubsup><mi>A</mi><m
   i>j</mi><mi>⊤</mi></msubsup><mi>y</mi></mrow></mtd></mtr></mtable>
   :MATH]
   (2)
   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mfrac><mn>1</
   mn><mi>N</mi></mfrac><msubsup><mi>A</mi><mi>j</mi><mi>⊤</mi></msubsup><
   mrow><mo>(</mo><munder><mo>∑</mo><mi>k</mi></munder><msub><mi>A</mi><mi
   >k</mi></msub><msub><mi>α</mi><mi>k</mi></msub><mo>+</mo><mi>ϵ</mi><mo>
   )</mo></mrow></mrow></mtd></mtr></mtable> :MATH]
   (3)
   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><munder><mo>∑<
   /mo><mi>k</mi></munder><mfrac><mn>1</mn><mi>N</mi></mfrac><msubsup><mi>
   A</mi><mi>j</mi><mi>⊤</mi></msubsup><msub><mi>A</mi><mi>k</mi></msub><m
   sub><mi>α</mi><mi>k</mi></msub><mo>+</mo><mfrac><mn>1</mn><mi>N</mi></m
   frac><msubsup><mi>A</mi><mi>j</mi><mi>⊤</mi></msubsup><mi>ϵ</mi></mrow>
   </mtd></mtr></mtable> :MATH]
   (4)
   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><munder><mo>∑<
   /mo><mi>k</mi></munder><msub><mover
   accent="true"><mi>r</mi><mo>^</mo></mover><mrow><mi>j</mi><mi>k</mi></m
   row></msub><msub><mi>α</mi><mi>k</mi></msub><mo>+</mo><msup><mi>ϵ</mi><
   mo>′</mo></msup></mrow></mtd></mtr></mtable> :MATH]
   (5)

   where
   [MATH:
   <mrow><msup><mi>ϵ</mi><mo>′</mo></msup><mo>=</mo><mfrac><mn>1</mn><mi>N
   </mi></mfrac><msubsup><mi>A</mi><mi>j</mi><mi>⊤</mi></msubsup><mi>ϵ</mi
   ></mrow> :MATH]
   with
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>V</mi><mi>a</mi><mi>r</mi><mrow><mo>(</mo
   ><msup><mi>ϵ</mi><mo>′</mo></msup><mo>)</mo></mrow><mo>=</mo><mfrac><mn
   >1</mn><msup><mi>N</mi><mn>2</mn></msup></mfrac><msubsup><mi>A</mi><mi>
   j</mi><mi>⊤</mi></msubsup><mi>V</mi><mi>a</mi><mi>r</mi><mrow><mo>(</mo
   ><mi>ϵ</mi><mo>)</mo></mrow><msub><mi>A</mi><mi>j</mi></msub><mo>=</mo>
   <mfrac><mn>1</mn><mi>N</mi></mfrac><msubsup><mi>σ</mi><mi>ϵ</mi><mn>2</
   mn></msubsup></mrow></mtd></mtr></mtable> :MATH]

   and
   [MATH: <mrow><msub><mover
   accent="true"><mi>r</mi><mo>^</mo></mover><mrow><mi>j</mi><mi>k</mi></m
   row></msub><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><msubsup><mi>A<
   /mi><mi>j</mi><mi>⊤</mi></msubsup><msub><mi>A</mi><mi>k</mi></msub></mr
   ow> :MATH]
   is the estimated Pearson correlation in gene expression between the
   j-th gene and the k-th gene.

   We define
   [MATH:
   <mrow><msubsup><mi>χ</mi><mi>j</mi><mn>2</mn></msubsup><mo>=</mo><mi>N<
   /mi><msubsup><mover
   accent="true"><mi>α</mi><mo>^</mo></mover><mi>j</mi><mn>2</mn></msubsup
   ></mrow> :MATH]
   . Then, if we further assume α, r and ϵ′ are independent, we have
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>E</mi><mo>[</mo><msubsup><mi>χ</mi><mi>j<
   /mi><mn>2</mn></msubsup><mo>]</mo></mrow></mtd><mtd><mrow><mo>=</mo><mi
   >E</mi><mo>[</mo><mi>N</mi><msubsup><mover
   accent="true"><mi>α</mi><mo>^</mo></mover><mi>j</mi><mn>2</mn></msubsup
   ><mo>]</mo></mrow></mtd></mtr></mtable> :MATH]
   (6)
   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mi>N</mi><mi>
   E</mi><mo>[</mo><msup><mrow><mo>(</mo><munder><mo>∑</mo><mi>k</mi></mun
   der><msub><mover
   accent="true"><mi>r</mi><mo>^</mo></mover><mrow><mi>j</mi><mi>k</mi></m
   row></msub><msub><mi>α</mi><mi>k</mi></msub><mo>+</mo><msup><mi>ϵ</mi><
   mo>′</mo></msup><mo>)</mo></mrow><mn>2</mn></msup><mo>]</mo></mrow></mt
   d></mtr></mtable> :MATH]
   (7)
   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mi>N</mi><mun
   der><mo>∑</mo><mi>k</mi></munder><mi>E</mi><mrow><mo>[</mo><msubsup><mo
   ver
   accent="true"><mi>r</mi><mo>^</mo></mover><mrow><mi>j</mi><mi>k</mi></m
   row><mn>2</mn></msubsup><mo>]</mo></mrow><mi>E</mi><mrow><mo>[</mo><msu
   bsup><mi>α</mi><mi>k</mi><mn>2</mn></msubsup><mo>]</mo></mrow><mo>+</mo
   ><msubsup><mi>σ</mi><mi>ϵ</mi><mn>2</mn></msubsup></mrow></mtd></mtr></
   mtable> :MATH]
   (8)

   Now, consider C gene sets C[c], where c ∈ {1, …, C} and denote the
   proportion of total trait variance explained by the c-th gene set as
   τ[c] with
   [MATH:
   <mrow><msub><mi>τ</mi><mi>c</mi></msub><mo>=</mo><mfrac><mrow><msub><mo
   >∑</mo><mrow><mi>j</mi><mo>∈</mo><msub><mi>C</mi><mi>c</mi></msub></mro
   w></msub><mi>V</mi><mi>a</mi><mi>r</mi><mrow><mo>(</mo><msub><mi>α</mi>
   <mi>j</mi></msub><mo>)</mo></mrow></mrow><mrow><mrow><mo>|</mo></mrow><
   msub><mi>C</mi><mi>c</mi></msub><mrow><mo>|</mo></mrow></mrow></mfrac><
   /mrow> :MATH]
   . Here, |C[c]| denotes the number of genes in the c-th gene set.

   Consequently,
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>E</mi><mrow><mo>[</mo><msubsup><mi>α</mi>
   <mi>k</mi><mn>2</mn></msubsup><mo>]</mo></mrow><mo>=</mo><mi>V</mi><mi>
   a</mi><mi>r</mi><mrow><mo>(</mo><msub><mi>α</mi><mi>k</mi></msub><mo>)<
   /mo></mrow><mo>=</mo><munder><mo>∑</mo><mrow><mi>c</mi><mo>:</mo><mi>k<
   /mi><mo>∈</mo><msub><mi>C</mi><mi>c</mi></msub></mrow></munder><msub><m
   i>τ</mi><mi>c</mi></msub></mrow></mtd></mtr></mtable> :MATH]

   By approximating
   [MATH: <mrow><mi>E</mi><mo>[</mo><msubsup><mover
   accent="true"><mi>r</mi><mo>^</mo></mover><mrow><mi>j</mi><mi>k</mi></m
   row><mn>2</mn></msubsup><mo>]</mo></mrow> :MATH]
   with
   [MATH: <mrow><msubsup><mover
   accent="true"><mi>r</mi><mo>^</mo></mover><mrow><mi>j</mi><mi>k</mi></m
   row><mn>2</mn></msubsup><mo>+</mo><mfrac><mn>1</mn><mi>N</mi></mfrac></
   mrow> :MATH]
   , we have that
   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>E</mi><mo>[</mo><msubsup><mi>χ</mi><mi>j<
   /mi><mn>2</mn></msubsup><mo>]</mo></mrow></mtd><mtd><mrow><mo>=</mo><mi
   >N</mi><munder><mo>∑</mo><mi>k</mi></munder><mi>E</mi><mrow><mo>[</mo><
   msubsup><mover
   accent="true"><mi>r</mi><mo>^</mo></mover><mrow><mi>j</mi><mi>k</mi></m
   row><mn>2</mn></msubsup><mo>]</mo></mrow><mi>E</mi><mrow><mo>[</mo><msu
   bsup><mi>α</mi><mi>k</mi><mn>2</mn></msubsup><mo>]</mo></mrow><mo>+</mo
   ><msubsup><mi>σ</mi><mi>ϵ</mi><mn>2</mn></msubsup></mrow></mtd></mtr></
   mtable> :MATH]
   (9)
   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mi>N</mi><mun
   der><mo>∑</mo><mi>c</mi></munder><msub><mi>τ</mi><mi>c</mi></msub><mund
   er><mo>∑</mo><mrow><mi>k</mi><mo>∈</mo><msub><mi>C</mi><mi>c</mi></msub
   ></mrow></munder><msubsup><mover
   accent="true"><mi>r</mi><mo>^</mo></mover><mrow><mi>j</mi><mi>k</mi></m
   row><mn>2</mn></msubsup><mo>+</mo><munder><mo>∑</mo><mi>c</mi></munder>
   <msub><mi>τ</mi><mi>c</mi></msub><mo>+</mo><msubsup><mi>σ</mi><mi>ϵ</mi
   ><mn>2</mn></msubsup></mrow></mtd></mtr></mtable> :MATH]
   (10)
   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mi>N</mi><mun
   der><mo>∑</mo><mi>c</mi></munder><msub><mi>τ</mi><mi>c</mi></msub><mi>l
   </mi><mrow><mo>(</mo><mi>j</mi><mo>,</mo><mi>c</mi><mo>)</mo></mrow><mo
   >+</mo><mn>1</mn></mrow></mtd></mtr></mtable> :MATH]
   (11)

   where we define gene score as
   [MATH:
   <mrow><mi>l</mi><mrow><mo>(</mo><mi>j</mi><mo>,</mo><mi>c</mi><mo>)</mo
   ></mrow><mo>=</mo><msub><mo>∑</mo><mrow><mi>k</mi><mo>∈</mo><msub><mi>C
   </mi><mi>c</mi></msub></mrow></msub><msubsup><mover
   accent="true"><mi>r</mi><mo>^</mo></mover><mrow><mi>j</mi><mi>k</mi></m
   row><mn>2</mn></msubsup></mrow> :MATH]
   and
   [MATH:
   <mrow><mi>V</mi><mi>a</mi><mi>r</mi><mrow><mo>(</mo><mi>y</mi><mo>)</mo
   ></mrow><mo>=</mo><msub><mo>∑</mo><mi>c</mi></msub><msub><mi>τ</mi><mi>
   c</mi></msub><mo>+</mo><msubsup><mi>σ</mi><mi>ϵ</mi><mn>2</mn></msubsup
   ><mo>=</mo><mn>1</mn></mrow> :MATH]
   since the continuous trait is normalized.

   Therefore, if we are able to obtain estimates for
   [MATH: <msubsup><mi>χ</mi><mi>j</mi><mn>2</mn></msubsup> :MATH]
   and C gene score l(j, c) for j ∈ {1, …, G} and c ∈ {1, …, C}, we will
   be able to perform linear regression of the estimated
   [MATH: <msubsup><mi>χ</mi><mi>j</mi><mn>2</mn></msubsup> :MATH]
   on l(j, c), and derive regression coefficient that is an estimate for
   each τ[c] (c ∈ {1, …, C}), respectively.

   These are available from GWAS summary statistics of SNP-to-trait effect
   sizes, eQTL summary statistics of SNP-to-gene expression effect sizes,
   and a reference LD panel. Specifically,
    1. Suppose we have estimated effect sizes (β[p×1]) of p SNPs based on
       a GWAS including N[gwas] samples, i.e.

   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>β</mi><mo>=</mo><mfrac><mn>1</mn><msub><m
   i>N</mi><mtext>gwas</mtext></msub></mfrac><msup><mi>X</mi><mi>⊤</mi></m
   sup><mi>y</mi></mrow></mtd></mtr></mtable> :MATH]
       where X[N[gwas]×p] is the standardized genotype. Meanwhile, we have
       the eQTL summary statistics W estimated using

   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><msub><mi>A</mi><mrow><mi>e</mi><mi>Q</mi><mi
   >T</mi><mi>L</mi></mrow></msub><mo>=</mo><msub><mi>X</mi><mrow><mi>e</m
   i><mi>Q</mi><mi>T</mi><mi>L</mi></mrow></msub><mi>W</mi></mrow></mtd></
   mtr></mtable> :MATH]
       Therefore, the predicted gene expression in GWAS is given by

   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><mrow><mi>A</mi><mo>=</mo><mi>X</mi><mi>W</mi></mro
   w></mtd></mtr></mtable> :MATH]
       Since

   [MATH: <mtable displaystyle="true"><mtr><mtd
   columnalign="right"><msubsup><mi>χ</mi><mi>j</mi><mn>2</mn></msubsup></
   mtd><mtd><mrow><mo>=</mo><mi>N</mi><msubsup><mover
   accent="true"><mi>α</mi><mo>^</mo></mover><mi>j</mi><mn>2</mn></msubsup
   ></mrow></mtd></mtr></mtable> :MATH]
       (12)

   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mi>N</mi><msu
   p><mrow><mo>(</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><msubsup><mi>A</mi
   ><mi>j</mi><mi>⊤</mi></msubsup><mi>y</mi><mo>)</mo></mrow><mn>2</mn></m
   sup></mrow></mtd></mtr></mtable> :MATH]
       (13)

   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mi>N</mi><msu
   p><mrow><mo>(</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><msubsup><mi>W</mi
   ><mi>j</mi><mi>⊤</mi></msubsup><msup><mi>X</mi><mi>⊤</mi></msup><mi>y</
   mi><mo>)</mo></mrow><mn>2</mn></msup></mrow></mtd></mtr></mtable>
   :MATH]
       (14)

   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mi>N</mi><msu
   p><mrow><mo>(</mo><msubsup><mi>W</mi><mi>j</mi><mi>⊤</mi></msubsup><mi>
   β</mi><mo>)</mo></mrow><mn>2</mn></msup></mrow></mtd></mtr></mtable>
   :MATH]
       (15)
       the required
       [MATH: <msubsup><mi>χ</mi><mi>j</mi><mn>2</mn></msubsup> :MATH]
       can be estimated without accessing any individual-level data.
    2. Furthermore, a reference LD panel Σ[p×p] summarizing SNP-to-SNP
       correlation in the matched population with the GWAS study can
       provide estimates for r[jk] as

   [MATH:
   <mi>R</mi><mo>=</mo><mo>[</mo><msub><mi>r</mi><mrow><mi>j</mi><mi>k</mi
   ></mrow></msub><mo>]</mo> :MATH]
       (16)

   [MATH:
   <mrow><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><msup><mi>A</mi><mi>
   ⊤</mi></msup><mi>A</mi></mrow> :MATH]
       (17)

   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mfrac><mn>1</
   mn><mi>N</mi></mfrac><msup><mi>W</mi><mi>⊤</mi></msup><msup><mi>X</mi><
   mi>⊤</mi></msup><mi>X</mi><mi>W</mi></mrow></mtd></mtr></mtable> :MATH]
       (18)

   [MATH: <mtable
   displaystyle="true"><mtr><mtd></mtd><mtd><mrow><mo>=</mo><mfrac><mn>1</
   mn><mi>N</mi></mfrac><msup><mi>W</mi><mi>⊤</mi></msup><mo>Σ</mo><mi>W</
   mi></mrow></mtd></mtr></mtable> :MATH]
       (19)

   It is noteworthy that with individual-level gene expression data, we
   can also easily obtain the required
   [MATH: <msubsup><mi>χ</mi><mi>j</mi><mn>2</mn></msubsup> :MATH]
   and R = [r[jk]] by definition.

   In practice, many gene sets are not disjoint and share common genes
   with each other. Therefore, we regress one gene set at a time along
   with a “dummy” gene set that include the union of all of the other
   genes. The dummy gene set is used to account for unbalanced gene sets
   and to stabilize estimates of τ[c]. We also include an intercept in the
   regression model to alleviate non-gene-set biases, for example,
   positive correlation between gene scores and true gene effect sizes
   that could lead to intercept greater than 1 and negative correlation
   between gene scores and true gene effect sizes could lead to intercept
   smaller than 1.

Simulation design

   To assess the accuracy of our GSR approach, we simulated causal SNPs
   for gene expression as well as causal gene sets for a continuous trait
   based on real genotypes and known gene sets from existing databases.
   Our simulation included two stages: At stage 1, we first simulated gene
   expression based on reference genotype panel. We then estimated
   SNP-gene effects
   [MATH: <msub><mover
   accent="true"><mi>W</mi><mo>^</mo></mover><mi>g</mi></msub> :MATH]
   for each gene g based on the simulated gene expression and genotype,
   which were then used to predict gene expression; At stage 2,
   separately, we simulated the a continuous trait using simulated gene
   expression based on genotype, and estimated the marginal SNP-phenotype
   effects.

   Simulation step 1: simulating gene expression:
    1. To simulate individual genotype, we first partitioned genotype data
       for 489 individuals of European ancestry obtained from the 1000
       Genomes Project [[53]18] into independent 1703 LD blocks as defined
       by LDetect [[54]19];
    2. We then randomly sampled 100 LD blocks and used only those 100 LD
       blocks for the subsequent simulation; We used 100 LD blocks as
       opposed to whole genome to reduce computational burden required for
       multiple simulation runs;
    3. For LD block j (j ∈ {1, …, 100}) of an individual i (i ∈ {1, …,
       500}), we randomly sampled from the 489 available samples for block
       j, and concatenated these sampled LD blocks 1, …, 100 for this
       individual. We repeated this procedure to simulate genotype X[ref]
       for N[ref] = 500 individuals as a reference population;
    4. We standardized the simulated genotype X[ref];
    5. We randomly sampled k in-cis causal SNPs per gene within ± 500 kb
       around the gene, where k = 1 (default). We also experimented
       different number of causal SNPs k ∈ {2, 3, all in- cis SNPs};
    6. We sampled SNP-gene weights
       [MATH: <mrow><msub><mi
       mathvariant="bold">W</mi><mi>g</mi></msub><mo>∼</mo><mi
       mathvariant="script">N</mi><mrow><mo>(</mo><mn>0</mn><mo>,</mo><msu
       bsup><mi>h</mi><mi>g</mi><mn>2</mn></msubsup><mo>/</mo><mi>k</mi><m
       o>)</mo></mrow></mrow> :MATH]
       where gene expression heritability
       [MATH:
       <mrow><msubsup><mi>h</mi><mi>g</mi><mn>2</mn></msubsup><mo>=</mo><m
       n>0</mn><mo>.</mo><mn>1</mn></mrow> :MATH]
       (default), which is the variance of gene expression explained by
       genotype. We also experimented different gene heritability
       [MATH:
       <mrow><msubsup><mi>h</mi><mi>g</mi><mn>2</mn></msubsup><mo>=</mo><m
       row><mo>{</mo><mn>0</mn><mo>.</mo><mn>2</mn><mo>,</mo><mn>0</mn><mo
       >.</mo><mn>3</mn><mo>,</mo><mn>0</mn><mo>.</mo><mn>4</mn><mo>,</mo>
       <mn>0</mn><mo>.</mo><mn>5</mn><mo>}</mo></mrow></mrow> :MATH]
       ;
    7. We then simulated gene expression A[g,ref] = X[ref] W[g] + ϵ, where
       [MATH: <mrow><mi>ϵ</mi><mo>∼</mo><mi
       mathvariant="script">N</mi><mo>(</mo><mn>0</mn><mo>,</mo><msubsup><
       mi>σ</mi><mi>ϵ</mi><mn>2</mn></msubsup><mo>)</mo></mrow> :MATH]
       and
       [MATH:
       <mrow><msubsup><mi>σ</mi><mi>ϵ</mi><mn>2</mn></msubsup><mo>=</mo><m
       frac><mn>1</mn><msub><mi>N</mi><mtext>ref</mtext></msub></mfrac><mr
       ow><mo>∥</mo></mrow><msub><mi
       mathvariant="bold">X</mi><mtext>ref</mtext></msub><msub><mi
       mathvariant="bold">W</mi><mi>g</mi></msub><msup><mrow><mo>∥</mo></m
       row><mn>2</mn></msup><mrow><mo>(</mo><mfrac><mn>1</mn><msubsup><mi>
       h</mi><mi>g</mi><mn>2</mn></msubsup></mfrac><mo>-</mo><mn>1</mn><mo
       >)</mo></mrow><msub><mi
       mathvariant="bold">I</mi><msub><mi>N</mi><mtext>ref</mtext></msub><
       /msub></mrow> :MATH]
       to match the desired heritability:
       [MATH:
       <mrow><mfrac><mrow><mn>1</mn><mo>-</mo><msubsup><mi>h</mi><mi>g</mi
       ><mn>2</mn></msubsup></mrow><msubsup><mi>h</mi><mi>g</mi><mn>2</mn>
       </msubsup></mfrac><mo>=</mo><mfrac><msubsup><mi>σ</mi><mi>ϵ</mi><mn
       >2</mn></msubsup><mrow><mrow><mo>∥</mo></mrow><msub><mi
       mathvariant="bold">X</mi><mtext>ref</mtext></msub><msub><mi
       mathvariant="bold">W</mi><mi>g</mi></msub><msup><mrow><mo>∥</mo></m
       row><mn>2</mn></msup><mo>/</mo><msub><mi>N</mi><mtext>ref</mtext></
       msub></mrow></mfrac></mrow> :MATH]
    8. Finally, we applied LASSO regression
       [MATH: <mrow><msub><mi
       mathvariant="bold">A</mi><mrow><mi>g</mi><mo>,</mo><mtext>ref</mtex
       t></mrow></msub><mo>∼</mo><mover accent="true"><mi
       mathvariant="bold">X</mi><mo>¯</mo></mover><msub><mi
       mathvariant="bold">W</mi><mi>g</mi></msub></mrow> :MATH]
       to get
       [MATH: <msub><mover accent="true"><mi
       mathvariant="bold">W</mi><mo>^</mo></mover><mi>g</mi></msub> :MATH]
       for each gene.

   Simulation step 2: simulating phenotype:
    1. We simulated another N[gwas] = 50,000 GWAS individuals by the 100
       predefined LD blocks among the 489 Europeans in 1000 Genome data,
       following the same procedures as decribed above;
    2. We then standardized the simulated genotype X[gwas];
    3. We then sampled a causal pathway
       [MATH: <msub><mi mathvariant="script">C</mi><mi>c</mi></msub>
       :MATH]
       from MSigDB such that all of the
       [MATH:
       <mrow><msub><mi>G</mi><mi>c</mi></msub><mo>≡</mo><mrow><mo>|</mo><m
       sub><mi
       mathvariant="script">C</mi><mi>c</mi></msub><mo>|</mo></mrow></mrow
       > :MATH]
       genes in
       [MATH: <msub><mi mathvariant="script">C</mi><mi>c</mi></msub>
       :MATH]
       were causal genes for the phenotype;
    4. For each non-causal pathway, we removed genes that were also
       present in the causal pathway. We removed non-causal pathways
       containing fewer than five genes afterwards (default);
       Alternatively, in more realistic scenarios, we allowed for sharing
       genes with causal pathways by non-causal pathways;
    5. We sampled gene-phenotype effect
       [MATH: <mrow><mi>α</mi><mo>∼</mo><mi
       mathvariant="script">N</mi><mo>(</mo><mn>0</mn><mo>,</mo><msubsup><
       mi>σ</mi><mi>α</mi><mn>2</mn></msubsup><mo>/</mo><msub><mi>G</mi><m
       i>c</mi></msub><msub><mi
       mathvariant="bold">I</mi><msub><mi>G</mi><mi>c</mi></msub></msub><m
       o>)</mo></mrow> :MATH]
       , where the phenotypic variance explained by gene expression
       [MATH:
       <mrow><msubsup><mi>σ</mi><mi>α</mi><mn>2</mn></msubsup><mo>=</mo><m
       n>0</mn><mo>.</mo><mn>1</mn></mrow> :MATH]
       (default). We also experimented different
       [MATH:
       <mrow><msubsup><mi>σ</mi><mi>α</mi><mn>2</mn></msubsup><mo>∈</mo><m
       row><mo>{</mo><mn>0</mn><mo>.</mo><mn>1</mn><mo>,</mo><mn>0</mn><mo
       >.</mo><mn>2</mn><mo>,</mo><mn>0</mn><mo>.</mo><mn>3</mn><mo>,</mo>
       <mn>0</mn><mo>.</mo><mn>4</mn><mo>,</mo><mn>0</mn><mo>.</mo><mn>5</
       mn><mo>}</mo></mrow></mrow> :MATH]
       ;
    6. We simulated gene expression A[c] as in step 1 for the N[gwas]
       individuals, and standardized it to obtain
       [MATH: <msub><mover accent="true"><mi
       mathvariant="bold">A</mi><mo>¯</mo></mover><mi>c</mi></msub> :MATH]
    7. We simulated a continuous trait using causal gene expression:
       [MATH: <mrow><mi mathvariant="bold">y</mi><mo>=</mo><msub><mover
       accent="true"><mi
       mathvariant="bold">A</mi><mo>¯</mo></mover><mi>c</mi></msub><mi
       mathvariant="bold-italic">α</mi><mo>+</mo><msub><mi>ϵ</mi><mi>y</mi
       ></msub></mrow> :MATH]
       where
       [MATH: <mrow><msub><mi>ϵ</mi><mi>y</mi></msub><mo>∼</mo><mi
       mathvariant="script">N</mi><mrow><mo>(</mo><mn>0</mn><mo>,</mo><msu
       bsup><mi>σ</mi><msub><mi>ϵ</mi><mi>y</mi></msub><mn>2</mn></msubsup
       ><mo>)</mo></mrow></mrow> :MATH]
       . Here,
       [MATH:
       <mrow><msubsup><mi>σ</mi><msub><mi>ϵ</mi><mi>y</mi></msub><mn>2</mn
       ></msubsup><mo>=</mo><mfrac><mn>1</mn><msub><mi>N</mi><mtext>gwas</
       mtext></msub></mfrac><mrow><mo>∥</mo></mrow><msub><mover
       accent="true"><mi
       mathvariant="bold">A</mi><mo>¯</mo></mover><mi>c</mi></msub><mi
       mathvariant="bold-italic">α</mi><msup><mrow><mo>∥</mo></mrow><mn>2<
       /mn></msup><mrow><mo>(</mo><mfrac><mn>1</mn><msubsup><mi>σ</mi><mi>
       α</mi><mn>2</mn></msubsup></mfrac><mo>-</mo><mn>1</mn><mo>)</mo></m
       row><msub><mi
       mathvariant="bold">I</mi><msub><mi>N</mi><mrow><mi>g</mi><mi>w</mi>
       <mi>a</mi><mi>s</mi></mrow></msub></msub></mrow> :MATH]
       to match the predefined proportion of variance explained:
       [MATH:
       <mrow><mfrac><mrow><mn>1</mn><mo>-</mo><msubsup><mi>σ</mi><mi>α</mi
       ><mn>2</mn></msubsup></mrow><msubsup><mi>σ</mi><mi>α</mi><mn>2</mn>
       </msubsup></mfrac><mo>=</mo><mfrac><msubsup><mi>σ</mi><msub><mi>ϵ</
       mi><mi>y</mi></msub><mn>2</mn></msubsup><mrow><mrow><mo>∥</mo></mro
       w><msub><mover accent="true"><mi
       mathvariant="bold">A</mi><mo>¯</mo></mover><mi>c</mi></msub><mi
       mathvariant="bold-italic">α</mi><msup><mrow><mo>∥</mo></mrow><mn>2<
       /mn></msup><mo>/</mo><msub><mi>N</mi><mtext>gwas</mtext></msub></mr
       ow></mfrac></mrow> :MATH]
    8. Lastly, we computed GWAS summary SNP-to-trait effect size:
       [MATH:
       <mrow><mi>β</mi><mo>=</mo><mfrac><mn>1</mn><mi>N</mi></mfrac><msubs
       up><mi
       mathvariant="bold">X</mi><mtext>gwas</mtext><mi>⊤</mi></msubsup><mi
       mathvariant="bold">y</mi></mrow> :MATH]

   We repeated these simulation procedures 100 times. Unless otherwise
   stated, while we were experimenting various settings, we kept the other
   settings at their default values: k = 1 causal SNP per gene; gene
   expression variance explained per causal SNP
   [MATH:
   <mrow><msubsup><mi>h</mi><mi>g</mi><mn>2</mn></msubsup><mo>=</mo><mn>0<
   /mn><mo>.</mo><mn>1</mn><mo>/</mo><mi>k</mi></mrow> :MATH]
   ; phenotypic variance explained per gene
   [MATH:
   <mrow><msubsup><mi>σ</mi><mi>α</mi><mn>2</mn></msubsup><mo>=</mo><mn>0<
   /mn><mo>.</mo><mn>1</mn></mrow> :MATH]
   ; one causal pathway. Using these obtained summary statistics, we were
   able to perform GSR, PASCAL, LDSC and FOCUS in each simulated scenario.

Applying existing methods

   PASCAL: PASCAL was downloaded from
   [55]https://www2.unil.ch/cbg/index.php?title=Pascal [[56]9]. We
   executed the software using default settings. LDSC: Stratified LD score
   regression software was downloaded from
   [57]https://github.com/bulik/ldsc [[58]15]. Because LDSC operates on
   SNP level, we considered SNPs located within ± 500 kb around genes in
   each pathway to be involved in the corresponding pathway. Then, for
   each pathway, we computed the LD scores over all chromosomes. We
   experimented the options of running LDSC with and without the 53
   baseline annotations using our simulated data. We found that LDSC
   running without the 53 baseline worked better in our case. One possible
   reason is that the baseline annotations cover genome-wide SNPs whereas
   there are much fewer SNPs in the simulated pathways. FOCUS: We
   downloaded FOCUS [[59]14] from [60]https://github.com/bogdanlab/focus.
   We used FOCUS to infer the posterior probability of each gene being
   causal for the phenotype across all of the LD blocks. We then took the
   90% credible gene set as follows. We first summed all of the posteriors
   over all of the genes. We then sorted the genes by the decreasing order
   of their FOCUS-posteriors. We kept adding the top ranked the gene into
   the 90% credible gene until the sum of their posteriors was greater
   than or equal to the 90% of the total sum of posteriors. We used the
   90% credible gene set for hypergeometric test for each pathway to
   compute the p-values. We also tried other thresholds for credible sets
   ranging from 75% (including the fewest genes) to 99% (including the
   most genes). GSEA: GSEA software was obtained from
   [61]http://software.broadinstitute.org/gsea [[62]17]. We used the
   command-line version of GSEA to test for gene set enrichments using the
   observed gene expression and phenotype data.

Real data application

   We applied our approach to investigate pathway enrichment for 27
   complex traits ([63]Fig 2b) using publicly available summary statistics
   and genotype-expression weights based on 1,264 GTEx whole blood
   samples. The GWAS summary statistics were downloaded from public
   database
   [64]https://data.broadinstitute.org/alkesgroup/sumstats_formatted/
   [[65]7]. We downloaded expression weights and reference LD structure
   estimated in 1000 Genomes using 489 European individuals, from the
   TWAS/FUSION website ([66]http://gusevlab.org/projects/fusion/) [[67]11,
   [68]18]. Franke lab cell-type-specific gene expression dataset were
   obtained from
   [69]https://data.broadinstitute.org/mpg/depict/depict_download/tissue_e
   xpression.

Fig 2. Gene scores correlated with marginal TWAS summary statistics.

   [70]Fig 2
   [71]Open in a new tab

   (a) Gene scores were correlated with marginal TWAS chi-square
   statistics of Schizophrenia. We grouped genes into 100 bins by their
   gene scores to reduce noise. For each bin, we calculated the average
   gene scores and chi-square statistics (χ^2). An unbinned version is
   supplied in S1 Fig in [72]S1 File. (b) Summary of Pearson correlation
   between marginal TWAS summary statistics and gene scores for 27 traits.
   The negative correlation for T2D indicates that this trait, possibly
   due to complicated genetic architecture and confounding
   gene-to-environment interaction and drug effects, is not suitable for
   using our approach.

   In addition, we applied GSR to test for gene set enrichment in three
   well-powered types of cancer: breast invasive carcinoma (BRCA, 982
   cases and 199 controls), thyroid carcinoma (THCA, 441 cases and 371
   controls) and prostate adenocarcinoma (PRAD, 426 cases and 154
   controls), using gene expression datasets from The Cancer Genome Atlas
   (TCGA). Uniformly processed (normalized and batch-effect corrected)
   gene expression datasets from TCGA and GTEx were obtained from
   [73]https://figshare.com/articles/Data_record_3/5330593 [[74]20]. Gene
   expression and phenotype were standardized before supplying to the GSR
   software. Standard GSEA was also performed for comparison.

   Gene sets were downloaded from the MSigDb website
   [75]http://software.broadinstitute.org/gsea/msigdb/index.jsp. Here we
   combined BIOCARTA, KEGG and REACTOME to create a total of 1,050 gene
   sets. We also downloaded the 4,436 GO biological process terms as
   additional gene sets as well as the 189 gene sets pertaining to
   oncogenic signatures for the TCGA data analysis.

Results

Gene scores were correlated with TWAS statistics in polygenic complex traits

   Our method GSR is built on the hypothesis that the marginal gene effect
   sizes on the phenotype should be positively correlated with the sum of
   correlation with other genes, which include causal genes. To validate
   this hypothesis, we defined gene score for each gene as the sum of its
   squared Pearson correlation with all of the other genes, derived from
   gene expression levels. We calculated TWAS marginal statistics as the
   product of GWAS summary statistics (β) and eQTL weights (W) derived
   from the GTEx whole blood samples ([76]Eq 15). To assess the impact of
   gene-to-gene correlation on TWAS statistic, we correlated the gene
   scores with the TWAS marginal statistics for 27 complex traits.
   Overall, most traits had Pearson correlation between the gene score and
   the marginal TWAS statistic above 0.4. For instance, the correlation in
   schizophrenia was 0.76 (Inter-Quartile Range: 0.66—0.81 based on 1,000
   permutations; [77]Fig 2). This implies a pervasive confounding impact
   on the downstream analysis, including gene set or pathway enrichment
   analysis, causal gene identification, etc., using the TWAS summary
   statistics while assuming independence of genes ([78]Fig 1).

GSR improved pathway enrichment power

   In simulated scenarios with default settings ([79]Methods), compared to
   PASCAL and LDSC, GSR demonstrated hugely improved computational
   efficiency ([80]Table 1), superior sensitivity in detecting causal
   pathways with an improved statistical power as well as competitive
   specificity in controlling for false positives ([81]Fig 3).
   Specifically, in 100 simulations, GSR achieved an overall area under
   the precision-recall curve (AUPRC) of 0.925, and identified the true
   causal pathway as the most significant one 93 times, compared to 56
   times by PASCAL, which only achieved an overall AUPRC of 0.260.
   Notably, the FOCUS-predicted 75%, 90%, 99% credible gene sets were also
   significantly enriched for causal pathways ([82]Fig 3).

Table 1. Comparison of existing methods with GSR.

Method         GWAS              TWAS       Measured expression   Running time
PASCAL [[83]9] sum. stat. [84]^*                                           10 m
LDSC [[85]15]          sum. stat.                                  >24 h [86]^†
FOCUS [[87]14]         sum. stat. sum. stat.                               >24 h
GSEA [[88]17]                               individual expression          10 m
GSR                    sum. stat.            individual expression         3 min
   [89]Open in a new tab

   * Summary statistics

   ^† For custom gene sets, the main computation time for LDSC is
   calculating the LD score for all of the 1000 Genome SNPs.

Fig 3. Evaluation of power and robustness of GSR in detecting causal
pathways.

   [90]Fig 3
   [91]Open in a new tab

   (a) Precision-recall curves for GSR and PASCAL summarizing results from
   100 simulations. (b) Summary of p-values obtained by running GSR along
   with PASCAL, LDSC and FOCUS 10 times. For each method, the enrichment
   significance for causal pathways and non-causal pathways are displayed.
   We experimented FOCUS with 75%, 90%, and 99% credible sets for the
   pathway enrichments. For the ease of comparison, we plotted the y-axis
   on a square-root negative logarithmic scale. Red line denotes p-value
   threshold of 0.001; Blue line denotes p-value threshold of 0.1.

   We then varied four different settings: (a) the number of causal SNP
   per gene; (2) SNP-gene heritabilities; (3) gene-phenotype variance
   explained; (4) overlapping causal pathway. We focused our comparison
   with PASCAL because it directly tested for pathway enrichment and has
   been demonstrated to outperform other relevant enrichment methods
   [[92]9]. In all simulation settings, GSR demonstrated an improved power
   in detecting the causal pathways (S2 Fig in [93]S1 File), as it was
   able to detect causal pathways when multiple SNPs influenced gene
   expression, when the proportion of variance explained by the gene
   expression was low, or when the causal and non-causal pathways were
   allowed to overlap. In contrast, a lot of causal pathways were not
   deemed significant by PASCAL based on a p-value threshold of 0.001,
   which was equivalent to a Bonferroni-corrected p-value threshold of 0.1
   after correcting for multiple testing on approximately 100 pathways
   tested per simulation.

Improved power in pathway enrichment leveraging observed gene expression

   One unique feature of GSR is the ability to run not on only the summary
   statistics but also on observed gene expression, where the gene-gene
   expression correlation is directly estimated from the in-sample gene
   expression. To evaluate the accuracy of this application, we simulated
   gene expression and phenotype for 1,000 individuals, which were
   provided as input to GSR for pathway enrichment analysis. As a
   comparison, we applied GSR to the summary statistics generated from the
   same dataset.

   As in the simulation above, the SNP-expression weights were estimated
   from a separate set of 500 reference individuals whereas the
   SNP-phenotype associations were estimated from only 1,000 individuals.
   Notably, the sample size for the GWAS cohort is much smaller than the
   previous application to mimic the real data where usually fewer than
   1000 individuals have both the RNA-seq and phenotype available (e.g.,
   TCGA). Additionally, we applied standard GSEA [[94]17] to the same
   dataset with the observed gene expression. We observed an improved
   power of GSR when using the observed gene expression over GSR using the
   summary statistics ([95]Fig 4), whereas GSEA had a comparable
   performance as the latter. Specifically, all causal pathways in the
   simulated replicates had a p-value below 0.001, with the largest
   p-value being 7.5 × 10^−6, as determined by GSR using observed gene
   expression, while no causal pathway reached this level of significance
   (with the smallest p-value being 1.4 × 10^−2) determined by GSEA. We
   also compared the performances of GSR using observed gene expression to
   GSEA in various simulation settings and obtained consistent conclusions
   (S3 Fig in [96]S1 File).

Fig 4. Comparison of pathway enrichment determined by GSR using or not using
observed gene expression information, and by GSEA.

   [97]Fig 4
   [98]Open in a new tab

   Nominal (NOM) p-values yielded by GSEA were summarized. Red line
   denotes p-value threshold of 0.001; Blue line denotes p-value threshold
   of 0.1.

Gene set enrichments in complex traits

   Applying GSR to 27 complex traits, we revealed various pathways where
   the enriched gene sets were biologically meaningful. For example, the
   enriched gene sets for high density lipoprotein (HDL) predominantly
   involve lipid metabolism; In contrast, for Lupus, gene sets were
   enriched in interferon signalling pathways, a known immunological
   hallmark. We listed the top 10 enrichments over gene sets from MSigDB
   and Gene Ontology terms for HDL and the autoimmune trait Lupus in S1
   Table in [99]S1 File.

   Additionally, we applied GSR to test cell-type-specific enrichments
   using 205 cell types, 48 of which were derived from GTEx and 157 cell
   types were derived from Franke lab datasets [[100]15]. We observed
   biologically meaningful cell type-specific enrichment for the 27
   complex traits ([101]Fig 5). In particular, central neural system
   cell-specific gene sets were enriched for schizophrenia, immune
   cell-specific gene sets for lupus, immune cell-specific and digestive
   tract cell-specific gene sets for Crohn’s disease and cardiac
   cell-specific gene sets for coronary artery disease. Lastly, we
   correlated traits based on their gene set enrichments and observed
   meaningful phenotypic clusters, suggesting shared biological mechanisms
   by the related phenotypes (S4 Fig in [102]S1 File). For example,
   Crohn’s disease and ulcerative colitis, two subtypes of inflammatory
   bowel disease formed a cluster; Neurological diseases, schizophrenia
   and bipolar disorder formed a cluster; Moreover, lipid traits including
   LDL, HDL, and Triglycerides formed their own cluster.

Fig 5. Cell-type-specific enrichment of gene sets for representative complex
traits.

   [103]Fig 5
   [104]Open in a new tab

   GSR was applied to each complex trait in order to identify
   significantly enriched gene sets among 205 pre-defined
   cell-type-specific gene sets, represented by nine different colors.
   Gene sets were indicated by dots and were aligned in the same order on
   the x-axis. Red lines indecate Bonferroni-corrected p-value threshold
   (0.05).

Application on observed gene expression

   Lastly, using expression profiles of BRCA, THCA and PRAD from TCGA and
   GTEx [[105]20], we tested the enrichments of 186 oncogenic gene sets as
   well as 1,050 gene sets from BIOCARTA, KEGG, and REACTOME in each type
   of tumor. Overall, we observed a significantly stronger enrichments for
   the oncogenic signatures with higher p values compared to the more
   general gene sets across all three tumour types (t-test p-value = 6.4 ×
   10^−25, 9.0 × 10^−29 and 1.1 × 10^−23 for BRCA, PRAD and THCA
   respectively; S5 Fig in [106]S1 File). As a comparison, we also ran
   standard GSEA and observed qualitatively similar enrichments (S5 Fig in
   [107]S1 File).

Discussion

   In this work, we describe GSR, an efficient method to test for gene set
   or pathway enrichments using either GWAS summary statistics or observed
   gene expression and phenotype information. We demonstrate robust and
   powerful detection of causal pathways in extensive simulation using our
   proposed method compared to several state-of-the-art methods. When
   applying to the real data, we also obtained biologically meaningful
   enrichments of relevant gene sets and pathways. These features warrant
   GSR a widely applicable method in various study settings with an aim to
   interpret association test results and capture the underlying
   biological mechanisms.

   Our approach has superior computational efficiency. In particular, GSR
   took only 3-5 minutes running on the full summary statistics and less
   than 5 minutes on the full gene expression data with one million SNPs
   and 20,000 genes to test for enrichments of over 4,000 gene sets. In
   our simulations, it is not surprising that FOCUS can accurately
   fine-map causal genes as the simulation designs followed similar
   assumptions adopted by FOCUS [[108]14]. However, FOCUS is at least 20
   times slower than GSR. For the simulated data, FOCUS took 30 minutes to
   fine-map all of the genes in GWAS loci whereas GSR took under three
   minutes to test for pathway enrichments on the same machine.
   Additionally, the computational cost of FOCUS is exponential to the
   number of causal genes considered within each locus whereas GSR is not
   affected by the number of causal genes. Also, because GSR operates at
   genome-wide level, no threshold is needed to decide which genes to be
   included whereas FOCUS needs user-defined threshold for constructing
   the credible gene set for the subsequent hypergeometric enrichment
   test. Given these advantages, we envision that GSR will be a valuable
   tool for the bioinformatic community and statistical genetic community
   as a fast way to investigate the functional implications of complex
   polygenic traits.

   In different simulation settings, GSR exhibits improved pathway
   enrichment power over PASCAL and LDSC, two popular methods for
   partitioning heritability and identifying causal gene sets. Since GSR
   leverages SNP-to-gene association summaried by eQTL weights while
   either PASCAL or LDSC operates on the SNP level, without considering
   this intermediate association, such improvements are expected and
   beneficial. Given that existing eQTL studies have yielded reliable
   estimates of SNP-to-gene effects and are easily accessible, we consider
   GSR more promising in bridging the gap between large GWAS and
   multi-faceted functional annotations on the genome.

   One unique feature of our approach is that it could leverage the
   observed individual-level gene expression that are broadly available to
   calculate more accurate in-sample gene-gene correlation. Indeed, we
   observed more accurate detection of causal pathway for modest sample
   size (1000 individuals) where the phenotype and gene expression are
   available compared to GSR operating only on summary statistics. In real
   data analysis, we demonstrate that GSR can achieve similar biologically
   meaningful enrichments as GSEA when applied to the observed gene
   expression. On the other hand, GSR has the advantage of working with
   summary statistics when the individual gene expression and phenotype
   are not available where GSEA could hardly be performed.

   It is noteworthy that p values generated by different methods in this
   study are not directly comparable due to different model assumptions,
   statistical tests being used, sampling methods, etc. However, we posit
   that the p-values themselves are informative in reality. When gene set
   enrichment analysis is performed in related studies, p-values are
   usually directly adopted to identify specific signals as a common
   practice. Therefore, GSR may be promising to refine interpretation and
   reveal under-identified biological mechanisms in existing studies, as
   it is able to yield smaller p-values for the true underlying pathways.

   Our method has important limitations. First of all, our method relies
   on pre-computed eQTL weights, which might absorb measurement
   uncertainty, confounding effects as well as stochastic errors. Besides,
   it is usually unknown how these weights vary across different
   populations, i.e. whether the effect of each SNP on the corresponding
   gene expression is conserved, particularly when investigation is
   carried out on a diseased population while using a non-diseased
   reference population. Furthermore, our method is built on an important
   assumption that the effect sizes of genes on the trait and the derived
   gene scores are independent. In practice, if this assumption is
   violated, our method might suffer from the bias introduced. While no
   method exists to examine the validity of these properties to our
   knowledge, since we obtained consistent results in our real data
   analyses, we posit that our method should be robust in identifying
   causal pathways. We propose our method could be widely utilized various
   studies where further calibration of the exact estimates of effect
   sizes should continuously improve its performance.

Supporting information

   S1 File

   (PDF)
   [109]Click here for additional data file.^ (483.7KB, pdf)

Acknowledgments