Abstract

Background

   As high-throughput genomic technologies become accurate and affordable,
   an increasing number of data sets have been accumulated in the public
   domain and genomic information integration and meta-analysis have
   become routine in biomedical research. In this paper, we focus on
   microarray meta-analysis, where multiple microarray studies with
   relevant biological hypotheses are combined in order to improve
   candidate marker detection. Many methods have been developed and
   applied in the literature, but their performance and properties have
   only been minimally investigated. There is currently no clear
   conclusion or guideline as to the proper choice of a meta-analysis
   method given an application; the decision essentially requires both
   statistical and biological considerations.

Results

   We performed 12 microarray meta-analysis methods for combining multiple
   simulated expression profiles, and such methods can be categorized for
   different hypothesis setting purposes: (1) HS[ A ]: DE genes with
   non-zero effect sizes in all studies, (2) HS[ B ]: DE genes with
   non-zero effect sizes in one or more studies and (3) HS[ r ]: DE gene
   with non-zero effect in "majority" of studies. We then performed a
   comprehensive comparative analysis through six large-scale real
   applications using four quantitative statistical evaluation criteria:
   detection capability, biological association, stability and robustness.
   We elucidated hypothesis settings behind the methods and further apply
   multi-dimensional scaling (MDS) and an entropy measure to characterize
   the meta-analysis methods and data structure, respectively.

Conclusions

   The aggregated results from the simulation study categorized the 12
   methods into three hypothesis settings (HS[ A ], HS[ B ], and HS[ r ]).
   Evaluation in real data and results from MDS and entropy analyses
   provided an insightful and practical guideline to the choice of the
   most suitable method in a given application. All source files for
   simulation and real data are available on the author’s publication
   website.

Background

   Microarray technology has been widely used to identify differential
   expressed (DE) genes in biomedical research in the past decade. Many
   transcriptomic microarray studies have been generated and made
   available in public domains such as the Gene Expression Omnibus (GEO)
   from NCBI ( [30]http://www.ncbi.nlm.nih.gov/geo/) and ArrayExpress from
   EBI ( [31]http://www.ebi.ac.uk/arrayexpress/). From the databases, one
   can easily obtain multiple studies of a relevant biological or disease
   hypothesis. Since a single study often has small sample size and
   limited statistical power, combining information across multiple
   studies is an intuitive way to increase sensitivity. Ramasamy, et al.
   proposed a seven-step practical guidelines for conducting microarray
   meta-analysis [[32]1]: "(i) identify suitable microarray studies; (ii)
   extract the data from studies; (iii) prepare the individual datasets;
   (iv) annotate the individual datasets; (v) resolve the many-to-many
   relationship between probes and genes; (vi) combine the study-specific
   estimates; (vii) analyze, present, and interpret results". In the first
   step although theoretically meta-analysis increases the statistical
   power to detect DE genes, the performance can be deteriorated if
   problematic or heterogeneous studies are combined. In many
   applications, the data inclusion/exclusion criteria are based on ad-hoc
   expert opinions, a naïve sample size threshold or selection of
   platforms without an objective quality control procedure. Kang et al.
   proposed six quantitative quality control measures (MetaQC) for
   decision of study inclusion [[33]2]. Step (ii)-(v) are related to data
   preprocessing. Finally, Step (vi) and (vii) involve the selection of
   meta-analysis method and interpretation of the result and are the foci
   of this paper.

   Many microarray meta-analysis methods have been developed and applied
   in the literature. According to a recent review paper by Tseng et al.
   [[34]3], popular methods mainly combine three different types of
   statistics: combine p-values, combine effect sizes and combine ranks.
   In this paper, we include 12 popular as well as state-of-the-art
   methods in the evaluation and comparison. Six methods (Fisher,
   Stouffer, adaptively weighted Fisher, minimum p-value, maximum p-value
   and rth ordered p-value) belonged to the p-value combination category,
   two methods (fixed effects model and random effects model) belonged to
   the effect size combination category and four methods (RankProd,
   RankSum, product of ranks and sum of ranks) belonged to the rank
   combination category. Details of these methods and citations will be
   provided in the Method section. Despite the availability of many
   methods, pros and cons of these methods and a comprehensive evaluation
   remain largely missing in the literature. To our knowledge, Hong and
   Breitling [[35]4], Campain and Yang [[36]5] are the only two
   comparative studies that have systematically compared multiple
   meta-analysis methods. The number of methods compared (three and five
   methods, respectively) and the number of real examples examined (two
   and three examples respectively with each example covering 2–5
   microarray studies) were, however, limited. The conclusions of the two
   papers were suggestive with limited insights to guide practitioners. In
   addition, as we will discuss in the Method section, different
   meta-analysis methods have different underlying hypothesis setting
   targets. As a result, the selection of an adequate (or optimal)
   meta-analysis method depends heavily on the data structure and the
   hypothesis setting to achieve the underlying biological goal.

   In this paper, we compare 12 popular microarray meta-analysis methods
   using simulation and six real applications to benchmark their
   performance by four statistical criteria (detection capability,
   biological association, stability and robustness). Using simulation, we
   will characterize the strength of each method under three different
   hypothesis settings (i.e. detect DE genes in "all studies", "majority
   of studies" or "one or more studies"; see Method section for more
   details). We will compare the similarity and grouping of the
   meta-analysis methods based on their DE gene detection results (by
   using a similarity measure and multi-dimension scaling plot) and use an
   entropy measure to characterize the data structure to determine which
   hypothesis setting may be more adequate in a given application.
   Finally, we give a guideline to help practitioners select the best
   meta-analysis method under the choice of hypothesis setting in their
   applications.

Methods

Real data sets

   Six example data sets for microarray meta-analysis were collected for
   evaluations in this paper. Each example contained 4–8 microarray
   studies. Five of the six examples were of the commonly seen two-group
   comparison and the last breast cancer example contained relapse-free
   survival outcome. We applied the MetaQC package [[37]2] to assess
   quality of the studies for meta-analysis and determined the final
   inclusion/exclusion criteria. The principal component analysis (PCA)
   bi-plots and the six QC measures are summarized in Additional file
   [38]1: Figure S1, Tables S2 and S3. Details of the data sets are
   available in Additional file [39]1: Table S1.

Underlying hypothesis settings

   Following the classical convention of Brinbaum [[40]6] and Li and Tseng
   [[41]7] (see also Tseng et al. [[42]3]), meta-analysis methods can be
   classified into two complementary hypothesis settings. In the first
   hypothesis setting (denoted as HS[ A ]), the goal is to detect DE genes
   that have non-zero effect sizes in all studies:
   [MATH:
   <mrow><msub><mrow><mi>H</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>:<
   /mo><msubsup><mrow><mo>∩</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn
   ></mrow><mrow><mi>K</mi></mrow></msubsup><mfenced open="{"
   close="}"><mrow><msub><mi>θ</mi><mi>k</mi></msub><mo>=</mo><mn>0</mn></
   mrow></mfenced><mspace
   width="0.5em"></mspace><mtext>versus</mtext><mspace
   width="0.75em"></mspace><msub><mrow><mi>H</mi></mrow><mrow><mi>a</mi></
   mrow></msub><mo>:</mo><msubsup><mrow><mo>∩</mo></mrow><mrow><mi>k</mi><
   mo>=</mo><mn>1</mn></mrow><mrow><mi>K</mi></mrow></msubsup><mfenced
   open="{"
   close="}"><mrow><msub><mrow><mi>θ</mi></mrow><mrow><mi>k</mi></mrow></m
   sub><mo>≠</mo><mn>0</mn></mrow></mfenced><mspace
   width=".5em"></mspace><mfenced open="("
   close=")"><mrow><mi>H</mi><msub><mrow><mi>S</mi></mrow><mrow><mi>A</mi>
   </mrow></msub></mrow></mfenced></mrow> :MATH]

   where θ[ k ]is the effect size of study k. The second hypothesis
   setting (denoted as HS[ B ]), however, aims to detect a DE gene if it
   has non-zero effect size in "one or more" studies:
   [MATH:
   <mrow><msub><mrow><mi>H</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>:<
   /mo><msubsup><mrow><mo>∩</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn
   ></mrow><mrow><mi>K</mi></mrow></msubsup><mfenced open="{"
   close="}"><mrow><msub><mrow><mi>θ</mi></mrow><mrow><mi>k</mi></mrow></m
   sub><mo>=</mo><mn>0</mn></mrow></mfenced><mspace
   width="0.12em"></mspace><mi>versus</mi><mspace
   width="0.3em"></mspace><msub><mrow><mi>H</mi></mrow><mrow><mi>a</mi></m
   row></msub><mo>:</mo><mspace
   width="0.49em"></mspace><msubsup><mrow><mo>∩</mo></mrow><mrow><mi>k</mi
   ><mo>=</mo><mn>1</mn></mrow><mrow><mi>K</mi></mrow></msubsup><mfenced
   open="{"
   close="}"><mrow><msub><mrow><mi>θ</mi></mrow><mrow><mi>k</mi></mrow></m
   sub><mo>≠</mo><mn>0</mn></mrow></mfenced><mspace
   width="1em"></mspace><mfenced open="("
   close=")"><mrow><mi>H</mi><msub><mrow><mi>S</mi></mrow><mrow><mi>B</mi>
   </mrow></msub></mrow></mfenced></mrow> :MATH]

   In most applications, HS[ A ]is more appropriate to detect conserved
   and consistent candidate markers across all studies. However, different
   degrees of heterogeneity can exist in the studies and HS[ B ]can be
   useful to detect study-specific markers (e.g. studies from different
   tissues are combined and tissue specific markers are expected and of
   interest). Since HS[ A ]is often too conservative when many studies are
   combined, Song and Tseng (2012) proposed a more practical and robust
   hypothesis setting (namely HS[ r ]) that targets on DE genes with
   non-zero effect sizes in "majority" of studies, where majority of
   studies is defined as, for example, more than 50% of combined studies
   (i.e. r ≥ 0.5⋅K). The robust hypothesis setting considered was:
   [MATH:
   <mrow><msub><mrow><mi>H</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>:<
   /mo><msubsup><mrow><mo>∩</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn
   ></mrow><mrow><mi>K</mi></mrow></msubsup><mfenced open="{"
   close="}"><mrow><msub><mrow><mi>θ</mi></mrow><mrow><mi>k</mi></mrow></m
   sub><mo>=</mo><mn>0</mn></mrow></mfenced><mspace
   width="0.25em"></mspace><mtext>versus</mtext><mspace
   width="0.25em"></mspace><msub><mrow><mi>H</mi></mrow><mrow><mi>a</mi></
   mrow></msub><mo>:</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mrow><mi>I</mi><mfenced open="{"
   close="}"><mrow><msub><mrow><mi>θ</mi></mrow><mrow><mi>k</mi></mrow></m
   sub><mo>≠</mo><mn>0</mn></mrow></mfenced><mo>≥</mo><mi>r</mi><mspace
   width="1em"></mspace><mfenced open="("
   close=")"><mrow><mi>H</mi><msub><mrow><mi>S</mi></mrow><mrow><mi>r</mi>
   </mrow></msub></mrow></mfenced></mrow></mrow> :MATH]

   A major contribution of this paper is to characterize meta-analysis
   methods suitable for different hypothesis settings (HS[ A ], HS[ B ]and
   HS[ r ]) using simulation and real applications and to compare their
   performance with four benchmarks to provide a practical guideline.

Microarray meta-analysis data pre-processing

   Assume that we have K microarray studies to combine. For study k
   (1 ≤ k ≤ K), denote by x[ gsk ]the gene expression intensity of gene g
   (1 ≤ g ≤ G) and sample s (1 ≤ s ≤ S[ k ]; S[ k ]the number of samples
   in study k), and y[ sk ]the disease/outcome variable of sample s. The
   disease/outcome variable can be of binary, multi-class, continuous or
   censored data, representing the disease state, severity or prognosis
   outcome (e.g. tumor versus normal or recurrence survival time). The
   goal of microarray meta-analysis is to combine information of K studies
   to detect differentially expressed (DE) genes associated with the
   disease/outcome variable. Such DE genes serve as candidate markers for
   disease classification, diagnosis or prognosis prediction and help
   understand the genetic mechanisms underlying a disease. In this paper,
   before meta-analysis we first applied penalized t-statistic to each
   individual study to generate p-values or DE ranks [[43]8] for a binary
   outcome. In contrast to traditional t-statistic, penalized t-statistic
   adds a fudge parameter s[0] to stabilize the denominator
   [MATH: <mfenced open="(" close=")"><mrow><mi>T</mi><mo>=</mo><mfenced
   open="(" close=")"><mrow><mover accent="true"><mi>X</mi><mo
   stretchy="true">¯</mo></mover><mo>-</mo><mover
   accent="true"><mi>Y</mi><mo
   stretchy="true">¯</mo></mover></mrow></mfenced><mo
   stretchy="true">/</mo><mo stretchy="true">(</mo><mover
   accent="true"><mrow><mi>s</mi></mrow><mrow><mo
   stretchy="true">^</mo></mrow></mover><mo>+</mo><msub><mrow><mi>s</mi></
   mrow><mrow><mn>0</mn></mrow></msub></mrow></mfenced> :MATH]
   ;
   [MATH: <mover accent="true"><mrow><mi>X</mi></mrow><mrow><mo
   stretchy="true">¯</mo></mrow></mover> :MATH]
   and
   [MATH: <mover accent="true"><mrow><mi>Y</mi></mrow><mrow><mo
   stretchy="true">¯</mo></mrow></mover> :MATH]
   are means of case and control groups) and to avoid a large t-statistic
   due to small estimated variance
   [MATH: <mover accent="true"><mrow><mi>s</mi></mrow><mrow><mo
   stretchy="true">^</mo></mrow></mover> :MATH]
   . The p-values were calculated using the null distributions derived
   from conventional non-parametric permutation analysis by randomly
   permuting the case and control labels for 10,000 times [[44]9]. For
   censored outcome variables, Cox proportion hazard model and log-rank
   test were used [[45]10]. Meta-analysis methods (described in the next
   subsection) were then used to combine information across studies and
   generate meta-analyzed p-values. To account for multiple comparison,
   Benjamini and Hochberg procedure was used to control false discovery
   rate (FDR) [[46]11]. All methods were implemented using the "MetaDE"
   package in R [[47]12]. Data sets and all programming codes are
   available at [48]http://www.biostat.pitt.edu/bioinfo/publication.htm.

Microarray meta-analysis methods

   According to a recent review paper [[49]3], microarray meta-analysis
   methods can be categorized into three types: combine p-values, combine
   effect sizes and combine ranks. Below, we briefly describe 12 methods
   that were selected for comparison.

Combine p-values

   Fisher The Fisher’s method [[50]13] sums up the log-transformed
   p-values obtained from individual studies. The combined Fisher’s
   statistic
   [MATH:
   <mrow><msubsup><mrow><mi>χ</mi></mrow><mrow><mtext>Fisher</mtext></mrow
   ><mrow><mn>2</mn></mrow></msubsup><mo>=</mo><mo>-</mo><mn>2</mn><msubsu
   p><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>k</mi></mrow></msubsup><mrow><mtext>log</mtext><mfenced
   open="("
   close=")"><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mfenced></mrow></mrow> :MATH]
   follows a χ^2 distribution with 2 k degrees of freedom under the null
   hypothesis (assuming null p-values are un;iformly distributed). Note
   that we perform permutation analysis instead of such parametric
   evaluation for Fisher and other methods in this paper. Smaller p-values
   contribute larger scores to the Fisher’s statistic.

   Stouffer Stouffer’s method [[51]14] sums the inverse normal transformed
   p-values. Stouffer’s statistics
   [MATH:
   <mrow><msub><mrow><mi>T</mi></mrow><mrow><mtext>Stouffer</mtext></mrow>
   </msub><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>k</mi></mrow></msubsup><mrow><msub><mrow><mi>z</mi></mrow><mr
   ow><mi>i</mi></mrow></msub><mo
   stretchy="true">/</mo><msqrt><mrow><mi>k</mi></mrow></msqrt><mo
   stretchy="true">(</mo><msub><mrow><mi>z</mi></mrow><mrow><mi>i</mi></mr
   ow></msub><msup><mrow><mi
   mathvariant="normal">Φ</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></ms
   up><mfenced open="("
   close=")"><msub><mrow><mi>p</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mfenced></mrow><mo>,</mo></mrow> :MATH]
   where Φ is standard normal c.c.f) follows a standard normal
   distribution under the null hypothesis. Similar to Fisher’s method,
   smaller p-values contribute more to the Stouffer’s score, but in a
   smaller magnitude.

   Adaptively weighted (AW) Fisher The AW Fisher’s method [[52]7] assigns
   different weights to each individual study
   [MATH:
   <mrow><msub><mrow><mi>T</mi></mrow><mrow><mtext>AW</mtext></mrow></msub
   ><mo>=</mo><mo>-</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mrow><msub><mrow><mi>w</mi></mrow><mr
   ow><mi>k</mi></mrow></msub><mo>⋅</mo><mo>log</mo><mfenced open="("
   close=")"><msub><mrow><mi>P</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mfenced><mo>,</mo><msub><mrow><mi>w</mi></mrow><mrow><mi>k</mi></mrow><
   /msub><mo>=</mo><mn>0</mn><mspace
   width="0.24em"></mspace><mtext>or</mtext><mspace
   width="0.12em"></mspace><mn>1</mn></mrow></mrow> :MATH]
   and it searches through all possible weights to find the best adaptive
   weight with the smallest derived p-value. One significant advantage of
   this method is its ability to indicate which studies contribute to the
   evidence aggregation and elucidates heterogeneity in the meta-analysis.
   Details can be referred to the Additional file [53]1.

   Minimum p -value (minP) The minP method takes the minimum p-value among
   the K studies as the test statistic [[54]15]. It follows a beta
   distribution with degrees of freedom α = 1 and β = k under the null
   hypothesis. This method detects a DE gene whenever a small p-value
   exists in any one of the K studies.

   Maximum p -value (maxP) The maxP method takes maximum p-value as the
   test statistic [[55]16]. It follows a beta distribution with degrees of
   freedom α = K and β = 1 under the null hypothesis. This method targets
   on DE genes that have small p-values in "all" studies.

   r-th ordered p -value (rOP) The rOP method takes the r-th order
   statistic among sorted p-values of K combined studies. Under the null
   hypothesis, the statistic follows a beta distribution with degrees of
   freedom α = r and β = K – r + 1. The minP and maxP methods are special
   cases of rOP. In Song and Tseng [[56]17], rOP is considered a robust
   form of maxP (where r is set as greater than 0.5∙K) to identify
   candidate markers differentially expressed in "majority" of studies.

Combine effect size

   Fixed effects model (FEM) FEM combines the effect size across K studies
   by assuming a simple linear model with an underlying true effect size
   plus a random error in each study.

   Random effects model (REM) REM [[57]18] extends FEM by allowing random
   effects for the inter-study heterogeneity in the model. Detailed
   formulation and inference of FEM and REM are available in the
   Additional file [58]1.

Combine rank statistics

   RankProd (RP) and RankSum (RS) RankProd and RankSum are based on the
   common biological belief that if a gene is repeatedly at the top of the
   lists ordered by up- or down-regulation fold change in replicate
   experiments, the gene is more likely a DE gene [[59]19]. Detailed
   formulation and algorithms are available in the Additional file [60]1.

   Product of ranks (PR) and Sum of ranks (SR) These two methods apply a
   naïve product or sum of the DE evidence ranks across studies [[61]20].
   Suppose R[ gk ]represents the rank of p-value of gene g among all genes
   in study k. The test statistics of PR and SR methods are calculated as
   [MATH:
   <mrow><mi>P</mi><msub><mrow><mi>R</mi></mrow><mrow><mi>g</mi></mrow></m
   sub><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∏</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mrow><msub><mrow><mi>R</mi></mrow><mr
   ow><mtext mathvariant="italic">gk</mtext></mrow></msub></mrow></mrow>
   :MATH]
   and
   [MATH:
   <mrow><mi>S</mi><msub><mrow><mi>R</mi></mrow><mrow><mi>g</mi></mrow></m
   sub><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mrow><msub><mrow><mi>R</mi></mrow><mr
   ow><mtext
   mathvariant="italic">gk</mtext></mrow></msub></mrow><mo>,</mo></mrow>
   :MATH]
   respectively. P-values of the test statistics can be calculated
   analytically or obtained from a permutation analysis. Note that the
   ranks taken from the smallest to largest (the choice in the method) are
   more sensitive than ranking from largest to smallest in the PR method,
   while it makes no difference to SR.

Characterization of meta-analysis methods

MDS plots to characterize the methods

   The multi-dimensional scaling (MDS) plot is a useful visualization tool
   for exploring high-dimensional data in a low-dimensional space
   [[62]21]. In the evaluation of 12 meta-analysis methods, we calculated
   the adjusted DE similarity measure for every pair of methods to
   quantify the similarity of their DE analysis results in a given
   example. A dissimilarity measure is then defined as one minus the
   adjusted DE similarity measure and the dissimilarity measure is used to
   generate an MDS plot of the 12 methods. In the MDS plot, methods that
   are clustered in a neighborhood indicate that they produce similar DE
   analysis results.

Entropy measure to characterize data sets

   As indicated in the Section of "Underlying hypothesis settings",
   selection of the most suitable meta-analysis method(s) largely depends
   on their underlying hypothesis setting (HS[ A ], HS[ B ]and HS[ r ]).
   The selection of a hypothesis setting for a given application should be
   based on the experimental design, biological knowledge and the
   associated analytical objectives. There are, however, occasions that
   little prior knowledge or preference is available and an objective
   characterization of the data structure is desired in a given
   application. For this purpose, we developed a data-driven entropy
   measure to characterize whether a given meta-analysis data set contains
   more HS[ A ]-type markers or HS[ B ]-type markers [[63]22]. The
   algorithm is described below:

   1. Apply Fisher’s meta-analysis method to combine p-values across
   studies to identify the top H candidate markers. Here we used
   H = 1,000, H represents the rough number of DE genes (in our belief)
   that are contained in the data.

   2. For each selected marker, the standardized minus p-value score for
   gene g in the k-th study is defined as
   [MATH: <mrow><msub><mrow><mi>l</mi></mrow><mrow><mtext
   mathvariant="italic">gk</mtext></mrow></msub><mo>=</mo><mo>-</mo><mtext
   >log</mtext><mfenced open="("
   close=")"><msub><mrow><mi>p</mi></mrow><mrow><mtext
   mathvariant="italic">gk</mtext></mrow></msub></mfenced><mo
   stretchy="true">/</mo><mo>-</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mrow><mtext>log</mtext><mfenced
   open="(" close=")"><msub><mrow><mi>p</mi></mrow><mrow><mtext
   mathvariant="italic">gk</mtext></mrow></msub></mfenced></mrow><mo>.</mo
   ></mrow> :MATH]
   Note that 0 ≤ l[ gk ] ≤ 1, large l[ gk ]corresponds to more significant
   p-value p[ gk ], and
   [MATH: <mrow><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mrow><msub><mrow><mi>l</mi></mrow><mr
   ow><mtext
   mathvariant="italic">gk</mtext></mrow></msub></mrow><mo>=</mo><mn>1</mn
   ><mo>.</mo></mrow> :MATH]

   3. The entropy of gene g is defined as
   [MATH:
   <mrow><msub><mrow><mi>e</mi></mrow><mrow><mi>g</mi></mrow></msub><mo>=<
   /mo><mo>-</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mrow><msub><mrow><mi>l</mi></mrow><mr
   ow><mtext mathvariant="italic">gk</mtext></mrow></msub></mrow><mspace
   width="0.12em"></mspace><mtext>log</mtext><mfenced open="("
   close=")"><msub><mrow><mi>l</mi></mrow><mrow><mtext
   mathvariant="italic">gk</mtext></mrow></msub></mfenced></mrow> :MATH]
   . Box-plots of entropies of the top H genes are generated for each
   meta-analysis application (Figure  [64]1(b)).

Figure 1.

   Figure 1
   [65]Open in a new tab

   Characterization of methods and datasets. (a) Multi-dimensional scaling
   (MDS) plot of all 12 methods based on the average dissimilarity matrix
   of six examples and (b) The box-plots of entropies in six data sets.
   Colors (red, green and blue) indicate clusters of methods with similar
   DE detection ordering. High entropies indicate that high consistency of
   DE gene detection across studies (e.g. MDD). Low entropies show greater
   heterogeneity in DE gene detection (e.g. prostate cancer).

   Intuitively, a high entropy value indicates that the gene has small
   p-values in all or most studies and is of HS[ A ]or HS[ r ]-type.
   Conversely, genes with small entropy have small p-values in one or only
   few studies where HS[ B ]-type methods are more adequate. When
   calculating l[ gk ]in step 2, we capped –log(p[ gk ]) at 10 to avoid
   contributions of close-to-zero p-values that can generate near-infinite
   scores. The entropy box-plot helps determine an appropriate
   meta-analysis hypothesis setting if no pre-set biological objective
   exists.

Evaluation criteria

   For objective quantitative evaluation, we developed the following four
   statistical criteria to benchmark performance of the methods.

Detection capability

   The first criterion considers the number of DE genes detected by each
   meta-analysis method under the same pre-set FDR threshold (e.g.
   FDR = 1%). Although detecting more DE genes does not guarantee better
   "statistical power", this criterion has served as a surrogate of
   statistical power in previous comparative studies [[66]23]. Since we do
   not know the underlying true DE genes, we refer to this evaluation as
   "detection capability" in this paper. An implicit assumption underlying
   this criterion is that the statistical procedure to detect DE genes in
   each study and the FDR control in the meta-analysis are accurate (or
   roughly accurate). To account for data variability in the evaluation,
   we bootstrapped (i.e. sampled with replacement to obtain the same
   number of samples in each bootstrapped dataset) the samples in each
   study for B = 50 times and show the plots of ean with standard error
   bars. In the bootstrapping, the entire sample is either selected or not
   so the gene dependence structure is maintained. Denote by r[ meb ]the
   rank of detection capability performance (the smaller the better) of
   method m (1 ≤ m ≤ 12) in example e (1 ≤ e ≤ 6) and in the
   b^th (1 ≤ b ≤ 12) bootstrap simulation. The mean standardized rank
   (MSR) for method m and example e is calculated as
   [MATH: <mrow><msub><mrow><mtext
   mathvariant="italic">MSR</mtext></mrow><mrow><mtext
   mathvariant="italic">me</mtext></mrow></msub><mo>=</mo><msubsup><mrow><
   mo
   mathsize="big">∑</mo></mrow><mrow><mi>b</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>B</mi></mrow></msubsup><mfenced open="("
   close=")"><mrow><msub><mrow><mi>r</mi></mrow><mrow><mtext
   mathvariant="italic">meb</mtext></mrow></msub><mo
   stretchy="true">/</mo><mo>#</mo><mtext>of</mtext><mspace
   width="0.25em"></mspace><mtext>methods</mtext><mspace
   width="0.25em"></mspace><mtext>compared</mtext></mrow></mfenced><mo
   stretchy="true">/</mo><mi>B</mi></mrow> :MATH]
   and the aggregated standardized rank (ASR) is calculated as
   [MATH: <mrow><msub><mrow><mtext
   mathvariant="italic">ASR</mtext></mrow><mrow><mi>m</mi></mrow></msub><m
   o>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>e</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mn>6</mn></mrow></msubsup><mrow><msub><mrow><msub><mrow><mtext
   mathvariant="italic">MSR</mtext></mrow><mrow><mi>m</mi></mrow></msub></
   mrow><mrow><mi>e</mi></mrow></msub><mo
   stretchy="true">/</mo><mn>6</mn><mo>,</mo></mrow></mrow> :MATH]
   representing the overall performance of method m across all six
   examples. Additional file [67]1: Table S4 shows the MSR and ASR of all
   12 methods and Figure  [68]2 (in the Result section) shows plot of mean
   with standard error bars for each method ordered by ASR. We note that
   MSR and ASR are both standardized between 0 and 1. The standardization
   in MSR is necessary because in the breast cancer survival example we
   cannot apply FEM, REM, RankSum and RankProd as they are developed only
   for a two group comparison.

Figure 2.

   Figure 2
   [69]Open in a new tab

   The plot of mean numbers of detected DE genes with error bars of
   standard error from 50 bootstrapped data sets for the 12 meta-analysis
   methods. Note that FEM, REM, RankProd and RankSum cannot be applied to
   survival examples.

Biological association

   The second criterion requires that a good meta-analysis method should
   detect a DE gene list that has better association with pre-defined
   "gold standard" pathways related to the targeted disease. Such a "gold
   standard" pathway set should be obtained from biological knowledge for
   a given disease or biological mechanism under investigation. However,
   since most disease or biological mechanisms are not well-studied,
   obtaining such "gold standard" pathways is either difficult or
   questionable. To facilitate this evaluation without bias, we develop a
   computational and data-driven approach to determine a set of surrogate
   disease-related pathways out of a large collection of pathways by
   combining pathway enrichment analysis results from each single study.
   Specifically, we first collected 2,287 pathways (gene sets) from MSigDB
   ( [70]http://www.broadinstitute.org/gsea/msigdb/): 1,454 pathways from
   "GO", 186 pathways from "KEGG", 217 pathways from "BIOCARTA" and 430
   pathways from "REACTOME", respectively. We filtered out pathways with
   less than 5 genes or more than 200 genes and 2,113 pathways were left
   for the analysis. DE analysis was performed in each single study
   separately and pathway enrichment analysis was performed for all the
   2,113 pathways by the Kolmogorov-Smirnov (KS) association test. Denote
   by p[ uk ]the resulting pathway enrichment p-value from KS test for
   pathway u (1 ≤ u ≤ 2,113) and study k (1 ≤ k ≤ K). For a given study k,
   enrichment ranks over pathways were calculated as r[ uk ] = rank[ u
   ](p[uk ]). A rank-sum score for a given pathway u was then derived as
   [MATH:
   <mrow><msub><mrow><mi>S</mi></mrow><mrow><mi>u</mi></mrow></msub><mo>=<
   /mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>K</mi></mrow></msubsup><mrow><msub><mrow><mi>r</mi></mrow><mr
   ow><mtext
   mathvariant="italic">uk</mtext></mrow></msub><mo>.</mo></mrow></mrow>
   :MATH]
   Intuitively, pathways with small rank-sum scores indicate that they are
   likely associated with the disease outcome by aggregated evidence of
   the K individual study analyses. We choose the top |D| pathways that
   had the smallest rank-sum scores as the surrogate disease-related
   pathways and used these to proceed with the biological association
   evaluation of meta-analysis methods in the following.

   Given the selected surrogate pathways D, the following procedure was
   used to evaluate performance of the 12 meta-analysis methods for a
   given example e (1 ≤ e ≤ 6). For each meta-analysis method m
   (1 ≤ m ≤ M = 12), the DE analysis result was associated with pathway u
   and the resulting enrichment p-value by KS-test was denoted by
   [MATH: <mrow><msub><mrow><mover
   accent="true"><mrow><mi>P</mi></mrow><mrow><mo
   stretchy="true">˜</mo></mrow></mover></mrow><mrow><mtext
   mathvariant="italic">med</mtext></mrow></msub><mspace
   width="0.5em"></mspace><mfenced open="("
   close=")"><mrow><mn>1</mn><mo>≤</mo><mi>d</mi><mo>≤</mo><mo
   stretchy="true">|</mo><mi>D</mi><mo
   stretchy="true">|</mo></mrow></mfenced><mo>.</mo></mrow> :MATH]
   The rank of
   [MATH: <msub><mrow><mover accent="true"><mi>P</mi><mo
   stretchy="true">˜</mo></mover></mrow><mrow><mtext
   mathvariant="italic">med</mtext></mrow></msub> :MATH]
   for method m among 12 methods was denoted by
   [MATH: <mrow><msub><mrow><mi>v</mi></mrow><mrow><mtext
   mathvariant="italic">med</mtext></mrow></msub><mo>=</mo><msub><mrow><mt
   ext>rank</mtext></mrow><mrow><mi>m</mi></mrow></msub><mspace
   width="0.12em"></mspace><mfenced open="(" close=")"><msub><mrow><mover
   accent="true"><mi>P</mi><mo
   stretchy="true">˜</mo></mover></mrow><mrow><mtext
   mathvariant="italic">med</mtext></mrow></msub></mfenced><mo>.</mo></mro
   w> :MATH]
   Similar to the detection capability evaluation, we calculated the mean
   standardized rank (MSR) for method m and example e as
   [MATH: <mrow><msub><mrow><mtext>MSR</mtext></mrow><mrow><mtext
   mathvariant="italic">me</mtext></mrow></msub><mo>=</mo><msubsup><mrow><
   mo
   mathsize="big">∑</mo></mrow><mrow><mi>d</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>D</mi></mrow></msubsup><mrow><mfenced open="("
   close=")"><mrow><msub><mrow><mi>v</mi></mrow><mrow><mtext
   mathvariant="italic">med</mtext></mrow></msub><mo
   stretchy="true">/</mo><mo>#</mo><mtext>of the methods
   compared</mtext></mrow></mfenced><mo
   stretchy="true">/</mo><mi>D</mi></mrow></mrow> :MATH]
   and the aggregated standardized rank (ASR) as
   [MATH:
   <mrow><msub><mrow><mtext>ASR</mtext></mrow><mrow><mi>m</mi></mrow></msu
   b><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>e</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mn>6</mn></mrow></msubsup><mrow><msub><mrow><mtext>MSR</mtext></
   mrow><mrow><mtext mathvariant="italic">me</mtext></mrow></msub><mo
   stretchy="true">/</mo><mn>6</mn><mo>,</mo></mrow></mrow> :MATH]
   representing the overall performance of method m. To select the
   parameter |D| for surrogate disease-related pathways, Additional file
   [71]1: Figure S4 shows the trend of MSR[ me ](on the y-axis) versus |D|
   (on the x-axis) as |D| increases. The result indicated that the
   performance evaluation using different D only minimally impacted the
   conclusion when D > 30. We choose D = 100 throughout this paper.

   Note that we used KS test, instead of the popular Fisher’s exact test
   because each single study detected variable number of DE genes under a
   given FDR cutoff and the Fisher’s exact test is usually not powerful
   unless a few hundred DE genes are detected. On the other hand, the KS
   test does not require an arbitrary p-value cutoff to determine the DE
   gene list for enrichment analysis.

Stability

   The third criterion examines whether a meta-analysis method generates
   stable DE analysis result. To achieve this goal, we randomly split
   samples into half in each study (so that cases and controls are as
   equally split as possible). The first half of each study was taken to
   perform the first meta-analysis and generate a DE analysis result.
   Similarly, the second half of each study was taken to perform a second
   meta-analysis. The generated DE analysis results from two separate
   meta-analyses were compared by the adjusted DE similarity measure (to
   be described in the next section). The procedure is repeated for B = 50
   times. Denote by S[ meb ]the adjusted DE similarity measure of method m
   of the b^th simulation in example e. Similar to the first two criteria,
   MSR and ASR were calculated based on S[ meb ]to evaluate the methods.

Robustness

   The final criterion investigates the robustness of a meta-analysis
   method when an outlying irrelevant study is mistakenly added to the
   meta-analysis. For each of the six real examples, we randomly picked
   one irrelevant study from the other five examples, added it to the
   specific example for meta-analysis and evaluated the change from the
   original meta-analysis. The adjusted DE similarity measure was
   calculated between the original meta-analysis and the new meta-analysis
   with an added outlier. A high adjusted DE similarity measure shows
   better robustness against inclusion of the outlying study. This
   procedure was repeated until all irrelevant studies were used. The MSR
   and ASR are then calculated based on the adjusted DE similarity
   measures to evaluate the methods.

Similarity measure between two ordered DE gene lists

   To compare results of two DE detection methods (from single study
   analysis or meta-analysis), a commonly used method in the literature is
   to take the DE genes under certain p-value or FDR threshold, plot the
   Venn diagram and compute the ratio of overlap. This method, however,
   greatly depends on the selection of FDR threshold and is unstable.
   Another approach is to take the generated DE ordered gene lists from
   two methods and compute the non-parametric Spearman rank correlation
   [[72]24]. This method avoids the arbitrary FDR cutoff but gives, say,
   the top 100 important DE genes and the bottom 100 non-DE genes equal
   contribution. To circumvent this pitfall, Li et al. proposed a
   parametric reproducibility measure for ChIP-seq data in the ENCODE
   project [[73]25]. Yang et al. introduced an OrderedList measure to
   quantify similarity of two ordered DE gene lists [[74]26]. For
   simplicity, we extended the OrderedList measure into a standardized
   similarity score for the evaluation purpose in this paper.
   Specifically, suppose G[ 1 ]and G[ 2 ]are two ordered DE gene lists
   (e.g. ordered by p-values) and small ranks represent more significant
   DE genes. We denote by O[ n ](G[1], G[2]) the number of overlapped
   genes in the top n genes of G[ 1 ]and G[ 2 ]. As a result, 0 ≤ O[ n
   ](G[1], G[2]) ≤ n and a large O[ n ](G[1], G[2]) value indicates high
   similarity of the two ordered lists in the top n genes. A weighted
   average similarity score is calculated as
   [MATH: <mrow><mi>S</mi><mfenced open="("
   close=")"><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msu
   b></mrow></mfenced><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>G</mi></mrow></msubsup><mrow><msup><mrow><mi>e</mi></mrow><mr
   ow><mo>-</mo><mtext
   mathvariant="italic">an</mtext></mrow></msup><mo>·</mo></mrow><mspace
   width="0.5em"></mspace><msub><mrow><mi>O</mi></mrow><mrow><mi>n</mi></m
   row></msub><mfenced open="("
   close=")"><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msu
   b></mrow></mfenced><mo>,</mo></mrow> :MATH]
   where G is the total number of matched genes and the power α controls
   the magnitude of weights emphasized on the top ranked genes. When α is
   large, top ranked genes are weighted higher in the similarity measure.
   The expected value (under the null hypothesis that the two gene
   rankings are randomly generated) and maximum value of S can be easily
   calculated:
   [MATH: <mrow><msub><mrow><mi>E</mi></mrow><mrow><mtext
   mathvariant="italic">null</mtext></mrow></msub><mfenced open="("
   close=")"><mrow><mi>S</mi><mfenced open="("
   close=")"><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msu
   b></mrow></mfenced></mrow></mfenced><mo>=</mo><msubsup><mrow><mo>∑</mo>
   </mrow><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>G</mi></mro
   w></msubsup><mrow><msup><mrow><mi>e</mi></mrow><mrow><mo>-</mo><mi>α</m
   i><mi>n</mi></mrow></msup><mo>·</mo><msup><mrow><mi>n</mi></mrow><mrow>
   <mn>2</mn></mrow></msup><mo
   stretchy="true">/</mo><mi>G</mi></mrow></mrow> :MATH]
   and
   [MATH: <mrow><mtext mathvariant="italic">max</mtext><mspace
   width="0.5em"></mspace><mfenced open="("
   close=")"><mrow><mi>S</mi><mfenced open="("
   close=")"><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msu
   b></mrow></mfenced></mrow></mfenced><mo>=</mo><msubsup><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>G</mi></mrow></msubsup><mrow><msup><mrow><mi>e</mi></mrow><mr
   ow><mo>-</mo><mi
   mathvariant="italic">an</mi></mrow></msup><mo>·</mo><mi>n</mi><mo>.</mo
   ></mrow></mrow> :MATH]
   We apply an idea similar to adjusted Rand index [[75]27] used to
   measure similarity of two clustering results and define the adjusted DE
   similarity measure as
   [MATH:
   <mrow><msup><mrow><mi>S</mi></mrow><mrow><mo>*</mo></mrow></msup><mfenc
   ed open="("
   close=")"><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msu
   b></mrow></mfenced><mo>=</mo><mfrac><mrow><mi>S</mi><mfenced open="("
   close=")"><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msu
   b></mrow></mfenced><mo>-</mo><msub><mrow><mi>E</mi></mrow><mrow><mtext
   mathvariant="italic">null</mtext></mrow></msub><mfenced open="("
   close=")"><mrow><mi>S</mi><mfenced open="("
   close=")"><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msu
   b></mrow></mfenced></mrow></mfenced></mrow><mrow><mtext>Max</mtext><mfe
   nced open="(" close=")"><mrow><mi>S</mi><mfenced open="("
   close=")"><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msu
   b></mrow></mfenced></mrow></mfenced><mo>-</mo><msub><mrow><mi>E</mi></m
   row><mrow><mtext
   mathvariant="italic">null</mtext></mrow></msub><mfenced open="("
   close=")"><mrow><mi>S</mi><mfenced open="("
   close=")"><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></m
   sub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msu
   b></mrow></mfenced></mrow></mfenced></mrow></mfrac></mrow> :MATH]

   This measure ranges between -1 to 1 and gives an expected value of 0 if
   two ordered gene lists are obtained by random chance. Yang et al.
   proposed a resampling-based and ROC methods to estimate the best
   selection of α. Since the number of DE genes in our examples are
   generally high, we choose a relatively small α = 0.001 throughout this
   paper. We have tested different α and found that the results were
   similar (Additional file [76]1: Figure S7).

Results

Simulation setting

   We conducted simulation studies to evaluate and characterize the 12
   meta-analysis methods for detecting biomarkers in the underlying
   hypothesis settings of HS[ A ], HS[ B ]or HS[ r ]. The simulation
   algorithm is described below:

   1. We simulated 800 genes with 40 gene clusters (20 genes in each
   cluster) and other 1,200 genes do not belong to any cluster. The
   cluster indexes C[ g ]for gene g (1 ≤ g ≤ 2, 000) were randomly
   sampled, such that ∑ I{C[ g ] = 0} = 1, 200 and ∑ I{C[ g
   ] = c} = 20, 1 ≤ c ≤ 40.

   2. For genes in cluster c (1 ≤ c ≤ 40) and in study k (1 ≤ k ≤ 5), we
   sampled
   [MATH: <mrow><msubsup><mrow><mi
   mathvariant="normal">∑</mi></mrow><mrow><mtext
   mathvariant="italic">ck</mtext></mrow><mrow><mo
   mathvariant="italic">'</mo></mrow></msubsup><mo>~</mo><msup><mrow><mi>W
   </mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msup><mfenced open="("
   close=")"><mrow><mi
   mathvariant="normal">Ψ</mi><mo>,</mo><mn>60</mn></mrow></mfenced><mo>,<
   /mo></mrow> :MATH]
   where Ψ = 0.5I[20 × 20] + 0.5J[20 × 20], W^- 1 denotes the inverse
   Wishart distribution, I is the identity matrix and J is the matrix with
   all elements equal 1. We then standardized
   [MATH: <msubsup><mrow><mi
   mathvariant="normal">Σ</mi></mrow><mrow><mtext
   mathvariant="italic">ck</mtext></mrow><mrow><mo
   mathvariant="italic">'</mo></mrow></msubsup> :MATH]
   into Σ[ ck ]where the diagonal elements are all 1’s.

   3. 20 genes in cluster c was denoted by the index of g[ c1], …, g[
   c20], i.e.
   [MATH:
   <mrow><msub><mrow><mi>C</mi></mrow><mrow><msub><mrow><mi>g</mi></mrow><
   mrow><mtext
   mathvariant="italic">cj</mtext></mrow></msub></mrow></msub><mo>=</mo><m
   i>c</mi><mo>,</mo><mtext>where</mtext><mspace
   width="0.25em"></mspace><mn>1</mn><mo>≤</mo><mi>c</mi><mo>≤</mo><mn>40<
   /mn><mspace width="0.25em"></mspace><mtext>and</mtext><mspace
   width="0.37em"></mspace><mn>1</mn><mo>≤</mo><mi>j</mi><mo>≤</mo><mn>20<
   /mn><mo>.</mo></mrow> :MATH]
   We sampled gene expression levels of genes in cluster c for sample n as
   [MATH: <mrow><msup><mrow><mfenced open="("
   close=")"><mrow><msubsup><mrow><mi>X</mi></mrow><mrow><msub><mrow><mi>g
   </mi></mrow><mrow><mi>c</mi><mn>1</mn></mrow></msub><mrow><mtext
   mathvariant="italic">nk</mtext></mrow></mrow><mrow><mo
   mathvariant="italic">'</mo></mrow></msubsup><mo>,</mo><mo>…</mo><mo>,</
   mo><msubsup><mrow><mi>X</mi></mrow><mrow><msub><mrow><mi>g</mi></mrow><
   mrow><mi>c</mi><mn>20</mn></mrow></msub><mtext
   mathvariant="italic">nk</mtext></mrow><mrow><mo
   mathvariant="italic">'</mo></mrow></msubsup></mrow></mfenced></mrow><mr
   ow><mi>T</mi></mrow></msup><mo>~</mo><mtext
   mathvariant="italic">MVN</mtext><mfenced open="("
   close=")"><mrow><mn>0</mn><mo>,</mo><msub><mrow><mi
   mathvariant="normal">∑</mi></mrow><mrow><mtext
   mathvariant="italic">ck</mtext></mrow></msub></mrow></mfenced></mrow>
   :MATH]
   where 1 ≤ n ≤ 100 and 1 ≤ k ≤ 5, and sample expression level for the
   gene
   [MATH: <mrow><mi>g</mi><mo>~</mo><mi>N</mi><mfenced open="("
   close=")"><mrow><mn>0</mn><mo>,</mo><msubsup><mrow><mi>σ</mi></mrow><mr
   ow><mi>k</mi></mrow><mrow><mn>2</mn></mrow></msubsup></mrow></mfenced><
   /mrow> :MATH]
   which is not in any cluster for sample n, where 1 ≤ n ≤ 100, 1 ≤ k ≤ 5
   and
   [MATH:
   <msubsup><mrow><mi>σ</mi></mrow><mrow><mi>k</mi></mrow><mrow><mn>2</mn>
   </mrow></msubsup> :MATH]
   was uniformly distributed from [0.8, 1.2], which indicates different
   variance for study k.

   4. For the first 1,000 genes (1 ≤ g ≤ 1, 000), k[ g ](the number of
   studies that are differentially expressed for gene g) was generated by
   sampling k[ g ] = 1, 2, 3, 4 and 5, respectively. For the next 1,000
   genes (1, 001 ≤ g ≤ 2, 000), k[ g ] = 0 represents non-DE genes in all
   five studies.

   5. To simulate expression intensities for cases, we randomly sampled δ[
   gk ] ∈ {0, 1}, such that ∑ [ k ]δ[ gk ] = k[ g ]. If δ[ gk ] = 1, gene
   g in study k was a DE gene, otherwise it was a non-DE gene. When δ[ gk
   ] = 1, we sampled expression intensities μ[ gk ]from a uniform
   distribution in the range of [0.5, 3], which means we considered the
   concordance effect (up-regulated) among all simulated studies. Hence,
   the expression for control samples are
   [MATH: <mrow><msub><mrow><mi>X</mi></mrow><mrow><mtext
   mathvariant="italic">gnk</mtext></mrow></msub><mo>=</mo><msubsup><mrow>
   <mi>X</mi></mrow><mrow><mtext
   mathvariant="italic">gnk</mtext></mrow><mrow><mo
   mathvariant="italic">'</mo></mrow></msubsup><mo>,</mo></mrow> :MATH]
   and case samples are
   [MATH: <mrow><msub><mrow><mi>Y</mi></mrow><mrow><mtext
   mathvariant="italic">gnk</mtext></mrow></msub><mo>=</mo><msubsup><mrow>
   <mi>X</mi></mrow><mrow><mi>g</mi><mfenced open="("
   close=")"><mrow><mi>n</mi><mo>+</mo><mn>50</mn></mrow></mfenced><mi>k</
   mi></mrow><mrow><mo>'</mo></mrow></msubsup><mo>+</mo><msub><mrow><mi>μ<
   /mi></mrow><mrow><mtext
   mathvariant="italic">gk</mtext></mrow></msub><mo>·</mo><msub><mrow><mi>
   δ</mi></mrow><mrow><mtext
   mathvariant="italic">gk</mtext></mrow></msub><mo>,</mo></mrow> :MATH]
   for 1 ≤ g ≤ 2, 000, 1 ≤ n ≤ 50 and 1 ≤ k ≤ 5.

   In the simulation study, we had 1,000 non-DE genes in all five studies
   (k[ g ] = 0), and 1,000 genes were differentially expressed in 1 ~ 5
   studies (k[ g ] = 1, 2, 3, 4, 5). On average, we had roughly the same
   number (~200) of genes in each group of k[ g ] = 1, 2, 3, 4, 5. See
   Additional file [77]1: Figure S2 for the heatmap of a simulated example
   (red colour represents up-regulated genes). We applied the 12
   meta-analysis method under FDR control at 5%. With the knowledge of
   true k[ g ], we were able to derive the sensitivity and specificity for
   HS[ A ]and HS[ B ], respectively. In HS[ A ], genes with k[ g ] = 5
   were the underlying true positives and genes with k[ g ] = 0 ~ 4 were
   the underlying true negatives; in HS[ B ], gene with k[ g ] = 1 ~ 5
   were the underlying true positives and genes with k[ g ] = 0 were the
   true negatives. By adjusting the decision cut-off, the receiver
   operating characteristic (ROC) curves and the resulting area under the
   curve (AUC) were used to evaluate the performance. We simulated 50 data
   sets and reported the means and standard errors of the AUC values. AUC
   values range between 0 and 1. AUC = 50% represents a random guess and
   AUC = 1 reaches the perfect prediction. The above simulation scheme
   only considered the concordance effect sizes (i.e. all with
   up-regulation when a gene is DE in a study) among five simulated
   studies. In many applications, some genes may have p-value statistical
   significance in the meta-analysis but the effect sizes are discordant
   (i.e. a gene is up-regulation in one study but down-regulation in
   another study). To investigate that effect, we performed a second
   simulation that considers random discordant cases. In step 5, the μ[ gk
   ]became a mixture of two uniform distributions: π[ gk ]Unif ⋅[-3,
   -0.5]+ (1 - π[ gk ])⋅ Unif[0.5, 3], where π[ gk ]is the probability of
   gene g (1 ≤ g ≤ 2, 000) in study k(1 ≤ k ≤ 5) to have a discordant
   effect size (down-regulated). We set π[ gk ] = 0.2 for the discordant
   simulation setting.

Simulation results to characterize the methods

   The simulation study provided the underlying truth to characterize the
   meta-analysis methods according to their strengths and weaknesses for
   detecting DE genes of different hypothesis settings. The performances
   of 12 methods were evaluated by receiver operating characteristic (ROC)
   curves, which is a visualization tool that illustrates the sensitivity
   and specificity trade-off, and the resulting area under the ROC curve
   (AUC) under two different hypothesis settings of HS[ A ]and HS[ B ].
   Table  [78]1 shows the detected number of DE genes under nominal FDR at
   5%, the true FDR and AUC values under HS[ A ]and HS[ B ]for all 12
   methods. The values were averaged over 50 simulations and the standard
   errors are shown in the parentheses.

Table 1.

   The detected number of DE genes (at FDR = 5%), the true FDR, AUC values
   under HS [A ]and HS [B ]and the concluding characterization of targeted
   hypothesis setting of each method
                      maxP    rOP   minP  Fisher   AW   Stouffer
   Detected #
     __________________________________________________________________

   321
     __________________________________________________________________

   522
     __________________________________________________________________

   1005
     __________________________________________________________________

   1000
     __________________________________________________________________

   1000
     __________________________________________________________________

   974
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (2.2)
     __________________________________________________________________

   (2.35)
     __________________________________________________________________

   (0.85)
     __________________________________________________________________

   (1.06)
     __________________________________________________________________

   (1.05)
     __________________________________________________________________

   (1.5)
     __________________________________________________________________

   True FDR (HS[ A ])
     __________________________________________________________________

   .068
     __________________________________________________________________

   .018
     __________________________________________________________________

   .447
     __________________________________________________________________

   .444
     __________________________________________________________________

   .444
     __________________________________________________________________

   .43
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (.0008)
     __________________________________________________________________

   (.0012)
     __________________________________________________________________

   (.0006)
     __________________________________________________________________

   (.0007)
     __________________________________________________________________

   (.0008)
     __________________________________________________________________

   (.0009)
     __________________________________________________________________

   True FDR (HS[ B ])
     __________________________________________________________________

   .007
     __________________________________________________________________

   .011
     __________________________________________________________________

   .016
     __________________________________________________________________

   .017
     __________________________________________________________________

   .016
     __________________________________________________________________

   .022
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (.0005)
     __________________________________________________________________

   (.0004)
     __________________________________________________________________

   (.0006)
     __________________________________________________________________

   (.0006)
     __________________________________________________________________

   (.0007)
     __________________________________________________________________

   (.0006)
     __________________________________________________________________

   AUC (HS[ A ])
     __________________________________________________________________

   .996
     __________________________________________________________________

   .964
     __________________________________________________________________

   .8
     __________________________________________________________________

   .82
     __________________________________________________________________

   .79
     __________________________________________________________________

   .89
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (.0003)
     __________________________________________________________________

   (.0014)
     __________________________________________________________________

   (.0005)
     __________________________________________________________________

   (.0005)
     __________________________________________________________________

   (.0005)
     __________________________________________________________________

   (.0006)
     __________________________________________________________________

   AUC (HS[ B ])
     __________________________________________________________________

   .75
     __________________________________________________________________

   .833
     __________________________________________________________________

   .99
     __________________________________________________________________

   .99
     __________________________________________________________________

   .99
     __________________________________________________________________

   .99
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (.0013)
     __________________________________________________________________

   (.01)
     __________________________________________________________________

   (.0001)
     __________________________________________________________________

   (.0001)
     __________________________________________________________________

   (.0001)
     __________________________________________________________________

   (.0005)
     __________________________________________________________________

   Characterization
     __________________________________________________________________

   HS[ A ]
     __________________________________________________________________

   HS[ r ]
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________


     __________________________________________________________________

   PR
     __________________________________________________________________

   SR
     __________________________________________________________________

   FEM
     __________________________________________________________________

   REM
     __________________________________________________________________

   RankProd
     __________________________________________________________________

   RankSum
     __________________________________________________________________

   Detected #
     __________________________________________________________________

   136
     __________________________________________________________________

   186
     __________________________________________________________________

   948
     __________________________________________________________________

   411
     __________________________________________________________________

   391
     __________________________________________________________________

   105
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (2.51)
     __________________________________________________________________

   (2.3)
     __________________________________________________________________

   (1.75)
     __________________________________________________________________

   (2.86)
     __________________________________________________________________

   (3.31)
     __________________________________________________________________

   (1.514)
     __________________________________________________________________

   True FDR (HS[ A ])
     __________________________________________________________________

   .008
     __________________________________________________________________

   .01
     __________________________________________________________________

   .415
     __________________________________________________________________

   .117
     __________________________________________________________________

   .13
     __________________________________________________________________

   .389
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (.0003)
     __________________________________________________________________

   (.0004)
     __________________________________________________________________

   (.0009)
     __________________________________________________________________

   (.0015)
     __________________________________________________________________

   (.0014)
     __________________________________________________________________

   (.0008)
     __________________________________________________________________

   True FDR (HS[ B ])
     __________________________________________________________________

   0
     __________________________________________________________________

   0
     __________________________________________________________________

   .022
     __________________________________________________________________

   .007
     __________________________________________________________________

   0
     __________________________________________________________________

   0
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (0)
     __________________________________________________________________

   (0)
     __________________________________________________________________

   (.0007)
     __________________________________________________________________

   (.0004)
     __________________________________________________________________

   (0)
     __________________________________________________________________

   (0)
     __________________________________________________________________

   AUC (HS[ A ])
     __________________________________________________________________

   .986
     __________________________________________________________________

   .99
     __________________________________________________________________

   .917
     __________________________________________________________________

   .99
     __________________________________________________________________

   .916
     __________________________________________________________________

   .504
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (.0003)
     __________________________________________________________________

   (.0002)
     __________________________________________________________________

   (.0009)
     __________________________________________________________________

   (.0002)
     __________________________________________________________________

   (.0011)
     __________________________________________________________________

   (.0046)
     __________________________________________________________________

   AUC (HS[ B ])
     __________________________________________________________________

   .981
     __________________________________________________________________

   .95
     __________________________________________________________________

   .984
     __________________________________________________________________

   .92
     __________________________________________________________________

   .934
     __________________________________________________________________

   .496
     __________________________________________________________________

   (se)
     __________________________________________________________________

   (.0004)
     __________________________________________________________________

   (.0008)
     __________________________________________________________________

   (.0004)
     __________________________________________________________________

   (.0011)
     __________________________________________________________________

   (.0012)
     __________________________________________________________________

   (.0025)
     __________________________________________________________________

   Characterization  HS[ A ]HS[ A ]HS[ B ]HS[ r ]HS[ B ]HS[ B ]
   [79]Open in a new tab

   Figure  [80]3 shows the histogram of the true number of DE studies
   (i.e. k[ g ]) among the detected DE genes under FDR = 5% for each
   method. It is clearly seen that minP, Fisher, AW, Stouffer and FEM
   detected HS[ B ]-type DE genes and had high AUC values under HS[ B
   ]criterion (0.98-0.99), compared to lower AUC values under HS[ A
   ]criterion (0.79-0.9). For these methods, the true FDR for HS[ A
   ]generally lost control (0.41- 0.44). On the other hand, maxP, rOP and
   REM had high AUC under HS[ A ]criterion (0.96-0.99) (true
   FDR = 0.068-0.117) compared to HS[ B ](0.75-0.92). maxP detected mostly
   HS[ A ]-type of markers and rOP and REM detected mostly HS[ r ]-type DE
   genes. PR and SR detected mostly HS[ A ]-type DE genes but they
   surprisingly had very high AUC under both HS[ A ]and HS[ B ]criteria.
   The RankProd method detected DE genes between HS[ r ]and HS[ B ]types
   and had a good AUC value under HS[ B ]. The RankSum detected HS[ B
   ]-type DE genes but had poor AUC values (0.5) for both HS[ A ]and HS[ B
   ]. Table  [81]1 includes our concluding characterization of the
   targeted hypothesis settings for each meta-analysis method (see also
   Additional file [82]1: Figure S5 of the ROC curve and AUC of HS[ A
   ]-type and HS[ B ]-type in 12 meta-analysis methods). Additional file
   [83]1: Figure S3 shows the result for the second discordant simulation
   setting. The numbers of studies with opposite effect size are
   represented by different colours in histogram plot (green: all studies
   with concordance effect size; blue: one study has opposite effect size
   with the remaining; red: two studies have opposite effect size with the
   remaining). In summary, almost all meta-analysis methods could not
   avoid inclusion of genes with opposite effect sizes. Particularly,
   methods utilizing p-values from two-sided tests (e.g. Fisher, AW, minP,
   maxP and rOP) could not distinguish direction of effect sizes. Stouffer
   was the only method that accommodated the effect size direction in its
   z-transformation formulation but its ability to avoid DE genes with
   discordant effect sizes seemed still limited. Owen (2009) proposed a
   one-sided correction procedure for Fisher’s method to avoid detection
   of discordant effect sizes in meta-analysis [[84]28]. The null
   distribution of the new statistic, however, became difficult to derive.
   The approach can potentially be extended to other methods and more
   future research will be needed for this issue.

Figure 3.

   [85]Figure 3
   [86]Open in a new tab

   The histograms of the true number of DE studies among detected DE genes
   under FDR = 5% in each method.

Results of the four evaluation criteria

Detection capability

   Figure  [87]2 shows the number of DE genes identified by each of the 12
   meta-analysis methods (FDR = 10% for MDD and breast cancer due to their
   weak signals and FDR = 1% for all the others). Each plot shows mean
   with standard error bars for 50 bootstrapped data sets. Additional file
   [88]1: Table S4 shows the MSR and ASR for each method in the six
   examples. The methods in Figure  [89]2 are ordered according to their
   ASR values. The top six methods with the strongest detection capability
   were those that detected HS[ B ]-type DE genes from the conclusion of
   Table  [90]1: Fisher, AW, Stouffer, minP, FEM and RankSum. The order of
   performance of these six methods was pretty consistent across all six
   examples. The next four methods were rOP, RankProd, maxP and REM and
   they targeted on either HS[ r ]or HS[ A ]. PR and SR had the weakest
   detection capability, which was consistent with the simulation result
   in Table  [91]1.

Biological association

   Figure  [92]4 shows plots of mean with standard error bars from the
   pathway association p-values (minus log-transformed) of the top 100
   surrogate disease-related pathways for the 12 methods. Additional file
   [93]1: Table S5 shows the corresponding MSR and ASR. We found that
   Stouffer, Fisher and AW had the best performance among the 12 methods.
   Surprisingly we found that although PR and SR had low detection
   capability in simulation and real data, they consistently had
   relatively high biological association results. This may be due to the
   better DE gene ordering results these two methods provide, as was also
   shown by the high AUC values under both hypothesis settings in the
   simulation.

Figure 4.

   [94]Figure 4
   [95]Open in a new tab

   Plots of mean values of –log[10](p) with error bars of standard error
   from KS-test based on the top 100 surrogate pathways. Note that FEM,
   REM, RankProd and RankSum cannot be applied to survival examples.

Stability

   Figure  [96]5 shows the plots of mean with standard error bars of
   stability calculated by adjusted DE similarity measure. Additional file
   [97]1: Table S6 contains the corresponding MSR and ASR. In summary,
   RankProd and RankSum methods were the most stable meta-analysis methods
   probably because these two nonparametric approaches take into account
   all possible fold change calculations between cases and controls. They
   do not need any distributional assumptions, which provided stability
   even when sample sizes were small [[98]29]. The maximum p-value method
   consistently had the lowest stability in all data sets, which is
   somewhat expected. For a given candidate marker with a small maximum
   p-value, the chance that at least one study has significantly inflated
   p-values is high when sample size is reduced by half. The stability
   measures in the breast cancer example were generally lower than other
   examples. This is mainly due to the weak signals for survival outcome
   association, which might be improved if larger sample size is
   available.

Figure 5.

   [99]Figure 5
   [100]Open in a new tab

   Plots of mean with error bars of standard error of stability in six
   examples based on the adjusted similarity between DE results of two
   randomly split data sets. Note that FEM, REM, RankProd and RankSum
   cannot be applied to survival examples.

Robustness

   Figure  [101]6 shows the plots of mean with standard error bars of
   robustness calculated by adjusted DE similarity measure between the
   original meta-analysis and the new meta-analysis with an added outlier.
   Additional file [102]1: Table S7 shows the corresponding MSR and ASR
   values. In general, methods suitable for HS[ B ](minP, AW, Fisher and
   Stouffer) have better robustness than methods for HS[ A ]or HS[ r
   ](e.g. maxP and rOP). The trend is consistent in the prostate cancer,
   brain cancer and IPF examples but is more variable in the weak-signal
   MDD and breast cancer examples. RankSum was surprisingly the most
   sensitive method to outliers, while RankProd performs not bad.

Figure 6.

   [103]Figure 6
   [104]Open in a new tab

   Plots of mean with error bars of standard error of robustness in six
   examples based on the adjusted similarity between DE results
   with/without adding one irrelevant noise study. Note that FEM, REM,
   RankProd and RankSum cannot be applied to survival examples.

Characterization of methods by MDS plots

   We applied the adjusted DE similarity measure to quantify the
   similarity of the DE gene orders from any two meta-analysis methods.
   The resulting dissimilarity measure (i.e. one minus adjusted similarity
   measure) was used to construct the multidimensional scaling (MDS) plot,
   showing the similarity/dissimilarity structure between the 12 methods
   in a two-dimensional space. When two methods were close to each other,
   they generated similar DE gene ordering. The patterns of MDS plots from
   six examples generated quite consistent results (Additional file
   [105]1: Figure S6). Figure  [106]1(a) shows an aggregated MDS plot
   where the input dissimilarity matrix is averaged from the six examples.
   We clearly observed that Fisher, AW, Stouffer, minP, PR and SR were
   consistently clustered together in all six individual and the
   aggregated MDS plot (labeled in red). This is not surprising given that
   these methods all sum transformed p-value evidence across studies
   (except for minP). Two methods to combine effect sizes and two methods
   to combine ranks (FEM, REM, RankProd and RankSum labeled in blue) are
   consistently clustered together. Finally, the maxP and rOP methods seem
   to form a third loose cluster (labeled in green).

Characterization of data sets by entropy measure

   From the simulation study, selection of a most suitable meta-analysis
   method depends on the hypothesis setting behind the methods. The choice
   of a hypothesis setting mostly depends on the biological purpose of the
   analysis; that is, whether one aims to detect candidate markers
   differentially expressed in "all" (HS[ A ]), "most" (HS[ r ]) or "one
   or more" (HS[ B ]) studies. However, when no biological prior
   information or preference exists, the entropy measure can be
   objectively used to determine the choice of hypothesis setting. The
   analysis identifies the top 1,000 genes from Fisher’s meta-analysis
   method and the gene-specific entropy of each gene is calculated. When
   the entropy is small, the p-values are small in only one or very few
   studies. Conversely, when the entropy is large, most or all of the
   studies have small p-values. Figure  [107]1(b) shows the box-plots of
   entropy of the top 1,000 candidate genes identified by Fisher’s method
   in the six data sets. The result shows that prostate cancer comparing
   primary and metastatic tumor samples had the smallest entropy values,
   which indicated high heterogeneity across the three studies and that
   HS[B ]should be considered in the meta-analysis. On the other hand, MDD
   had the highest entropy values. Although the signals of each MDD study
   were very weak, they were rather consistent across studies and
   application of HS[ A ]or HS[ r ]was adequate. For the other examples,
   we suggest using the robust HS[ r ]unless other prior biological
   purpose is indicated.

Conclusions and discussions

An application guideline for practitioners

   From the simulation study, the 12 meta-analysis methods were
   categorized into three hypothesis settings (HS[ A ], HS[ B ]and HS[ r
   ]), showing their strengths for detecting different types of DE genes
   in the meta-analysis (Figure  [108]3 and the second column of Table
   [109]2). For example, maxP is categorized to HS[ A ]since it tends to
   detect only genes that are differentially expressed in all studies.
   From the results using four evaluation criteria, we summarized the rank
   of ASR values (i.e. the order used in Figures  [110]2 and [111]6) and
   calculated the rank sum of each method in Table  [112]2. The methods
   were then sorted first by the hypothesis setting categories and then by
   the rank sum. The clusters of methods from the MDS plot were also
   displayed. For methods in the HS[ A ]category, we surprisingly see that
   the maxP method performed among the worst in all four evaluation
   criteria and should be avoided. PR was a better choice in this
   hypothesis setting although it provides a rather weak detection
   capability. For HS[ B ], Fisher, AW and Stouffer performed very well in
   general. Among these three methods, we note that AW has an additional
   advantage to provide an adaptive weight index that indicates the subset
   of studies contributing to the meta-analysis and characterizes the
   heterogeneity (e.g. adaptive weight (1,0,…) indicates that the marker
   is DE in study 1 but not in study 2, etc.). As a result, we recommend
   AW over Fisher and Stouffer in the HS[ B ]category. For HS[ r ], the
   result was less conclusive. REM provided better stability and
   robustness but sacrificed detection capability and biological
   association. On the other hand, rOP obtained better detection
   capability and biological association but was neither stable nor
   robust. In general, since detection capability and biological
   association are of more importance in the meta-analysis and rOP has the
   advantage to link the choice of r in HS[ r ]with the rOP method (e.g.
   when r = 0.7∙K, we identify genes that are DE in more than 70% of
   studies), we recommend rOP over REM.

Table 2.

   Ranks of method performance in the four evaluation criteria
     Targeted HS Power detection Biological association Stability
   Robustness Rank Sum MDS^ *1
   PR
     __________________________________________________________________

   HS[ A ]
     __________________________________________________________________

   12
     __________________________________________________________________

   4
     __________________________________________________________________

   4
     __________________________________________________________________

   6
     __________________________________________________________________

   26
     __________________________________________________________________

   1
     __________________________________________________________________

   SR
     __________________________________________________________________

   HS[ A ]
     __________________________________________________________________

   11
     __________________________________________________________________

   6
     __________________________________________________________________

   9
     __________________________________________________________________

   7
     __________________________________________________________________

   33
     __________________________________________________________________

   1
     __________________________________________________________________

   maxP
     __________________________________________________________________

   HS[ A ]
     __________________________________________________________________

   9
     __________________________________________________________________

   10
     __________________________________________________________________

   12
     __________________________________________________________________

   11
     __________________________________________________________________

   42
     __________________________________________________________________

   2
     __________________________________________________________________

   rOP
     __________________________________________________________________

   HS[ r ]
     __________________________________________________________________

   7
     __________________________________________________________________

   5
     __________________________________________________________________

   10
     __________________________________________________________________

   10
     __________________________________________________________________

   32
     __________________________________________________________________

   2
     __________________________________________________________________

   REM
     __________________________________________________________________

   HS[ r ]
     __________________________________________________________________

   10
     __________________________________________________________________

   11
     __________________________________________________________________

   5
     __________________________________________________________________

   8
     __________________________________________________________________

   34
     __________________________________________________________________

   3
     __________________________________________________________________

   Fisher
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________

   1
     __________________________________________________________________

   2
     __________________________________________________________________

   3
     __________________________________________________________________

   3
     __________________________________________________________________

   9
     __________________________________________________________________

   1
     __________________________________________________________________

   AW
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________

   2
     __________________________________________________________________

   3
     __________________________________________________________________

   6
     __________________________________________________________________

   2
     __________________________________________________________________

   13
     __________________________________________________________________

   1
     __________________________________________________________________

   Stouffer
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________

   3
     __________________________________________________________________

   1
     __________________________________________________________________

   8
     __________________________________________________________________

   4
     __________________________________________________________________

   16
     __________________________________________________________________

   1
     __________________________________________________________________

   minP
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________

   4
     __________________________________________________________________

   7
     __________________________________________________________________

   7
     __________________________________________________________________

   1
     __________________________________________________________________

   19
     __________________________________________________________________

   1
     __________________________________________________________________

   RankProd
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________

   8
     __________________________________________________________________

   8
     __________________________________________________________________

   1
     __________________________________________________________________

   5
     __________________________________________________________________

   22
     __________________________________________________________________

   3
     __________________________________________________________________

   RankSum
     __________________________________________________________________

   HS[ B ]
     __________________________________________________________________

   6
     __________________________________________________________________

   12
     __________________________________________________________________

   2
     __________________________________________________________________

   12
     __________________________________________________________________

   32
     __________________________________________________________________

   3
     __________________________________________________________________

   FEM HS[ B ]5 9 11 9 34 3
   [113]Open in a new tab

   *^1Cluster number of methods in the MDS plot (Figure  [114]1(a)).

   Below, we provide a general guideline for a practitioner when applying
   microarray meta-analysis. Data sets of a relevant biological or disease
   hypothesis are firstly identified, preprocessed and annotated according
   to Step (i) - (v) in Ramasamy et al. Proper quality assessment should
   be performed to exclude studies with problematic quality (e.g. with the
   aid of MetaQC as we did in the six examples). Based on the experimental
   design and biological objectives of collected data, one should
   determine whether the meta-analysis aims to identify biomarkers
   differentially expressed in all studies (HS[ A ]), in one or more
   studies (HS[ B ]) or in majority of studies (HS[ r ]). In general, if
   higher heterogeneity is expected from, say, heterogeneous experimental
   protocol, cohort or tissues, HS[ B ]should be considered. For example,
   if the combined studies come from different tissues (e.g. the first
   study uses peripheral blood, the second study uses muscle tissue and so
   on), tissue-specific markers may be expected and HS[ B ]should be
   applied. On the contrary, if the collected studies are relatively
   homogeneous (e.g. use the same array platform or from the same lab),
   HS[r ]is generally recommended, as it provides robustness and detects
   consistent signals across the majority of studies. In the situation
   that no prior knowledge is available to choose a desired hypothesis
   setting or if the researcher is interested in a data-driven decision,
   the entropy measure in Figure  [115]1(b) can be applied and the
   resulting box-plot can be compared to the six examples in this paper to
   guide the decision. Once the hypothesis setting is determined, the
   choice of a meta-analysis method can be selected from the discussion
   above and Table  [116]2.

Conclusions

   In this paper, we performed a comprehensive comparative study to
   evaluate 12 microarray meta-analysis methods using simulation and six
   real examples with four evaluation criteria. We clarified three
   hypothesis settings that were implicitly assumed behind the methods.
   The evaluation results produced a practical guideline to inform
   biologists the best choice of method(s) in real applications.

   With the reduced cost of high-throughput experiments, data from
   microarray, new sequencing techniques and mass spectrometry accumulate
   rapidly in the public domain. Integration of multiple data sets has
   become a routine approach to increase statistical power, reduce false
   positives and provide more robust and validated conclusions. The
   evaluation in this paper focuses on microarray meta-analysis but the
   principles and messages apply to other types of genomic meta-analysis
   (e.g. GWAS, methylation, miRNA and eQTL). When the next-generation
   sequencing technology becomes more affordable in the future, sequencing
   data will become more prevalent as well and similar meta-analysis
   techniques will apply. For these different types of genomic
   meta-analysis, similar comprehensive evaluation could be performed and
   application guidelines should be established as well.

Competing interests

   The authors declare that they have no competing interests.

Authors’ contributions

   GCT supervised the whole project. LCC developed all statistical
   analysis and HML developed partial statistical analysis. GCT and LCC
   drafted the manuscript. All authors read and approved the final
   manuscript.

Supplementary Material

   Additional file 1

   Supplementary methods of (1) Adaptive weighted (AW) Fisher, (2)
   Combined statistical estimates (effect size) methods of FEM and REM,
   (3) Combined rank statistics methods: Rank Product (RankProd) and Rank
   Sum (RankSum). Figure S1. MetaQC. Figure S2. Heatmap of simulated
   example (red color represents up-regulated genes). Figure S3. The
   histograms of the true number of DE studies among detected DE genes
   under FDR = 5% in each method for discordance case. Figure S4.
   Cumulative moving average to determine D = 100. Figure S5. The ROC
   curves and AUC for the hypothesis settings of HS[ A ]-type and (red
   line) HS[ B ]-type (black line) in each meta-analysis method. Figure
   S6. Multidimensional scaling (MDS) plots of individual data sets.
   Figure S7. Stability and Robustness plot for α = 0.0001, 0.005 and
   0.01. Table S1. Detailed data sets description. Table S2. MetaQC
   results. Table S3. Data sets and number of matched genes. Table S4.
   Mean standardized rank (MSR) and aggregated standardized rank (ASR) for
   detection capability. Table S5. Mean standardized rank (MSR) and
   aggregated standardized rank (ASR) for biological association. Table
   S6. Mean standardized rank (MSR) and aggregated standardized rank (ASR)
   for stability. Table S7. Mean standardized rank (MSR) and aggregated
   standardized rank (ASR) for robustness.
   [117]Click here for file^ (1MB, pdf)

Contributor Information

   Lun-Ching Chang, Email: luc15@pitt.edu.

   Hui-Min Lin, Email: hul27@pitt.edu.

   Etienne Sibille, Email: sibilleel@upmc.edu.

   George C Tseng, Email: ctseng@pitt.edu.

Acknowledgements