Abstract

Background

   Many bioinformatics studies aim to identify markers, or features, that
   can be used to discriminate between distinct groups. In problems where
   strong individual markers are not available, or where interactions
   between gene products are of primary interest, it may be necessary to
   consider combinations of features as a marker family. To this end,
   recent work proposes a hierarchical Bayesian framework for feature
   selection that places a prior on the set of features we wish to select
   and on the label-conditioned feature distribution. While an analytical
   posterior under Gaussian models with block covariance structures is
   available, the optimal feature selection algorithm for this model
   remains intractable since it requires evaluating the posterior over the
   space of all possible covariance block structures and feature-block
   assignments. To address this computational barrier, in prior work we
   proposed a simple suboptimal algorithm, 2MNC-Robust, with robust
   performance across the space of block structures. Here, we present
   three new heuristic feature selection algorithms.

Results

   The proposed algorithms outperform 2MNC-Robust and many other popular
   feature selection algorithms on synthetic data. In addition, enrichment
   analysis on real breast cancer, colon cancer, and Leukemia data
   indicates they also output many of the genes and pathways linked to the
   cancers under study.

Conclusions

   Bayesian feature selection is a promising framework for small-sample
   high-dimensional data, in particular biomarker discovery applications.
   When applied to cancer data these algorithms outputted many genes
   already shown to be involved in cancer as well as potentially new
   biomarkers. Furthermore, one of the proposed algorithms, SPM, outputs
   blocks of heavily correlated genes, particularly useful for studying
   gene interactions and gene networks.

Electronic supplementary material

   The online version of this article (10.1186/s12859-018-2059-8) contains
   supplementary material, which is available to authorized users.

   Keywords: Feature selection, Bayesian learning, Biomarker discovery,
   Heuristic search algorithms

Background

   Many bioinformatics studies aim to identify predictive biomarkers that
   can be used to establish diagnosis or prognosis, or to predict a drug
   response [[27]1–[28]3]. This problem can often be framed as a feature
   selection task, where the goal is to identify a list of features
   (molecular biomarkers) that can discriminate between groups of interest
   based on high-dimensional data from microarray, RNA-seq, or other
   high-throughput technologies.

   Initially, exploratory studies are often conducted on small samples to
   generate a shortlist of biomarker candidates before a large-sample
   validation study is performed [[29]4]. However, such studies have too
   often been unsuccessful at producing reliable and reproducible
   biomarkers [[30]5]. Biomarker discovery is inherently difficult, given
   the large number of features, highly complex interactions between genes
   and gene products, enormous variety of dysfunctions that can occur, and
   many sources of error in the data. As a result, feature selection
   algorithms are often implemented without much consideration of the
   particular demands of the problem. For instance, variants of t-test are
   perhaps the most widely implemented selection strategies in
   bioinformatics, but can only detect strong individual features, and
   fail to take correlations into account.

   Given that molecular signaling is often inherently multivariate, there
   is a need for methods that can account for correlations and extract
   combinations of features as a marker family. Wrapper methods do this by
   ranking sets of features according to some objective function, usually
   the error of a classifier. However, methods based on classifier error
   are computationally expensive, and may not necessarily produce the best
   markers; indeed, strong features can be excluded if they are correlated
   with other strong features. Furthermore, analysis downstream from
   feature selection may include gene set enrichment analysis, where the
   hope is to identify known pathways or other biological mechanisms that
   contain a statistically significant number of genes in the reported
   gene set, or may involve the development of new pathways and gene
   networks. We are thus motivated to develop methods that not only select
   markers useful for discrimination, but select all relevant markers,
   even individually weak ones.

   To address this, in prior work we proposed a hierarchical Bayesian
   framework for feature selection, labeling features as “good” or “bad”,
   where good features are those we wish to select, i.e., biomarkers. This
   framework places a prior on the set of good features and the underlying
   distribution parameters. Three Gaussian models have been considered.
   Under independent features, Optimal Bayesian Filtering reports a
   feature set of a given size with a maximal expected number of truly
   good features (CMNC-OBF) [[31]6]. Assuming fully dependent good
   features and independent bad features, 2MNC-DGIB is a fast suboptimal
   method that ranks features by evaluating all sets of size 2 [[32]7].
   Finally, assuming good and bad features are separately dependent,
   2MNC-Robust proposes an approximation of the posterior on good features
   and uses a ranking strategy similar to 2MNC-DGIB to select features
   [[33]8].

   While 2MNC-DGIB has outstanding performance when its assumptions are
   satisfied [[34]7], it performs poorly when bad features are dependent
   [[35]9]. On the other hand, CMNC-OBF and 2MNC-Robust have been shown to
   have robust performance across Bayesian models with block-diagonal
   covariances [[36]9]. CMNC-OBF is extremely fast and enjoys particularly
   excellent performance when markers are individually strong with low
   correlations, but, like all filter methods, may miss weak features that
   are of interest due to high correlations with strong features [[37]6,
   [38]9]. 2MNC-Robust is computationally very manageable and generally
   improves upon CMNC-OBF in the presence of correlations.

   Although CMNC-OBF and 2MNC-Robust are robust to different
   block-diagonal covariance structures, they do not attempt to detect
   these underlying structures, and their assumptions and approximations
   constrain performance. Thus, in this work we propose three new feature
   selection algorithms that: (1) use an iterative strategy to update the
   approximate posterior used in 2MNC-Robust, (2) use a novel scoring
   function inspired by Bayes factors to improve overall rankings, and (3)
   attempt to actually detect the underlying block structure of the data.
   We show that these algorithms have comparable computation time to
   2MNC-Robust, while outperforming 2MNC-Robust and many other popular
   feature selection algorithms on a synthetic Bayesian model assuming
   block-diagonal covariance matrices, and a synthetic microarray data
   model. Finally, we apply the proposed algorithms and CMNC-OBF to breast
   cancer, colon cancer, and AML datasets, and perform enrichment analysis
   on each to address validation.

Feature selection model

   We review a hierarchical Bayesian model that serves as a reference for
   the approximate posterior developed in 2MNC-Robust [[39]8, [40]9] and
   will be used in the algorithms we present in the next section.

   Consider a binary feature selection problem with class labels y=0,1.
   Let F be the set of feature indices. Assume features are partitioned
   into blocks, where features in each block are dependent, but features
   in different blocks are independent. Assume each block is either good
   or bad. A good block has different class-conditioned distributions
   between the two classes, while a bad block has the same distribution in
   both classes. We denote a partitioning of F to good and bad blocks by
   P=(P[G],P[B]), and hereafter call it a feature partition, where
   P[G]={G[1],⋯,G[u]} is the set of u good blocks and P[B]={B[1],⋯,B[v]}
   is the set of v bad blocks. Furthermore, denote the set of all features
   in good blocks as good features,
   [MATH:
   <mi>G</mi><mo>=</mo><msubsup><mrow><mo>∪</mo></mrow><mrow><mi>i</mi><mo
   >=</mo><mn>1</mn></mrow><mrow><mi>u</mi></mrow></msubsup><msub><mrow><m
   i>G</mi></mrow><mrow><mi>i</mi></mrow></msub> :MATH]
   , and denote all features in bad blocks as bad features,
   [MATH:
   <mi>B</mi><mo>=</mo><msubsup><mrow><mo>∪</mo></mrow><mrow><mi>j</mi><mo
   >=</mo><mn>1</mn></mrow><mrow><mi>v</mi></mrow></msubsup><msub><mrow><m
   i>B</mi></mrow><mrow><mi>j</mi></mrow></msub> :MATH]
   . Denote the random feature partition by
   [MATH: <mover
   accent="true"><mrow><mi>P</mi></mrow><mo>¯</mo></mover><mo>=</mo><mo>(<
   /mo><msub><mrow><mover
   accent="true"><mrow><mi>P</mi></mrow><mo>¯</mo></mover></mrow><mrow><mi
   >G</mi></mrow></msub><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>P</mi></mrow><mo>¯</mo></mover></mrow><mrow><mi
   >B</mi></mrow></msub><mo>)</mo> :MATH]
   , the random set of good features by
   [MATH: <mover accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover>
   :MATH]
   , and the random set of bad features by
   [MATH: <mover accent="true"><mrow><mi>B</mi></mrow><mo>¯</mo></mover>
   :MATH]
   .

   We define
   [MATH: <mi>π</mi><mo>(</mo><mi>P</mi><mo>)</mo><mo>=</mo><mi
   mathvariant="normal">P</mi><mo>(</mo><mover
   accent="true"><mrow><mi>P</mi></mrow><mo>¯</mo></mover><mo>=</mo><mi>P<
   /mi><mo>)</mo> :MATH]
   to be the prior distribution on
   [MATH: <mover accent="true"><mrow><mi>P</mi></mrow><mo>¯</mo></mover>
   :MATH]
   . Let P be fixed. Let θ^P be the parameter describing the joint feature
   distribution of P. Since blocks are independent of each other we can
   write
   [MATH:
   <msup><mrow><mi>θ</mi></mrow><mrow><mi>P</mi></mrow></msup><mspace
   width="0.3em"></mspace><mo>=</mo><mspace
   width="0.3em"></mspace><mfenced close="]" open="["
   separators=""><mrow><mspace
   width="0.3em"></mspace><msubsup><mrow><mi>θ</mi></mrow><mrow><mn>0</mn>
   </mrow><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></msub
   ></mrow></msubsup><mo>,</mo><mo>⋯</mo><mspace
   width="0.3em"></mspace><mo>,</mo><msubsup><mrow><mi>θ</mi></mrow><mrow>
   <mn>0</mn></mrow><mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>u</mi></m
   row></msub></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>θ</mi></mrow><
   mrow><mn>1</mn></mrow><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</m
   n></mrow></msub></mrow></msubsup><mo>,</mo><mo>⋯</mo><mspace
   width="0.3em"></mspace><mo>,</mo><msubsup><mrow><mi>θ</mi></mrow><mrow>
   <mn>1</mn></mrow><mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>u</mi></m
   row></msub></mrow></msubsup><mo>,</mo><msup><mrow><mi>θ</mi></mrow><mro
   w><msub><mrow><mi>B</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow></m
   sup><mo>,</mo><mo>⋯</mo><mspace
   width="0.3em"></mspace><mo>,</mo><msup><mrow><mi>θ</mi></mrow><mrow><ms
   ub><mrow><mi>B</mi></mrow><mrow><mi>v</mi></mrow></msub></mrow></msup><
   /mrow></mfenced> :MATH]
   , where
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>y</mi></mrow><mrow><msub><mro
   w><mi>G</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></msubsup>
   :MATH]
   parametrizes class-y features in G[i], and
   [MATH:
   <msup><mrow><mi>θ</mi></mrow><mrow><msub><mrow><mi>B</mi></mrow><mrow><
   mi>j</mi></mrow></msub></mrow></msup> :MATH]
   parametrizes features in B[j]. Assume
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>y</mi></mrow><mrow><msub><mro
   w><mi>G</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow></msubsup>
   :MATH]
   and
   [MATH:
   <msup><mrow><mi>θ</mi></mrow><mrow><msub><mrow><mi>B</mi></mrow><mrow><
   mi>j</mi></mrow></msub></mrow></msup> :MATH]
   ’s are independent given P, i.e.,
   [MATH: <mi>π</mi><mfenced close=")" open="("
   separators=""><mrow><msup><mrow><mi>θ</mi></mrow><mrow><mi>P</mi></mrow
   ></msup></mrow></mfenced><mo>=</mo><munderover><mrow><mo>∏</mo></mrow><
   mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>u</mi></mrow></mund
   erover><mi>π</mi><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>θ</mi></mrow><mrow><mn>0</mn></m
   row><mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mrow></msubsup></mrow></mfenced><mi>π</mi><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>θ</mi></mrow><mrow><mn>1</mn></m
   row><mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mrow></msubsup></mrow></mfenced><munderover><mrow><mo>∏</mo></mrow><mro
   w><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>v</mi></mrow></mundero
   ver><mi>π</mi><mfenced close=")" open="("
   separators=""><mrow><msup><mrow><mi>θ</mi></mrow><mrow><msub><mrow><mi>
   B</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow></msup></mrow></mfenc
   ed> :MATH]
   .

   Given a training set, Inline graphic , of n independent and identically
   distributed (i.i.d.) points, with n[y] points in each class, we have
   Inline graphic and Inline graphic , where π^∗(.) denotes posterior,
   Inline graphic and Inline graphic are class-y points in G[i] and points
   in B[j], respectively, and Inline graphic and Inline graphic are the
   likelihoods. Following steps in [[41]7, [42]10], we have
   graphic file with name 12859_2018_2059_Figh_HTML.gif 1

   In addition, the marginal posterior of a feature set G is Inline
   graphic , and marginal posterior of a feature f is Inline graphic .
   Note Inline graphic is different than Inline graphic .

Gaussian model

   Here we solve Eq. ([43]1) for jointly Gaussian features. We assume for
   a block A,
   [MATH:
   <msubsup><mrow><mi>θ</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>=</mo><mfenced close="]" open="["
   separators=""><mrow><msubsup><mrow><mi>μ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><mi>Σ</mi
   ></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup></mrow>
   </mfenced> :MATH]
   and θ^A=[μ^A,Σ^A], where
   [MATH:
   <msubsup><mrow><mi>μ</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   and μ^A are the mean vectors, and
   [MATH:
   <msubsup><mrow><mi>Σ</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   and Σ^A are the covariance matrices.

   Let P be a feature partition. Suppose A is a good block of P. Assume
   [MATH:
   <mi>π</mi><mo>(</mo><msubsup><mrow><mi>θ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup><mo>)</mo> :MATH]
   is Normal-Inverse-Wishart (NIW). Hence,
   [MATH: <mi>π</mi><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>θ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup></mrow></mfenced><mo>=</mo><mi>π</
   mi><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>Σ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup></mrow></mfenced><mi>π</mi><mfence
   d close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>μ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup><mo>|</mo><msubsup><mrow><mi>Σ</mi
   ></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup></mrow>
   </mfenced> :MATH]
   , where
   [MATH: <mtable><mtr><mtd><mi>π</mi><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>Σ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup></mrow></mfenced></mtd><mtd><mo>=<
   /mo><msubsup><mrow><mi>K</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A<
   /mi></mrow></msubsup><mo>|</mo><msubsup><mrow><mi>Σ</mi></mrow><mrow><m
   i>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><msup><mrow><mo>|</mo><
   /mrow><mrow><mo>−</mo><mfrac><mrow><msubsup><mrow><mi>κ</mi></mrow><mro
   w><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><mo>+</mo><mo>|</mo
   ><mi>A</mi><mo>|</mo><mo>+</mo><mn>1</mn></mrow><mrow><mn>2</mn></mrow>
   </mfrac></mrow></msup><mtext>exp</mtext><mfenced close=")" open="("
   separators=""><mrow><mo>−</mo><mn>0.5</mn><mtext>Tr</mtext><mfenced
   close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup><msup><mrow><mfenced close=")"
   open="("
   separators=""><mrow><msubsup><mrow><mi>Σ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup></mrow></mfenced></mrow><mrow><mo>
   −</mo><mn>1</mn></mrow></msup></mrow></mfenced></mrow></mfenced><mo>,</
   mo></mtd></mtr><mtr><mtd><mi>π</mi><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>μ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup><mo>|</mo><msubsup><mrow><mi>Σ</mi
   ></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup></mrow>
   </mfenced></mtd><mtd><mo>=</mo><msubsup><mrow><mi>L</mi></mrow><mrow><m
   i>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><mo>|</mo><msubsup><mro
   w><mi>Σ</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msub
   sup><msup><mrow><mo>|</mo></mrow><mrow><mo>−</mo><mn>0.5</mn></mrow></m
   sup></mtd></mtr><mtr><mtd><mspace
   width="1em"></mspace><mo>×</mo><mtext>exp</mtext><mfenced close=")"
   open="(" separators=""><mrow><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mo>−</mo><mn>0.5</mn><msubsup><mrow><mi>ν</mi><
   /mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><msup><mr
   ow><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>μ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup><mspace
   width="0.3em"></mspace><mo>−</mo><mspace
   width="0.3em"></mspace><msubsup><mrow><mi>m</mi></mrow><mrow><mi>y</mi>
   </mrow><mrow><mi>A</mi></mrow></msubsup></mrow></mfenced></mrow><mrow><
   mi>T</mi></mrow></msup><mspace
   width="0.3em"></mspace><msup><mrow><mfenced close=")" open="("
   separators=""><mrow><mspace
   width="0.3em"></mspace><msubsup><mrow><mi>Σ</mi></mrow><mrow><mi>y</mi>
   </mrow><mrow><mi>A</mi></mrow></msubsup></mrow></mfenced></mrow><mrow><
   mo>−</mo><mn>1</mn></mrow></msup><mspace
   width="0.3em"></mspace><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>μ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup><mo>−</mo><msubsup><mrow><mi>m</mi
   ></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup></mrow>
   </mfenced><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace></mrow></mfenced><mspace
   width="0.3em"></mspace><mo>,</mo></mtd></mtr></mtable> :MATH]

   where for a matrix |.| denotes determinant.
   [MATH:
   <msubsup><mrow><mi>S</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>,</mo><msubsup><mrow><mi>κ</mi></mrow><mrow><mi>y<
   /mi></mrow><mrow><mi>A</mi></mrow></msubsup><mo>,</mo><msubsup><mrow><m
   i>m</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup>
   :MATH]
   , and
   [MATH:
   <msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   are hyperparameters, which are assumed given and fixed.
   [MATH:
   <msubsup><mrow><mi>S</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   is an |A|×|A| matrix, where for a set |.| denotes cardinality. For a
   proper prior
   [MATH:
   <msubsup><mrow><mi>S</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   is symmetric and positive-definite, and
   [MATH:
   <msubsup><mrow><mi>κ</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>></mo><mo>|</mo><mi>A</mi><mo>|</mo><mo>−</mo><mn>
   1</mn> :MATH]
   . If
   [MATH:
   <msubsup><mrow><mi>κ</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>></mo><mo>|</mo><mi>A</mi><mo>|</mo><mo>+</mo><mn>
   1</mn> :MATH]
   , then
   [MATH: <mi>E</mi><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>Σ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup></mrow></mfenced><mo>=</mo><msubsu
   p><mrow><mi>S</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow>
   </msubsup><mo>/</mo><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>κ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup><mo>−</mo><mo>|</mo><mi>A</mi><mo>
   |</mo><mo>−</mo><mn>1</mn></mrow></mfenced> :MATH]
   . Furthermore,
   [MATH:
   <msubsup><mrow><mi>m</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   is an |A|×1 vector describing the average mean of features and for a
   proper prior we need
   [MATH:
   <msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>></mo><mn>0</mn> :MATH]
   .
   [MATH:
   <msubsup><mrow><mi>K</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>L</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   represent the relative weights of each distribution. For a proper
   distribution we have
   [MATH:
   <msubsup><mrow><mi>K</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>=</mo><mo>|</mo><msubsup><mrow><mi>S</mi></mrow><m
   row><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><msup><mrow><mo>|
   </mo></mrow><mrow><mn>0.5</mn><msubsup><mrow><mi>κ</mi></mrow><mrow><mi
   >y</mi></mrow><mrow><mi>A</mi></mrow></msubsup></mrow></msup><msup><mro
   w><mn>2</mn></mrow><mrow><mo>−</mo><mn>0.5</mn><msubsup><mrow><mi>κ</mi
   ></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><mo>|</
   mo><mi>A</mi><mo>|</mo></mrow></msup><mo>/</mo><msub><mrow><mi>Γ</mi></
   mrow><mrow><mo>|</mo><mi>A</mi><mo>|</mo></mrow></msub><mfenced
   close=")" open="("
   separators=""><mrow><mn>0.5</mn><msubsup><mrow><mi>κ</mi></mrow><mrow><
   mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup></mrow></mfenced>
   :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>L</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>=</mo><msup><mrow><mfenced close=")" open="("
   separators=""><mrow><mn>2</mn><mi>π</mi><mo>/</mo><msubsup><mrow><mi>ν<
   /mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup></mr
   ow></mfenced></mrow><mrow><mo>−</mo><mn>0.5</mn><mo>|</mo><mi>A</mi><mo
   >|</mo></mrow></msup> :MATH]
   , where Γ[d] denotes the multivariate gamma function.

   Since NIW is a conjugate prior of Gaussian distribution, given sample,
   [MATH:
   <msup><mrow><mi>π</mi></mrow><mrow><mo>∗</mo></mrow></msup><mfenced
   close=")" open="("
   separators=""><mrow><msubsup><mrow><mi>θ</mi></mrow><mrow><mi>y</mi></m
   row><mrow><mi>A</mi></mrow></msubsup></mrow></mfenced> :MATH]
   is again NIW with updated hyperparameters:
   [MATH:
   <msubsup><mrow><mi>κ</mi></mrow><mrow><mi>y</mi></mrow><mrow><msup><mro
   w><mi>A</mi></mrow><mrow><mo>∗</mo></mrow></msup></mrow></msubsup><mo>=
   </mo><msubsup><mrow><mi>κ</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A
   </mi></mrow></msubsup><mo>+</mo><msub><mrow><mi>n</mi></mrow><mrow><mi>
   y</mi></mrow></msub> :MATH]
   ,
   [MATH:
   <msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</mi></mrow><mrow><msup><mro
   w><mi>A</mi></mrow><mrow><mo>∗</mo></mrow></msup></mrow></msubsup><mo>=
   </mo><msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A
   </mi></mrow></msubsup><mo>+</mo><msub><mrow><mi>n</mi></mrow><mrow><mi>
   y</mi></mrow></msub> :MATH]
   ,
   [MATH:
   <msubsup><mrow><mi>m</mi></mrow><mrow><mi>y</mi></mrow><mrow><msup><mro
   w><mi>A</mi></mrow><mrow><mo>∗</mo></mrow></msup></mrow></msubsup><mo>=
   </mo><mfrac><mrow><msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</mi></mro
   w><mrow><mi>A</mi></mrow></msubsup><msubsup><mrow><mi>m</mi></mrow><mro
   w><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><mo>+</mo><msub><mr
   ow><mi>n</mi></mrow><mrow><mi>y</mi></mrow></msub><msubsup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >y</mi></mrow><mrow><mi>A</mi></mrow></msubsup></mrow><mrow><msubsup><m
   row><mi>ν</mi></mrow><mrow><mi>y</mi></mrow><mrow><msup><mrow><mi>A</mi
   ></mrow><mrow><mo>∗</mo></mrow></msup></mrow></msubsup></mrow></mfrac>
   :MATH]
   , and
   [MATH:
   <msubsup><mrow><mi>S</mi></mrow><mrow><mi>y</mi></mrow><mrow><msup><mro
   w><mi>A</mi></mrow><mrow><mo>∗</mo></mrow></msup></mrow></msubsup><mspa
   ce
   width="0.3em"></mspace><mo>=</mo><msubsup><mrow><mi>S</mi></mrow><mrow>
   <mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><mo>+</mo><mo>(</mo><
   msub><mrow><mi>n</mi></mrow><mrow><mi>y</mi></mrow></msub><mo>−</mo><mn
   >1</mn><mo>)</mo><msubsup><mrow><mover
   accent="true"><mrow><mi>Σ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><mo>+</mo><mfrac><mrow><
   msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi><
   /mrow></msubsup><msub><mrow><mi>n</mi></mrow><mrow><mi>y</mi></mrow></m
   sub></mrow><mrow><msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</mi></mrow
   ><mrow><mi>A</mi></mrow></msubsup><mo>+</mo><msub><mrow><mi>n</mi></mro
   w><mrow><mi>y</mi></mrow></msub></mrow></mfrac><mfenced close=")"
   open="(" separators=""><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><mo>−</mo><msubsup><mrow
   ><mi>m</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubs
   up></mrow></mfenced><msup><mrow><mfenced close=")" open="("
   separators=""><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><mo>−</mo><msubsup><mrow
   ><mi>m</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubs
   up></mrow></mfenced></mrow><mrow><mi>T</mi></mrow></msup><mo>,</mo>
   :MATH]

   where
   [MATH: <msubsup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >y</mi></mrow><mrow><mi>A</mi></mrow></msubsup> :MATH]
   and
   [MATH: <msubsup><mrow><mover
   accent="true"><mrow><mi>Σ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >y</mi></mrow><mrow><mi>A</mi></mrow></msubsup> :MATH]
   are class-conditioned sample mean and covariance of Inline graphic ,
   respectively [[44]11]. Now suppose A is a bad block. We assume the
   prior on θ^A is NIW with hyperparameters S^A,κ^A,m^A, and ν^A, and
   relative weights K^A and L^A. Given sample, π^∗(θ^A) is NIW with
   [MATH:
   <msup><mrow><mi>κ</mi></mrow><mrow><msup><mrow><mi>A</mi></mrow><mrow><
   mo>∗</mo></mrow></msup></mrow></msup><mo>=</mo><msup><mrow><mi>κ</mi></
   mrow><mrow><mi>A</mi></mrow></msup><mo>+</mo><mi>n</mi> :MATH]
   ,
   [MATH:
   <msup><mrow><mi>ν</mi></mrow><mrow><msup><mrow><mi>A</mi></mrow><mrow><
   mo>∗</mo></mrow></msup></mrow></msup><mo>=</mo><msup><mrow><mi>ν</mi></
   mrow><mrow><mi>A</mi></mrow></msup><mo>+</mo><mi>n</mi> :MATH]
   ,
   [MATH:
   <msup><mrow><mi>m</mi></mrow><mrow><msup><mrow><mi>A</mi></mrow><mrow><
   mo>∗</mo></mrow></msup></mrow></msup><mo>=</mo><mfrac><mrow><msup><mrow
   ><mi>ν</mi></mrow><mrow><mi>A</mi></mrow></msup><msup><mrow><mi>m</mi><
   /mrow><mrow><mi>A</mi></mrow></msup><mo>+</mo><mi>n</mi><msup><mrow><mo
   ver
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >A</mi></mrow></msup></mrow><mrow><msup><mrow><mi>ν</mi></mrow><mrow><m
   sup><mrow><mi>A</mi></mrow><mrow><mo>∗</mo></mrow></msup></mrow></msup>
   </mrow></mfrac> :MATH]
   , and
   [MATH:
   <msup><mrow><mi>S</mi></mrow><mrow><msup><mrow><mi>A</mi></mrow><mrow><
   mo>∗</mo></mrow></msup></mrow></msup><mo>=</mo><msup><mrow><mi>S</mi></
   mrow><mrow><mi>A</mi></mrow></msup><mo>+</mo><mo>(</mo><mi>n</mi><mo>−<
   /mo><mn>1</mn><mo>)</mo><msup><mrow><mover
   accent="true"><mrow><mi>Σ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >A</mi></mrow></msup><mo>+</mo><mfrac><mrow><msup><mrow><mi>ν</mi></mro
   w><mrow><mi>A</mi></mrow></msup><mi>n</mi></mrow><mrow><msup><mrow><mi>
   ν</mi></mrow><mrow><mi>A</mi></mrow></msup><mo>+</mo><mi>n</mi></mrow><
   /mfrac><mfenced close=")" open="("
   separators=""><mrow><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >A</mi></mrow></msup><mo>−</mo><msup><mrow><mi>m</mi></mrow><mrow><mi>A
   </mi></mrow></msup></mrow></mfenced><msup><mrow><mfenced close=")"
   open="(" separators=""><mrow><msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >A</mi></mrow></msup><mo>−</mo><msup><mrow><mi>m</mi></mrow><mrow><mi>A
   </mi></mrow></msup></mrow></mfenced></mrow><mrow><mi>T</mi></mrow></msu
   p><mo>,</mo> :MATH]

   where
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>μ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >A</mi></mrow></msup> :MATH]
   and
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>Σ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >A</mi></mrow></msup> :MATH]
   are sample mean and covariance of Inline graphic , respectively
   [[45]11]. As long as π^∗(P) is proper, using the normalization constant
   of NIW distribution to compute the integrals in Eq. ([46]1) we have
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><msup><mrow><mi>π</mi></mrow><mrow><mo>∗</mo></mrow></m
   sup><mo>(</mo><mi>P</mi><mo>)</mo></mtd><mtd
   class="align-2"><mo>∝</mo><mi>π</mi><mo>(</mo><mi>P</mi><mo>)</mo><mund
   erover accent="false" accentunder="false"><mrow><mo
   mathsize="big">∏</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>u</mi></mrow></munderover><msubsup><mrow><mi>Q</mi></mrow><mr
   ow><mn>0</mn></mrow><mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>i</mi>
   </mrow></msub></mrow></msubsup><msubsup><mrow><mi>Q</mi></mrow><mrow><m
   n>1</mn></mrow><mrow><msub><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mro
   w></msub></mrow></msubsup><msup><mrow><mfenced close="|" open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>0</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow><mrow>
   <mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><mr
   ow><mo>−</mo><mn>0.5</mn><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>0</m
   n></mrow><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></msup><msup><m
   row><mfenced close="|" open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>1</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow><mrow>
   <mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><mr
   ow><mo>−</mo><mn>0.5</mn><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>1</m
   n></mrow><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow><
   mrow><mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></msup><mspace
   width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr><mtr><mtd class="align-1"></mtd><mtd
   class="align-2"><mo>×</mo><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∏</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>v</mi></mrow></munderover><msup><mrow><mi>Q</mi></mrow><mrow>
   <msub><mrow><mi>B</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow></msu
   p><mo>|</mo><msup><mrow><mi>S</mi></mrow><mrow><msubsup><mrow><mi>B</mi
   ></mrow><mrow><mi>j</mi></mrow><mrow><mo>∗</mo></mrow></msubsup></mrow>
   </msup><msup><mrow><mo>|</mo></mrow><mrow><mo>−</mo><mn>0.5</mn><msup><
   mrow><mi>κ</mi></mrow><mrow><msubsup><mrow><mi>B</mi></mrow><mrow><mi>j
   </mi></mrow><mrow><mo>∗</mo></mrow></msubsup></mrow></msup></mrow></msu
   p><mo>,</mo><mspace width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr></mtable> :MATH]

   where
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><msubsup><mrow><mi>Q</mi></mrow><mrow><mi>y</mi></mrow>
   <mrow><mi>A</mi></mrow></msubsup></mtd><mtd
   class="align-2"><mo>=</mo><msubsup><mrow><mi>K</mi></mrow><mrow><mi>y</
   mi></mrow><mrow><mi>A</mi></mrow></msubsup><msubsup><mrow><mi>L</mi></m
   row><mrow><mi>y</mi></mrow><mrow><mi>A</mi></mrow></msubsup><msup><mrow
   ><mn>2</mn></mrow><mrow><mn>0.5</mn><msubsup><mrow><mi>κ</mi></mrow><mr
   ow><mi>y</mi></mrow><mrow><mi>A</mi><mo>∗</mo></mrow></msubsup><mo>|</m
   o><mi>A</mi><mo>|</mo></mrow></msup><msub><mrow><mi>Γ</mi></mrow><mrow>
   <mo>|</mo><mi>A</mi><mo>|</mo></mrow></msub><mfenced close=")" open="("
   separators=""><mrow><mn>0.5</mn><msubsup><mrow><mi>κ</mi></mrow><mrow><
   mi>y</mi></mrow><mrow><msup><mrow><mi>A</mi></mrow><mrow><mo>∗</mo></mr
   ow></msup></mrow></msubsup></mrow></mfenced><msup><mrow><mfenced
   close=")" open="("
   separators=""><mrow><mn>2</mn><mi>π</mi><mo>/</mo><msubsup><mrow><mi>ν<
   /mi></mrow><mrow><mi>y</mi></mrow><mrow><msup><mrow><mi>A</mi></mrow><m
   row><mo>∗</mo></mrow></msup></mrow></msubsup></mrow></mfenced></mrow><m
   row><mn>0.5</mn><mo>|</mo><mi>A</mi><mo>|</mo></mrow></msup><mo>,</mo><
   mspace width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr><mtr><mtd
   class="align-1"><msup><mrow><mi>Q</mi></mrow><mrow><mi>A</mi></mrow></m
   sup></mtd><mtd
   class="align-2"><mo>=</mo><msup><mrow><mi>K</mi></mrow><mrow><mi>A</mi>
   </mrow></msup><msup><mrow><mi>L</mi></mrow><mrow><mi>A</mi></mrow></msu
   p><msup><mrow><mn>2</mn></mrow><mrow><mn>0.5</mn><msup><mrow><mi>κ</mi>
   </mrow><mrow><mi>A</mi><mo>∗</mo></mrow></msup><mo>|</mo><mi>A</mi><mo>
   |</mo></mrow></msup><msub><mrow><mi>Γ</mi></mrow><mrow><mo>|</mo><mi>A<
   /mi><mo>|</mo></mrow></msub><mfenced close=")" open="("
   separators=""><mrow><mn>0.5</mn><msup><mrow><mi>κ</mi></mrow><mrow><msu
   p><mrow><mi>A</mi></mrow><mrow><mo>∗</mo></mrow></msup></mrow></msup></
   mrow></mfenced><msup><mrow><mfenced close=")" open="("
   separators=""><mrow><mn>2</mn><mi>π</mi><mo>/</mo><msup><mrow><mi>ν</mi
   ></mrow><mrow><msup><mrow><mi>A</mi></mrow><mrow><mo>∗</mo></mrow></msu
   p></mrow></msup></mrow></mfenced></mrow><mrow><mn>0.5</mn><mo>|</mo><mi
   >A</mi><mo>|</mo></mrow></msup><mi>.</mi><mspace
   width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr></mtable> :MATH]

   Assuming: (1) π(P) is such that the block structure, i.e., the number
   and size of good and bad blocks, is fixed, (2) for each good block A,
   [MATH:
   <msubsup><mrow><mi>K</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   ,
   [MATH:
   <msubsup><mrow><mi>L</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   ,
   [MATH:
   <msubsup><mrow><mi>κ</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   , and
   [MATH:
   <msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   do not depend on the features indices in A, and (3) for each bad block
   A, K^A, L^A, κ^A, and ν^A do not depend on the features indices in A,
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><msup><mrow><mi>π</mi></mrow><mrow><mo>∗</mo></mrow></m
   sup><mo>(</mo><mi>P</mi><mo>)</mo><mo>∝</mo><mi>π</mi><mo>(</mo><mi>P</
   mi><mo>)</mo><msup><mrow><mfenced close=")" open="("
   separators=""><mrow><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∏</mo></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>u</mi></mrow></munderover><msup><mrow><mfenced close="|"
   open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>0</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow><mrow>
   <mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><mr
   ow><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>0</mn></mrow><mrow><msubsu
   p><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>∗</mo></mrow>
   </msubsup></mrow></msubsup></mrow></msup><msup><mrow><mfenced close="|"
   open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>1</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow><mrow>
   <mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><mr
   ow><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>1</mn></mrow><mrow><msubsu
   p><mrow><mi>G</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo>∗</mo></mrow>
   </msubsup></mrow></msubsup></mrow></msup><munderover accent="false"
   accentunder="false"><mrow><mo
   mathsize="big">∏</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>v</mi></mrow></munderover><msup><mrow><mfenced close="|"
   open="|"
   separators=""><mrow><msup><mrow><mi>S</mi></mrow><mrow><msubsup><mrow><
   mi>B</mi></mrow><mrow><mi>j</mi></mrow><mrow><mo>∗</mo></mrow></msubsup
   ></mrow></msup></mrow></mfenced></mrow><mrow><msup><mrow><mi>κ</mi></mr
   ow><mrow><msubsup><mrow><mi>B</mi></mrow><mrow><mi>j</mi></mrow><mrow><
   mo>∗</mo></mrow></msubsup></mrow></msup></mrow></msup></mrow></mfenced>
   </mrow><mrow><mo>−</mo><mn>0.5</mn></mrow></msup><mi>.</mi></mtd></mtr>
   </mtable> :MATH]

Methods

   Here we describe the set selection methods used. Note we aim to find
   the set of true good features, rather than the true underlying feature
   partition. The Maximum Number Correct (MNC) criterion [[47]7] outputs
   the set maximizing the expected number of correctly labeled features
   and the Constrained MNC (CMNC) criterion outputs the set with maximum
   expected number of correctly labeled features constrained to having
   exactly D selected features, where D is a parameter of the optimization
   problem [[48]9]. The solution of MNC is {f∈F:π^∗(f)>0.5} [[49]7] and
   the solution of CMNC is picking the top D features with largest π^∗(f)
   [[50]9]. Therefore, both MNC and CMNC require computing π^∗(f) for all
   f∈F, which is not computationally feasible for an arbitrary block
   structure unless |F| is very small. We review two previously proposed
   algorithms, OBF and 2MNC-Robust, and then present three new algorithms.

Optimal Bayesian filter

   Optimal Bayesian Filter (OBF) assumes all blocks have size one, i.e.,
   all features are independent, and assumes the events
   [MATH: <mo>{</mo><mi>f</mi><mo>∈</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>}</mo>
   :MATH]
   are independent a priori. In this case π^∗(f) can be found in closed
   form with little computation cost [[51]6, [52]9]. OBF is optimal under
   its modeling assumptions. As argued in [[53]9], in the presence of
   correlation OBF is a robust suboptimal algorithm that can detect
   individually strong good features, i.e., those whose mean and/or
   variance is very different between the two classes, but cannot take
   advantage of correlations to correctly label individually weak good
   features, those whose mean and variance are similar in both classes.

2MNC-Robust

   The 2MNC algorithm [[54]7] suggests approximating π^∗(f) using π^∗(G)
   for all sets G such that |G|=2, and picking the top D features. Since
   finding π^∗(G) for all feature partitions where |∪P[G]|=2 is typically
   infeasible, an approximate posterior,
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo> :MATH]
   , is proposed [[55]8], where for all G⊆F of size 2,

   graphic file with name 12859_2018_2059_Figo_HTML.gif

   The normalization constant is found such that
   [MATH:
   <munder><mrow><mo>∑</mo></mrow><mrow><mi>G</mi><mo>⊆</mo><mi>F</mi><mo>
   :</mo><mo>|</mo><mi>G</mi><mo>|</mo><mo>=</mo><mn>2</mn></mrow></munder
   ><msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo><mo>=</mo><mn>1</mn>
   :MATH]
   .
   [MATH: <mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><mi>G<
   /mi><mo>)</mo> :MATH]
   mimics the role of
   [MATH:
   <mi>π</mi><mo>(</mo><mi>G</mi><mo>)</mo><mo>=</mo><mi>P</mi><mo>(</mo><
   mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>=</mo><mi>G<
   /mi><mo>)</mo><mo>=</mo><munder><mrow><mo>∑</mo></mrow><mrow><mi>P</mi>
   <mo>:</mo><mo>∪</mo><msub><mrow><mi>P</mi></mrow><mrow><mi>G</mi></mrow
   ></msub><mo>=</mo><mi>G</mi></mrow></munder><mi>π</mi><mo>(</mo><mi>P</
   mi><mo>)</mo> :MATH]
   . Using some suboptimal method might affect one’s decision of the value
   used as the prior of a feature set, replacing π^∗(G) with
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo> :MATH]
   . For example, knowing
   [MATH: <mo>|</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>|</mo><mo>><
   /mo><mn>2</mn> :MATH]
   implies π(G)=0 for all sets of size 2; however, 2MNC-Robust only
   evaluates such sets. In this case,
   [MATH: <mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><mi>G<
   /mi><mo>)</mo><mo>=</mo><mi>P</mi><mo>(</mo><mi>G</mi><mo>⊆</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>)</mo>
   :MATH]
   might be a suitable choice to replace π(G).
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo><mo>=</mo><munder><m
   row><mo>∑</mo></mrow><mrow><mi>G</mi><mo>:</mo><mi>f</mi><mo>∈</mo><mi>
   G</mi></mrow></munder><msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo> :MATH]
   is the approximate marginal posterior of f∈F. For the Gaussian model,
   if the number of good features is fixed and hyperparameters do not
   depend on the feature indices,
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo></mtd><mtd
   class="align-2"><mo>∝</mo><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><mi>G<
   /mi><mo>)</mo><msup><mrow><mfenced close=")" open="("
   separators=""><mrow><mo>|</mo><msubsup><mrow><mi>S</mi></mrow><mrow><mn
   >0</mn></mrow><mrow><msup><mrow><mi>G</mi></mrow><mrow><mo>∗</mo></mrow
   ></msup></mrow></msubsup><msup><mrow><mo>|</mo></mrow><mrow><msubsup><m
   row><mi>κ</mi></mrow><mrow><mn>0</mn></mrow><mrow><msup><mrow><mi>G</mi
   ></mrow><mrow><mo>∗</mo></mrow></msup></mrow></msubsup></mrow></msup><m
   o>|</mo><msubsup><mrow><mi>S</mi></mrow><mrow><mn>1</mn></mrow><mrow><m
   sup><mrow><mi>G</mi></mrow><mrow><mo>∗</mo></mrow></msup></mrow></msubs
   up><msup><mrow><mo>|</mo></mrow><mrow><msubsup><mrow><mi>κ</mi></mrow><
   mrow><mn>1</mn></mrow><mrow><msup><mrow><mi>G</mi></mrow><mrow><mo>∗</m
   o></mrow></msup></mrow></msubsup></mrow></msup><mstyle
   mathsize="1.19em"><mfenced close="" open="/"
   separators=""><mrow></mrow></mfenced></mstyle><mo>|</mo><msup><mrow><mi
   >S</mi></mrow><mrow><msup><mrow><mi>G</mi></mrow><mrow><mo>∗</mo></mrow
   ></msup></mrow></msup><msup><mrow><mo>|</mo></mrow><mrow><msup><mrow><m
   i>κ</mi></mrow><mrow><msup><mrow><mi>G</mi></mrow><mrow><mo>∗</mo></mro
   w></msup></mrow></msup></mrow></msup></mrow></mfenced></mrow><mrow><mo>
   −</mo><mn>0.5</mn></mrow></msup><mi>.</mi><mspace
   width="2em"></mspace></mtd></mtr></mtable> :MATH]
   2

   2MNC-Robust is implementing 2MNC with
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo> :MATH]
   . As mentioned before, 2MNC-Robust does not tune itself to the
   underlying block structure of data.

Recursive marginal posterior inflation

   It is easy to show that
   [MATH:
   <munder><mrow><mo>∑</mo></mrow><mrow><mi>f</mi><mo>∈</mo><mi>F</mi></mr
   ow></munder><msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo><mo>=</mo><mn>2</mn>
   :MATH]
   when only sets of size 2 are used to find
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo> :MATH]
   . Hence, under MNC criterion one would at most pick 4 good features,
   implying we underestimate π^∗(f) by only using sets of size 2 when
   [MATH: <mo>|</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>|</mo><mo>><
   /mo><mo>></mo><mn>2</mn> :MATH]
   . REcursive MArginal posterior INflation (REMAIN) aims to sequentially
   detect good features by rescaling
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo><mo>=</mo><munder><m
   row><mo>∑</mo></mrow><mrow><mi>G</mi><mo>:</mo><mi>f</mi><mo>∈</mo><mi>
   G</mi><mo>,</mo><mo>|</mo><mi>G</mi><mo>|</mo><mo>=</mo><mn>2</mn></mro
   w></munder><msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo> :MATH]
   . We initialize REMAIN with the set of all features, F[r]=F. Then,
   REMAIN uses the MArginal posterior INflation (MAIN) algorithm to
   identify several features as good, removes them from F[r], and feeds
   MAIN with the truncated F[r] to select additional features. This
   process iterates until MAIN does not output any features. REMAIN is
   nothing but repetitive calls to MAIN with shrinking feature sets,
   making MAIN the heart of this algorithm.

   graphic file with name 12859_2018_2059_Figp_HTML.gif

   Pseudo-code of MAIN is provided in Algorithm 1, where H(G) is the right
   hand side of Eq. ([56]2). Inputted with a feature set F[t], MAIN finds
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo> :MATH]
   using sets of size 2, and finds the set
   [MATH:
   <msub><mrow><mi>G</mi></mrow><mrow><mi>s</mi></mrow></msub><mo>=</mo><m
   o>{</mo><mi>f</mi><mo>∈</mo><msub><mrow><mi>F</mi></mrow><mrow><mi>t</m
   i></mrow></msub><mo>:</mo><msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo><mo>></mo><msub><mro
   w><mi>T</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>}</mo> :MATH]
   . MAIN adds G[s] to
   [MATH: <mover accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover>
   :MATH]
   , the set of features in F[t] already labeled as good. It then updates
   F[t] to F[t]∖G[s], and rescales
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo> :MATH]
   of features f∈F[t] so that
   [MATH:
   <munder><mrow><mo>∑</mo></mrow><mrow><mi>f</mi><mo>∈</mo><msub><mrow><m
   i>F</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow></munder><msup><mro
   w><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo><mo>=</mo><mn>2</mn>
   :MATH]
   . Note features in
   [MATH: <mover accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover>
   :MATH]
   are used to compute
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo> :MATH]
   for features f∈F[t], but
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo> :MATH]
   of features
   [MATH: <mi>f</mi><mo>∈</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover> :MATH]
   are not used in the scaling of
   [MATH:
   <munder><mrow><mo>∑</mo></mrow><mrow><mi>f</mi><mo>∈</mo><msub><mrow><m
   i>F</mi></mrow><mrow><mi>t</mi></mrow></msub></mrow></munder><msup><mro
   w><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo><mo>=</mo><mn>2</mn>
   :MATH]
   . MAIN iterates until
   [MATH: <mover
   accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover><mo>=</mo><mi>ϕ<
   /mi> :MATH]
   , or H(G)≤T[2] for all G⊆F[t] with |G|=2.

   Not finding new features in MAIN might be due to the remaining good
   features being weaker and independent of
   [MATH: <mover accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover>
   :MATH]
   . Hence, REMAIN removes
   [MATH: <mover accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover>
   :MATH]
   from F[r], and feeds MAIN with the updated F[r]. This way, features in
   [MATH: <mover accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover>
   :MATH]
   are not used to compute
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo> :MATH]
   anymore for any feature f∈F[r], thus making it easier to detect weaker
   good features that are independent of features already selected by
   REMAIN. Pseudo-code of REMIAN is provided in Algorithm 2.

   T[1] mimics the role of the threshold used in the MNC criterion. Hence,
   T[1]∈[ 0, 1]. Recall that by evaluating sets of size 2 we underestimate
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo> :MATH]
   when
   [MATH: <mo>|</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>|</mo><mo>><
   /mo><mo>></mo><mn>2</mn> :MATH]
   . Therefore, when confident
   [MATH: <mo>|</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>|</mo><mo>><
   /mo><mo>></mo><mn>2</mn> :MATH]
   , one might opt for smaller values for T[1] rather than values close to
   1. As T[2] is a threshold over un-normalized posteriors, H(G), extra
   care must be taken when setting T[2]. We suggest T[2]=n for
   high-dimensional feature selection applications, which is a good rule
   of thumb based on our simulation results and asymptotic analysis of
   H(.).

   graphic file with name 12859_2018_2059_Figq_HTML.gif

   Note the number of features reported by REMAIN is variable; however,
   one can easily obtain close to a desired number of selected features by
   tuning T[1]. To illustrate, we provide an example based on the data
   generation model used in the “[57]Synthetic microarray simulations”
   section, where we assume there are 100 markers, i.e., good features,
   and 4900 non-markers, i.e., bad features. We use the synergetic model
   with block size k=5 and correlation coefficients ρ[0]=ρ[1]=0.9. Fixing
   T[2]=n, Table [58]1 lists the average number of markers and non-markers
   selected over 1000 iterations for n=20 and 100 across different values
   of T[1]. REMAIN outputs very few features when T[1] is large, and too
   many features when T[1] is extremely low. The best choice for T[1] can
   vary greatly from case to case, but one strategy is to choose T[1] so
   that REMAIN selects close to a given number of features. For example,
   T[1]=0.05 is a good choice in this simulation if one desires
   approximately 100 selected features. Another strategy is based on the
   number of features selected by REMAIN across various values of T[1].
   When T[1] is large, reducing T[1] only slightly increases the output
   feature size, for instance when T[1]>0.05 in this simulation. However,
   one might observe a rapid increase in the output size by slightly
   reducing T[1], for instance T[1] changing from 0.05 to 0.01 in this
   simulation. For such observed patterns, the value for which this
   phenomenon occurs might be a desirable choice.

Table 1.

   Performance of REMAIN for various values of T[1]
  Objective        n   T[1]=0.3 T[1]=0.2 T[1]=0.1 T[1]=0.05 T[1]=0.01 T[1]=0.005
  Marker found     20  0.9      1.4      2.5      4.6       15.3      26.2
  Non-marker found 20  2.0      3.7      9.4      23.3      164.5     403.0
  Marker found     100 52.5     56.2     58.2     60.0      68.9      76.1
  Non-marker found 100 1.6      3.3      8.8      20.2      124.7     288.5
   [59]Open in a new tab

Posterior factor

   Feature selection can be construed as a model selection problem where
   each model is a set of good features. Let f be a feature. If f is a
   good feature, we expect that if we add f to any model G, i.e., a set of
   good features, then
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>∪</mo><mo>{</mo><mspace
   width="0.3em"></mspace><mi>f</mi><mo>}</mo><mo>)</mo><mo>/</mo><msup><m
   row><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo><mo>></mo><mo>></mo>
   <mn>1</mn> :MATH]
   . If f is a bad feature, we expect
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>∪</mo><mo>{</mo><mspace
   width="0.3em"></mspace><mi>f</mi><mo>}</mo><mo>)</mo><mo>/</mo><msup><m
   row><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo> :MATH]
   to be much smaller. Hence, if we average this ratio over a family of
   models that do not contain f, and denote it by β(f), we expect β(f)>>1
   if f is a good feature, and comparable to or smaller than 1 if f is a
   bad feature.
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>∪</mo><mo>{</mo><mspace
   width="0.3em"></mspace><mi>f</mi><mo>}</mo><mo>)</mo><mo>/</mo><msup><m
   row><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo> :MATH]
   is similar, but not identical, to the Bayes factor encountered in model
   selection [[60]12], where here we compare a model with good feature set
   G∪{ f} versus a model with good feature set G excluding f. The
   posterior factor, β(f), averages the approximate posterior ratio across
   all feature sets G∈F∖{ f}, i.e.,
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mi>β</mi><mo>(</mo><mi>f</mi><mo>)</mo><mo>=</mo><mfra
   c><mrow><mn>1</mn></mrow><mrow><mo>|</mo><msup><mrow><mi>Ω</mi></mrow><
   mrow><mi>f</mi></mrow></msup><mo>|</mo></mrow></mfrac><munder><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><mi>G</mi><mo>∈</mo><msup><mrow><mi>Ω
   </mi></mrow><mrow><mi>f</mi></mrow></msup></mrow></munder><mfrac><mrow>
   <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>∪</mo><mo>{</mo><mspace
   width="0.3em"></mspace><mi>f</mi><mo>}</mo><mo>)</mo></mrow><mrow><msup
   ><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo></mrow></mfrac><mo>,
   </mo></mtd></mtr></mtable> :MATH]
   3

   where Ω^f={G⊆F:f∉G}. As the summation over Ω^f is computationally
   infeasible, we propose approximate posterior factor, hereafter denoted
   by
   [MATH: <mover accent="true"><mrow><mi>β</mi></mrow><mo>~</mo></mover>
   :MATH]
   .
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mover
   accent="true"><mrow><mi>β</mi></mrow><mo>~</mo></mover><mo>(</mo><mi>f<
   /mi><mo>)</mo><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo><
   mi>F</mi><mo>|</mo><mo>−</mo><mn>1</mn></mrow></mfrac><munder><mrow><mo
   mathsize="big">∑</mo></mrow><mrow><msup><mrow><mi>f</mi></mrow><mrow><m
   o>′</mo></mrow></msup><mo>∈</mo><mi>F</mi><mo>∖</mo><mo>{</mo><mi>f</mi
   ><mo>}</mo></mrow></munder><mfrac><mrow><msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mo>{</mo><mspace
   width="0.3em"></mspace><mi>f</mi><mo>,</mo><msup><mrow><mi>f</mi></mrow
   ><mrow><mo>′</mo></mrow></msup><mo>}</mo><mo>)</mo></mrow><mrow><msup><
   mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mo>{</mo><mspace
   width="0.3em"></mspace><msup><mrow><mi>f</mi></mrow><mrow><mo>′</mo></m
   row></msup><mo>}</mo><mo>)</mo></mrow></mfrac><mi>.</mi></mtd></mtr></m
   table> :MATH]
   4

   We propose approximate POsterior FActor-Constrained (POFAC) algorithm
   as follows: Use
   [MATH: <mover
   accent="true"><mrow><mi>β</mi></mrow><mo>~</mo></mover><mo>(</mo><mspac
   e width="0.3em"></mspace><mi>f</mi><mo>)</mo> :MATH]
   to rank features, and pick the top D features. Note that D is a
   parameter of the algorithm.

Sequential partition mustering

   Sequential Partition Mustering (SPM) aims to improve feature selection
   performance by sequentially detecting good blocks, and adding them to
   the set of previously selected features. To find a good block, we start
   with the most significant feature, i.e., the feature with largest
   [MATH: <mover accent="true"><mrow><mi>β</mi></mrow><mo>~</mo></mover>
   :MATH]
   , and find the block containing it. Note we do not aim to find the
   structure of bad blocks, and as soon as we declare no more good
   features remain the algorithm terminates.

   Suppose u[0] is the current most significant feature. In order to find
   the block containing u[0] we propose the Good Seed Grower (GSG)
   algorithm, which can be construed as a seed growing algorithm with u[0]
   as the seed. Pseudo-code of GSG is presented in Algorithm 3, where for
   any two non-empty disjoint sets G[1],G[2]⊂F,
   [MATH:
   <mrow><mtable><mtr><mtd><msub><mrow><mi>C</mi></mrow><mrow><mn>1</mn></
   mrow></msub><mo>(</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mr
   ow></msub><mo>,</mo><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow
   ></msub><mo>)</mo></mtd><mtd><mo>=</mo><mfrac><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><msub>
   <mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><m
   row><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>)</mo></mrow><mr
   ow><mn>1</mn><mo>−</mo><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><msub>
   <mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><m
   row><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>)</mo></mrow></m
   frac><mfrac><mrow><msubsup><mrow><mi>Q</mi></mrow><mrow><mn>0</mn></mro
   w><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>12</mn></mrow></msub></m
   row></msubsup><msubsup><mrow><mi>Q</mi></mrow><mrow><mn>1</mn></mrow><m
   row><msub><mrow><mi>G</mi></mrow><mrow><mn>12</mn></mrow></msub></mrow>
   </msubsup></mrow><mrow><msubsup><mrow><mi>Q</mi></mrow><mrow><mn>0</mn>
   </mrow><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></msub
   ></mrow></msubsup><msubsup><mrow><mi>Q</mi></mrow><mrow><mn>1</mn></mro
   w><mrow><msub><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></msub></mr
   ow></msubsup><msubsup><mrow><mi>Q</mi></mrow><mrow><mn>0</mn></mrow><mr
   ow><msub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></
   msubsup><msubsup><mrow><mi>Q</mi></mrow><mrow><mn>1</mn></mrow><mrow><m
   sub><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></msubs
   up></mrow></mfrac></mtd></mtr><mtr><mtd><mspace
   width="1em"></mspace><mo>×</mo><msup><mrow><mfenced close=")" open="("
   separators=""><mrow><mfrac><mrow><msup><mrow><mfenced close="|"
   open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>0</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mn>12</mn></mrow><mrow
   ><mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><m
   row><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>0</mn></mrow><mrow><msubs
   up><mrow><mi>G</mi></mrow><mrow><mn>12</mn></mrow><mrow><mo>∗</mo></mro
   w></msubsup></mrow></msubsup></mrow></msup><msup><mrow><mfenced
   close="|" open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>1</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mn>12</mn></mrow><mrow
   ><mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><m
   row><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>1</mn></mrow><mrow><msubs
   up><mrow><mi>G</mi></mrow><mrow><mn>12</mn></mrow><mrow><mo>∗</mo></mro
   w></msubsup></mrow></msubsup></mrow></msup></mrow><mrow><msup><mrow><mf
   enced close="|" open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>0</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow><mrow>
   <mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><mr
   ow><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>0</mn></mrow><mrow><msubsu
   p><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow><mrow><mo>∗</mo></mrow>
   </msubsup></mrow></msubsup></mrow></msup><msup><mrow><mfenced close="|"
   open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>1</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow><mrow>
   <mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><mr
   ow><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>1</mn></mrow><mrow><msubsu
   p><mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow><mrow><mo>∗</mo></mrow>
   </msubsup></mrow></msubsup></mrow></msup><msup><mrow><mfenced close="|"
   open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>0</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow><mrow>
   <mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><mr
   ow><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>0</mn></mrow><mrow><msubsu
   p><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow><mrow><mo>∗</mo></mrow>
   </msubsup></mrow></msubsup></mrow></msup><msup><mrow><mfenced close="|"
   open="|"
   separators=""><mrow><msubsup><mrow><mi>S</mi></mrow><mrow><mn>1</mn></m
   row><mrow><msubsup><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow><mrow>
   <mo>∗</mo></mrow></msubsup></mrow></msubsup></mrow></mfenced></mrow><mr
   ow><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>1</mn></mrow><mrow><msubsu
   p><mrow><mi>G</mi></mrow><mrow><mn>2</mn></mrow><mrow><mo>∗</mo></mrow>
   </msubsup></mrow></msubsup></mrow></msup></mrow></mfrac></mrow></mfence
   d></mrow><mrow><mo>−</mo><mn>0.5</mn></mrow></msup><mo>,</mo></mtd></mt
   r></mtable></mrow> :MATH]

   G[12]=G[1]∪G[2], and
   [MATH: <mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><msub>
   <mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><m
   row><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>)</mo> :MATH]
   approximates π(G[1],G[2]), the prior probability that at least one of
   the features in G[2] is not independent of G[1]. Note Inline graphic ,
   where Inline graphic is the family of feature partitions that contain a
   block U such that U∩G[1]≠ϕ and U∩G[2]≠ϕ. At each iteration, GSG finds
   the feature u^∗ that maximizes C[1](U,{u^∗}), where U is the currently
   detected sub-block of the block containing u[0]. GSG declares u^∗ and U
   belong to the same block if C[1](U,{u^∗})>T[3], and adjoins u^∗ to U;
   otherwise, it terminates and declares U as the block containing u[0].
   Here we assume
   [MATH:
   <msub><mrow><mi>T</mi></mrow><mrow><mn>3</mn></mrow></msub><mo>=</mo><m
   sub><mrow><mi>t</mi></mrow><mrow><mn>1</mn></mrow></msub><msup><mrow><m
   i>n</mi></mrow><mrow><msub><mrow><mi>t</mi></mrow><mrow><mn>2</mn></mro
   w></msub><mo>|</mo><mi>U</mi><mo>|</mo></mrow></msup> :MATH]
   , where t[1],t[2]>0 are parameters of GSG. While we have only
   considered one possible family of thresholds, we expect this family to
   be large enough for most practical purposes.

   graphic file with name 12859_2018_2059_Figt_HTML.gif

   Pseudo-code of SPM is explained in Algorithm 4. Let F[t] be the feature
   set used by SPM initialized to F[t]=F. We start with the most
   significant feature u[0] and find the block containing it, U. We then
   update F[t] to F[t]∖U. If
   [MATH: <mover
   accent="true"><mrow><mi>β</mi></mrow><mo>~</mo></mover><mo>(</mo><mi>f<
   /mi><mo>)</mo><mo><</mo><msub><mrow><mi>T</mi></mrow><mrow><mn>4</mn></
   mrow></msub> :MATH]
   for all f∈F[t], then SPM declares F[t] does not contain any good
   features and terminates; otherwise, it picks the most significant
   feature of F[t] and iterates. Similar to REMAIN, SPM cannot be forced
   to output a fixed number of features, but T[4] can be used to tune SPM
   to output close to a desired number of features. In addition, t[1] and
   t[2] can be used to avoid picking large blocks, which also affects the
   output feature set size.

   graphic file with name 12859_2018_2059_Figu_HTML.gif

   We again provide an example based on the simulation we did for REMAIN.
   We let t[1]=10^1,10^2,⋯,10^5,t[2]=1,2,3,4,5, and T[4]=10^2,10^4,10^6,
   and consider all combinations to construct a very wide range of
   parameters. Figures [61]1 and [62]2 illustrate how parameters of SPM
   affect its performance for n=20 and 100, respectively. While low
   thresholds mislabel more non-markers as good features, they correctly
   label more markers compared with large thresholds. When n=20, in order
   to correctly label at least 10 markers on average, at least 50
   non-markers are mislabeled, and to mislabel at most 5 non-markers on
   average, one cannot correctly detect more than 5 markers. On the other
   hand, when n=100, one can simultaneously correctly label at least 80
   markers and mislabel at most 10 non-markers for almost all parameters.
   Moving from lowest parameter values to highest, we observe the average
   outputted feature size varies from approximately 400 to less than 5
   when n=20, while it only varies from 120 to 70 when n=100. Thereby,
   n=100 is less sensitive to the choice of parameters than n=20. Suppose
   n=100 and one aims to find most markers and few non-markers. Many
   parameters can achieve this goal. In addition, within the range of such
   parameters, the number of features labeled as good does not change very
   much by slightly varying the parameters. Hence, for a fixed sample, one
   can implement SPM over a wide range of parameters, find the range where
   the output feature size does not vary much by slightly changing the
   parameters, and pick a value in that region that outputs a feature set
   with close to a reasonable number of features.

Fig. 1.

   Fig. 1
   [63]Open in a new tab

   Performance of SPM for various values of t[1], t[2], T[4], and n=20.
   Average number of markers labeled as good for a T[4]=10^2, b T[4]=10^4,
   and c T[4]=10^6. Average number of non-markers labeled as good for d
   T[4]=10^2, e T[4]=10^4, and f T[4]=10^6

Fig. 2.

   Fig. 2
   [64]Open in a new tab

   Performance of SPM for various values of t[1], t[2], T[4], and n=100.
   Average number of markers labeled as good for a T[4]=10^2, b T[4]=10^4,
   and c T[4]=10^6. Average number of non-markers labeled as good for d
   T[4]=10^2, e T[4]=10^4, and f T[4]=10^6

Simulations

   We compare the performance of proposed algorithms with many popular
   feature selection algorithms over a Bayesian setting, and a synthetic
   microarray model introduced in [[65]13] and extended in [[66]8, [67]9].

Bayesian simulation

   In this simulation we assume |F|=4100 and
   [MATH: <mo>|</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>|</mo><mo>=<
   /mo><mn>100</mn> :MATH]
   . We assume there is 1 good block for each of the following sizes: 10,
   20, 30, and 40. We also assume there are 20 bad blocks for each of the
   following sizes: 5, 10, 15, 20, 50, and 100. We first randomly assign
   each feature to a block such that the assumed block structure is
   satisfied, effectively constructing
   [MATH: <mover accent="true"><mrow><mi>P</mi></mrow><mo>¯</mo></mover>
   :MATH]
   . Afterwards, distribution parameters are randomly drawn from the
   following NIW prior. For each good block, A, we have
   [MATH:
   <msubsup><mrow><mi>S</mi></mrow><mrow><mn>0</mn></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>=</mo><msubsup><mrow><mi>S</mi></mrow><mrow><mn>1<
   /mn></mrow><mrow><mi>A</mi></mrow></msubsup><mo>=</mo><mn>0.5</mn><mo>×
   </mo><msub><mrow><mi>I</mi></mrow><mrow><mo>|</mo><mi>A</mi><mo>|</mo><
   mo>×</mo><mo>|</mo><mi>A</mi><mo>|</mo></mrow></msub><mo>,</mo><msubsup
   ><mrow><mi>κ</mi></mrow><mrow><mn>0</mn></mrow><mrow><mi>A</mi></mrow><
   /msubsup><mo>=</mo><msubsup><mrow><mi>κ</mi></mrow><mrow><mn>1</mn></mr
   ow><mrow><mi>A</mi></mrow></msubsup><mo>=</mo><mo>|</mo><mi>A</mi><mo>|
   </mo><mo>+</mo><mn>2</mn><mo>,</mo><msubsup><mrow><mi>m</mi></mrow><mro
   w><mn>0</mn></mrow><mrow><mi>A</mi></mrow></msubsup><mo>=</mo><msubsup>
   <mrow><mi>m</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi>A</mi></mrow></
   msubsup><mo>=</mo><mn>0</mn> :MATH]
   , and
   [MATH:
   <msubsup><mrow><mi>ν</mi></mrow><mrow><mn>0</mn></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>=</mo><msubsup><mrow><mi>ν</mi></mrow><mrow><mn>1<
   /mn></mrow><mrow><mi>A</mi></mrow></msubsup><mo>=</mo><mn>4</mn> :MATH]
   , where I is the identity matrix. Also, for a bad block, A, we have
   S^A=0.5×I[|A|×|A|],κ^A=|A|+2,m^A=0, and ν^A=4. Given distribution
   parameters, a stratified sample of size n with equal points in each
   class is drawn. The following feature selection methods declare the set
   of good features: t-test, Bhatacharrya Distance (BD), Mutual
   Information (MI) using the non-parameter method of [[68]14] with
   spacing parameter m=1, Sequential Forward Search using the bolstered
   error estimate [[69]15] of Regularized Linear Discriminant Analysis
   applied to the top 300 features of BD (SFS-RLDA), FOrward selection
   using Hilbert-Schmidt Independence Criterion (FOHSIC) [[70]16] applied
   to the top 300 features of BD, CMNC-OBF, 2MNC-Robust, REMAIN, POFAC,
   and SPM. Note t-test, MI, BD, and CMNC-OBF are filter methods. All
   methods except REMAIN and SPM output
   [MATH: <mo>|</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>|</mo>
   :MATH]
   features. CMNC-OBF assumes the events
   [MATH: <mo>{</mo><mi>f</mi><mo>∈</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>}</mo>
   :MATH]
   are independent and
   [MATH: <mi>P</mi><mo>(</mo><mi>f</mi><mo>∈</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>)</mo>
   :MATH]
   is constant for all f∈F. 2MNC-Robust and REMAIN assume
   [MATH: <mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><mi>G<
   /mi><mo>)</mo> :MATH]
   is uniform over all sets of size 2, and zero otherwise. POFAC assumes
   [MATH: <mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><mi>G<
   /mi><mo>)</mo> :MATH]
   is uniform over all sets of size 1 and 2. Finally, SPM assumes
   [MATH: <mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><msub>
   <mrow><mi>G</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><m
   row><mi>G</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>)</mo><mo>=</mo>
   <mn>0.5</mn> :MATH]
   for all sets G[1],G[2]⊆F, and uses the same
   [MATH: <mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover><mo>(</mo><mi>G<
   /mi><mo>)</mo> :MATH]
   of POFAC to compute
   [MATH: <mover
   accent="true"><mrow><mi>β</mi></mrow><mo>~</mo></mover><mo>(</mo><mi>f<
   /mi><mo>)</mo> :MATH]
   . Bayesian algorithms use proper priors with hyperparameters of the
   same form given previously (PP), and Jeffreys non-informative prior
   (JP), where for each set, A,
   [MATH:
   <msubsup><mrow><mi>S</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   and S^A are zero matrices,
   [MATH:
   <msubsup><mrow><mi>K</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>=</mo><msup><mrow><mi>K</mi></mrow><mrow><mi>A</mi
   ></mrow></msup><mo>=</mo><msubsup><mrow><mi>L</mi></mrow><mrow><mi>y</m
   i></mrow><mrow><mi>A</mi></mrow></msubsup><mo>=</mo><msup><mrow><mi>L</
   mi></mrow><mrow><mi>A</mi></mrow></msup><mo>=</mo><mn>1</mn> :MATH]
   , and
   [MATH:
   <msubsup><mrow><mi>κ</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>=</mo><msup><mrow><mi>κ</mi></mrow><mrow><mi>A</mi
   ></mrow></msup><mo>=</mo><msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</m
   i></mrow><mrow><mi>A</mi></mrow></msubsup><mo>=</mo><msup><mrow><mi>ν</
   mi></mrow><mrow><mi>A</mi></mrow></msup><mo>=</mo><mn>0</mn> :MATH]
   . With
   [MATH:
   <msubsup><mrow><mi>ν</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup><mo>=</mo><msup><mrow><mi>ν</mi></mrow><mrow><mi>A</mi
   ></mrow></msup><mo>=</mo><mn>0</mn> :MATH]
   we do not need to specify
   [MATH:
   <msubsup><mrow><mi>m</mi></mrow><mrow><mi>y</mi></mrow><mrow><mi>A</mi>
   </mrow></msubsup> :MATH]
   and m^A. We use T[1]=0.3 and T[2]=n for REMAIN using both PP and JP.
   For SPM-PP we set t[1]=100, t[2]=0.5, and T[4]=100n^2, which resulted
   in adequate performance among all sample sizes. When using SPM-JP we
   use the same t[1] and T[4], but set t[2]=1 to avoid picking large
   blocks. This process iterates 1000 times.

   Figure [71]3 plots the average number of correctly labeled features as
   sample size increases from 10 to 100 in steps of 10. SPM-PP has the
   best performance; however, SPM-JP experiences a sharp drop under small
   sample sizes. For larger sample sizes, POFAC-PP performs second only to
   SPM-PP. However, POFAC-JP outperforms SPM-JP. REMAIN adequately
   balances performance across all sample sizes. All proposed algorithms,
   except SPM-JP, outperform 2MNC-Robust, CMNC-OBF, and other feature
   selection algorithms. SPM-JP outperforms previous algorithms if sample
   size is not very small.

Fig. 3.

   Fig. 3
   [72]Open in a new tab

   Performance of various feature selection algorithms under Gaussian
   data. Average number of correctly labeled features versus sample size
   for randomly generated parameters for (a) 2MNC-Robust-PP,
   2MNC-Robust-JP, POFAC-PP, POFAC-JP, REMAIN-PP, REMAIN-JP, SPM-PP,
   SPM-JP, (b) CMNC-OBF-PP, CMNC-OBF-JP, t-test, FOHSIC, MI, BD, and
   SFS-RLDA

   In this simulation filter methods were the fastest with comparable
   computation time, and FOHSIC was the most computationally intensive
   method. A comparison of run-times for this specific simulation is
   provided in Table [73]2 assuming the run-time of 2MNC-Robust is the
   unit of time. Parallel processing can be used to speed up these
   algorithms, for instance, in the 4th step of GSG, and to compute
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>G</mi><mo>)</mo> :MATH]
   in 2MNC-Robust and POFAC. Although SPM is a sequential algorithm, its
   bottle-neck is step 4 of GSG, making parallel processing a good
   strategy to extensively speed up SPM.

Table 2.

   Run-time comparison of Bayesian simulation
   Alg. Filter  2MNC-Robust REMAIN POFAC SPM SFS-RLDA FOHSIC
   Time < 10^−3 1           2      1.05  1.5 10       15
   [74]Open in a new tab

Synthetic microarray simulations

   Here an extended version of a synthetic model developed to mimic
   microarrays is used to generate data. The original model is introduced
   in [[75]13], and has been extended in [[76]8, [77]9]. In these models
   features are markers or non-markers. Markers are either global or
   heterogeneous. Global markers (GM) are homogeneous within each class.
   Heterogeneous Markers (HM) compromise c subclasses, where for each
   specific set of heterogeneous markers, a specific subset of the
   training sample has a different distribution than markers in class 0,
   and the remaining sample points have the same distribution as class 0.
   Markers comprise blocks of size k, where each block in class y is
   Gaussian with mean μ[y] and covariance σ[y]Σ[y]. Diagonal elements of
   Σ[y] are 1 and non-diagonal elements are ρ[y]. The original model of
   [[78]13] forced ρ[0]=ρ[1]. We also have μ[0] = [ 0,⋯,0]. There are
   three types of markers according to their mean in class 1: redundant,
   synergetic, and marginal, with μ[1] being [ 1,⋯,1],[ 1,1/2,⋯,1/k], and
   [ 1,0,⋯,0], respectively. Non-markers are either Low Variance (LV) or
   High Variance (HV). In the original model LV non-markers are
   independent, each with a Gaussian distribution, N(0,σ[0]). However, in
   the extended model of [[79]8, [80]9], similar to markers in class 0, LV
   non-markers comprise blocks of size k, where in each block features are
   jointly Gaussian with mean μ[0] and covariance σ[0]Σ[0]. HV non-markers
   are independent with marginal distribution pN(0,σ[0])+(1−p)N(1,σ[1]),
   where p is drawn from the uniform distribution over [ 0,1].

   We assume |F|=5000,|GM|=20,|HM|=80,|HV|=2000, and c=2. We consider all
   possible combinations of the following parameters: all 3 mean types,
   k=5,10,20, and ρ[0],ρ[1]=0.1,0.5,0.9. We also consider the “large and
   unequal variance” setting of Table 1 in [[81]13], which sets σ[0]=0.25
   and σ[1]=0.64. Given each set of distribution parameters, we randomly
   assign features to blocks of global markers, heterogeneous markers, and
   LV non-markers. The remaining features comprise the independent HV
   non-markers. We generate a stratified sample of size n with equal
   points in each class. The following algorithms are used to declare the
   set of good features: t-test, BD, MI, SFS-RLDA, CMNC-OBF, 2MNC-Robust,
   REMAIN, POFAC, and SPM. We removed FOHSIC due to its computation cost.
   All Bayesian algorithms use JP. We use thresholds of the Bayesian
   simulation, except we set T[1]=0.05. One can tune T[3] and T[4] for one
   of the 81 possible settings, or a specific sample size, but it can
   affect the performance of other settings. We picked the thresholds of
   the Bayesian simulation as they provided satisfactory performance among
   large sample sizes. This process iterates 500 times. For each set of
   distribution parameters we define performance as the average number of
   markers identified as good plus the average number of non-markers
   identified as bad. Figure [82]4 plots the average and worst case
   performance for each fixed mean type across other distribution
   parameters as sample size increases from 10 to 100 in steps of 10.
   Bayesian methods tend to outperform non-Bayesian methods. While simpler
   methods such as CMNC-OBF outperform more complicated methods when
   sample size is small, complicated methods such as SPM have superior
   performance when sample size is large.

Fig. 4.

   Fig. 4
   [83]Open in a new tab

   Average and worst case performance of feature selection algorithms.
   Average and worst case performance are obtained using 27 combinations
   of the synthetic microarray model parameters k, ρ[0] and ρ[1] with
   fixed mean type, where performance is defined to be the average number
   of markers identified as good plus the average number of non-markers
   identified as bad over 500 iterations: a average performance for
   redundant means, b worst case performance for redundant means, c
   average performance for synergetic means, d worst case performance for
   synergetic means, e average performance for marginal means, and f worst
   case performance for marginal means

   For small sample sizes, or cases where correctly labeling good features
   is more difficult, REMAIN tends to output very few features resulting
   in very good performance, in contrast to methods that are forced to
   output
   [MATH: <mo>|</mo><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>¯</mo></mover><mo>|</mo><mo>=<
   /mo><mn>100</mn> :MATH]
   features. OBF can be implemented with the MNC objective instead of CMNC
   to enjoy this characteristic of REMAIN. SPM seems to have the most
   diverse behavior. While it performs inferior to all feature selection
   algorithms when sample size is very small, it tends to outperform all
   other methods for larger sample sizes. In order for the quantities used
   in SPM to be well-defined under JP, sample size must be larger than the
   block size. Hence, under small samples SPM with JP tends to break good
   blocks into smaller blocks, thereby losing some of its ability to
   identify weak good features with strong dependencies, and making it
   more prone to detecting blocks incorrectly. Also note that we have used
   the same parameters for SPM across all data models and sample sizes,
   and performance is expected to improve if t[1],t[2] and T[4] are
   calibrated each time it is run.

   POFAC is an interesting option, enjoying competitive performance across
   all sample sizes. It outperforms 2MNC-Robust while its computation cost
   is only slightly larger. CMNC-OBF tends to select individually strong
   markers, i.e., markers with class 1 mean far from 0. CMNC-OBF performs
   very similar to BD in this simulation, with their performance graphs
   almost overlapping.

   Figure [84]5 plots average performance for fixed class-conditioned
   correlation coefficients across other distribution parameters. Simpler
   methods outperform complicated algorithms when sample size is small,
   and REMAIN enjoys outstanding performance for small sample sizes by
   reporting very few features. REMAIN has difficulty detecting weak
   markers, i.e., heterogeneous markers with class 1 mean close to 0, as
   for larger sample sizes its performance increment is very little for a
   10 point increase in sample size. Average performance with respect to
   sample size for each of the 81 possible data generation settings is
   provided in the supplementary [see Additional file [85]1].

Fig. 5.

   Fig. 5
   [86]Open in a new tab

   Average performance of feature selection algorithms. Average
   performance is obtained using 9 combinations of the synthetic
   microarray model parameters k and mean type with fixed ρ[0] and ρ[1],
   where performance is defined to be the average number of markers
   identified as good plus the average number of non-markers identified as
   bad over 500 iterations: a ρ[0]=0.1,ρ[1]=0.1, b ρ[0]=0.5,ρ[1]=0.1, c
   ρ[0]=0.9,ρ[1]=0.1, d ρ[0]=0.1,ρ[1]=0.5, e ρ[0]=0.5,ρ[1]=0.5, f
   ρ[0]=0.9,ρ[1]=0.5, g ρ[0]=0.1,ρ[1]=0.9, h ρ[0]=0.5,ρ[1]=0.9, and i
   ρ[0]=0.9,ρ[1]=0.9

   While correctly labeling more features tends to result in lower
   classification error, maximizing the average number of correctly
   labeled features does not necessarily minimize classification error
   [see Additional file [87]1]. An example can be seen in the
   Supplementary, where we examine the prediction error of several popular
   classifiers with feature selection on the synthetic microarray model
   [see Additional file [88]1].

Results

   We apply CMNC-OBF, POFAC, REMAIN, and SPM with the same priors used for
   synthetic microarray simulations to cancer microarray datasets, select
   the top genes, and perform enrichment analysis. We list the top 5 genes
   selected by CMNC-OBF, POFAC, and REMAIN. The top 100 genes are provided
   in the supplementary [see Additional file [89]1]. REMAIN ranks genes as
   follows. In each call to MAIN, we rank genes of
   [MATH: <mover accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover>
   :MATH]
   by the order they are added to
   [MATH: <mover accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover>
   :MATH]
   , and if several genes are added at once in step 5 of MAIN, they are
   ranked based on
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>π</mi></mrow><mo>~</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>(</mo><mi>f</mi><mo>)</mo> :MATH]
   . In addition,
   [MATH: <mover accent="true"><mrow><mi>G</mi></mrow><mo>~</mo></mover>
   :MATH]
   ’s are ranked by the order they are obtained using consecutive calls to
   the MAIN subroutine. Note SPM outputs a set of feature blocks, not a
   feature ranking. Studying blocks of SPM might provide invaluable
   information about the underlying biological mechanisms of the disease
   under study, but we leave this for future work.

   We perform enrichment analysis using PANTHER [[90]17, [91]18]. The top
   20 enriched pathways are reported in the supplementary [see Additional
   file [92]1]. We list their names, number of known genes in each
   pathway, number of selected genes that belong to the pathway, and the
   corresponding p-value. Here we only list the top 3 pathways and their
   p-values. We study if among the top genes and pathways any are already
   suggested to be involved in the cancer under study. The complete
   analysis, with references that suggest involvement of the top reported