Abstract
Background
Many bioinformatics studies aim to identify markers, or features, that
can be used to discriminate between distinct groups. In problems where
strong individual markers are not available, or where interactions
between gene products are of primary interest, it may be necessary to
consider combinations of features as a marker family. To this end,
recent work proposes a hierarchical Bayesian framework for feature
selection that places a prior on the set of features we wish to select
and on the label-conditioned feature distribution. While an analytical
posterior under Gaussian models with block covariance structures is
available, the optimal feature selection algorithm for this model
remains intractable since it requires evaluating the posterior over the
space of all possible covariance block structures and feature-block
assignments. To address this computational barrier, in prior work we
proposed a simple suboptimal algorithm, 2MNC-Robust, with robust
performance across the space of block structures. Here, we present
three new heuristic feature selection algorithms.
Results
The proposed algorithms outperform 2MNC-Robust and many other popular
feature selection algorithms on synthetic data. In addition, enrichment
analysis on real breast cancer, colon cancer, and Leukemia data
indicates they also output many of the genes and pathways linked to the
cancers under study.
Conclusions
Bayesian feature selection is a promising framework for small-sample
high-dimensional data, in particular biomarker discovery applications.
When applied to cancer data these algorithms outputted many genes
already shown to be involved in cancer as well as potentially new
biomarkers. Furthermore, one of the proposed algorithms, SPM, outputs
blocks of heavily correlated genes, particularly useful for studying
gene interactions and gene networks.
Electronic supplementary material
The online version of this article (10.1186/s12859-018-2059-8) contains
supplementary material, which is available to authorized users.
Keywords: Feature selection, Bayesian learning, Biomarker discovery,
Heuristic search algorithms
Background
Many bioinformatics studies aim to identify predictive biomarkers that
can be used to establish diagnosis or prognosis, or to predict a drug
response [[27]1–[28]3]. This problem can often be framed as a feature
selection task, where the goal is to identify a list of features
(molecular biomarkers) that can discriminate between groups of interest
based on high-dimensional data from microarray, RNA-seq, or other
high-throughput technologies.
Initially, exploratory studies are often conducted on small samples to
generate a shortlist of biomarker candidates before a large-sample
validation study is performed [[29]4]. However, such studies have too
often been unsuccessful at producing reliable and reproducible
biomarkers [[30]5]. Biomarker discovery is inherently difficult, given
the large number of features, highly complex interactions between genes
and gene products, enormous variety of dysfunctions that can occur, and
many sources of error in the data. As a result, feature selection
algorithms are often implemented without much consideration of the
particular demands of the problem. For instance, variants of t-test are
perhaps the most widely implemented selection strategies in
bioinformatics, but can only detect strong individual features, and
fail to take correlations into account.
Given that molecular signaling is often inherently multivariate, there
is a need for methods that can account for correlations and extract
combinations of features as a marker family. Wrapper methods do this by
ranking sets of features according to some objective function, usually
the error of a classifier. However, methods based on classifier error
are computationally expensive, and may not necessarily produce the best
markers; indeed, strong features can be excluded if they are correlated
with other strong features. Furthermore, analysis downstream from
feature selection may include gene set enrichment analysis, where the
hope is to identify known pathways or other biological mechanisms that
contain a statistically significant number of genes in the reported
gene set, or may involve the development of new pathways and gene
networks. We are thus motivated to develop methods that not only select
markers useful for discrimination, but select all relevant markers,
even individually weak ones.
To address this, in prior work we proposed a hierarchical Bayesian
framework for feature selection, labeling features as “good” or “bad”,
where good features are those we wish to select, i.e., biomarkers. This
framework places a prior on the set of good features and the underlying
distribution parameters. Three Gaussian models have been considered.
Under independent features, Optimal Bayesian Filtering reports a
feature set of a given size with a maximal expected number of truly
good features (CMNC-OBF) [[31]6]. Assuming fully dependent good
features and independent bad features, 2MNC-DGIB is a fast suboptimal
method that ranks features by evaluating all sets of size 2 [[32]7].
Finally, assuming good and bad features are separately dependent,
2MNC-Robust proposes an approximation of the posterior on good features
and uses a ranking strategy similar to 2MNC-DGIB to select features
[[33]8].
While 2MNC-DGIB has outstanding performance when its assumptions are
satisfied [[34]7], it performs poorly when bad features are dependent
[[35]9]. On the other hand, CMNC-OBF and 2MNC-Robust have been shown to
have robust performance across Bayesian models with block-diagonal
covariances [[36]9]. CMNC-OBF is extremely fast and enjoys particularly
excellent performance when markers are individually strong with low
correlations, but, like all filter methods, may miss weak features that
are of interest due to high correlations with strong features [[37]6,
[38]9]. 2MNC-Robust is computationally very manageable and generally
improves upon CMNC-OBF in the presence of correlations.
Although CMNC-OBF and 2MNC-Robust are robust to different
block-diagonal covariance structures, they do not attempt to detect
these underlying structures, and their assumptions and approximations
constrain performance. Thus, in this work we propose three new feature
selection algorithms that: (1) use an iterative strategy to update the
approximate posterior used in 2MNC-Robust, (2) use a novel scoring
function inspired by Bayes factors to improve overall rankings, and (3)
attempt to actually detect the underlying block structure of the data.
We show that these algorithms have comparable computation time to
2MNC-Robust, while outperforming 2MNC-Robust and many other popular
feature selection algorithms on a synthetic Bayesian model assuming
block-diagonal covariance matrices, and a synthetic microarray data
model. Finally, we apply the proposed algorithms and CMNC-OBF to breast
cancer, colon cancer, and AML datasets, and perform enrichment analysis
on each to address validation.
Feature selection model
We review a hierarchical Bayesian model that serves as a reference for
the approximate posterior developed in 2MNC-Robust [[39]8, [40]9] and
will be used in the algorithms we present in the next section.
Consider a binary feature selection problem with class labels y=0,1.
Let F be the set of feature indices. Assume features are partitioned
into blocks, where features in each block are dependent, but features
in different blocks are independent. Assume each block is either good
or bad. A good block has different class-conditioned distributions
between the two classes, while a bad block has the same distribution in
both classes. We denote a partitioning of F to good and bad blocks by
P=(P[G],P[B]), and hereafter call it a feature partition, where
P[G]={G[1],⋯,G[u]} is the set of u good blocks and P[B]={B[1],⋯,B[v]}
is the set of v bad blocks. Furthermore, denote the set of all features
in good blocks as good features,
[MATH:
G=∪i=1uGi :MATH]
, and denote all features in bad blocks as bad features,
[MATH:
B=∪j=1vBj :MATH]
. Denote the random feature partition by
[MATH: P¯=(<
/mo>P¯G,P¯B) :MATH]
, the random set of good features by
[MATH: G¯
:MATH]
, and the random set of bad features by
[MATH: B¯
:MATH]
.
We define
[MATH: π(P)=P(P¯=P<
/mi>) :MATH]
to be the prior distribution on
[MATH: P¯
:MATH]
. Let P be fixed. Let θ^P be the parameter describing the joint feature
distribution of P. Since blocks are independent of each other we can
write
[MATH:
θP=θ0
G1,⋯,θ
0Gu,θ<
mrow>1G1,⋯,θ
1Gu,θB1,⋯,θBv<
/mrow> :MATH]
, where
[MATH:
θyGi
:MATH]
parametrizes class-y features in G[i], and
[MATH:
θB<
mi>j :MATH]
parametrizes features in B[j]. Assume
[MATH:
θyGi
:MATH]
and
[MATH:
θB<
mi>j :MATH]
’s are independent given P, i.e.,
[MATH: πθP=∏<
mrow>i=1uπθ0Gi
mrow>πθ1Gi
mrow>∏j=1vπθ
Bj :MATH]
.
Given a training set, Inline graphic , of n independent and identically
distributed (i.i.d.) points, with n[y] points in each class, we have
Inline graphic and Inline graphic , where π^∗(.) denotes posterior,
Inline graphic and Inline graphic are class-y points in G[i] and points
in B[j], respectively, and Inline graphic and Inline graphic are the
likelihoods. Following steps in [[41]7, [42]10], we have
graphic file with name 12859_2018_2059_Figh_HTML.gif 1
In addition, the marginal posterior of a feature set G is Inline
graphic , and marginal posterior of a feature f is Inline graphic .
Note Inline graphic is different than Inline graphic .
Gaussian model
Here we solve Eq. ([43]1) for jointly Gaussian features. We assume for
a block A,
[MATH:
θyA
=μyA,ΣyA
:MATH]
and θ^A=[μ^A,Σ^A], where
[MATH:
μyA
:MATH]
and μ^A are the mean vectors, and
[MATH:
ΣyA
:MATH]
and Σ^A are the covariance matrices.
Let P be a feature partition. Suppose A is a good block of P. Assume
[MATH:
π(θyA) :MATH]
is Normal-Inverse-Wishart (NIW). Hence,
[MATH: πθyA=π
mi>ΣyAπμyA|ΣyA
:MATH]
, where
[MATH: πΣyA=<
/mo>KyA<
/mi>|ΣyA|<
/mrow>−κyA+|A|+12
exp−0.5TrSyAΣyA
−1,
mo>πμyA|ΣyA
=LyA|ΣyA|−0.5×exp−0.5ν<
/mrow>yAμyA−my
A<
mi>TΣy
A<
mo>−1μyA−myA
, :MATH]
where for a matrix |.| denotes determinant.
[MATH:
SyA
,κy<
/mi>A,myA
:MATH]
, and
[MATH:
νyA
:MATH]
are hyperparameters, which are assumed given and fixed.
[MATH:
SyA
:MATH]
is an |A|×|A| matrix, where for a set |.| denotes cardinality. For a
proper prior
[MATH:
SyA
:MATH]
is symmetric and positive-definite, and
[MATH:
κyA
>|A|−
1 :MATH]
. If
[MATH:
κyA
>|A|+
1 :MATH]
, then
[MATH: EΣyA=SyA
/κyA−|A
|−1 :MATH]
. Furthermore,
[MATH:
myA
:MATH]
is an |A|×1 vector describing the average mean of features and for a
proper prior we need
[MATH:
νyA
>0 :MATH]
.
[MATH:
KyA
:MATH]
and
[MATH:
LyA
:MATH]
represent the relative weights of each distribution. For a proper
distribution we have
[MATH:
KyA
=|SyA|
0.5κyA2−0.5κyA|
mo>A|/Γ
mrow>|A|0.5κ<
mi>yA
:MATH]
and
[MATH:
LyA
=2π/ν<
/mi>yA−0.5|A| :MATH]
, where Γ[d] denotes the multivariate gamma function.
Since NIW is a conjugate prior of Gaussian distribution, given sample,
[MATH:
π∗θyA :MATH]
is again NIW with updated hyperparameters:
[MATH:
κyA∗=
κyA
+n
y :MATH]
,
[MATH:
νyA∗=
νyA
+n
y :MATH]
,
[MATH:
myA∗=
νyAmyA+nyμ^yAνyA∗
:MATH]
, and
[MATH:
SyA∗=S
yA+(<
msub>ny−1)Σ^yA+<
msubsup>νyA<
/mrow>nyνyA+nyμ^yA−myAμ^yA−myAT,
:MATH]
where
[MATH: μ^yA :MATH]
and
[MATH: Σ^yA :MATH]
are class-conditioned sample mean and covariance of Inline graphic ,
respectively [[44]11]. Now suppose A is a bad block. We assume the
prior on θ^A is NIW with hyperparameters S^A,κ^A,m^A, and ν^A, and
relative weights K^A and L^A. Given sample, π^∗(θ^A) is NIW with
[MATH:
κA<
mo>∗=κ
mrow>A+n :MATH]
,
[MATH:
νA<
mo>∗=ν
mrow>A+n :MATH]
,
[MATH:
mA<
mo>∗=νAm<
/mrow>A+nμ^AνA∗
:MATH]
, and
[MATH:
SA<
mo>∗=S
mrow>A+(n−<
/mo>1)Σ^A+νAn
νA+n<
/mfrac>μ^A−mA
μ^A−mA
T, :MATH]
where
[MATH: μ^A :MATH]
and
[MATH: Σ^A :MATH]
are sample mean and covariance of Inline graphic , respectively
[[45]11]. As long as π^∗(P) is proper, using the normalization constant
of NIW distribution to compute the integrals in Eq. ([46]1) we have
[MATH: π∗(P)∝π(P)∏i=1
uQ0Gi
Q1GiS0Gi
∗−0.5κ0Gi<
mrow>∗S1Gi
∗−0.5κ1Gi<
mrow>∗×∏j=1
vQ
Bj|SBj∗
|−0.5<
mrow>κBj
∗, :MATH]
where
[MATH: Qy
A=Ky
mi>ALyA20.5κyA∗|A|Γ
|A|0.5κ<
mi>yA∗2π/ν<
/mi>yA∗0.5|A|,<
mspace width="2em">QA=KA
LA20.5κ
A∗|A
|Γ|A<
/mi>|0.5κA∗
mrow>2π/νA∗0.5|A|. :MATH]
Assuming: (1) π(P) is such that the block structure, i.e., the number
and size of good and bad blocks, is fixed, (2) for each good block A,
[MATH:
KyA
:MATH]
,
[MATH:
LyA
:MATH]
,
[MATH:
κyA
:MATH]
, and
[MATH:
νyA
:MATH]
do not depend on the features indices in A, and (3) for each bad block
A, K^A, L^A, κ^A, and ν^A do not depend on the features indices in A,
[MATH: π∗(P)∝π(P
mi>)∏i=1
uS0Gi
∗κ0Gi∗
S1Gi
∗κ1Gi∗
∏j=1
vS<
mi>Bj∗κBj<
mo>∗
−0.5.
:MATH]
Methods
Here we describe the set selection methods used. Note we aim to find
the set of true good features, rather than the true underlying feature
partition. The Maximum Number Correct (MNC) criterion [[47]7] outputs
the set maximizing the expected number of correctly labeled features
and the Constrained MNC (CMNC) criterion outputs the set with maximum
expected number of correctly labeled features constrained to having
exactly D selected features, where D is a parameter of the optimization
problem [[48]9]. The solution of MNC is {f∈F:π^∗(f)>0.5} [[49]7] and
the solution of CMNC is picking the top D features with largest π^∗(f)
[[50]9]. Therefore, both MNC and CMNC require computing π^∗(f) for all
f∈F, which is not computationally feasible for an arbitrary block
structure unless |F| is very small. We review two previously proposed
algorithms, OBF and 2MNC-Robust, and then present three new algorithms.
Optimal Bayesian filter
Optimal Bayesian Filter (OBF) assumes all blocks have size one, i.e.,
all features are independent, and assumes the events
[MATH: {f∈G¯}
:MATH]
are independent a priori. In this case π^∗(f) can be found in closed
form with little computation cost [[51]6, [52]9]. OBF is optimal under
its modeling assumptions. As argued in [[53]9], in the presence of
correlation OBF is a robust suboptimal algorithm that can detect
individually strong good features, i.e., those whose mean and/or
variance is very different between the two classes, but cannot take
advantage of correlations to correctly label individually weak good
features, those whose mean and variance are similar in both classes.
2MNC-Robust
The 2MNC algorithm [[54]7] suggests approximating π^∗(f) using π^∗(G)
for all sets G such that |G|=2, and picking the top D features. Since
finding π^∗(G) for all feature partitions where |∪P[G]|=2 is typically
infeasible, an approximate posterior,
[MATH: π~∗(G) :MATH]
, is proposed [[55]8], where for all G⊆F of size 2,
graphic file with name 12859_2018_2059_Figo_HTML.gif
The normalization constant is found such that
[MATH:
∑G⊆F
:|G|=2π~∗(G)=1
:MATH]
.
[MATH: π~(G<
/mi>) :MATH]
mimics the role of
[MATH:
π(G)=P(<
mover
accent="true">G¯=G<
/mi>)=∑P
:∪PG=Gπ(P
mi>) :MATH]
. Using some suboptimal method might affect one’s decision of the value
used as the prior of a feature set, replacing π^∗(G) with
[MATH: π~∗(G) :MATH]
. For example, knowing
[MATH: |G¯|><
/mo>2 :MATH]
implies π(G)=0 for all sets of size 2; however, 2MNC-Robust only
evaluates such sets. In this case,
[MATH: π~(G<
/mi>)=P(G⊆G¯)
:MATH]
might be a suitable choice to replace π(G).
[MATH: π~∗(f)=∑G:f∈
Gπ~∗(G) :MATH]
is the approximate marginal posterior of f∈F. For the Gaussian model,
if the number of good features is fixed and hyperparameters do not
depend on the feature indices,
[MATH: π~∗(G)∝π~(G<
/mi>)|S0G∗|κ0G∗|S1G∗|κ<
mrow>1G∗|SG∗|κG∗
−0.5. :MATH]
2
2MNC-Robust is implementing 2MNC with
[MATH: π~∗(f) :MATH]
. As mentioned before, 2MNC-Robust does not tune itself to the
underlying block structure of data.
Recursive marginal posterior inflation
It is easy to show that
[MATH:
∑f∈Fπ~∗(f)=2
:MATH]
when only sets of size 2 are used to find
[MATH: π~∗(f) :MATH]
. Hence, under MNC criterion one would at most pick 4 good features,
implying we underestimate π^∗(f) by only using sets of size 2 when
[MATH: |G¯|><
/mo>>2 :MATH]
. REcursive MArginal posterior INflation (REMAIN) aims to sequentially
detect good features by rescaling
[MATH: π~∗(f)=∑G:f∈
G,|G|=2π~∗(G) :MATH]
. We initialize REMAIN with the set of all features, F[r]=F. Then,
REMAIN uses the MArginal posterior INflation (MAIN) algorithm to
identify several features as good, removes them from F[r], and feeds
MAIN with the truncated F[r] to select additional features. This
process iterates until MAIN does not output any features. REMAIN is
nothing but repetitive calls to MAIN with shrinking feature sets,
making MAIN the heart of this algorithm.
graphic file with name 12859_2018_2059_Figp_HTML.gif
Pseudo-code of MAIN is provided in Algorithm 1, where H(G) is the right
hand side of Eq. ([56]2). Inputted with a feature set F[t], MAIN finds
[MATH: π~∗(f) :MATH]
using sets of size 2, and finds the set
[MATH:
Gs={f∈Ft:π~∗(f)>T1} :MATH]
. MAIN adds G[s] to
[MATH: G~
:MATH]
, the set of features in F[t] already labeled as good. It then updates
F[t] to F[t]∖G[s], and rescales
[MATH: π~∗(f) :MATH]
of features f∈F[t] so that
[MATH:
∑f∈Ftπ~∗(f)=2
:MATH]
. Note features in
[MATH: G~
:MATH]
are used to compute
[MATH: π~∗(f) :MATH]
for features f∈F[t], but
[MATH: π~∗(f) :MATH]
of features
[MATH: f∈G~ :MATH]
are not used in the scaling of
[MATH:
∑f∈Ftπ~∗(f)=2
:MATH]
. MAIN iterates until
[MATH: G~=ϕ<
/mi> :MATH]
, or H(G)≤T[2] for all G⊆F[t] with |G|=2.
Not finding new features in MAIN might be due to the remaining good
features being weaker and independent of
[MATH: G~
:MATH]
. Hence, REMAIN removes
[MATH: G~
:MATH]
from F[r], and feeds MAIN with the updated F[r]. This way, features in
[MATH: G~
:MATH]
are not used to compute
[MATH: π~∗(f) :MATH]
anymore for any feature f∈F[r], thus making it easier to detect weaker
good features that are independent of features already selected by
REMAIN. Pseudo-code of REMIAN is provided in Algorithm 2.
T[1] mimics the role of the threshold used in the MNC criterion. Hence,
T[1]∈[ 0, 1]. Recall that by evaluating sets of size 2 we underestimate
[MATH: π~∗(f) :MATH]
when
[MATH: |G¯|><
/mo>>2 :MATH]
. Therefore, when confident
[MATH: |G¯|><
/mo>>2 :MATH]
, one might opt for smaller values for T[1] rather than values close to
1. As T[2] is a threshold over un-normalized posteriors, H(G), extra
care must be taken when setting T[2]. We suggest T[2]=n for
high-dimensional feature selection applications, which is a good rule
of thumb based on our simulation results and asymptotic analysis of
H(.).
graphic file with name 12859_2018_2059_Figq_HTML.gif
Note the number of features reported by REMAIN is variable; however,
one can easily obtain close to a desired number of selected features by
tuning T[1]. To illustrate, we provide an example based on the data
generation model used in the “[57]Synthetic microarray simulations”
section, where we assume there are 100 markers, i.e., good features,
and 4900 non-markers, i.e., bad features. We use the synergetic model
with block size k=5 and correlation coefficients ρ[0]=ρ[1]=0.9. Fixing
T[2]=n, Table [58]1 lists the average number of markers and non-markers
selected over 1000 iterations for n=20 and 100 across different values
of T[1]. REMAIN outputs very few features when T[1] is large, and too
many features when T[1] is extremely low. The best choice for T[1] can
vary greatly from case to case, but one strategy is to choose T[1] so
that REMAIN selects close to a given number of features. For example,
T[1]=0.05 is a good choice in this simulation if one desires
approximately 100 selected features. Another strategy is based on the
number of features selected by REMAIN across various values of T[1].
When T[1] is large, reducing T[1] only slightly increases the output
feature size, for instance when T[1]>0.05 in this simulation. However,
one might observe a rapid increase in the output size by slightly
reducing T[1], for instance T[1] changing from 0.05 to 0.01 in this
simulation. For such observed patterns, the value for which this
phenomenon occurs might be a desirable choice.
Table 1.
Performance of REMAIN for various values of T[1]
Objective n T[1]=0.3 T[1]=0.2 T[1]=0.1 T[1]=0.05 T[1]=0.01 T[1]=0.005
Marker found 20 0.9 1.4 2.5 4.6 15.3 26.2
Non-marker found 20 2.0 3.7 9.4 23.3 164.5 403.0
Marker found 100 52.5 56.2 58.2 60.0 68.9 76.1
Non-marker found 100 1.6 3.3 8.8 20.2 124.7 288.5
[59]Open in a new tab
Posterior factor
Feature selection can be construed as a model selection problem where
each model is a set of good features. Let f be a feature. If f is a
good feature, we expect that if we add f to any model G, i.e., a set of
good features, then
[MATH: π~∗(G∪{f})/π~∗(G)>>
1 :MATH]
. If f is a bad feature, we expect
[MATH: π~∗(G∪{f})/π~∗(G) :MATH]
to be much smaller. Hence, if we average this ratio over a family of
models that do not contain f, and denote it by β(f), we expect β(f)>>1
if f is a good feature, and comparable to or smaller than 1 if f is a
bad feature.
[MATH: π~∗(G∪{f})/π~∗(G) :MATH]
is similar, but not identical, to the Bayes factor encountered in model
selection [[60]12], where here we compare a model with good feature set
G∪{ f} versus a model with good feature set G excluding f. The
posterior factor, β(f), averages the approximate posterior ratio across
all feature sets G∈F∖{ f}, i.e.,
[MATH: β(f)=1|Ω<
mrow>f|∑G∈Ω
f
π~∗(G∪{f})π~∗(G),
:MATH]
3
where Ω^f={G⊆F:f∉G}. As the summation over Ω^f is computationally
infeasible, we propose approximate posterior factor, hereafter denoted
by
[MATH: β~
:MATH]
.
[MATH: β~(f<
/mi>)=1|<
mi>F|−1∑f′∈F∖{f}π~∗({f,f′})<
mrow>π~∗({f′}). :MATH]
4
We propose approximate POsterior FActor-Constrained (POFAC) algorithm
as follows: Use
[MATH: β~(f) :MATH]
to rank features, and pick the top D features. Note that D is a
parameter of the algorithm.
Sequential partition mustering
Sequential Partition Mustering (SPM) aims to improve feature selection
performance by sequentially detecting good blocks, and adding them to
the set of previously selected features. To find a good block, we start
with the most significant feature, i.e., the feature with largest
[MATH: β~
:MATH]
, and find the block containing it. Note we do not aim to find the
structure of bad blocks, and as soon as we declare no more good
features remain the algorithm terminates.
Suppose u[0] is the current most significant feature. In order to find
the block containing u[0] we propose the Good Seed Grower (GSG)
algorithm, which can be construed as a seed growing algorithm with u[0]
as the seed. Pseudo-code of GSG is presented in Algorithm 3, where for
any two non-empty disjoint sets G[1],G[2]⊂F,
[MATH:
C1
mrow>(G1,G2)=π~(
G1,G2)1−π~(
G1,G2)Q0G12Q1G12
Q0
G1Q1G1Q0G2
msubsup>Q1G2×S0G12∗κ0G12∗S1G12∗κ1G12∗S0G1
∗κ0G1∗
S1G1
∗κ1G1∗
S0G2
∗κ0G2∗
S1G2
∗κ1G2∗
−0.5, :MATH]
G[12]=G[1]∪G[2], and
[MATH: π~(
G1,G2) :MATH]
approximates π(G[1],G[2]), the prior probability that at least one of
the features in G[2] is not independent of G[1]. Note Inline graphic ,
where Inline graphic is the family of feature partitions that contain a
block U such that U∩G[1]≠ϕ and U∩G[2]≠ϕ. At each iteration, GSG finds
the feature u^∗ that maximizes C[1](U,{u^∗}), where U is the currently
detected sub-block of the block containing u[0]. GSG declares u^∗ and U
belong to the same block if C[1](U,{u^∗})>T[3], and adjoins u^∗ to U;
otherwise, it terminates and declares U as the block containing u[0].
Here we assume
[MATH:
T3=t1nt2|U| :MATH]
, where t[1],t[2]>0 are parameters of GSG. While we have only
considered one possible family of thresholds, we expect this family to
be large enough for most practical purposes.
graphic file with name 12859_2018_2059_Figt_HTML.gif
Pseudo-code of SPM is explained in Algorithm 4. Let F[t] be the feature
set used by SPM initialized to F[t]=F. We start with the most
significant feature u[0] and find the block containing it, U. We then
update F[t] to F[t]∖U. If
[MATH: β~(f<
/mi>)<T4
mrow> :MATH]
for all f∈F[t], then SPM declares F[t] does not contain any good
features and terminates; otherwise, it picks the most significant
feature of F[t] and iterates. Similar to REMAIN, SPM cannot be forced
to output a fixed number of features, but T[4] can be used to tune SPM
to output close to a desired number of features. In addition, t[1] and
t[2] can be used to avoid picking large blocks, which also affects the
output feature set size.
graphic file with name 12859_2018_2059_Figu_HTML.gif
We again provide an example based on the simulation we did for REMAIN.
We let t[1]=10^1,10^2,⋯,10^5,t[2]=1,2,3,4,5, and T[4]=10^2,10^4,10^6,
and consider all combinations to construct a very wide range of
parameters. Figures [61]1 and [62]2 illustrate how parameters of SPM
affect its performance for n=20 and 100, respectively. While low
thresholds mislabel more non-markers as good features, they correctly
label more markers compared with large thresholds. When n=20, in order
to correctly label at least 10 markers on average, at least 50
non-markers are mislabeled, and to mislabel at most 5 non-markers on
average, one cannot correctly detect more than 5 markers. On the other
hand, when n=100, one can simultaneously correctly label at least 80
markers and mislabel at most 10 non-markers for almost all parameters.
Moving from lowest parameter values to highest, we observe the average
outputted feature size varies from approximately 400 to less than 5
when n=20, while it only varies from 120 to 70 when n=100. Thereby,
n=100 is less sensitive to the choice of parameters than n=20. Suppose
n=100 and one aims to find most markers and few non-markers. Many
parameters can achieve this goal. In addition, within the range of such
parameters, the number of features labeled as good does not change very
much by slightly varying the parameters. Hence, for a fixed sample, one
can implement SPM over a wide range of parameters, find the range where
the output feature size does not vary much by slightly changing the
parameters, and pick a value in that region that outputs a feature set
with close to a reasonable number of features.
Fig. 1.
Fig. 1
[63]Open in a new tab
Performance of SPM for various values of t[1], t[2], T[4], and n=20.
Average number of markers labeled as good for a T[4]=10^2, b T[4]=10^4,
and c T[4]=10^6. Average number of non-markers labeled as good for d
T[4]=10^2, e T[4]=10^4, and f T[4]=10^6
Fig. 2.
Fig. 2
[64]Open in a new tab
Performance of SPM for various values of t[1], t[2], T[4], and n=100.
Average number of markers labeled as good for a T[4]=10^2, b T[4]=10^4,
and c T[4]=10^6. Average number of non-markers labeled as good for d
T[4]=10^2, e T[4]=10^4, and f T[4]=10^6
Simulations
We compare the performance of proposed algorithms with many popular
feature selection algorithms over a Bayesian setting, and a synthetic
microarray model introduced in [[65]13] and extended in [[66]8, [67]9].
Bayesian simulation
In this simulation we assume |F|=4100 and
[MATH: |G¯|=<
/mo>100 :MATH]
. We assume there is 1 good block for each of the following sizes: 10,
20, 30, and 40. We also assume there are 20 bad blocks for each of the
following sizes: 5, 10, 15, 20, 50, and 100. We first randomly assign
each feature to a block such that the assumed block structure is
satisfied, effectively constructing
[MATH: P¯
:MATH]
. Afterwards, distribution parameters are randomly drawn from the
following NIW prior. For each good block, A, we have
[MATH:
S0A
=S1<
/mn>A=0.5×
I|A|<
mo>×|A|,κ0A<
/msubsup>=κ1A=|A|
+2,m0A=
m1A
msubsup>=0 :MATH]
, and
[MATH:
ν0A
=ν1<
/mn>A=4 :MATH]
, where I is the identity matrix. Also, for a bad block, A, we have
S^A=0.5×I[|A|×|A|],κ^A=|A|+2,m^A=0, and ν^A=4. Given distribution
parameters, a stratified sample of size n with equal points in each
class is drawn. The following feature selection methods declare the set
of good features: t-test, Bhatacharrya Distance (BD), Mutual
Information (MI) using the non-parameter method of [[68]14] with
spacing parameter m=1, Sequential Forward Search using the bolstered
error estimate [[69]15] of Regularized Linear Discriminant Analysis
applied to the top 300 features of BD (SFS-RLDA), FOrward selection
using Hilbert-Schmidt Independence Criterion (FOHSIC) [[70]16] applied
to the top 300 features of BD, CMNC-OBF, 2MNC-Robust, REMAIN, POFAC,
and SPM. Note t-test, MI, BD, and CMNC-OBF are filter methods. All
methods except REMAIN and SPM output
[MATH: |G¯|
:MATH]
features. CMNC-OBF assumes the events
[MATH: {f∈G¯}
:MATH]
are independent and
[MATH: P(f∈G¯)
:MATH]
is constant for all f∈F. 2MNC-Robust and REMAIN assume
[MATH: π~(G<
/mi>) :MATH]
is uniform over all sets of size 2, and zero otherwise. POFAC assumes
[MATH: π~(G<
/mi>) :MATH]
is uniform over all sets of size 1 and 2. Finally, SPM assumes
[MATH: π~(
G1,G2)=
0.5 :MATH]
for all sets G[1],G[2]⊆F, and uses the same
[MATH: π~(G<
/mi>) :MATH]
of POFAC to compute
[MATH: β~(f<
/mi>) :MATH]
. Bayesian algorithms use proper priors with hyperparameters of the
same form given previously (PP), and Jeffreys non-informative prior
(JP), where for each set, A,
[MATH:
SyA
:MATH]
and S^A are zero matrices,
[MATH:
KyA
=KA=LyA=L
mi>A=1 :MATH]
, and
[MATH:
κyA
=κA=νyA=ν
mi>A=0 :MATH]
. With
[MATH:
νyA
=νA=0 :MATH]
we do not need to specify
[MATH:
myA
:MATH]
and m^A. We use T[1]=0.3 and T[2]=n for REMAIN using both PP and JP.
For SPM-PP we set t[1]=100, t[2]=0.5, and T[4]=100n^2, which resulted
in adequate performance among all sample sizes. When using SPM-JP we
use the same t[1] and T[4], but set t[2]=1 to avoid picking large
blocks. This process iterates 1000 times.
Figure [71]3 plots the average number of correctly labeled features as
sample size increases from 10 to 100 in steps of 10. SPM-PP has the
best performance; however, SPM-JP experiences a sharp drop under small
sample sizes. For larger sample sizes, POFAC-PP performs second only to
SPM-PP. However, POFAC-JP outperforms SPM-JP. REMAIN adequately
balances performance across all sample sizes. All proposed algorithms,
except SPM-JP, outperform 2MNC-Robust, CMNC-OBF, and other feature
selection algorithms. SPM-JP outperforms previous algorithms if sample
size is not very small.
Fig. 3.
Fig. 3
[72]Open in a new tab
Performance of various feature selection algorithms under Gaussian
data. Average number of correctly labeled features versus sample size
for randomly generated parameters for (a) 2MNC-Robust-PP,
2MNC-Robust-JP, POFAC-PP, POFAC-JP, REMAIN-PP, REMAIN-JP, SPM-PP,
SPM-JP, (b) CMNC-OBF-PP, CMNC-OBF-JP, t-test, FOHSIC, MI, BD, and
SFS-RLDA
In this simulation filter methods were the fastest with comparable
computation time, and FOHSIC was the most computationally intensive
method. A comparison of run-times for this specific simulation is
provided in Table [73]2 assuming the run-time of 2MNC-Robust is the
unit of time. Parallel processing can be used to speed up these
algorithms, for instance, in the 4th step of GSG, and to compute
[MATH: π~∗(G) :MATH]
in 2MNC-Robust and POFAC. Although SPM is a sequential algorithm, its
bottle-neck is step 4 of GSG, making parallel processing a good
strategy to extensively speed up SPM.
Table 2.
Run-time comparison of Bayesian simulation
Alg. Filter 2MNC-Robust REMAIN POFAC SPM SFS-RLDA FOHSIC
Time < 10^−3 1 2 1.05 1.5 10 15
[74]Open in a new tab
Synthetic microarray simulations
Here an extended version of a synthetic model developed to mimic
microarrays is used to generate data. The original model is introduced
in [[75]13], and has been extended in [[76]8, [77]9]. In these models
features are markers or non-markers. Markers are either global or
heterogeneous. Global markers (GM) are homogeneous within each class.
Heterogeneous Markers (HM) compromise c subclasses, where for each
specific set of heterogeneous markers, a specific subset of the
training sample has a different distribution than markers in class 0,
and the remaining sample points have the same distribution as class 0.
Markers comprise blocks of size k, where each block in class y is
Gaussian with mean μ[y] and covariance σ[y]Σ[y]. Diagonal elements of
Σ[y] are 1 and non-diagonal elements are ρ[y]. The original model of
[[78]13] forced ρ[0]=ρ[1]. We also have μ[0] = [ 0,⋯,0]. There are
three types of markers according to their mean in class 1: redundant,
synergetic, and marginal, with μ[1] being [ 1,⋯,1],[ 1,1/2,⋯,1/k], and
[ 1,0,⋯,0], respectively. Non-markers are either Low Variance (LV) or
High Variance (HV). In the original model LV non-markers are
independent, each with a Gaussian distribution, N(0,σ[0]). However, in
the extended model of [[79]8, [80]9], similar to markers in class 0, LV
non-markers comprise blocks of size k, where in each block features are
jointly Gaussian with mean μ[0] and covariance σ[0]Σ[0]. HV non-markers
are independent with marginal distribution pN(0,σ[0])+(1−p)N(1,σ[1]),
where p is drawn from the uniform distribution over [ 0,1].
We assume |F|=5000,|GM|=20,|HM|=80,|HV|=2000, and c=2. We consider all
possible combinations of the following parameters: all 3 mean types,
k=5,10,20, and ρ[0],ρ[1]=0.1,0.5,0.9. We also consider the “large and
unequal variance” setting of Table 1 in [[81]13], which sets σ[0]=0.25
and σ[1]=0.64. Given each set of distribution parameters, we randomly
assign features to blocks of global markers, heterogeneous markers, and
LV non-markers. The remaining features comprise the independent HV
non-markers. We generate a stratified sample of size n with equal
points in each class. The following algorithms are used to declare the
set of good features: t-test, BD, MI, SFS-RLDA, CMNC-OBF, 2MNC-Robust,
REMAIN, POFAC, and SPM. We removed FOHSIC due to its computation cost.
All Bayesian algorithms use JP. We use thresholds of the Bayesian
simulation, except we set T[1]=0.05. One can tune T[3] and T[4] for one
of the 81 possible settings, or a specific sample size, but it can
affect the performance of other settings. We picked the thresholds of
the Bayesian simulation as they provided satisfactory performance among
large sample sizes. This process iterates 500 times. For each set of
distribution parameters we define performance as the average number of
markers identified as good plus the average number of non-markers
identified as bad. Figure [82]4 plots the average and worst case
performance for each fixed mean type across other distribution
parameters as sample size increases from 10 to 100 in steps of 10.
Bayesian methods tend to outperform non-Bayesian methods. While simpler
methods such as CMNC-OBF outperform more complicated methods when
sample size is small, complicated methods such as SPM have superior
performance when sample size is large.
Fig. 4.
Fig. 4
[83]Open in a new tab
Average and worst case performance of feature selection algorithms.
Average and worst case performance are obtained using 27 combinations
of the synthetic microarray model parameters k, ρ[0] and ρ[1] with
fixed mean type, where performance is defined to be the average number
of markers identified as good plus the average number of non-markers
identified as bad over 500 iterations: a average performance for
redundant means, b worst case performance for redundant means, c
average performance for synergetic means, d worst case performance for
synergetic means, e average performance for marginal means, and f worst
case performance for marginal means
For small sample sizes, or cases where correctly labeling good features
is more difficult, REMAIN tends to output very few features resulting
in very good performance, in contrast to methods that are forced to
output
[MATH: |G¯|=<
/mo>100 :MATH]
features. OBF can be implemented with the MNC objective instead of CMNC
to enjoy this characteristic of REMAIN. SPM seems to have the most
diverse behavior. While it performs inferior to all feature selection
algorithms when sample size is very small, it tends to outperform all
other methods for larger sample sizes. In order for the quantities used
in SPM to be well-defined under JP, sample size must be larger than the
block size. Hence, under small samples SPM with JP tends to break good
blocks into smaller blocks, thereby losing some of its ability to
identify weak good features with strong dependencies, and making it
more prone to detecting blocks incorrectly. Also note that we have used
the same parameters for SPM across all data models and sample sizes,
and performance is expected to improve if t[1],t[2] and T[4] are
calibrated each time it is run.
POFAC is an interesting option, enjoying competitive performance across
all sample sizes. It outperforms 2MNC-Robust while its computation cost
is only slightly larger. CMNC-OBF tends to select individually strong
markers, i.e., markers with class 1 mean far from 0. CMNC-OBF performs
very similar to BD in this simulation, with their performance graphs
almost overlapping.
Figure [84]5 plots average performance for fixed class-conditioned
correlation coefficients across other distribution parameters. Simpler
methods outperform complicated algorithms when sample size is small,
and REMAIN enjoys outstanding performance for small sample sizes by
reporting very few features. REMAIN has difficulty detecting weak
markers, i.e., heterogeneous markers with class 1 mean close to 0, as
for larger sample sizes its performance increment is very little for a
10 point increase in sample size. Average performance with respect to
sample size for each of the 81 possible data generation settings is
provided in the supplementary [see Additional file [85]1].
Fig. 5.
Fig. 5
[86]Open in a new tab
Average performance of feature selection algorithms. Average
performance is obtained using 9 combinations of the synthetic
microarray model parameters k and mean type with fixed ρ[0] and ρ[1],
where performance is defined to be the average number of markers
identified as good plus the average number of non-markers identified as
bad over 500 iterations: a ρ[0]=0.1,ρ[1]=0.1, b ρ[0]=0.5,ρ[1]=0.1, c
ρ[0]=0.9,ρ[1]=0.1, d ρ[0]=0.1,ρ[1]=0.5, e ρ[0]=0.5,ρ[1]=0.5, f
ρ[0]=0.9,ρ[1]=0.5, g ρ[0]=0.1,ρ[1]=0.9, h ρ[0]=0.5,ρ[1]=0.9, and i
ρ[0]=0.9,ρ[1]=0.9
While correctly labeling more features tends to result in lower
classification error, maximizing the average number of correctly
labeled features does not necessarily minimize classification error
[see Additional file [87]1]. An example can be seen in the
Supplementary, where we examine the prediction error of several popular
classifiers with feature selection on the synthetic microarray model
[see Additional file [88]1].
Results
We apply CMNC-OBF, POFAC, REMAIN, and SPM with the same priors used for
synthetic microarray simulations to cancer microarray datasets, select
the top genes, and perform enrichment analysis. We list the top 5 genes
selected by CMNC-OBF, POFAC, and REMAIN. The top 100 genes are provided
in the supplementary [see Additional file [89]1]. REMAIN ranks genes as
follows. In each call to MAIN, we rank genes of
[MATH: G~
:MATH]
by the order they are added to
[MATH: G~
:MATH]
, and if several genes are added at once in step 5 of MAIN, they are
ranked based on
[MATH: π~∗(f) :MATH]
. In addition,
[MATH: G~
:MATH]
’s are ranked by the order they are obtained using consecutive calls to
the MAIN subroutine. Note SPM outputs a set of feature blocks, not a
feature ranking. Studying blocks of SPM might provide invaluable
information about the underlying biological mechanisms of the disease
under study, but we leave this for future work.
We perform enrichment analysis using PANTHER [[90]17, [91]18]. The top
20 enriched pathways are reported in the supplementary [see Additional
file [92]1]. We list their names, number of known genes in each
pathway, number of selected genes that belong to the pathway, and the
corresponding p-value. Here we only list the top 3 pathways and their
p-values. We study if among the top genes and pathways any are already
suggested to be involved in the cancer under study. The complete
analysis, with references that suggest involvement of the top reported