Abstract Background Many bioinformatics studies aim to identify markers, or features, that can be used to discriminate between distinct groups. In problems where strong individual markers are not available, or where interactions between gene products are of primary interest, it may be necessary to consider combinations of features as a marker family. To this end, recent work proposes a hierarchical Bayesian framework for feature selection that places a prior on the set of features we wish to select and on the label-conditioned feature distribution. While an analytical posterior under Gaussian models with block covariance structures is available, the optimal feature selection algorithm for this model remains intractable since it requires evaluating the posterior over the space of all possible covariance block structures and feature-block assignments. To address this computational barrier, in prior work we proposed a simple suboptimal algorithm, 2MNC-Robust, with robust performance across the space of block structures. Here, we present three new heuristic feature selection algorithms. Results The proposed algorithms outperform 2MNC-Robust and many other popular feature selection algorithms on synthetic data. In addition, enrichment analysis on real breast cancer, colon cancer, and Leukemia data indicates they also output many of the genes and pathways linked to the cancers under study. Conclusions Bayesian feature selection is a promising framework for small-sample high-dimensional data, in particular biomarker discovery applications. When applied to cancer data these algorithms outputted many genes already shown to be involved in cancer as well as potentially new biomarkers. Furthermore, one of the proposed algorithms, SPM, outputs blocks of heavily correlated genes, particularly useful for studying gene interactions and gene networks. Electronic supplementary material The online version of this article (10.1186/s12859-018-2059-8) contains supplementary material, which is available to authorized users. Keywords: Feature selection, Bayesian learning, Biomarker discovery, Heuristic search algorithms Background Many bioinformatics studies aim to identify predictive biomarkers that can be used to establish diagnosis or prognosis, or to predict a drug response [[27]1–[28]3]. This problem can often be framed as a feature selection task, where the goal is to identify a list of features (molecular biomarkers) that can discriminate between groups of interest based on high-dimensional data from microarray, RNA-seq, or other high-throughput technologies. Initially, exploratory studies are often conducted on small samples to generate a shortlist of biomarker candidates before a large-sample validation study is performed [[29]4]. However, such studies have too often been unsuccessful at producing reliable and reproducible biomarkers [[30]5]. Biomarker discovery is inherently difficult, given the large number of features, highly complex interactions between genes and gene products, enormous variety of dysfunctions that can occur, and many sources of error in the data. As a result, feature selection algorithms are often implemented without much consideration of the particular demands of the problem. For instance, variants of t-test are perhaps the most widely implemented selection strategies in bioinformatics, but can only detect strong individual features, and fail to take correlations into account. Given that molecular signaling is often inherently multivariate, there is a need for methods that can account for correlations and extract combinations of features as a marker family. Wrapper methods do this by ranking sets of features according to some objective function, usually the error of a classifier. However, methods based on classifier error are computationally expensive, and may not necessarily produce the best markers; indeed, strong features can be excluded if they are correlated with other strong features. Furthermore, analysis downstream from feature selection may include gene set enrichment analysis, where the hope is to identify known pathways or other biological mechanisms that contain a statistically significant number of genes in the reported gene set, or may involve the development of new pathways and gene networks. We are thus motivated to develop methods that not only select markers useful for discrimination, but select all relevant markers, even individually weak ones. To address this, in prior work we proposed a hierarchical Bayesian framework for feature selection, labeling features as “good” or “bad”, where good features are those we wish to select, i.e., biomarkers. This framework places a prior on the set of good features and the underlying distribution parameters. Three Gaussian models have been considered. Under independent features, Optimal Bayesian Filtering reports a feature set of a given size with a maximal expected number of truly good features (CMNC-OBF) [[31]6]. Assuming fully dependent good features and independent bad features, 2MNC-DGIB is a fast suboptimal method that ranks features by evaluating all sets of size 2 [[32]7]. Finally, assuming good and bad features are separately dependent, 2MNC-Robust proposes an approximation of the posterior on good features and uses a ranking strategy similar to 2MNC-DGIB to select features [[33]8]. While 2MNC-DGIB has outstanding performance when its assumptions are satisfied [[34]7], it performs poorly when bad features are dependent [[35]9]. On the other hand, CMNC-OBF and 2MNC-Robust have been shown to have robust performance across Bayesian models with block-diagonal covariances [[36]9]. CMNC-OBF is extremely fast and enjoys particularly excellent performance when markers are individually strong with low correlations, but, like all filter methods, may miss weak features that are of interest due to high correlations with strong features [[37]6, [38]9]. 2MNC-Robust is computationally very manageable and generally improves upon CMNC-OBF in the presence of correlations. Although CMNC-OBF and 2MNC-Robust are robust to different block-diagonal covariance structures, they do not attempt to detect these underlying structures, and their assumptions and approximations constrain performance. Thus, in this work we propose three new feature selection algorithms that: (1) use an iterative strategy to update the approximate posterior used in 2MNC-Robust, (2) use a novel scoring function inspired by Bayes factors to improve overall rankings, and (3) attempt to actually detect the underlying block structure of the data. We show that these algorithms have comparable computation time to 2MNC-Robust, while outperforming 2MNC-Robust and many other popular feature selection algorithms on a synthetic Bayesian model assuming block-diagonal covariance matrices, and a synthetic microarray data model. Finally, we apply the proposed algorithms and CMNC-OBF to breast cancer, colon cancer, and AML datasets, and perform enrichment analysis on each to address validation. Feature selection model We review a hierarchical Bayesian model that serves as a reference for the approximate posterior developed in 2MNC-Robust [[39]8, [40]9] and will be used in the algorithms we present in the next section. Consider a binary feature selection problem with class labels y=0,1. Let F be the set of feature indices. Assume features are partitioned into blocks, where features in each block are dependent, but features in different blocks are independent. Assume each block is either good or bad. A good block has different class-conditioned distributions between the two classes, while a bad block has the same distribution in both classes. We denote a partitioning of F to good and bad blocks by P=(P[G],P[B]), and hereafter call it a feature partition, where P[G]={G[1],⋯,G[u]} is the set of u good blocks and P[B]={B[1],⋯,B[v]} is the set of v bad blocks. Furthermore, denote the set of all features in good blocks as good features, [MATH: G=i=1uGi :MATH] , and denote all features in bad blocks as bad features, [MATH: B=j=1vBj :MATH] . Denote the random feature partition by [MATH: P¯=(< /mo>P¯G,P¯B) :MATH] , the random set of good features by [MATH: G¯ :MATH] , and the random set of bad features by [MATH: B¯ :MATH] . We define [MATH: π(P)=P(P¯=P< /mi>) :MATH] to be the prior distribution on [MATH: P¯ :MATH] . Let P be fixed. Let θ^P be the parameter describing the joint feature distribution of P. Since blocks are independent of each other we can write [MATH: θP=θ0 G1,,θ 0Gu,θ< mrow>1G1,,θ 1Gu,θB1,,θBv< /mrow> :MATH] , where [MATH: θyGi :MATH] parametrizes class-y features in G[i], and [MATH: θB< mi>j :MATH] parametrizes features in B[j]. Assume [MATH: θyGi :MATH] and [MATH: θB< mi>j :MATH] ’s are independent given P, i.e., [MATH: πθP=< mrow>i=1uπθ0Giπθ1Gij=1vπθ Bj :MATH] . Given a training set, Inline graphic , of n independent and identically distributed (i.i.d.) points, with n[y] points in each class, we have Inline graphic and Inline graphic , where π^∗(.) denotes posterior, Inline graphic and Inline graphic are class-y points in G[i] and points in B[j], respectively, and Inline graphic and Inline graphic are the likelihoods. Following steps in [[41]7, [42]10], we have graphic file with name 12859_2018_2059_Figh_HTML.gif 1 In addition, the marginal posterior of a feature set G is Inline graphic , and marginal posterior of a feature f is Inline graphic . Note Inline graphic is different than Inline graphic . Gaussian model Here we solve Eq. ([43]1) for jointly Gaussian features. We assume for a block A, [MATH: θyA =μyA,ΣyA :MATH] and θ^A=[μ^A,Σ^A], where [MATH: μyA :MATH] and μ^A are the mean vectors, and [MATH: ΣyA :MATH] and Σ^A are the covariance matrices. Let P be a feature partition. Suppose A is a good block of P. Assume [MATH: π(θyA) :MATH] is Normal-Inverse-Wishart (NIW). Hence, [MATH: πθyA=πΣyAπμyA|ΣyA :MATH] , where [MATH: πΣyA=< /mo>KyA< /mi>|ΣyA|< /mrow>κyA+|A|+12 exp0.5TrSyAΣyA1,πμyA|ΣyA =LyA|ΣyA|0.5×exp0.5ν< /mrow>yAμyAmy A< mi>TΣy A< mo>−1μyAmyA , :MATH] where for a matrix |.| denotes determinant. [MATH: SyA ,κy< /mi>A,myA :MATH] , and [MATH: νyA :MATH] are hyperparameters, which are assumed given and fixed. [MATH: SyA :MATH] is an |A|×|A| matrix, where for a set |.| denotes cardinality. For a proper prior [MATH: SyA :MATH] is symmetric and positive-definite, and [MATH: κyA >|A| 1 :MATH] . If [MATH: κyA >|A|+ 1 :MATH] , then [MATH: EΣyA=SyA /κyA|A |1 :MATH] . Furthermore, [MATH: myA :MATH] is an |A|×1 vector describing the average mean of features and for a proper prior we need [MATH: νyA >0 :MATH] . [MATH: KyA :MATH] and [MATH: LyA :MATH] represent the relative weights of each distribution. For a proper distribution we have [MATH: KyA =|SyA| 0.5κyA20.5κyA|A|/Γ|A|0.5κ< mi>yA :MATH] and [MATH: LyA =2π/ν< /mi>yA0.5|A| :MATH] , where Γ[d] denotes the multivariate gamma function. Since NIW is a conjugate prior of Gaussian distribution, given sample, [MATH: πθyA :MATH] is again NIW with updated hyperparameters: [MATH: κyA= κyA +n y :MATH] , [MATH: νyA= νyA +n y :MATH] , [MATH: myA= νyAmyA+nyμ^yAνyA :MATH] , and [MATH: SyA=S yA+(< msub>ny1)Σ^yA+< msubsup>νyA< /mrow>nyνyA+nyμ^yAmyAμ^yAmyAT, :MATH] where [MATH: μ^yA :MATH] and [MATH: Σ^yA :MATH] are class-conditioned sample mean and covariance of Inline graphic , respectively [[44]11]. Now suppose A is a bad block. We assume the prior on θ^A is NIW with hyperparameters S^A,κ^A,m^A, and ν^A, and relative weights K^A and L^A. Given sample, π^∗(θ^A) is NIW with [MATH: κA< mo>∗=κA+n :MATH] , [MATH: νA< mo>∗=νA+n :MATH] , [MATH: mA< mo>∗=νAm< /mrow>A+nμ^AνA :MATH] , and [MATH: SA< mo>∗=SA+(n−< /mo>1)Σ^A+νAn νA+n< /mfrac>μ^AmA μ^AmA T, :MATH] where [MATH: μ^A :MATH] and [MATH: Σ^A :MATH] are sample mean and covariance of Inline graphic , respectively [[45]11]. As long as π^∗(P) is proper, using the normalization constant of NIW distribution to compute the integrals in Eq. ([46]1) we have [MATH: π(P)π(P)i=1 uQ0Gi Q1GiS0Gi 0.5κ0Gi< mrow>S1Gi 0.5κ1Gi< mrow>×j=1 vQ Bj|SBj |0.5< mrow>κBj , :MATH] where [MATH: Qy A=KyALyA20.5κyA|A|Γ |A|0.5κ< mi>yA2π/ν< /mi>yA0.5|A|,< mspace width="2em">QA=KA LA20.5κ A|A |Γ|A< /mi>|0.5κA2π/νA0.5|A|. :MATH] Assuming: (1) π(P) is such that the block structure, i.e., the number and size of good and bad blocks, is fixed, (2) for each good block A, [MATH: KyA :MATH] , [MATH: LyA :MATH] , [MATH: κyA :MATH] , and [MATH: νyA :MATH] do not depend on the features indices in A, and (3) for each bad block A, K^A, L^A, κ^A, and ν^A do not depend on the features indices in A, [MATH: π(P)π(P)i=1 uS0Gi κ0Gi S1Gi κ1Gi j=1 vS< mi>BjκBj< mo>∗ 0.5. :MATH] Methods Here we describe the set selection methods used. Note we aim to find the set of true good features, rather than the true underlying feature partition. The Maximum Number Correct (MNC) criterion [[47]7] outputs the set maximizing the expected number of correctly labeled features and the Constrained MNC (CMNC) criterion outputs the set with maximum expected number of correctly labeled features constrained to having exactly D selected features, where D is a parameter of the optimization problem [[48]9]. The solution of MNC is {f∈F:π^∗(f)>0.5} [[49]7] and the solution of CMNC is picking the top D features with largest π^∗(f) [[50]9]. Therefore, both MNC and CMNC require computing π^∗(f) for all f∈F, which is not computationally feasible for an arbitrary block structure unless |F| is very small. We review two previously proposed algorithms, OBF and 2MNC-Robust, and then present three new algorithms. Optimal Bayesian filter Optimal Bayesian Filter (OBF) assumes all blocks have size one, i.e., all features are independent, and assumes the events [MATH: {fG¯} :MATH] are independent a priori. In this case π^∗(f) can be found in closed form with little computation cost [[51]6, [52]9]. OBF is optimal under its modeling assumptions. As argued in [[53]9], in the presence of correlation OBF is a robust suboptimal algorithm that can detect individually strong good features, i.e., those whose mean and/or variance is very different between the two classes, but cannot take advantage of correlations to correctly label individually weak good features, those whose mean and variance are similar in both classes. 2MNC-Robust The 2MNC algorithm [[54]7] suggests approximating π^∗(f) using π^∗(G) for all sets G such that |G|=2, and picking the top D features. Since finding π^∗(G) for all feature partitions where |∪P[G]|=2 is typically infeasible, an approximate posterior, [MATH: π~(G) :MATH] , is proposed [[55]8], where for all G⊆F of size 2, graphic file with name 12859_2018_2059_Figo_HTML.gif The normalization constant is found such that [MATH: GF :|G|=2π~(G)=1 :MATH] . [MATH: π~(G< /mi>) :MATH] mimics the role of [MATH: π(G)=P(< mover accent="true">G¯=G< /mi>)=P :PG=Gπ(P) :MATH] . Using some suboptimal method might affect one’s decision of the value used as the prior of a feature set, replacing π^∗(G) with [MATH: π~(G) :MATH] . For example, knowing [MATH: |G¯|>< /mo>2 :MATH] implies π(G)=0 for all sets of size 2; however, 2MNC-Robust only evaluates such sets. In this case, [MATH: π~(G< /mi>)=P(GG¯) :MATH] might be a suitable choice to replace π(G). [MATH: π~(f)=G:f Gπ~(G) :MATH] is the approximate marginal posterior of f∈F. For the Gaussian model, if the number of good features is fixed and hyperparameters do not depend on the feature indices, [MATH: π~(G)π~(G< /mi>)|S0G|κ0G|S1G|κ< mrow>1G|SG|κG0.5. :MATH] 2 2MNC-Robust is implementing 2MNC with [MATH: π~(f) :MATH] . As mentioned before, 2MNC-Robust does not tune itself to the underlying block structure of data. Recursive marginal posterior inflation It is easy to show that [MATH: fFπ~(f)=2 :MATH] when only sets of size 2 are used to find [MATH: π~(f) :MATH] . Hence, under MNC criterion one would at most pick 4 good features, implying we underestimate π^∗(f) by only using sets of size 2 when [MATH: |G¯|>< /mo>>2 :MATH] . REcursive MArginal posterior INflation (REMAIN) aims to sequentially detect good features by rescaling [MATH: π~(f)=G:f G,|G|=2π~(G) :MATH] . We initialize REMAIN with the set of all features, F[r]=F. Then, REMAIN uses the MArginal posterior INflation (MAIN) algorithm to identify several features as good, removes them from F[r], and feeds MAIN with the truncated F[r] to select additional features. This process iterates until MAIN does not output any features. REMAIN is nothing but repetitive calls to MAIN with shrinking feature sets, making MAIN the heart of this algorithm. graphic file with name 12859_2018_2059_Figp_HTML.gif Pseudo-code of MAIN is provided in Algorithm 1, where H(G) is the right hand side of Eq. ([56]2). Inputted with a feature set F[t], MAIN finds [MATH: π~(f) :MATH] using sets of size 2, and finds the set [MATH: Gs={fFt:π~(f)>T1} :MATH] . MAIN adds G[s] to [MATH: G~ :MATH] , the set of features in F[t] already labeled as good. It then updates F[t] to F[t]∖G[s], and rescales [MATH: π~(f) :MATH] of features f∈F[t] so that [MATH: fFtπ~(f)=2 :MATH] . Note features in [MATH: G~ :MATH] are used to compute [MATH: π~(f) :MATH] for features f∈F[t], but [MATH: π~(f) :MATH] of features [MATH: fG~ :MATH] are not used in the scaling of [MATH: fFtπ~(f)=2 :MATH] . MAIN iterates until [MATH: G~=ϕ< /mi> :MATH] , or H(G)≤T[2] for all G⊆F[t] with |G|=2. Not finding new features in MAIN might be due to the remaining good features being weaker and independent of [MATH: G~ :MATH] . Hence, REMAIN removes [MATH: G~ :MATH] from F[r], and feeds MAIN with the updated F[r]. This way, features in [MATH: G~ :MATH] are not used to compute [MATH: π~(f) :MATH] anymore for any feature f∈F[r], thus making it easier to detect weaker good features that are independent of features already selected by REMAIN. Pseudo-code of REMIAN is provided in Algorithm 2. T[1] mimics the role of the threshold used in the MNC criterion. Hence, T[1]∈[ 0, 1]. Recall that by evaluating sets of size 2 we underestimate [MATH: π~(f) :MATH] when [MATH: |G¯|>< /mo>>2 :MATH] . Therefore, when confident [MATH: |G¯|>< /mo>>2 :MATH] , one might opt for smaller values for T[1] rather than values close to 1. As T[2] is a threshold over un-normalized posteriors, H(G), extra care must be taken when setting T[2]. We suggest T[2]=n for high-dimensional feature selection applications, which is a good rule of thumb based on our simulation results and asymptotic analysis of H(.). graphic file with name 12859_2018_2059_Figq_HTML.gif Note the number of features reported by REMAIN is variable; however, one can easily obtain close to a desired number of selected features by tuning T[1]. To illustrate, we provide an example based on the data generation model used in the “[57]Synthetic microarray simulations” section, where we assume there are 100 markers, i.e., good features, and 4900 non-markers, i.e., bad features. We use the synergetic model with block size k=5 and correlation coefficients ρ[0]=ρ[1]=0.9. Fixing T[2]=n, Table [58]1 lists the average number of markers and non-markers selected over 1000 iterations for n=20 and 100 across different values of T[1]. REMAIN outputs very few features when T[1] is large, and too many features when T[1] is extremely low. The best choice for T[1] can vary greatly from case to case, but one strategy is to choose T[1] so that REMAIN selects close to a given number of features. For example, T[1]=0.05 is a good choice in this simulation if one desires approximately 100 selected features. Another strategy is based on the number of features selected by REMAIN across various values of T[1]. When T[1] is large, reducing T[1] only slightly increases the output feature size, for instance when T[1]>0.05 in this simulation. However, one might observe a rapid increase in the output size by slightly reducing T[1], for instance T[1] changing from 0.05 to 0.01 in this simulation. For such observed patterns, the value for which this phenomenon occurs might be a desirable choice. Table 1. Performance of REMAIN for various values of T[1] Objective n T[1]=0.3 T[1]=0.2 T[1]=0.1 T[1]=0.05 T[1]=0.01 T[1]=0.005 Marker found 20 0.9 1.4 2.5 4.6 15.3 26.2 Non-marker found 20 2.0 3.7 9.4 23.3 164.5 403.0 Marker found 100 52.5 56.2 58.2 60.0 68.9 76.1 Non-marker found 100 1.6 3.3 8.8 20.2 124.7 288.5 [59]Open in a new tab Posterior factor Feature selection can be construed as a model selection problem where each model is a set of good features. Let f be a feature. If f is a good feature, we expect that if we add f to any model G, i.e., a set of good features, then [MATH: π~(G{f})/π~(G)>> 1 :MATH] . If f is a bad feature, we expect [MATH: π~(G{f})/π~(G) :MATH] to be much smaller. Hence, if we average this ratio over a family of models that do not contain f, and denote it by β(f), we expect β(f)>>1 if f is a good feature, and comparable to or smaller than 1 if f is a bad feature. [MATH: π~(G{f})/π~(G) :MATH] is similar, but not identical, to the Bayes factor encountered in model selection [[60]12], where here we compare a model with good feature set G∪{ f} versus a model with good feature set G excluding f. The posterior factor, β(f), averages the approximate posterior ratio across all feature sets G∈F∖{ f}, i.e., [MATH: β(f)=1|Ω< mrow>f|GΩ f π~(G{f})π~(G), :MATH] 3 where Ω^f={G⊆F:f∉G}. As the summation over Ω^f is computationally infeasible, we propose approximate posterior factor, hereafter denoted by [MATH: β~ :MATH] . [MATH: β~(f< /mi>)=1|< mi>F|1fF{f}π~({f,f})< mrow>π~({f}). :MATH] 4 We propose approximate POsterior FActor-Constrained (POFAC) algorithm as follows: Use [MATH: β~(f) :MATH] to rank features, and pick the top D features. Note that D is a parameter of the algorithm. Sequential partition mustering Sequential Partition Mustering (SPM) aims to improve feature selection performance by sequentially detecting good blocks, and adding them to the set of previously selected features. To find a good block, we start with the most significant feature, i.e., the feature with largest [MATH: β~ :MATH] , and find the block containing it. Note we do not aim to find the structure of bad blocks, and as soon as we declare no more good features remain the algorithm terminates. Suppose u[0] is the current most significant feature. In order to find the block containing u[0] we propose the Good Seed Grower (GSG) algorithm, which can be construed as a seed growing algorithm with u[0] as the seed. Pseudo-code of GSG is presented in Algorithm 3, where for any two non-empty disjoint sets G[1],G[2]⊂F, [MATH: C1(G1,G2)=π~( G1,G2)1π~( G1,G2)Q0G12Q1G12 Q0 G1Q1G1Q0G2Q1G2×S0G12κ0G12S1G12κ1G12S0G1 κ0G1 S1G1 κ1G1 S0G2 κ0G2 S1G2 κ1G2 0.5, :MATH] G[12]=G[1]∪G[2], and [MATH: π~( G1,G2) :MATH] approximates π(G[1],G[2]), the prior probability that at least one of the features in G[2] is not independent of G[1]. Note Inline graphic , where Inline graphic is the family of feature partitions that contain a block U such that U∩G[1]≠ϕ and U∩G[2]≠ϕ. At each iteration, GSG finds the feature u^∗ that maximizes C[1](U,{u^∗}), where U is the currently detected sub-block of the block containing u[0]. GSG declares u^∗ and U belong to the same block if C[1](U,{u^∗})>T[3], and adjoins u^∗ to U; otherwise, it terminates and declares U as the block containing u[0]. Here we assume [MATH: T3=t1nt2|U| :MATH] , where t[1],t[2]>0 are parameters of GSG. While we have only considered one possible family of thresholds, we expect this family to be large enough for most practical purposes. graphic file with name 12859_2018_2059_Figt_HTML.gif Pseudo-code of SPM is explained in Algorithm 4. Let F[t] be the feature set used by SPM initialized to F[t]=F. We start with the most significant feature u[0] and find the block containing it, U. We then update F[t] to F[t]∖U. If [MATH: β~(f< /mi>)<T4 :MATH] for all f∈F[t], then SPM declares F[t] does not contain any good features and terminates; otherwise, it picks the most significant feature of F[t] and iterates. Similar to REMAIN, SPM cannot be forced to output a fixed number of features, but T[4] can be used to tune SPM to output close to a desired number of features. In addition, t[1] and t[2] can be used to avoid picking large blocks, which also affects the output feature set size. graphic file with name 12859_2018_2059_Figu_HTML.gif We again provide an example based on the simulation we did for REMAIN. We let t[1]=10^1,10^2,⋯,10^5,t[2]=1,2,3,4,5, and T[4]=10^2,10^4,10^6, and consider all combinations to construct a very wide range of parameters. Figures [61]1 and [62]2 illustrate how parameters of SPM affect its performance for n=20 and 100, respectively. While low thresholds mislabel more non-markers as good features, they correctly label more markers compared with large thresholds. When n=20, in order to correctly label at least 10 markers on average, at least 50 non-markers are mislabeled, and to mislabel at most 5 non-markers on average, one cannot correctly detect more than 5 markers. On the other hand, when n=100, one can simultaneously correctly label at least 80 markers and mislabel at most 10 non-markers for almost all parameters. Moving from lowest parameter values to highest, we observe the average outputted feature size varies from approximately 400 to less than 5 when n=20, while it only varies from 120 to 70 when n=100. Thereby, n=100 is less sensitive to the choice of parameters than n=20. Suppose n=100 and one aims to find most markers and few non-markers. Many parameters can achieve this goal. In addition, within the range of such parameters, the number of features labeled as good does not change very much by slightly varying the parameters. Hence, for a fixed sample, one can implement SPM over a wide range of parameters, find the range where the output feature size does not vary much by slightly changing the parameters, and pick a value in that region that outputs a feature set with close to a reasonable number of features. Fig. 1. Fig. 1 [63]Open in a new tab Performance of SPM for various values of t[1], t[2], T[4], and n=20. Average number of markers labeled as good for a T[4]=10^2, b T[4]=10^4, and c T[4]=10^6. Average number of non-markers labeled as good for d T[4]=10^2, e T[4]=10^4, and f T[4]=10^6 Fig. 2. Fig. 2 [64]Open in a new tab Performance of SPM for various values of t[1], t[2], T[4], and n=100. Average number of markers labeled as good for a T[4]=10^2, b T[4]=10^4, and c T[4]=10^6. Average number of non-markers labeled as good for d T[4]=10^2, e T[4]=10^4, and f T[4]=10^6 Simulations We compare the performance of proposed algorithms with many popular feature selection algorithms over a Bayesian setting, and a synthetic microarray model introduced in [[65]13] and extended in [[66]8, [67]9]. Bayesian simulation In this simulation we assume |F|=4100 and [MATH: |G¯|=< /mo>100 :MATH] . We assume there is 1 good block for each of the following sizes: 10, 20, 30, and 40. We also assume there are 20 bad blocks for each of the following sizes: 5, 10, 15, 20, 50, and 100. We first randomly assign each feature to a block such that the assumed block structure is satisfied, effectively constructing [MATH: P¯ :MATH] . Afterwards, distribution parameters are randomly drawn from the following NIW prior. For each good block, A, we have [MATH: S0A =S1< /mn>A=0.5× I|A|< mo>×|A|,κ0A< /msubsup>=κ1A=|A| +2,m0A= m1A=0 :MATH] , and [MATH: ν0A =ν1< /mn>A=4 :MATH] , where I is the identity matrix. Also, for a bad block, A, we have S^A=0.5×I[|A|×|A|],κ^A=|A|+2,m^A=0, and ν^A=4. Given distribution parameters, a stratified sample of size n with equal points in each class is drawn. The following feature selection methods declare the set of good features: t-test, Bhatacharrya Distance (BD), Mutual Information (MI) using the non-parameter method of [[68]14] with spacing parameter m=1, Sequential Forward Search using the bolstered error estimate [[69]15] of Regularized Linear Discriminant Analysis applied to the top 300 features of BD (SFS-RLDA), FOrward selection using Hilbert-Schmidt Independence Criterion (FOHSIC) [[70]16] applied to the top 300 features of BD, CMNC-OBF, 2MNC-Robust, REMAIN, POFAC, and SPM. Note t-test, MI, BD, and CMNC-OBF are filter methods. All methods except REMAIN and SPM output [MATH: |G¯| :MATH] features. CMNC-OBF assumes the events [MATH: {fG¯} :MATH] are independent and [MATH: P(fG¯) :MATH] is constant for all f∈F. 2MNC-Robust and REMAIN assume [MATH: π~(G< /mi>) :MATH] is uniform over all sets of size 2, and zero otherwise. POFAC assumes [MATH: π~(G< /mi>) :MATH] is uniform over all sets of size 1 and 2. Finally, SPM assumes [MATH: π~( G1,G2)= 0.5 :MATH] for all sets G[1],G[2]⊆F, and uses the same [MATH: π~(G< /mi>) :MATH] of POFAC to compute [MATH: β~(f< /mi>) :MATH] . Bayesian algorithms use proper priors with hyperparameters of the same form given previously (PP), and Jeffreys non-informative prior (JP), where for each set, A, [MATH: SyA :MATH] and S^A are zero matrices, [MATH: KyA =KA=LyA=LA=1 :MATH] , and [MATH: κyA =κA=νyA=νA=0 :MATH] . With [MATH: νyA =νA=0 :MATH] we do not need to specify [MATH: myA :MATH] and m^A. We use T[1]=0.3 and T[2]=n for REMAIN using both PP and JP. For SPM-PP we set t[1]=100, t[2]=0.5, and T[4]=100n^2, which resulted in adequate performance among all sample sizes. When using SPM-JP we use the same t[1] and T[4], but set t[2]=1 to avoid picking large blocks. This process iterates 1000 times. Figure [71]3 plots the average number of correctly labeled features as sample size increases from 10 to 100 in steps of 10. SPM-PP has the best performance; however, SPM-JP experiences a sharp drop under small sample sizes. For larger sample sizes, POFAC-PP performs second only to SPM-PP. However, POFAC-JP outperforms SPM-JP. REMAIN adequately balances performance across all sample sizes. All proposed algorithms, except SPM-JP, outperform 2MNC-Robust, CMNC-OBF, and other feature selection algorithms. SPM-JP outperforms previous algorithms if sample size is not very small. Fig. 3. Fig. 3 [72]Open in a new tab Performance of various feature selection algorithms under Gaussian data. Average number of correctly labeled features versus sample size for randomly generated parameters for (a) 2MNC-Robust-PP, 2MNC-Robust-JP, POFAC-PP, POFAC-JP, REMAIN-PP, REMAIN-JP, SPM-PP, SPM-JP, (b) CMNC-OBF-PP, CMNC-OBF-JP, t-test, FOHSIC, MI, BD, and SFS-RLDA In this simulation filter methods were the fastest with comparable computation time, and FOHSIC was the most computationally intensive method. A comparison of run-times for this specific simulation is provided in Table [73]2 assuming the run-time of 2MNC-Robust is the unit of time. Parallel processing can be used to speed up these algorithms, for instance, in the 4th step of GSG, and to compute [MATH: π~(G) :MATH] in 2MNC-Robust and POFAC. Although SPM is a sequential algorithm, its bottle-neck is step 4 of GSG, making parallel processing a good strategy to extensively speed up SPM. Table 2. Run-time comparison of Bayesian simulation Alg. Filter 2MNC-Robust REMAIN POFAC SPM SFS-RLDA FOHSIC Time < 10^−3 1 2 1.05 1.5 10 15 [74]Open in a new tab Synthetic microarray simulations Here an extended version of a synthetic model developed to mimic microarrays is used to generate data. The original model is introduced in [[75]13], and has been extended in [[76]8, [77]9]. In these models features are markers or non-markers. Markers are either global or heterogeneous. Global markers (GM) are homogeneous within each class. Heterogeneous Markers (HM) compromise c subclasses, where for each specific set of heterogeneous markers, a specific subset of the training sample has a different distribution than markers in class 0, and the remaining sample points have the same distribution as class 0. Markers comprise blocks of size k, where each block in class y is Gaussian with mean μ[y] and covariance σ[y]Σ[y]. Diagonal elements of Σ[y] are 1 and non-diagonal elements are ρ[y]. The original model of [[78]13] forced ρ[0]=ρ[1]. We also have μ[0] = [ 0,⋯,0]. There are three types of markers according to their mean in class 1: redundant, synergetic, and marginal, with μ[1] being [ 1,⋯,1],[ 1,1/2,⋯,1/k], and [ 1,0,⋯,0], respectively. Non-markers are either Low Variance (LV) or High Variance (HV). In the original model LV non-markers are independent, each with a Gaussian distribution, N(0,σ[0]). However, in the extended model of [[79]8, [80]9], similar to markers in class 0, LV non-markers comprise blocks of size k, where in each block features are jointly Gaussian with mean μ[0] and covariance σ[0]Σ[0]. HV non-markers are independent with marginal distribution pN(0,σ[0])+(1−p)N(1,σ[1]), where p is drawn from the uniform distribution over [ 0,1]. We assume |F|=5000,|GM|=20,|HM|=80,|HV|=2000, and c=2. We consider all possible combinations of the following parameters: all 3 mean types, k=5,10,20, and ρ[0],ρ[1]=0.1,0.5,0.9. We also consider the “large and unequal variance” setting of Table 1 in [[81]13], which sets σ[0]=0.25 and σ[1]=0.64. Given each set of distribution parameters, we randomly assign features to blocks of global markers, heterogeneous markers, and LV non-markers. The remaining features comprise the independent HV non-markers. We generate a stratified sample of size n with equal points in each class. The following algorithms are used to declare the set of good features: t-test, BD, MI, SFS-RLDA, CMNC-OBF, 2MNC-Robust, REMAIN, POFAC, and SPM. We removed FOHSIC due to its computation cost. All Bayesian algorithms use JP. We use thresholds of the Bayesian simulation, except we set T[1]=0.05. One can tune T[3] and T[4] for one of the 81 possible settings, or a specific sample size, but it can affect the performance of other settings. We picked the thresholds of the Bayesian simulation as they provided satisfactory performance among large sample sizes. This process iterates 500 times. For each set of distribution parameters we define performance as the average number of markers identified as good plus the average number of non-markers identified as bad. Figure [82]4 plots the average and worst case performance for each fixed mean type across other distribution parameters as sample size increases from 10 to 100 in steps of 10. Bayesian methods tend to outperform non-Bayesian methods. While simpler methods such as CMNC-OBF outperform more complicated methods when sample size is small, complicated methods such as SPM have superior performance when sample size is large. Fig. 4. Fig. 4 [83]Open in a new tab Average and worst case performance of feature selection algorithms. Average and worst case performance are obtained using 27 combinations of the synthetic microarray model parameters k, ρ[0] and ρ[1] with fixed mean type, where performance is defined to be the average number of markers identified as good plus the average number of non-markers identified as bad over 500 iterations: a average performance for redundant means, b worst case performance for redundant means, c average performance for synergetic means, d worst case performance for synergetic means, e average performance for marginal means, and f worst case performance for marginal means For small sample sizes, or cases where correctly labeling good features is more difficult, REMAIN tends to output very few features resulting in very good performance, in contrast to methods that are forced to output [MATH: |G¯|=< /mo>100 :MATH] features. OBF can be implemented with the MNC objective instead of CMNC to enjoy this characteristic of REMAIN. SPM seems to have the most diverse behavior. While it performs inferior to all feature selection algorithms when sample size is very small, it tends to outperform all other methods for larger sample sizes. In order for the quantities used in SPM to be well-defined under JP, sample size must be larger than the block size. Hence, under small samples SPM with JP tends to break good blocks into smaller blocks, thereby losing some of its ability to identify weak good features with strong dependencies, and making it more prone to detecting blocks incorrectly. Also note that we have used the same parameters for SPM across all data models and sample sizes, and performance is expected to improve if t[1],t[2] and T[4] are calibrated each time it is run. POFAC is an interesting option, enjoying competitive performance across all sample sizes. It outperforms 2MNC-Robust while its computation cost is only slightly larger. CMNC-OBF tends to select individually strong markers, i.e., markers with class 1 mean far from 0. CMNC-OBF performs very similar to BD in this simulation, with their performance graphs almost overlapping. Figure [84]5 plots average performance for fixed class-conditioned correlation coefficients across other distribution parameters. Simpler methods outperform complicated algorithms when sample size is small, and REMAIN enjoys outstanding performance for small sample sizes by reporting very few features. REMAIN has difficulty detecting weak markers, i.e., heterogeneous markers with class 1 mean close to 0, as for larger sample sizes its performance increment is very little for a 10 point increase in sample size. Average performance with respect to sample size for each of the 81 possible data generation settings is provided in the supplementary [see Additional file [85]1]. Fig. 5. Fig. 5 [86]Open in a new tab Average performance of feature selection algorithms. Average performance is obtained using 9 combinations of the synthetic microarray model parameters k and mean type with fixed ρ[0] and ρ[1], where performance is defined to be the average number of markers identified as good plus the average number of non-markers identified as bad over 500 iterations: a ρ[0]=0.1,ρ[1]=0.1, b ρ[0]=0.5,ρ[1]=0.1, c ρ[0]=0.9,ρ[1]=0.1, d ρ[0]=0.1,ρ[1]=0.5, e ρ[0]=0.5,ρ[1]=0.5, f ρ[0]=0.9,ρ[1]=0.5, g ρ[0]=0.1,ρ[1]=0.9, h ρ[0]=0.5,ρ[1]=0.9, and i ρ[0]=0.9,ρ[1]=0.9 While correctly labeling more features tends to result in lower classification error, maximizing the average number of correctly labeled features does not necessarily minimize classification error [see Additional file [87]1]. An example can be seen in the Supplementary, where we examine the prediction error of several popular classifiers with feature selection on the synthetic microarray model [see Additional file [88]1]. Results We apply CMNC-OBF, POFAC, REMAIN, and SPM with the same priors used for synthetic microarray simulations to cancer microarray datasets, select the top genes, and perform enrichment analysis. We list the top 5 genes selected by CMNC-OBF, POFAC, and REMAIN. The top 100 genes are provided in the supplementary [see Additional file [89]1]. REMAIN ranks genes as follows. In each call to MAIN, we rank genes of [MATH: G~ :MATH] by the order they are added to [MATH: G~ :MATH] , and if several genes are added at once in step 5 of MAIN, they are ranked based on [MATH: π~(f) :MATH] . In addition, [MATH: G~ :MATH] ’s are ranked by the order they are obtained using consecutive calls to the MAIN subroutine. Note SPM outputs a set of feature blocks, not a feature ranking. Studying blocks of SPM might provide invaluable information about the underlying biological mechanisms of the disease under study, but we leave this for future work. We perform enrichment analysis using PANTHER [[90]17, [91]18]. The top 20 enriched pathways are reported in the supplementary [see Additional file [92]1]. We list their names, number of known genes in each pathway, number of selected genes that belong to the pathway, and the corresponding p-value. Here we only list the top 3 pathways and their p-values. We study if among the top genes and pathways any are already suggested to be involved in the cancer under study. The complete analysis, with references that suggest involvement of the top reported