Abstract

   An essential aspect of medical research is the prediction for a health
   outcome and the scientific identification of important factors. As a
   result, numerous methods were developed for model selections in recent
   years. In the era of big data, machine learning has been broadly
   adopted for data analysis. In particular, the Support Vector Machine
   (SVM) has an excellent performance in classifications and predictions
   with the high-dimensional data. In this research, a novel model
   selection strategy is carried out, named as the Stepwise Support Vector
   Machine (StepSVM). The new strategy is based on the SVM to conduct a
   modified stepwise selection, where the tuning parameter could be
   determined by 10-fold cross-validation that minimizes the mean squared
   error. Two popular methods, the conventional stepwise logistic
   regression model and the SVM Recursive Feature Elimination (SVM-RFE),
   were compared to the StepSVM. The Stability and accuracy of the three
   strategies were evaluated by simulation studies with a complex
   hierarchical structure. Up to five variables were selected to predict
   the dichotomous cancer remission of a lung cancer patient. Regarding
   the stepwise logistic regression, the mean of the C-statistic was
   69.19%. The overall accuracy of the SVM-RFE was estimated at 70.62%. In
   contrast, the StepSVM provided the highest prediction accuracy of
   80.57%. Although the StepSVM is more time consuming, it is more
   consistent and outperforms the other two methods.

Introduction

   There are two types of machine learning, the supervised machine
   learning with a specific outcome variable and the unsupervised machine
   learning that only examines the associations between a set of
   predictors [[26]1]. Regression and classification are two primary
   applications for supervised learning, such as the generalized linear
   model (GLM) [[27]2], the logistic regression model [[28]3], and the
   Support Vector Machine (SVM) [[29]4]. For unsupervised learning,
   clustering is the leading interest and the most popular method is the
   Principal Components Analysis (PCA) [[30]5].

   The SVM is a machine learning tool dealing with classification
   problems. With an increasing amount of variables collected, the high
   dimensional data draw more attention in image processing and the SVM is
   considered a powerful classification method. Chang et al. concluded
   that the SVM is useful in the imaging diagnosis of breast cancer and
   its classification ability is nearly equal to a neural network model
   [[31]6]. In particular, when a non-linear structure exists, the SVM
   demonstrates its superior ability to find the optimal separating
   hyperplane by kernel tricks into a higher dimensional feature space
   [[32]7].

   One powerful application of the SVM is the model selection. However,
   the conventional logistic regression using the concordance statistic
   (C-statistic or C-index) [[33]8] is capable of various types of model
   selection based on the Mann-Whitney U statistic [[34]9]. The machine
   learning technique is somewhat desired [[35]10] since the kernel
   function is powerful in classification problems [[36]11]. In 2002, a
   new methodology addressed the problem of the selection of a small
   subset of genes from broad patterns of gene expression data, recorded
   on DNA microarrays [[37]12]. Using available training examples from
   cancer and healthy patients, they built a classifier suitable for
   genetic diagnosis, as well as drug discovery. In contrast to previous
   attempts that addressed this problem of selecting genes with the
   correlated structure, they carried out a new method of gene selection
   utilizing the Support Vector Machine based on Recursive Feature
   Elimination (SVM-RFE) by re-weighting the genes using backward
   eliminations.

   The prediction performances of the SVM based on different kernel
   functions were compared by Huang et al. [[38]13]. This study suggests
   that the linear kernel-based SVM ensembles based on the bagging method
   and the RBF kernel-based SVM ensembles based on the boosting method
   could be the better choices for a small scale dataset if feature
   selection is performed in the pre-processing data stage. For a large
   scale dataset, the RBF kernel-based SVM ensembles based on the boosting
   method perform better than the other classifiers.

   The SVM is also a popular winner in genetic studies. Zhi et al.
   [[39]14] selected candidate genes by the SVM classifier based on the
   betweenness centrality (BC) algorithm. Colorectal cancer (CRC) dataset
   from the Cancer Genome Atlas database was used to evaluate the accuracy
   of the SVM classifier. Pathway enrichment analysis was carried out for
   the SVM-classified gene signatures.

   Recently, Battineni et al. [[40]15] applied the SVM on MRI (Magnetic
   Resonance Image) data to predict dementia patients. Through deliberate
   statistical analyses, they pointed out the importance of parameter
   tuning in the SVM. Their results showed substantial evidence that
   better performance values for dementia prediction could be accomplished
   by low gamma (1.0E-4) and high regularized (C = 100) values.
   Undoubtedly, applications of the SVM are quite extensive in various
   research fields.

   Although the SVM has been widely extended in a variety of concepts,
   such as the parameter tuning or kernel choices, this research is
   focusing on statistical methodologies for stepwise selection models.
   The backward eliminations implemented by the SVM-RFE consider variables
   that are top-ranked to be eliminated last are not necessarily the
   factors that are individually most relevant. In particular, these
   predictors are the most relevant conditional on the specific ranked
   subset in the model. In order to avoid incorrectly eliminated factors,
   we propose a novel forward algorithm with stepwise considerations based
   on the SVM. The name of the new strategy is the Stepwise Support Vector
   Machine (StepSVM).

   The performances of each method could be evaluated by accuracy (
   [MATH:
   <mi>A</mi><mi>c</mi><mi>c</mi><mi>u</mi><mi>r</mi><mi>a</mi><mi>c</mi><
   mi>y</mi><mo>=</mo><mfrac><mrow><mi>T</mi><mi>u</mi><mi>r</mi><mi>e</mi
   ><mspace
   width="0.25em"></mspace><mi>P</mi><mi>o</mi><mi>s</mi><mi>i</mi><mi>t</
   mi><mi>i</mi><mi>v</mi><mi>e</mi><mo>+</mo><mi>T</mi><mi>u</mi><mi>r</m
   i><mi>e</mi><mspace
   width="0.25em"></mspace><mi>N</mi><mi>e</mi><mi>g</mi><mi>a</mi><mi>t</
   mi><mi>i</mi><mi>v</mi><mi>e</mi></mrow><mrow><mi>T</mi><mi>o</mi><mi>t
   </mi><mi>a</mi><mi>l</mi><mspace
   width="0.25em"></mspace><mi>N</mi><mi>u</mi><mi>m</mi><mi>b</mi><mi>e</
   mi><mi>r</mi><mspace
   width="0.25em"></mspace><mi>O</mi><mi>f</mi><mspace
   width="0.25em"></mspace><mi>T</mi><mi>e</mi><mi>s</mi><mi>t</mi><mi>s</
   mi></mrow></mfrac> :MATH]
   ). Cross-validation (CV) will be implemented to assess accuracy
   [[41]16]. In particular, the 10-fold CV is the one we adopted according
   to the previous study [[42]17].

Materials and methods

   The SVM classifies subjects according to the separating hyperplane,
   which is defined as ω^Tx[i]+b = 0, where ω^Tx[i]+b≥1,∀y[i] = 1 and
   ω^Tx[i]+b≤−1,∀y[i] = −1. It could be rewritten as y[i](ω^Tx[i]+b)≥1, i
   = 1⋯m, with the
   [MATH: <mrow><mrow><mi mathvariant="italic">Maximum</mi><mspace
   width="0.25em"></mspace><mi
   mathvariant="italic">Margin</mi><mo>=</mo><munder><mrow><mi
   mathvariant="normal">max</mi></mrow><mrow><mi
   mathvariant="bold-italic">ω</mi><mo>,</mo><mi>b</mi></mrow></munder></m
   row><mrow><mfrac><mrow><mn>2</mn></mrow><mrow><mo>‖</mo><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mo>‖</mo></mrow></mfrac></mrow>
   </mrow> :MATH]
   or
   [MATH: <mrow><mrow><munder><mrow><mi
   mathvariant="normal">min</mi></mrow><mrow><mi
   mathvariant="bold-italic">ω</mi><mo>,</mo><mi>b</mi></mrow></munder></m
   row><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac>
   <mo stretchy="false">‖</mo><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mo
   stretchy="false">‖</mo></mrow></mrow> :MATH]
   .

   With a restriction C (Cost), y[i](ω^Tx[i]+b)≥1−ξ[i], ξ[i]≥0,i = 1⋯m,
   with the maximum margin becomes
   [MATH: <mrow><mrow><munder><mrow><mi
   mathvariant="normal">min</mi></mrow><mrow><mi>ω</mi><mo>,</mo><mi>b</mi
   ></mrow></munder></mrow><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mn>2
   </mn></mrow></mfrac><mo>‖</mo><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mo>‖</mo></mrow></mrow><mo>+</m
   o><mi mathvariant="normal">C</mi><mrow><msubsup><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></msubsup><mrow><msub><mrow><mi>ξ</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></mrow> :MATH]
   . In order to obtain the maximum value, the Lagrange Multiplier Method
   is applied, Lagrangiansis
   [MATH: <mi>L</mi><mo>(</mo><mrow><mi
   mathvariant="bold-italic">ω</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>ξ</mi
   ><mo>,</mo><mi>α</mi><mo>,</mo><mi>β</mi></mrow><mo>)</mo><mo>=</mo><mf
   rac><mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac><msup><mrow><
   mo>‖</mo><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mo>‖</mo></mrow><mrow><mn>2</mn
   ></mrow></msup><mo>+</mo><mi>C</mi><mrow><msubsup><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></msubsup><mrow><msub><mrow><mi>ξ</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></mrow><mo>−</mo><mrow><msubsup><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></msubsup><mrow><msub><mrow><mi>α</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></mrow><mo>×</mo><mo>[</mo><mrow><msup><mr
   ow><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo
   stretchy="false">(</mo><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>T</mi></mrow></msup><m
   sub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>+</mo><mi>b</mi><mo
   stretchy="false">)</mo><mo>+</mo><mn>1</mn><mo>−</mo><msub><mrow><mi>ξ<
   /mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>]</mo><mo>−</mo><mr
   ow><msubsup><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></msubsup><mrow><msub><mrow><msub><mrow><mi>β</mi></m
   row><mrow><mi>i</mi></mrow></msub><mi>ξ</mi></mrow><mrow><mi>i</mi></mr
   ow></msub></mrow></mrow> :MATH]
   , where α[i] and β[i] are the Lagrange Multipliers.

   According to Karush-Kuhn and Tucker (KKT), partial derivatives of
   Lagrangian with respect to ω, b, and ξ equals zero.
    1.
       [MATH: <mfrac><mrow><mo>∂</mo></mrow><mrow><mo>∂</mo><mi
       mathvariant="bold-italic">ω</mi></mrow></mfrac><mi>L</mi><mo>(</mo>
       <mrow><mi
       mathvariant="bold-italic">ω</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>ξ
       </mi><mo>,</mo><mi>α</mi><mo>,</mo><mi>β</mi></mrow><mo>)</mo><mo>=
       </mo><mn>0</mn> :MATH]

   [MATH: <mfrac><mrow><mo>∂</mo></mrow><mrow><mo>∂</mo><mi
   mathvariant="bold-italic">ω</mi></mrow></mfrac><mfrac><mrow><mn>1</mn><
   /mrow><mrow><mn>2</mn></mrow></mfrac><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>T</mi></mrow></msup><m
   i
   mathvariant="bold-italic">ω</mi><mo>+</mo><mi>C</mi><mrow><munderover><
   mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>ξ</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow></mrow><mo>−</mo><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow></mrow><mo>×</mo><mo>[</mo><mrow><msub>
   <mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo
   stretchy="false">(</mo><msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>T</mi></mrow></msup><m
   sub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>+</mo><mi>b</mi><mo
   stretchy="false">)</mo><mo>+</mo><mn>1</mn><mo>−</mo><msub><mrow><mi>ξ<
   /mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>]</mo><mo>−</mo><mr
   ow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>β</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><msub><mrow><mi>ξ</mi></mrow><mrow><mi>i</mi><
   /mrow></msub></mrow></mrow><mo>=</mo><mn>0</mn> :MATH]

   [MATH: <mi
   mathvariant="bold-italic">ω</mi><mo>−</mo><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi><
   /mrow></msub><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mrow></mrow><mo>=</mo><mn>0</mn> :MATH]

   [MATH: <mi
   mathvariant="bold-italic">ω</mi><mo>=</mo><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi><
   /mrow></msub><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mrow></mrow> :MATH]
    2.
       [MATH:
       <mfrac><mrow><mo>∂</mo></mrow><mrow><mo>∂</mo><mi>b</mi></mrow></mf
       rac><mi>L</mi><mo>(</mo><mrow><mi
       mathvariant="bold-italic">ω</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>ξ
       </mi><mo>,</mo><mi>α</mi><mo>,</mo><mi>β</mi></mrow><mo>)</mo><mo>=
       </mo><mn>0</mn> :MATH]

   [MATH:
   <mfrac><mrow><mo>∂</mo></mrow><mrow><mo>∂</mo><mi>b</mi></mrow></mfrac>
   <mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac><msup><mro
   w><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>T</mi></mrow></msup><m
   i
   mathvariant="bold-italic">ω</mi><mo>+</mo><mi>C</mi><mrow><munderover><
   mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>ξ</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow></mrow><mo>−</mo><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow></mrow><mo>×</mo><mo>[</mo><mrow><msub>
   <mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>(</mo><msup><m
   row><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>T</mi></mrow></msup><m
   sub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>+</mo><mi>b</mi><mo>)</mo><mo>+</mo><mn>1</mn><mo>−</mo><msub><mrow><
   mi>ξ</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>]</mo><mo>−</m
   o><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>β</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><msub><mrow><mi>ξ</mi></mrow><mrow><mi>i</mi><
   /mrow></msub></mrow></mrow><mo>=</mo><mn>0</mn> :MATH]

   [MATH: <mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi><
   /mrow></msub></mrow></mrow><mo>=</mo><mn>0</mn> :MATH]
    3.
       [MATH:
       <mfrac><mrow><mo>∂</mo></mrow><mrow><mo>∂</mo><mi>ξ</mi></mrow></mf
       rac><mi>L</mi><mo>(</mo><mrow><mi
       mathvariant="bold-italic">ω</mi><mo>,</mo><mi>b</mi><mo>,</mo><mi>ξ
       </mi><mo>,</mo><mi>α</mi><mo>,</mo><mi>β</mi></mrow><mo>)</mo><mo>=
       </mo><mn>0</mn> :MATH]

   [MATH:
   <mfrac><mrow><mo>∂</mo></mrow><mrow><mo>∂</mo><mi>ξ</mi></mrow></mfrac>
   <mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac><msup><mro
   w><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>T</mi></mrow></msup><m
   i
   mathvariant="bold-italic">ω</mi><mo>+</mo><mi>C</mi><mrow><munderover><
   mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>ξ</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow></mrow><mo>−</mo><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow></mrow><mo>×</mo><mo>[</mo><mrow><msub>
   <mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>(</mo><msup><m
   row><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>T</mi></mrow></msup><m
   sub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>+</mo><mi>b</mi><mo>)</mo><mo>+</mo><mn>1</mn><mo>−</mo><msub><mrow><
   mi>ξ</mi></mrow><mrow><mi>i</mi></mrow></msub></mrow><mo>]</mo><mo>−</m
   o><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>β</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><msub><mrow><mi>ξ</mi></mrow><mrow><mi>i</mi><
   /mrow></msub></mrow></mrow><mo>=</mo><mn>0</mn> :MATH]

   [MATH:
   <mi>C</mi><mo>−</mo><msub><mrow><mi>α</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><mo>−</mo><msub><mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow><
   /msub><mo>=</mo><mn>0</mn> :MATH]

   [MATH: <mrow><mi mathvariant="normal">Finally</mi><mo>,</mo><mspace
   width="0.25em"></mspace><mi
   mathvariant="normal">Θ</mi><mrow><mo>(</mo><mrow><mi>α</mi><mo>,</mo><m
   i>β</mi></mrow><mo>)</mo></mrow><mo>=</mo><mo>−</mo><mfrac><mn>1</mn><m
   n>2</mn></mfrac><msubsup><mstyle
   displaystyle="false"><mo>∑</mo></mstyle><mrow><mi>i</mi><mo>=</mo><mn>1
   </mn></mrow><mi>m</mi></msubsup><msubsup><mstyle
   displaystyle="false"><mo>∑</mo></mstyle><mrow><mi>j</mi><mo>=</mo><mn>1
   </mn></mrow><mi>m</mi></msubsup><msub><mi>α</mi><mi>i</mi></msub><msub>
   <mi>α</mi><mi>j</mi></msub><msub><mi>y</mi><mi>i</mi></msub><msub><mi>y
   </mi><mi>j</mi></msub><msubsup><mi
   mathvariant="bold-italic">x</mi><mi>i</mi><mi>T</mi></msubsup><msub><mi
   mathvariant="bold-italic">x</mi><mi>j</mi></msub><mo>+</mo><msubsup><ms
   tyle
   displaystyle="false"><mo>∑</mo></mstyle><mrow><mi>i</mi><mo>=</mo><mn>1
   </mn></mrow><mi>m</mi></msubsup><msub><mi>α</mi><mi>i</mi></msub></mrow
   > :MATH]

   [MATH: <mrow><mrow><munder><mrow><mi
   mathvariant="normal">max</mi></mrow><mrow><mi>α</mi></mrow></munder></m
   row><mrow><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><mo>−</mo></mrow></mrow></mrow></mrow><mfrac><
   mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac><mrow><munderover>
   <mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><msub><mrow><mi>α</mi></mrow><mrow><mi>j</mi><
   /mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow></msub
   ><msub><mrow><mi>y</mi></mrow><mrow><mi>j</mi></mrow></msub><msubsup><m
   row><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   >T</mi></mrow></msubsup><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>j</mi></mrow></msub></
   mrow></mrow></mrow></mrow> :MATH]

   [MATH: <mi mathvariant="normal">s</mi><mi
   mathvariant="normal">u</mi><mi mathvariant="normal">b</mi><mi
   mathvariant="normal">j</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">c</mi><mi mathvariant="normal">t</mi><mspace
   width="0.25em"></mspace><mi mathvariant="normal">t</mi><mi
   mathvariant="normal">o</mi><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></munderover><mrow><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi><
   /mrow></msub></mrow></mrow><mo>=</mo><mn>0</mn><mo>,</mo><mspace
   width="0.50em"></mspace><mn>0</mn><mo>≤</mo><msub><mrow><mi>α</mi></mro
   w><mrow><mi>i</mi></mrow></msub><mo>≤</mo><mi>C</mi> :MATH]

   If the raw data are not linearly separable, the kernel function solves
   the classification problem in a higher-dimensional space or an
   infinite-dimensional space, such as the RBF kernel [[43]18]. The kernel
   function is defined as K(x,x′) = ϕ(x)^Tϕ(x′). For the linear kernel,
   K(x,x′) = 〈x,x′〉. Regarding the non-linear one, Radial Basis Function
   (RBF) kernel K(x,x′) = exp (−γ‖x−x′‖^2) is commonly adopted. γ is
   greater than zero and the choice has been discussed previously
   [[44]19]. Finally, the optimal separating hyperplane is given by
   [MATH: <msup><mrow><mi
   mathvariant="bold-italic">ω</mi></mrow><mrow><mi>T</mi></mrow></msup><m
   sub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>+</mo><mi>b</mi><mo>=</mo><msup><mrow><mo>(</mo><mrow><mrow><msubsup>
   <mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></msubsup><mrow><msub><mrow><mi>α</mi></mrow><mrow><m
   i>i</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mr
   ow></msub><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mrow></mrow></mrow><mo>)</mo></mrow><mrow><mi>T</mi></mrow></msup><msub
   ><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>+</mo><mi>b</mi><mo>=</mo><mrow><msubsup><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></msubsup><mrow><msub><mrow><mi>α</mi></mrow><mrow><m
   i>i</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mr
   ow></msub><msub><mrow><mo><</mo><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>></mo><mo>+</mo><mi>b</mi></mrow></mrow><msup><mrow><mo>=</mo><mrow><
   msubsup><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></msubsup><mrow><msub><mrow><mi>α</mi></mrow><mrow><m
   i>i</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mr
   ow></msub></mrow></mrow><mi>ϕ</mi><mo>(</mo><mrow><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mrow><mo>)</mo></mrow><mrow><mi>T</mi></mrow></msup><mi>ϕ</mi><mo>(</mo
   ><mrow><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub></
   mrow><mo>)</mo><mo>+</mo><mi>b</mi><mo>=</mo><mrow><msubsup><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>m</mi></mrow></msubsup><mrow><msub><mrow><mi>α</mi></mrow><mrow><m
   i>i</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mr
   ow></msub><mi>K</mi><mo>(</mo><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>,</mo><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>)</mo><mo>+</mo><mi>b</mi></mrow></mrow> :MATH]
   .

   Two popular model selection strategies are compared to the StepSVM. The
   first one is the conventional logistic regression with stepwise
   selection since it is considered the gold standard for classification
   problems. The other one is the backward elimination method, the
   SVM-RFE. Note that the logistic regression model uses the aggregate
   data, the same as the traditional statistical approach. However, the
   two SVM methods separate the original data into 80% training and 20%
   testing data with a 10-fold CV.

   The simulation study is based on a hierarchical dataset using the R
   code from the University of California, Los Angeles Institute for
   Digital Research and Education
   ([45]https://stats.idre.ucla.edu/r/codefragments/mesimulation/), where
   patients are nested in doctors, and doctors are nested in hospitals.
   The number of hospitals (HID) is 35, and each hospital has 8 to 15
   doctors (DID). Finally, 2 to 40 patients (ID) are randomly generated
   for each doctor.

   As a result, the expected sample size is 8525 with 26 predictors, and
   the dichotomous remission variable is the primary outcome, and the
   hierarchical structure is listed in A1 Table in [46]S1 Appendix.
   Details of all variables are displayed in A2 Table in [47]S1 Appendix.
   Descriptive statistics for continuous variables are in A3 Table in
   [48]S1 Appendix. Categorical variables are described in A4 Table in
   [49]S1 Appendix (except the I.D. variables, DID and HID). One hundred
   repetitions were conducted to evaluate the performances of the three
   methods. Note that the sample size varies slightly due to random
   assignments of the number of doctors and patients.

   In clinical research, the prediction model usually prefers a simple
   scoring system with 5 to 10 significant factors. In our previous work,
   we implemented the SVM as the primary statistical model with only six
   predictors to examine the clinical prediction ability in an independent
   replication study [[50]20]. Therefore, in computer simulations, up to 5
   variables were selected from a total of 26 variables to compare the
   three strategies. Results are displayed as the order being selected by
   the model, where the coding numbers are in A5 Table in [51]S1 Appendix.

   The conventional stepwise selection was used by the logistic regression
   model, which did not require parameter tuning. For the SVM-RFE and
   StepSVM, both linear and RBF kernels with various combinations of C and
   γ were evaluated by 10-fold CV. However, only those parameters with the
   best performance were chosen for comparisons in the simulation study.
   Hence, the SVM-RFE adopted the Linear kernel and the optimal C value
   was 0.1 according to the tune(.) function. The StepSVM was also based
   on the 10-fold CV for parameter tuning and the RBF Kernel was selected
   with C = 10 and γ = 1. A sensitivity analysis for the SVM-RFE could
   ensure fair comparisons between the SVM-RFE and StepSVM. Therefore, an
   additional setting with C = 10 was also examined for the SVM-RFE.

   The StepSVM is intuitive and consists of only a few simple steps. In A1
   Fig in [52]S1 Appendix, the flow chart of the algorithm implemented for
   the StepSVM is presented. The first procedure of the StepSVM examines
   all possible combinations of any two variables from the 26 (m)
   variables, and the total possibilities are
   [MATH:
   <msubsup><mrow><mi>C</mi></mrow><mrow><mn>2</mn></mrow><mrow><mi>m</mi>
   </mrow></msubsup><mo>=</mo><mfrac><mrow><mi>m</mi><mo>!</mo></mrow><mro
   w><mo>(</mo><mi>m</mi><mo>−</mo><mn>2</mn><mo>)</mo><mo>!</mo></mrow></
   mfrac> :MATH]
   . The combination with the highest accuracy is selected as the first
   two components of the StepSVM. Next, in the remaining (m−2) = 26−2 = 24
   variables, sequentially add one variable to the first two components.
   Among the 24 scenarios, the one with the best accuracy is selected as
   the third component. The following iterations repeat the previous step,
   which aims to add in the next most influential factor after those being
   selected from the previous step. This procedure of selecting the next
   component repeats until the accuracy no longer increases, or the
   maximum number of components allowed is reached. In this research, up
   to 5 variables are allowed, since there are only 26 predictors that
   could be evaluated. In summary, the first procedure consists of 325
   possibilities. The 2^nd step reduced to 24 models, and the 3^rd step
   further goes down to 23 variables. Finally, the 5^th variables are
   selected from 22 possibilities.

Results

   Simulation results of the stepwise logistic regression model are
   displayed in [53]Fig 1, where different colors note the order of five
   variables being selected among the 100 repetitions. Variables with the
   highest values of frequency are mobility, FamilyHx, CancerStage, and
   Experience. The first variable being selected with the highest
   frequency is CancerStage, followed by mobility, FamilyHx, and then
   Experience. However, the 5^th variable varies and is not consistently
   chosen. The cumulative frequency of variables being selected is
   displayed in [54]Fig 2. The C-statistic of the 100 repetitions is
   displayed in [55]Fig 3.

Fig 1. Variables being selected by the stepwise logistic regression model.

   [56]Fig 1
   [57]Open in a new tab

Fig 2. Variables being selected by the stepwise logistic regression model
(cumulative frequency).

   [58]Fig 2
   [59]Open in a new tab

Fig 3. C-statistic of the stepwise logistic regression model.

   [60]Fig 3
   [61]Open in a new tab

   The SVM-RFE with C = 0.1 or C = 10 was conducted. When C = 0.1 (C =
   10), the results of the variable selection are displayed in [62]Fig 4
   ([63]Fig 5). Regardless of the C parameter, DID and HID are generally
   the first two variables being selected and the 3^rd to the 5^th
   variables are not consistent. The cumulative frequency of variables
   being selected is displayed in [64]Fig 6 ([65]Fig 7). When C = 0.1 (C =
   10), the accuracies of the 100 repetitions are displayed in [66]Fig 8
   ([67]Fig 9). Therefore, different values of the C parameter yielded
   similar results, but the set of 5 variables was consistent, and the
   accuracy was unchanged.

Fig 4. Variables being selected by the SVM-RFE (C = 0.1).

   [68]Fig 4
   [69]Open in a new tab

Fig 5. Variables being selected by the SVM-RFE (cumulative frequency) (C =
0.1).

   [70]Fig 5
   [71]Open in a new tab

Fig 6. Accuracy of the SVM-RFE (C = 0.1).

   [72]Fig 6
   [73]Open in a new tab

Fig 7. Variables being selected by the SVM-RFE (C = 10).

   [74]Fig 7
   [75]Open in a new tab

Fig 8. Accuracy of the SVM-RFE (C = 0.1).

   [76]Fig 8
   [77]Open in a new tab

Fig 9. Accuracy of the SVM-RFE (C = 10).

   [78]Fig 9
   [79]Open in a new tab

   The simulation results of the StepSVM are displayed in [80]Fig 10. Note
   that the combination of two variables is the first step being selected
   by the StepSVM. Unlike the other two methods, the 3^rd to the 5^th
   variables being selected were entirely consistent (DID, Experience, and
   Medicaid). Cumulative frequencies are displayed in [81]Fig 11.
   Accuracies of the 100 repetitions are shown in [82]Fig 12.

Fig 10. Variables being selected by the StepSVM.

   [83]Fig 10
   [84]Open in a new tab

Fig 11. Variables being selected by the StepSVM (cumulative frequency).

   [85]Fig 11
   [86]Open in a new tab

Fig 12. Accuracy of the StepSVM.

   [87]Fig 12
   [88]Open in a new tab

   In summary, the top five variables being selected by three different
   methods with the highest frequency are listed in [89]Table 1. The first
   variable selected by each method is highly associated with the outcome.
   CancerStage, DID, and Experience are picked by the logistic regression,
   the SVM-RFE, and the StepSVM, respectively.

Table 1. Variables being selected with the highest frequency.

   Method Parameter setting Variable
   Stepwise CancerStage, Experience, mobility, FamilyHx, School
   SVM-RFE[90]^a C = 0.1 DID, HID, Experience, SmokingHx, LengthofStay
   C = 10 DID, HID, LengthofStay, mobility, FamilyHx
   StepSVM[91]^b C = 10, γ = 1 Experience, Medicaid, Lawsuits, HID,
   ntumors
   [92]Open in a new tab

   ^aUsing Linear kernel

   ^bUsing Radial basis function kernel

   The performance of logistic regression is assessed by the C-Statistic
   ([93]Table 2), while the SVM-RFE and StepSVM are evaluated by accuracy.
   With the restriction of up to five variables being selected, the
   SVM-RFE came up with the same set of five variables that generated the
   same accuracy, regardless of the regularization parameter C. The
   logistic regression yielded the average C-statistic 70.69%. In
   contrast, the average accuracy for the SVM-RFE and StepSVM is 70.65%
   and 80.12%, respectively. It is worth noting that the StepSVM provided
   the best accuracy.

Table 2. Performance of the three methods.

   Method        Parameter setting C-statistic / Accuracy
   Stepwise                        C-statistic = 70.69%
   SVM-RFE[94]^a C = 0.1           Accuracy = 70.65%
                 C = 10            Accuracy = 70.65%
   StepSVM[95]^b C = 10, γ = 1     Accuracy = 80.12%
   [96]Open in a new tab

   ^aUsing Linear kernel

   ^bUsing Radial basis function kernel

Discussions

   Machine learnings are gaining popularity with astonishing speed in all
   kinds of research. In this study, a novel methodology for the forward
   stepwise model selection based on the famous SVM is carried out. Unlike
   the backward elimination scheme, which may erroneously remove variables
   based on conditional relevance, the StepSVM is intuitive with only a
   few simple stages and thus could be implemented easily.

   According to simulations studies, even with a complicated hierarchical
   structure, the StepSVM provided the highest accuracy, and the variables
   being selected were much more consistent. This new method may
   contribute significantly to various research fields such as medicine,
   clinical research, public health, or environmental health when the
   selection of a handful of predictors is desired to create an optimal
   prediction model.

   The only cost of the StepSVM is the computer execution time since it
   requires
   [MATH:
   <msubsup><mrow><mi>C</mi></mrow><mrow><mn>2</mn></mrow><mrow><mi>m</mi>
   </mrow></msubsup><mo>+</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</m
   n></mrow></mfrac><mo>(</mo><mrow><mi>m</mi><mo>−</mo><mn>2</mn></mrow><
   mo>)</mo><mo>×</mo><mo>(</mo><mrow><mi>m</mi><mo>−</mo><mn>1</mn></mrow
   ><mo>)</mo> :MATH]
   times of the SVM analyses. In contrast, the stepwise logistic
   regression has only
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mn>2</mn></mrow></mfrac><mi>m</mi>
   <mo>×</mo><mo>(</mo><mrow><mi>m</mi><mo>+</mo><mn>1</mn></mrow><mo>)</m
   o> :MATH]
   choices. The SVM-RFE has the least (m) times, and it is the fastest
   procedure.

   It is worth noting that the simulation studies are quite
   time-consuming. Although only 100 repetitions were conducted, the
   results and conclusions were very consistent. Further repetitions do
   not alter the conclusions. Besides, the StepSVM determines the most
   important factor based on the highest accuracy. However, if the
   sensitivity requires more attention than the other, one could easily
   alter the selection scheme with the highest sensitivity. Similarly, the
   highest specificity could also be adopted in the selection procedure.
   Hence, the StepSVM is not only simple but also very flexible and could
   be easily extended for more methodological researches.

   To date, the SVM is extensively implemented by a variety of concepts in
   order to accomplish many more complex problems. Since this research
   focuses on statistical methodologies with a stepwise selection
   technique, comparisons are limited to the statistical methods related
   to this topic. Therefore, future studies are needed to examine further
   the application of the StepSVM in various research fields such as the
   genome-wide association studies (GWAS) or a high dimensional big data.

Supporting information

   S1 Appendix

   (DOCX)
   [97]Click here for additional data file.^ (34.1KB, docx)

Acknowledgments