Abstract

Background

   Sparse principal component analysis (PCA) is a popular tool for
   dimensionality reduction, pattern recognition, and visualization of
   high dimensional data. It has been recognized that complex biological
   mechanisms occur through concerted relationships of multiple genes
   working in networks that are often represented by graphs. Recent work
   has shown that incorporating such biological information improves
   feature selection and prediction performance in regression analysis,
   but there has been limited work on extending this approach to PCA. In
   this article, we propose two new sparse PCA methods called Fused and
   Grouped sparse PCA that enable incorporation of prior biological
   information in variable selection.

Results

   Our simulation studies suggest that, compared to existing sparse PCA
   methods, the proposed methods achieve higher sensitivity and
   specificity when the graph structure is correctly specified, and are
   fairly robust to misspecified graph structures. Application to a
   glioblastoma gene expression dataset identified pathways that are
   suggested in the literature to be related with glioblastoma.

Conclusions

   The proposed sparse PCA methods Fused and Grouped sparse PCA can
   effectively incorporate prior biological information in variable
   selection, leading to improved feature selection and more interpretable
   principal component loadings and potentially providing insights on
   molecular underpinnings of complex diseases.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12859-017-1740-7)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Principal component analysis, Sparsity, Structural
   information, Genomic data

Background

   A central problem in high-dimensional genomic research is to identify a
   subset of genes and pathways that can help explain the total variation
   in high-dimensional genomic data with as little loss of information as
   possible. Principal component analysis (PCA) [[29]1] is a popular
   multivariate analysis method which seeks to concentrate the total
   information in data with a few linear combinations of the available
   data, making it an appropriate tool for dimensionality reduction, data
   analysis, and visualization in genomic research. Despite its
   popularity, the traditional PCA is often difficult to interpret as the
   principal component loadings are linear combinations of all available
   variables, the number of which can be very large for genomic data. It
   is therefore desirable to obtain interpretable principal components
   that use a subset of the available data to deal with the problem of
   interpretability of principal component loadings.

   Several alternatives to PCA have been proposed in the literature, most
   of which constrain the size of non-zero principal component loadings.
   An ad hoc approach sets the absolute value of loadings that are smaller
   than a threshold to zero. Though simple to understand, this approach
   has been shown to be misleading in the sense that magnitude of loadings
   is not the only factor to determine the importance of variables in a
   linear combination [[30]2]. Truncating PCs by loadings may result in
   quite different PCs explaining much smaller variation compared with the
   original PCs. Other approaches regularize the loadings to ensure that
   some are exactly zero, which implies that the corresponding variables
   are unimportant in explaining the total variation in the data. For
   instance, Jolliffe et al. [[31]3] proposed the SCotLass method that
   constrains the loadings with a lasso penalty, but their optimization
   problem is nonconvex, which is difficult to solve and does not
   guarantee convergence to a global solution. Zou et al. [[32]4] proposed
   a convex sparse PCA method (SPCA) that reformulates the PCA problem as
   a regression problem and imposes elastic net penalty on the PC
   loadings. Witten and Tibshirani [[33]5] also proposed the penalized
   matrix decomposition (PMD) that approximates the data with its spectral
   decomposition and imposes a lasso penalty on the right singular
   vectors, i.e., the principal component loadings.

   Although the aforementioned methods can effectively produce sparse
   principal component coefficients, their main limitation is that they
   are purely data driven and do not exploit available biological
   information such as gene networks. It has been recognized that complex
   biological mechanisms occur through concerted relationships of multiple
   genes working together in pathways. Recent work [[34]6, [35]7] has
   demonstrated in the regression setting that utilizing prior biological
   information among variables can improve variable selection and
   prediction and help gain a better understanding of analysis results. It
   is therefore desirable to conduct PCA with incorporation of known
   structural information. Allen et al. [[36]8] proposed a generalized
   least-square matrix decomposition framework for PCA that incorporates
   known structure of noise and generate sparse solutions. Although this
   method can flexibly account for noise structure in data, they do not
   utilize prior biological information, and do not consider the
   relationships among the signal variables in PCA. Jenatton et al.
   [[37]9] proposed a structured sparse PCA method that considers
   correlations among groups of variables and imposes a penalty similar to
   group lasso on the principal component loadings, but their method does
   not take into account the complex interactions among variables within a
   group. In this article, we proposed two new sparse PCA methods called
   Fused and Grouped sparse PCA that enable incorporation of prior
   biological information in PCA. The methods will allow for
   identification of genes and pathways. We generalize fused lasso
   [[38]10] and utilize L [γ] norm [[39]7] to achieve automatic variable
   selection and simultaneously account for complex relationships within
   pathways.

   Our work makes several contributions. To the best of our knowledge,
   this is the first attempt to impose both sparsity and smoothing
   penalties on principal component loadings to encourage the selection of
   variables that are connected in a network. Although Jenatton et al.
   [[40]9] and Shiga and Mamitsuka [[41]11] incorporated group information
   of variables when generating sparse PC solutions, they did not consider
   how variables are connected in each group. Our method considers not
   only the group information, but also any interaction structure of
   variables within a group. By utilizing the existing biological
   structure in the data, we are able to obtain sparse principal
   components that are more interpretable and may shed light on the
   underlying complex mechanisms in the data. We also develop an efficient
   algorithm that can handle high-dimensional problems. Simulation studies
   suggest that the methods have higher sensitivity and specificity in
   detecting true signals and ignoring noise variables, and are quite
   effective in improving the performance of sparse PCA methods when the
   graph structure is correctly specified. In addition, the proposed
   methods are robust to misspecified graph structure.

   The remainder of the paper is organized as follows. In “[42]Methods”
   section, we present methods and algorithms for the proposed sparse PCA.
   In “[43]Results” section, we conduct simulation studies to assess the
   performance of our methods in comparison with several existing sparse
   PCA methods. In “[44]Analysis of Glioblastoma data” section, we apply
   the proposed methods to data from a glioblastoma brain multiform study.
   We conclude with some discussion remarks in “[45]Discussions” section.

Methods

   Suppose that we have a random n×p matrix
   [MATH: <mi mathvariant="bold">X</mi><mo>=</mo><mo>(</mo><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo
   ><mo>…</mo><mo>,</mo><msub><mrow><mi
   mathvariant="bold">x</mi></mrow><mrow><mi>p</mi></mrow></msub><mo>)</mo
   ><mo>,</mo><mi
   mathvariant="bold">x</mi><mo>∈</mo><msup><mrow><mi>ℜ</mi></mrow><mrow><
   mi>n</mi></mrow></msup> :MATH]
   . We also assume that the predictors are centered to have column means
   zero. The network informaton for the p variables in X is represented by
   a weighted undirected graph
   [MATH: <mi
   mathvariant="script">G</mi><mo>=</mo><mo>(</mo><mi>C</mi><mo>,</mo><mi>
   E</mi><mo>,</mo><mi>W</mi><mo>)</mo> :MATH]
   , where C is the set of nodes corresponding to the p features, E={i∼j}
   is the set of edges indicating that features i and j are associated in
   a biologically meaningful way, and W includes the weight of each node.
   For node i, denote by d [i] its degree, i.e., the number of nodes that
   are directly connected to node i and by w [i]=f(d [i]) its weight which
   can depend on d [i]. Our goal is to obtain sparse PCA loadings while
   utilizing available structural information
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   in PCA. Our approach to the sparse PCA problem relies on the eigenvalue
   formulation of PCA, and for completeness sake, we briefly review the
   classical and sparse PCA problems.

Standard and sparse principal component analysis

   Classical PCA finds projections
   [MATH: <mi
   mathvariant="bold-italic">α</mi><mo>∈</mo><msup><mrow><mi>ℜ</mi></mrow>
   <mrow><mi>p</mi></mrow></msup> :MATH]
   such that the variance of the standardized linear combination X α is
   maximized. Mathematically, the first principal component loading α
   solves the optimization problem
   [MATH: <munder><mrow><mo>max</mo></mrow><mrow><mi
   mathvariant="bold-italic">α</mi><mo>≠</mo><mn
   mathvariant="bold">0</mn></mrow></munder><msup><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mtext>T</mtext></mrow></m
   sup><msup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mtext>T</mtext></mrow></msup><mi
   mathvariant="bold">X</mi><mi mathvariant="bold-italic">α</mi><mspace
   width="1em"></mspace><mspace width="1em"></mspace><mspace
   width="1em"></mspace><mtext>subject to</mtext><mspace
   width="1em"></mspace><mspace width="1em"></mspace><msup><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mtext>T</mtext></mrow></m
   sup><mi mathvariant="bold-italic">α</mi><mo>=</mo><mn>1</mn><mi>.</mi>
   :MATH]
   1

   For subsequent principal components, additional constraints are added
   to ensure that they are uncorrelated with previous principal
   components, so that each principal component axis captures different
   information in the data. Generally, for the rth PC, we have the
   optimization problem
   [MATH: <mtable class="eqnarray" columnalign="left center
   right"><mtr><mtd
   class="eqnarray-1"><munder><mrow><mtext>max</mtext></mrow><mrow><msub><
   mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mi>r</mi></mrow></msub><m
   o>≠</mo><mn mathvariant="bold">0</mn></mrow></munder></mtd><mtd
   class="eqnarray-2"><msubsup><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mi>r</mi></mrow><mrow><mt
   ext>T</mtext></mrow></msubsup><msup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mtext>T</mtext></mrow></msup><mi
   mathvariant="bold">X</mi><msub><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mi>r</mi></mrow></msub><m
   space width="1em"></mspace><mspace width="1em"></mspace><mspace
   width="1em"></mspace></mtd><mtd
   class="eqnarray-3"></mtd></mtr><mtr><mtd class="eqnarray-1"></mtd><mtd
   class="eqnarray-2"><mtext>subject to</mtext><mspace
   width="1em"></mspace><mspace width="1em"></mspace><msubsup><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mi>r</mi></mrow><mrow><mt
   ext>T</mtext></mrow></msubsup><msub><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mi>r</mi></mrow></msub><m
   o>=</mo><mn>1</mn><mo>,</mo><msubsup><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mi>s</mi></mrow><mrow><mt
   ext>T</mtext></mrow></msubsup><msub><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mi>r</mi></mrow></msub><m
   o>=</mo><mn>0</mn><mspace width="1em"></mspace></mtd><mtd
   class="eqnarray-3"></mtd><mtd><mtext></mtext></mtd></mtr><mtr><mtd
   class="eqnarray-1"></mtd><mtd
   class="eqnarray-2"><mo>∀</mo><mi>s</mi><mo><</mo><mi>r</mi><mo>,</mo><m
   space width="1em"></mspace><mspace
   width="1em"></mspace><mi>r</mi><mo>=</mo><mn>2</mn><mo>,</mo><mo>…</mo>
   <mo>,</mo><mi>q</mi><mo>≪</mo><mo>min</mo><mo>(</mo><mi>p</mi><mo>,</mo
   ><mi>n</mi><mo>−</mo><mn>1</mn><mo>)</mo><mi>.</mi><mspace
   width="1em"></mspace></mtd><mtd
   class="eqnarray-3"></mtd><mtd><mtext></mtext></mtd></mtr></mtable>
   :MATH]
   2

   Using Lagrangian multipliers, one can show that problem ([46]2) results
   in the eigenvalue problem
   [MATH: <mtable class="eqnarray" columnalign="left center
   right"><mtr><mtd class="eqnarray-1"><msup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mtext>T</mtext></mrow></msup><mi
   mathvariant="bold">X</mi><mi
   mathvariant="bold-italic">α</mi><mo>=</mo><mi>λ</mi><mi
   mathvariant="bold-italic">α</mi><mi>.</mi></mtd></mtr></mtable> :MATH]
   3

   Then the rth principal component loadings of X is the rth eigenvector
   that corresponds to the rth eigenvalue
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mo>~</mo></mover></mrow><mrow><mn
   >1</mn></mrow></msub><mo>≥</mo><mo>⋯</mo><mo>≥</mo><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mo>~</mo></mover></mrow><mrow><mi
   >r</mi></mrow></msub><mo>≥</mo><mo>⋯</mo><mo>≥</mo><mn>0</mn> :MATH]
   of the sample covariance matrix X ^T X. Of note, the magnitude, α [rk]
   of each principal component loading
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>~</mo></mover></mrow><mrow><
   mi>r</mi></mrow></msub><mo>=</mo><mo>[</mo><msub><mrow><mi>α</mi></mrow
   ><mrow><mi>r</mi><mn>1</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo>
   <msub><mrow><mi>α</mi></mrow><mrow><mtext
   mathvariant="italic">rk</mtext></mrow></msub><mo>,</mo><mo>…</mo><mo>,<
   /mo><msub><mrow><mi>α</mi></mrow><mrow><mtext
   mathvariant="italic">rp</mtext></mrow></msub><mo>]</mo> :MATH]
   represents the importance of the kth variable to the rth principal
   component, and these are typically nonzero. When p≫n, interpreting the
   principal components is a difficult task because the principal
   components are linear combinations of all variables. Thus for
   high-dimensional data, a certain type of regularization that ensures
   that some variables have negligible or no effect on the rth principal
   component is warranted to yield interpretable principal components.

   To achieve sparsity of the principal component loadings while
   incorporating structural information
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   , we utilize ideas in Safo and Ahn [[47]12] which is motivated by the
   Dantzig Selector for sparse estimation in regression problems.
   Specifically, we bound a modified version of the eigenvalue difference
   in ([48]3) with a l [∞] norm while minimizing a structured-sparsity
   inducing penalty of the principal component loadings:
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><munder><mrow><mtext>min</mtext></mrow><mrow><mi
   mathvariant="bold-italic">α</mi><mo>≠</mo><mn
   mathvariant="bold">0</mn></mrow></munder><mi
   mathvariant="script">P</mi><mo>(</mo><mi
   mathvariant="bold-italic">α</mi><mo>,</mo><mi>τ</mi><mo>)</mo><mspace
   width="1em"></mspace></mtd><mtd class="align-2"><mtext>subject
   to</mtext><mspace width="1em"></mspace><mo>∥</mo><msup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mtext>T</mtext></mrow></msup><mi
   mathvariant="bold">X</mi><msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>~</mo></mover></mrow><mrow><
   mi>r</mi></mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mo>~</mo></mover></mrow><mrow><mi
   >r</mi></mrow></msub><mi
   mathvariant="bold-italic">α</mi><msub><mrow><mo>∥</mo></mrow><mrow><mi>
   ∞</mi></mrow></msub><mo>≤</mo><mi>τ</mi><mspace
   width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr><mtr><mtd class="align-1"></mtd><mtd
   class="align-2"><mtext>and</mtext><mspace width="1em"></mspace><mspace
   width="1em"></mspace><msubsup><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>r</mi><mo>−</mo><mn>1</mn></m
   row><mrow><mtext>T</mtext></mrow></msubsup><mi
   mathvariant="bold-italic">α</mi><mo>=</mo><mn>0</mn><mi>.</mi><mspace
   width="1em"></mspace><mspace width="2em"></mspace></mtd><mtd><mspace
   width="2em"></mspace></mtd></mtr></mtable> :MATH]

   Here, for a random vector
   [MATH: <mi
   mathvariant="bold">z</mi><mo>∈</mo><msup><mrow><mi>ℜ</mi></mrow><mrow><
   mi>p</mi></mrow></msup> :MATH]
   , ∥z∥[∞] is the l [∞] norm defined as max[1≤i≤p]|z [i]|, τ>0 is a
   tuning parameter that controls how many of the coefficients in the
   principal component loadings will be exactly zero. In addition,
   [MATH: <mi
   mathvariant="bold">A</mi><mo>=</mo><mo>[</mo><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mn>1</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mi>s</mi></mrow></msub><mo>]</mo><mspace
   width="1em"></mspace><mo>∀</mo><mi>s</mi><mo><</mo><mi>r</mi> :MATH]
   is a concatenation of the previous sparse PCA solutions
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mi>s</mi></mrow></msub> :MATH]
   , and
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>~</mo></mover></mrow><mrow><
   mi>r</mi></mrow></msub> :MATH]
   is the nonsparse rth PCA loading, which is the eigenvector
   corresponding to the rth largest eigenvalue
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >r</mi></mrow></msub> :MATH]
   of X ^T X.

   There are a few advantages of this new formulation over the standard
   formulation for PCA. First, the objective function
   [MATH: <mi mathvariant="script">P</mi><mo>(</mo><mi
   mathvariant="bold-italic">α</mi><mo>,</mo><mi>τ</mi><mo>)</mo> :MATH]
   can easily incorporate the prior information about the PC loadings, for
   example, the structural information of variables. Second, this
   optimization problem can be easily solved by any off-the-shelf
   optimization software given
   [MATH: <mi mathvariant="script">P</mi><mo>(</mo><mi
   mathvariant="bold-italic">α</mi><mo>,</mo><mi>τ</mi><mo>)</mo> :MATH]
   is a convex function, e.g. CVX in Matlab. In the next sections, we
   introduce sparse PCA methods that utilize the network information
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   in X.

Grouped sparse PCA

   The first approach we propose is the grouped sparse PCA, similar in
   spirit with Pan et al. [[49]7]. Utilizing the graph structure
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   , we propose the following structured sparse PCA criterion for the rth
   principal component loading:
   [MATH: <mtable class="eqnarray" columnalign="left center
   right"><mtr><mtd
   class="eqnarray-1"><munder><mrow><mo>min</mo></mrow><mrow><mi
   mathvariant="bold-italic">α</mi><mo>≠</mo><mn
   mathvariant="bold">0</mn></mrow></munder><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace></mtd><mtd
   class="eqnarray-2"><mfenced close="}" open="{"
   separators=""><mrow><mo>(</mo><mn>1</mn><mo>−</mo><mi>η</mi><mo>)</mo><
   munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>∼</mo><mi>j</mi></mro
   w></munder><msup><mrow><mfenced close=")" open="("
   separators=""><mrow><mfrac><mrow><mo>|</mo><msub><mrow><mi>α</mi></mrow
   ><mrow><mi>i</mi></mrow></msub><msup><mrow><mo>|</mo></mrow><mrow><mi>γ
   </mi></mrow></msup></mrow><mrow><msub><mrow><mi>w</mi></mrow><mrow><mi>
   i</mi></mrow></msub></mrow></mfrac><mo>+</mo><mfrac><mrow><mo>|</mo><ms
   ub><mrow><mi>α</mi></mrow><mrow><mi>j</mi></mrow></msub><msup><mrow><mo
   >|</mo></mrow><mrow><mi>γ</mi></mrow></msup></mrow><mrow><msub><mrow><m
   i>w</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow></mfrac></mrow></mf
   enced></mrow><mrow><mn>1</mn><mo>/</mo><mi>γ</mi></mrow></msup><mo>+</m
   o><mi>η</mi><munder><mrow><mo>∑</mo></mrow><mrow><msub><mrow><mi>d</mi>
   </mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><mn>0</mn></mrow></munde
   r><mo>|</mo><msub><mrow><mi>α</mi></mrow><mrow><mi>i</mi></mrow></msub>
   <mo>|</mo></mrow></mfenced></mtd><mtd
   class="eqnarray-3"></mtd></mtr><mtr><mtd class="eqnarray-1"></mtd><mtd
   class="eqnarray-2"><mtext>subject to</mtext><mspace
   width="1em"></mspace><mo>∥</mo><msup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mtext>T</mtext></mrow></msup><mi
   mathvariant="bold">X</mi><mover accent="true"><mrow><msub><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mrow><mi>r</mi></mrow></msub></
   mrow><mo>~</mo></mover><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mo>~</mo></mover></mrow><mrow><mi
   >r</mi></mrow></msub><mi
   mathvariant="bold-italic">α</mi><msub><mrow><mo>∥</mo></mrow><mrow><mi>
   ∞</mi></mrow></msub><mo>≤</mo><mi>τ</mi><mspace
   width="1em"></mspace><mspace
   width="1em"></mspace><mtext>and</mtext><mspace
   width="1em"></mspace><mspace width="1em"></mspace><msubsup><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>r</mi><mo>−</mo><mn>1</mn></m
   row><mrow><mtext>T</mtext></mrow></msubsup><mi
   mathvariant="bold-italic">α</mi><mo>=</mo><mn>0</mn><mo>,</mo><mspace
   width="1em"></mspace></mtd><mtd
   class="eqnarray-3"></mtd><mtd><mtext></mtext></mtd></mtr></mtable>
   :MATH]
   4

   where ∥·∥[∞] is the l [∞] norm, τ>0 is a tuning parameter, γ>1 and
   0<η<1 are fixed,
   [MATH: <msub><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>r</mi><mo>−</mo><mn>1</mn></m
   row></msub><mo>=</mo><mo>(</mo><msub><mrow><mover
   accent="true"><mrow><mi>α</mi></mrow><mo>^</mo></mover></mrow><mrow><mn
   >1</mn></mrow></msub><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>α</mi></mrow><mo>^</mo></mover></mrow><mrow><mn
   >2</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>α</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >r</mi><mo>−</mo><mn>1</mn></mrow></msub><mo>)</mo> :MATH]
   is the matrix constituted of r−1 structured sparse PC loadings, and
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>~</mo></mover></mrow><mrow><
   mi>r</mi></mrow></msub> :MATH]
   is the rth nonsparse PC loading vector, which is the eigenvector
   corresponding to the rth largest eigenvalue of X ^T X.

   The first term in the objective function ([50]4) is the weighted
   grouped penalty of Pan et al. [[51]7], which induces grouped variable
   selection. The penalty encourages both α [i] and α [j] to be equal to
   zero simultaneously, suggesting that two neighboring genes in a network
   are more likely to participate in the same biological process
   simultaneously. The second term in the objective function induces
   sparsity in selection of singletons that are not connected to any other
   variables in the network. The tuning parameter τ enforces some
   coefficients of the principal components to be exactly zero with larger
   values encouraging more sparsity. The selection of τ is usually
   data-driven, and is discussed in section 2.4. The optimization problem
   is convex in α and can be solved with any off the shelf convex
   optimization package such as the CVX package [[52]13] in Matlab.

Fused sparse PCA

   The second structured sparse PCA is the Fused sparse PCA, which
   generalizes fused lasso [[53]10] to account for complex interactions
   within a pathway. Utilizing the graph structure
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   , we propose the following structured sparse PCA for the rth principal
   component loading:
   [MATH: <mtable class="eqnarray" columnalign="left center
   right"><mtr><mtd
   class="eqnarray-1"><munder><mrow><mo>min</mo></mrow><mrow><mi
   mathvariant="bold-italic">α</mi><mo>≠</mo><mn
   mathvariant="bold">0</mn></mrow></munder><mspace
   width="0.3em"></mspace><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace></mtd><mtd class="eqnarray-2"><mfenced close="}"
   open="{"
   separators=""><mrow><mo>(</mo><mn>1</mn><mo>−</mo><mi>η</mi><mo>)</mo><
   munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>∼</mo><mi>j</mi></mro
   w></munder><mfenced close="|" open="|"
   separators=""><mrow><mfrac><mrow><msub><mrow><mi>α</mi></mrow><mrow><mi
   >i</mi></mrow></msub></mrow><mrow><msub><mrow><mi>w</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></mfrac><mo>−</mo><mfrac><mrow><msub><mrow
   ><mi>α</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow><mrow><msub><mro
   w><mi>w</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow></mfrac></mrow>
   </mfenced><mo>+</mo><mi>η</mi><munder><mrow><mo>∑</mo></mrow><mrow><msu
   b><mrow><mi>d</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><mn>0<
   /mn></mrow></munder><mo>|</mo><msub><mrow><mi>α</mi></mrow><mrow><mi>j<
   /mi></mrow></msub><mo>|</mo></mrow></mfenced></mtd><mtd
   class="eqnarray-3"></mtd></mtr><mtr><mtd class="eqnarray-1"></mtd><mtd
   class="eqnarray-2"><mspace width="1em"></mspace><mtext>subject
   to</mtext><mspace width="1em"></mspace><mo>∥</mo><msup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mtext>T</mtext></mrow></msup><mi
   mathvariant="bold">X</mi><msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>~</mo></mover></mrow><mrow><
   mi>r</mi></mrow></msub><mo>−</mo><msub><mrow><mover
   accent="true"><mrow><mi>λ</mi></mrow><mo>~</mo></mover></mrow><mrow><mi
   >r</mi></mrow></msub><mi
   mathvariant="bold-italic">α</mi><msub><mrow><mo>∥</mo></mrow><mrow><mi>
   ∞</mi></mrow></msub><mo>≤</mo><mi>τ</mi><mspace
   width="1em"></mspace><mspace
   width="1em"></mspace><mtext>and</mtext><mspace
   width="1em"></mspace><mspace width="1em"></mspace><msubsup><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>r</mi><mo>−</mo><mn>1</mn></m
   row><mrow><mtext>T</mtext></mrow></msubsup><mi
   mathvariant="bold-italic">α</mi><mo>=</mo><mn>0</mn><mspace
   width="1em"></mspace></mtd><mtd
   class="eqnarray-3"></mtd><mtd><mtext></mtext></mtd></mtr></mtable>
   :MATH]
   5

   where τ>0 is tuning parameters, 0≤η≤1 is fixed,
   [MATH: <msub><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>r</mi><mo>−</mo><mn>1</mn></m
   row></msub><mo>=</mo><mo>(</mo><msub><mrow><mover
   accent="true"><mrow><mi>α</mi></mrow><mo>^</mo></mover></mrow><mrow><mn
   >1</mn></mrow></msub><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>α</mi></mrow><mo>^</mo></mover></mrow><mrow><mn
   >2</mn></mrow></msub><mo>,</mo><mo>…</mo><mo>,</mo><msub><mrow><mover
   accent="true"><mrow><mi>α</mi></mrow><mo>^</mo></mover></mrow><mrow><mi
   >r</mi><mo>−</mo><mn>1</mn></mrow></msub><mo>)</mo> :MATH]
   is the matrix constituted of r−1 structured sparse PC loadings, and
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>~</mo></mover></mrow><mrow><
   mi>r</mi></mrow></msub> :MATH]
   is the rth nonsparse PC loading vector. This penalty is a combination
   of weighted l [1] penalty on variables that are connected in the
   network and l [1] penalty on singletons that are not connected to any
   genes in the network. The first term in the objective function ([54]5)
   is the fused structured penalty that encourages the difference between
   variable pairs that are connected in the network to be small and hence
   the variables to be selected together.

   This penalty is similar to some existing penalties, but different in a
   number of ways. First, it is similar to the fused lasso—both attempt to
   smooth the coefficients that are connected in
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   . However, the fused lasso does not utilize prior biological
   information. Instead, it uses a data-driven clustering approach to
   order the variables that are correlated and imposes l [1] penalty on
   the difference between coefficients of adjacent variables. It also does
   not weight neighboring features, which may allow one to enforce various
   prior relationships among features. Second, the Fused sparse penalty is
   also similar but different to the network constrained penalty of Li and
   Li [[55]6]. Their penalty
   [MATH:
   <msub><mrow><mi>η</mi></mrow><mrow><mn>1</mn></mrow></msub><munder><mro
   w><mo>∑</mo></mrow><mrow><mi>j</mi></mrow></munder><mo>|</mo><msub><mro
   w><mi>α</mi></mrow><mrow><mi>j</mi></mrow></msub><mo>|</mo><mo>+</mo><m
   sub><mrow><mi>η</mi></mrow><mrow><mn>2</mn></mrow></msub><munder><mrow>
   <mo>∑</mo></mrow><mrow><mi>i</mi><mo>∼</mo><mi>j</mi></mrow></munder><m
   sup><mrow><mfenced close=")" open="("
   separators=""><mrow><mfrac><mrow><msub><mrow><mi>α</mi></mrow><mrow><mi
   >i</mi></mrow></msub></mrow><mrow><msub><mrow><mi>w</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></mfrac><mo>−</mo><mfrac><mrow><msub><mrow
   ><mi>α</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow><mrow><msub><mro
   w><mi>w</mi></mrow><mrow><mi>j</mi></mrow></msub></mrow></mfrac></mrow>
   </mfenced></mrow><mrow><mn>2</mn></mrow></msup> :MATH]
   uses the l [2] norm and it has been shown that this does not produce
   sparse solutions, where sparsity refers to variables that are connected
   in a network. In other words, it does not encourage grouped selection
   of variables in the network [[56]7]. Also, the additional tuning
   parameter η [2] increases computational costs for very large p since it
   requires solving a graph-constrained regression problem with dimension
   (n+p)×p.

   The two proposed methods differ in how the structural information is
   incorporated in the PCA problem. Grouped sPCA is dependent on γ in the
   L [γ] norm and have different sparsity solution in the PC loadings for
   different γ. Unlike the Fused sPCA, the weights in the Grouped sPCA
   allow for two neighboring nodes to have opposite effects, which may be
   relevant in some biological process. However, in the Fused sPCA, it is
   easy to understand that the l [1] norm difference of connected pairs
   allows variables that are connected or behave similarly to be close
   together, which is not so intuitive in the Grouped sPCA.

Algorithms

   We present two algorithms for the proposed structured sparse PCA
   methods. Algorithm 1 obtains the rth principal component loading vector
   for a fixed tuning parameter τ. Algorithm 2 provides a data driven
   approach for selecting the optimal tuning parameter value τ from a
   range of values. The normalization in step (3) of Algorithm 1 eases
   interpretation, and usually facilitates a visual comparison of the
   coefficients. Once the principal component loading vector is obtained,
   the coefficients (in absolute value) can be ranked to assess the
   contribution of the variables to a given PC. Both our methods require
   the data to be centered (column-centered for a n×p matrix) so that PCA
   can be conducted on covariance matrix. If the variables are measured on
   different scales or on a common scale with widely differing ranges, it
   is recommended to center and scale the variables to have unit variance
   before implementing the proposed methods.

   graphic file with name 12859_2017_1740_Figa_HTML.gif

   graphic file with name 12859_2017_1740_Figb_HTML.gif

   Algorithm 1 is developed to obtain r PC loading vectors. For the best
   r, we can introduce tuning parameter selection in step (2) using, for
   example cross validation to maximize the total variance explained by
   the rth principal component, with the smallest r explaining some
   proportion of variance explained selected as the optimal rth principal
   component. This would add extra layer of complexity to the tuning
   parameter selection, however.

   The tuning parameters τ=(τ [1],…τ [r]) control the model complexity and
   their optimal values need to be selected. We use Bayesian information
   criterion (BIC) [[57]8] and implement Algorithm 2 to select τ that
   yields a better rank r approximation to the test data. Compared with
   using cross-validation to select best tuning parameters, BIC can be
   computationally more efficient, especially for large datasets. The
   selection of the other tuning parameters in our experiments are
   described as follows. We fix η=0.5 for an equal likelihood of selecting
   networks and singletons. Since Pan et al. [[58]7] chose gamma=2 and 8
   and showed that these two gamma values achieved good performance, we
   fix γ=2 for both the simulation study and the real data analysis and we
   also compare in a subset of simulations γ=2 and γ=8 (see Additional
   file [59]1: Tables S1 and S2) to assess whether the results are robust
   to the gamma value. We set w [i] and w [j] as the degree of each node
   following the suggestion in Pan et al. [[60]7]. Our paper seeks to
   develop methods for estimating sparse principal components, as such it
   is not the focus of the paper to investigate principled approaches for
   selecting the number of principal components that will be used in
   subsequent analyses. We use the top two principal components in both
   our simulation study and the real data analysis. In practice, some
   ad-hoc approaches, such as choosing the top K PCs with more than 80%
   variation explained, can be used.

Results

   We conduct numerical studies including simiulations and real data
   analysis to assess the performance of the proposed methods in
   comparison with several existing sparse PCA methods. We consider two
   simulation settings that differ by the proportions of variation
   explained by the first two PCs. In the first setting, the first two PCs
   explain 6% of the total variation which indicates that true signals in
   the data are weak. In the second setting, the first two PC’s explain
   30% of the total variation in the data, representing a case where
   signals are strong. Within each setting, we consider the dimensions
   p=500 and p=10,000, and also consider two scenarios that differ by the
   graph structure
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   for the proposed methods.

Simulation settings

   Let X be a n×p matrix and let G [0] be the true covariance matrix used
   to generate X. Let
   [MATH: <msub><mrow><mi
   mathvariant="script">G</mi></mrow><mrow><mn>0</mn></mrow></msub> :MATH]
   be the corresponding graph structure. The true covariance matrix G [0]
   is partitioned as
   [MATH: <mtable class="eqnarray" columnalign="left center
   right"><mtr><mtd class="eqnarray-1"><msub><mrow><mi
   mathvariant="bold">G</mi></mrow><mrow><mn>0</mn></mrow></msub><mo>=</mo
   ><mfenced close=")" open="("
   separators=""><mrow><mtable><mtr><mtd><msub><mrow><mi
   mathvariant="bold">G</mi></mrow><mrow><mn>00</mn></mrow></msub></mtd><m
   td><mn mathvariant="bold">0</mn></mtd></mtr><mtr><mtd><mn
   mathvariant="bold">0</mn></mtd><mtd><mi>ν</mi><mo>×</mo><msub><mrow><mi
   mathvariant="bold">I</mi></mrow><mrow><mi>p</mi><mo>−</mo><mn>36</mn></
   mrow></msub><mspace
   width="1em"></mspace></mtd></mtr></mtable></mrow></mfenced><mo>,</mo></
   mtd><mtd class="eqnarray-2"></mtd><mtd
   class="eqnarray-3"></mtd><mtd><mtext></mtext></mtd></mtr></mtable>
   :MATH]

   where G [00] is block diagonal with ten blocks each of size 18 for
   p=500 and size 250 for p=10,000, and between block correlation 0. We
   set the variance of variables in the first two blocks to be 1, and 0.3
   for the remaining eight blocks. In addition, we set the correlation of
   a main and connecting variable to be 0.9 for the first two blocks and
   0.2 for the other blocks. Meanwhile, we let the correlation ρ
   [ik]∼Uniform(0.7,0.8),i≠k and i,k≥2 for the first two blocks, and ρ
   [ik]∼Uniform(0,0.2),i≠k and i,k≥2 for the other blocks. This type of
   covariance matrix G [0] suggests that data structure is determined by
   ten underlying subnetworks, where the first two PCs of the first two
   subnetworks are mostly important in detecting signals in the data. In
   other words, in both settings, the true PCs has 36 important variables
   and p−36 noise variables when p=500, and p=500 important variables and
   p−500 noise variables for p=10,000. We note that by changing the value
   of ν, we control the proportions of variation explained by the first
   two PCs. The ν values we used in both simulation settings are presented
   in Additional file [61]1: Table S3. For each setting, we specify n=100,
   and simulate X from multivariate normal distribution with mean 0 and
   variance G [0].

   For each setting and dimension, we consider two scenarios that differ
   by the graph structure
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   specified in the proposed sPCA methods. In the first scenario, the
   graph structure is correctly specified, that is
   [MATH: <mi mathvariant="script">G</mi><mo>=</mo><msub><mrow><mi
   mathvariant="script">G</mi></mrow><mrow><mn>0</mn></mrow></msub> :MATH]
   . This corresponds to the situation where all true structural
   information are available in
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   so that
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   is informative. The resulting network includes 500 variables and 170
   edges between each main variable and connecting variable when p equals
   500 (or 10,000 variables and 2490 edges when p equals 10,000), i.e.,
   E={i∼j|i,j=1,⋯,180} in
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   when p equals 500 (or E={i∼j|i, j=1,⋯,2,500} in
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   when p equals 10,000). Figure [62]1 is a graph of the network
   [MATH: <mi mathvariant="script">G</mi> :MATH]
   used in Fused and Grouped sPCA when network information is correctly
   specified.

Fig. 1.

   Fig. 1
   [63]Open in a new tab

   Network structure of simulated data: Correctly specified graph.
   Variables in circle represent signals, and square represent noise. (
   [MATH: <mi mathvariant="script">G</mi><mo>=</mo><msub><mrow><mi
   mathvariant="script">G</mi></mrow><mrow><mn>0</mn></mrow></msub> :MATH]
   )

   In the second scenario, the graph structure is randomly generated and
   does not capture the true information in the data. The resulting
   network includes a total of 170 random edges when p equals 500 (or 2490
   edges when p equals 10,000). We first generate a p×p matrix with each
   element from U(0,1) distribution. The elements with values more than an
   arbitrary cutoff 0.95 are saved as candidates for random edges by
   considering their row numbers and column numbers are connected nodes.
   We then choose a random subset with size 170 (or 2490) as the
   noninformative structure. It is possible that few random edges have
   overlaps with informative edges, but most of them are still noises.
   This setting assesses the performance of the proposed methods in cases
   where the structural information is uninformative and sheds light on
   robustness of the proposed methods. Additional file [64]1: Figure S1
   shows the graph structure for randomly specified edges.

   Performance Metrics We compare the proposed methods Grouped PCA and
   Fused PCA to the traditional PCA [[65]1], SPCA [[66]4] and SPC
   [[67]14]. We implement SPCA and SPC using the R-packages elasticnet and
   PMA respectively. We evaluate the performance of the methods using the
   following criteria.
     * Reconstruction error:
       [MATH: <mo>|</mo><mo>|</mo><msub><mrow><mi
       mathvariant="bold">X</mi></mrow><mrow><mtext
       mathvariant="italic">test</mtext></mrow></msub><mi
       mathvariant="bold">A</mi><msup><mrow><mi
       mathvariant="bold">A</mi></mrow><mrow><mi>T</mi></mrow></msup><mo>−
       </mo><msub><mrow><mi mathvariant="bold">X</mi></mrow><mrow><mtext
       mathvariant="italic">test</mtext></mrow></msub><mover
       accent="true"><mrow><mi
       mathvariant="bold">A</mi></mrow><mo>^</mo></mover><msup><mrow><move
       r accent="true"><mrow><mi
       mathvariant="bold">A</mi></mrow><mo>^</mo></mover></mrow><mrow><mte
       xt>T</mtext></mrow></msup><mo>|</mo><msubsup><mrow><mo>|</mo></mrow
       ><mrow><mi>F</mi></mrow><mrow><mn>2</mn></mrow></msubsup> :MATH]
       , where A=(α [1] α [2]) are the true PC loadings and
       [MATH: <mover accent="true"><mrow><mi
       mathvariant="bold">A</mi></mrow><mo>^</mo></mover><mo>=</mo><mo>(</
       mo><msub><mrow><mover accent="true"><mrow><mi
       mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mr
       ow><mn>1</mn></mrow></msub><mspace
       width="1em"></mspace><msub><mrow><mover accent="true"><mrow><mi
       mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mr
       ow><mn>2</mn></mrow></msub><mo>)</mo> :MATH]
       are the estimated PC loadings. This criterion tests the methods
       ability to approximate the testing data reconstructed using only
       the first two PC loadings.
     * Estimation error:
       [MATH: <mo>|</mo><mo>|</mo><mi
       mathvariant="bold">A</mi><msup><mrow><mi
       mathvariant="bold">A</mi></mrow><mrow><mi>T</mi></mrow></msup><mo>−
       </mo><mover accent="true"><mrow><mi
       mathvariant="bold">A</mi></mrow><mo>^</mo></mover><msup><mrow><move
       r accent="true"><mrow><mi
       mathvariant="bold">A</mi></mrow><mo>^</mo></mover></mrow><mrow><mte
       xt>T</mtext></mrow></msup><mo>|</mo><msubsup><mrow><mo>|</mo></mrow
       ><mrow><mi>F</mi></mrow><mrow><mn>2</mn></mrow></msubsup> :MATH]
       . This criterion tests the methods ability to estimate the linear
       subspace spanned by the true PC loadings [[68]15], with a smaller
       estimate preferred.
     * Selectivity: We also test the methods ability to select the right
       variables while ignoring noise variables using sensitivity and
       specificity which are defined as
       [MATH: <mtext
       mathvariant="italic">Sensitivity</mtext><mo>=</mo><mfrac><mrow><mi>
       #</mi><mspace width="1em"></mspace><mtext>of</mtext><mspace
       width="1em"></mspace><mtext>True</mtext><mspace
       width="1em"></mspace><mtext>Positive</mtext></mrow><mrow><mi>#</mi>
       <mspace width="1em"></mspace><mtext>of</mtext><mspace
       width="1em"></mspace><mtext>True</mtext><mspace
       width="1em"></mspace><mtext>Positive</mtext><mo>+</mo><mi>#</mi><ms
       pace width="1em"></mspace><mtext>of</mtext><mspace
       width="1em"></mspace><mtext>False</mtext><mspace
       width="1em"></mspace><mtext>Negative</mtext></mrow></mfrac><mo>,</m
       o> :MATH]
       [MATH: <mtext
       mathvariant="italic">Specificity</mtext><mo>=</mo><mfrac><mrow><mi>
       #</mi><mspace width="1em"></mspace><mtext>of</mtext><mspace
       width="1em"></mspace><mtext>True</mtext><mspace
       width="1em"></mspace><mtext>Negative</mtext></mrow><mrow><mi>#</mi>
       <mspace width="1em"></mspace><mtext>of</mtext><mspace
       width="1em"></mspace><mtext>True</mtext><mspace
       width="1em"></mspace><mtext>Negative</mtext><mo>+</mo><mi>#</mi><ms
       pace width="1em"></mspace><mtext>of</mtext><mspace
       width="1em"></mspace><mtext>False</mtext><mspace
       width="1em"></mspace><mtext>Positive</mtext></mrow></mfrac><mi>.</m
       i> :MATH]
       Sensitivity and specificity capture the accuracy of estimated PC
       loadings with high values indicating better performance.
     * Proportion of variance explained: The fourth comparison criterion
       is the proportion of variation explained in the testing and
       training data sets by the first two PC loadings, which is defined
       as
       [MATH: <mfrac><mrow><msup><mrow><mover accent="true"><mrow><mi
       mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mr
       ow><mtext>T</mtext></mrow></msup><mi
       mathvariant="bold">X</mi><msup><mrow><mi
       mathvariant="bold">X</mi></mrow><mrow><mtext>T</mtext></mrow></msup
       ><mover accent="true"><mrow><mi
       mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mr
       ow><mtext>trace</mtext><mo>(</mo><mi
       mathvariant="bold">X</mi><msup><mrow><mi
       mathvariant="bold">X</mi></mrow><mrow><mtext>T</mtext></mrow></msup
       ><mo>)</mo></mrow></mfrac><mo>,</mo> :MATH]
       where X is either the centered training or testing data set, and
       [MATH: <mover accent="true"><mrow><mtext></mtext><mspace
       width="1em"></mspace><mi>α</mi><mtext></mtext></mrow><mo>^</mo></mo
       ver> :MATH]
       is the estimated first or second PC.

Simulation results

   Table [69]1 shows the performance of the methods for the first setting
   where the first two PCs explain only 6% of the total variation in the
   data. We observe that the proposed methods are competitive for p=500
   and even more so when p=10,000. In particular, Grouped sPCA has smaller
   reconstruction and estimation errors when the graph structure is
   correctly specified and even when the graph structure is uninformative.
   On the other hand, Fused sPCA shows a suboptimal performance in
   comparison to Grouped sPCA, yet better or competitive performance when
   compared to the traditional PCA and SPCA for correctly specified graph
   structure and mis-specified graph structure. In terms of sensitivity
   and specificity, we observe that both Grouped sPCA and more especially
   Fused sPCA are better in detecting signals even when the graph
   structure is mis-specified, while Grouped sPCA is more competitive at
   not selecting noise variables. We also notice that both Grouped sPCA
   and Fused sPCA have good performance in proportions of cumulative
   variation explained compared with existing sparse PCA methods,
   especially compared with SPCA. In Table [70]2 where the first two PC’s
   explain 30% of the total variation in the data, we observe a similar
   performance of the proposed methods.

Table 1.

   Simulation results of setting 1
   Method RE EE Sensitivity Specificity cPVE
   1stPC 2ndPC 1stPC 2ndPC 1stPC 2ndPC
   P = 500
    PCA 31 (9e-1) 1.1 (3e-2) 1.0 1.0 0.0 0.0 4.3e-2 (2e-3) 8.2e-2 (2e-3)
    SPCA 34 (3) 1.2 (1e-1) 0.54 0.50 0.95 0.90 2.0e-2 (2e-3) 4.0e-2 (4e-3)
    SPC 16 (8) 0.57 (3e-1) 0.57 0.60 0.98 1.0 2.8e-2 (3e-3) 5.5e-2 (6e-3)
   Biological information correctly specified
    Fused sPCA 25 (6) 0.90 (2e-1) 1.0 1.0 0.73 0.70 2.9e-2 (4e-3) 5.1e-2
   (7e-3)
    Grouped sPCA 8.0 (6) 0.29 (2e-1) 0.81 0.80 0.97 1.0 3.2e-2 (2e-3)
   6.0e-2 (3e-3)
   Biological information randomly specified
    Fused sPCA 32 (4) 1.1 (2e-1) 0.95 1.0 0.51 0.51 3.0e-2 (4e-3) 5.2e-2
   (7e-3)
    Grouped sPCA 9.1 (6) 0.33 (2e-1) 0.81 0.80 0.97 1.0 3.2e-2 (2e-3)
   5.9e-2 (3e-3)
   P = 10,000
    PCA 112 (3) 1.3 (2e-2) 1.0 1.0 0.0 0.0 2.6e-2 (1e-3) 5.0e-2 (1e-3)
    SPCA 160 (4) 1.9 (3e-2) 0.15 0.15 0.99 0.99 2.3e-3 (5e-4) 4.5e-3
   (7e-4)
    SPC 172 (4) 2.0 (8e-3) 0.01 0.01 1.0 1.0 1.7e-4 (1e-4) 3.4e-4 (3e-4)
   Biological information correctly specified
    Fused sPCA 81 (50) 0.94 (0.5) 0.62 0.55 0.99 0.99 1.2e-2 (6e-3) 2.2e-2
   (1e-2)
    Grouped sPCA 54 (40) 0.62 (0.4) 0.62 0.58 0.99 1.0 1.4e-2 (3e-3)
   2.6e-2 (6e-3)
   Biological information randomly specified
    Fused sPCA 140 (30) 1.6 (0.4) 0.60 0.60 0.68 0.68 8.9e-3 (5e-3) 1.6e-2
   (1e-2)
    Grouped sPCA 58 (40) 0.67 (0.5) 0.59 0.55 0.99 1.0 1.4e-2 (3e-3)
   2.6e-2 (7e-2)
   [71]Open in a new tab

   Cumulative proportions of variance explained by true PCs are 0.03 for
   PC 1 and 0.06 for PC 1 and 2. P, number of variables. RE,
   reconstruction error, defined as
   [MATH: <mo>|</mo><mo>|</mo><msub><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mtext
   mathvariant="italic">test</mtext></mrow></msub><mi
   mathvariant="bold">A</mi><msup><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>T</mi></mrow></msup><mo>−</mo
   ><msub><mrow><mi mathvariant="bold">X</mi></mrow><mrow><mtext
   mathvariant="italic">test</mtext></mrow></msub><mover
   accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mo>^</mo></mover><msup><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mo>^</mo></mover></mrow><mrow><mtext>T
   </mtext></mrow></msup><mo>|</mo><msubsup><mrow><mo>|</mo></mrow><mrow><
   mi>F</mi></mrow><mrow><mn>2</mn></mrow></msubsup> :MATH]
   , where A=(α [1] α [2]). EE, estimation error, defined as
   [MATH: <mo>|</mo><mo>|</mo><mi mathvariant="bold">A</mi><msup><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>T</mi></mrow></msup><mo>−</mo
   ><mover accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mo>^</mo></mover><msup><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mo>^</mo></mover></mrow><mrow><mtext>T
   </mtext></mrow></msup><mo>|</mo><msubsup><mrow><mo>|</mo></mrow><mrow><
   mi>F</mi></mrow><mrow><mn>2</mn></mrow></msubsup> :MATH]
   . cPVE, proportions of cumulative variation explained. ·(·), mean(std)

Table 2.

   Simulation results of setting 2
   Method RE EE Sensitivity Specificity cPVE
   1stPC 2ndPC 1stPC 2ndPC 1stPC 2ndPC
   P = 500
    PCA 31 (0.9) 1.1 (3e-2) 1.0 1.0 0.0 0.0 4.3e-2 (2e-3) 8.2e-2 (2e-3)
    SPCA 35 (2) 1.3 (9e-2) 0.49 0.50 0.95 1.0 1.9e-2 (3e-3) 3.9e-2 (4e-3)
    SPC 15 (7) 0.54 (3e-1) 0.57 0.60 0.98 1.0 2.8e-2 (3e-3) 5.6e-2 (5e-3)
   Biological information correctly specified
    Fused sPCA 27 (4) 0.93 (2e-1) 1.0 1.0 0.70 0.70 3.0e-2 (3e-3) 5.3e-2
   (5e-3)
    Grouped sPCA 7.9 (5) 0.29 (2e-1) 0.80 0.80 0.97 1.0 3.2e-2(2e-3)
   6.0e-2 (3e-3)
   Biological information randomly specified
    Fused sPCA 32 (5) 1.1 (2e-1) 0.96 1.0 0.52 0.50 2.9e-2 (5e-3) 5.1e-2
   (8e-3)
    Grouped sPCA 9.2 (6) 0.33 (0.2) 0.79 0.8 0.97 1.0 3.2e-2 (2e-3) 5.9e-2
   (4e-3)
   P = 10,000
    PCA 112 (3) 1.3 (2e-2) 1.0 1.0 0.0 0.0 2.7e-2 (1e-3) 5.0e-2 (1e-3)
    SPCA 162 (4) 1.9 (3e-2) 0.16 0.16 1.0 1.0 2.0e-3 (5e-4) 4.0e-3 (8e-4)
    SPC 173 (4) 2.0 (5e-3) 5.0e-3 5.0e-3 1.0 1.0 1.6e-4 (1e-4) 3.2e-4
   (2e-4)
   Biological information correctly specified
    Fused sPCA 77 (40) 0.89 (0.5) 0.65 0.57 0.99 1.0 1.3e-2 (5e-3) 2.3e-2
   (9e-3)
    Grouped sPCA 46 (30) 0.53 (0.4) 0.65 0.62 0.99 1.0 1.5e-2 (2e-3)
   2.8e-2 (5e-3)
   Biological information randomly specified
    Fused sPCA 140 (30) 1.6 (0.4) 0.59 0.60 0.68 0.70 9.0e-3 (5e-3) 1.7e-2
   (1e-2)
    Grouped sPCA 53 (40) 0.61 (0.4) 0.63 0.60 0.99 1.0 1.5e-2 (3e-3)
   2.7e-2 (6e-3)
   [72]Open in a new tab

   Cumulative proportions of variance explained by true PCs are 0.15 for
   PC 1 and 0.30 for PC 1 and 2. P, number of variables. RE,
   reconstruction error, defined as
   [MATH: <mo>|</mo><mo>|</mo><msub><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mtext
   mathvariant="italic">test</mtext></mrow></msub><mi
   mathvariant="bold">A</mi><msup><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>T</mi></mrow></msup><mo>−</mo
   ><msub><mrow><mi mathvariant="bold">X</mi></mrow><mrow><mtext
   mathvariant="italic">test</mtext></mrow></msub><mover
   accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mo>^</mo></mover><msup><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mo>^</mo></mover></mrow><mrow><mtext>T
   </mtext></mrow></msup><mo>|</mo><msubsup><mrow><mo>|</mo></mrow><mrow><
   mi>F</mi></mrow><mrow><mn>2</mn></mrow></msubsup> :MATH]
   , where A=(α [1] α [2]). EE, estimation error, defined as
   [MATH: <mo>|</mo><mo>|</mo><mi mathvariant="bold">A</mi><msup><mrow><mi
   mathvariant="bold">A</mi></mrow><mrow><mi>T</mi></mrow></msup><mo>−</mo
   ><mover accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mo>^</mo></mover><msup><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold">A</mi></mrow><mo>^</mo></mover></mrow><mrow><mtext>T
   </mtext></mrow></msup><mo>|</mo><msubsup><mrow><mo>|</mo></mrow><mrow><
   mi>F</mi></mrow><mrow><mn>2</mn></mrow></msubsup> :MATH]
   . cPVE, proportions of cumulative variation explained. ·(·), mean(std)

   A comparison between p=500 and p=10,000 scenarios for both settings
   indicates that the gain in reconstruction error, estimation error,
   sensitivity, and proportions of variation explained can be substantial
   for Grouped sPCA and Fused sPCA compared with the existing sparse PCA
   methods, as the number of variables increases. This suggests that
   Grouped sPCA or Fused sPCA can achieve sparse PC loading estimations
   with higher accuracy, better variable selection, and larger proportion
   of variation explained, especially when the number of variables is
   relatively large.

   We evaluate the results on different γ values. Both Tables [73]1 and
   [74]2 use γ=2 and the results of the same settings with γ=8 are
   presented in Additional file [75]1: Tables S1 and S2. A comparison of
   Table [76]1 versus Additional file [77]1: Table S1 (or Table [78]2
   versus Additional file [79]1: Table S2) shows very similar results,
   indicating that the proposed methods are robust to the different
   selection of γ values. We also explore how much the results would be
   impacted by adding noise structural information in both settings with
   P=500. The results are demonstrated in Additional file [80]1: Tables S4
   and S5. We find that the results by both Fused sPCA and Grouped sPCA
   worsen a little as expected after adding 170 noise edges. We also find
   that Grouped sPCA is more robust to noise information than Fused sPCA.
   After noise informtion is added, Grouped sPCA still has good
   performance.

Analysis of Glioblastoma data

   We apply the proposed methods to analyze data from a Glioblastoma
   cancer study. Glioblastoma brain multiform (GBM) is the most common
   malignant brain tumor and is defined as grade IV astrocytoma by the
   Whold Health Organization because of its aggressive and malignant
   nature [[81]16]. The Cancer Genome Atlas Project (TCGA) [[82]17]
   integratively analyzed genome information of patients with glioblastoma
   and expanded the knowledge about the pathways and genes that may relate
   with glioblastoma. In our data analysis, we obtain part of the genomic
   data from TCGA project for glioblastoma, which is explained in detail
   by McLendon et al. [[83]17], Verhaak et al. [[84]18], Cooper et al.
   [[85]19]. This data set contains microarray data of 558 subjects with
   glioblastoma. The GBM subtype of each subject is also given.

   The goal of the analysis is to identify a subset of relevant genes that
   contribute to the variation in the different GBM subtypes, and also
   determine how the first two estimated PCs separate these subtypes. For
   both datasets, we first select 2,000 variables with the largest
   variation following the data preprocessing procedure in Witten et al.
   [[86]14]. In the next step, we select patients with subtype Classical,
   Mesenchymal, Neural, and Proneural following the previous work by
   Verhaak et al. [[87]18] resulting in 481 patients with subtype data. We
   obtain the gene network information for Fused and Grouped sparse PCA
   methods from the Kyoto Encyclopedia of Genes and Genomes (KEGG)
   database [[88]20]. The resulting network has 2000 genes and 1297 edges
   in the network. We center each variable to have mean 0 and standardize
   each variable to have variance one.

   To justify the structural information we use for the proposed methods,
   we conduct exploratory analysis using correlation coefficients of gene
   pairs. We group the gene pairs consisting of the selected 2000 genes
   into three categories: unconnected gene pairs (two genes that are not
   in any pathway), direct-connected gene pairs (two genes that have a
   direct edge connecting them), indirect-connected gene pairs (two genes
   that belong to the same pathway but do not have a direct edge
   connecting them) according to the KEGG Pathway information and we use
   boxplots to demonstrate the correlation coefficients of these three
   types of gene pairs. Additional file [89]1: Figure S2 shows the plot of
   correlation coefficients of gene pairs by their categories. There is a
   small but clear decreasing trend in correlation coefficients as one
   moves from direct-connected gene pairs to unconnected gene pairs. This
   shows that the gene pairs that are directly connected tend to have
   stronger correlations than those that are indirectly connected or
   unconnected, thus justifying the validity of pathway information we use
   in the analysis.

   In the analysis, we equally split each data set into training and
   testing sets, where the training set is used to estimate the optimal
   tuning parameters via BIC. The plots of BIC values versus tuning
   parameters for Grouped sPCA and Fused sPCA are shown in Additional file
   [90]1: Figure S3. We then apply the optimal parameters on the whole
   training set to estimate the first two PC loadings
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub><mo>,</mo><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</m
   o><mn>2</mn> :MATH]
   , and use the testing set to evaluate the estimated loadings using the
   following two criteria:
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mtext>Number of non-zero loadings
   of</mtext></mtd></mtr><mtr><mtd class="align-1"><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub><mo>=</mo><msubsup><mrow><mi
   mathvariant="bold-italic">Σ</mi></mrow><mrow><mi>j</mi><mo>=</mo><mn>1<
   /mn></mrow><mrow><mn>2000</mn></mrow></msubsup><mi>I</mi><mo>{</mo><msu
   b><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mtext
   mathvariant="italic">ij</mtext></mrow></msub><mo>≠</mo><mn>0</mn><mo>}<
   /mo><mo>,</mo><mspace
   width="1em"></mspace><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn>
   <mo>;</mo></mtd></mtr></mtable> :MATH]
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mtext>Proportion of variation explained
   by</mtext></mtd></mtr><mtr><mtd class="align-1"><msub><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub><mo>=</mo><mfrac><mrow><msubsup><mrow><mover
   accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mi>i</mi></mrow><mrow><mtext>T</mtext></mrow></msubsup><mi
   mathvariant="bold">X</mi><msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub></mrow><mrow><mtext>trace</mtext><mo>(</mo><mi
   mathvariant="bold">X</mi><msup><mrow><mi
   mathvariant="bold">X</mi></mrow><mrow><mi>T</mi></mrow></msup><mo>)</mo
   ></mrow></mfrac><mo>,</mo><mspace
   width="1em"></mspace><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn>
   <mo>,</mo></mtd></mtr></mtable> :MATH]

   where X is the centered training or testing data matrix. We also obtain
   the first two PCs
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover> :MATH]
   by
   [MATH: <msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub><mo>=</mo><mi
   mathvariant="bold">X</mi><msub><mrow><mover accent="true"><mrow><mi
   mathvariant="bold-italic">α</mi></mrow><mo>^</mo></mover></mrow><mrow><
   mi>i</mi></mrow></msub> :MATH]
   , i=1,2 and determine how well they separate patients with different
   GBM subtypes using support vector machine (SVM).

   Table [91]3 shows the number of non-zero loadings, the cumulative
   proportions of variation explained by the first two PC loadings, and
   the classification results using SVM. We find that SPC and SPCA are
   more sparse than the Fused sparse PCA and the Grouped sparse PCA. This
   is consistent with the simulation settings where SPC and SPCA tend to
   be more sparse and have higher false negatives that result in lower
   sensitivity. Regarding cumulative proportions of variation explained,
   we find that the proposed methods explain higher variation in the data,
   but this may be due to the large number of variables selected. The last
   column of Table [92]3 gives the classification results from applying
   SVM on the testing set using the estimated first two PC loadings. The
   Fused and Grouped sparse PCA have the highest number of correctly
   specified subjects. Of the existing methods, PCA and SPCA achieve good
   performance of separating patients with different subtypes, while SPC
   has the lowest number of subjects correctly classified.

Table 3.

   Analysis of the GBM data using Kegg Pathway information. cPVE
   represents proportions of cumulative variation explained
   Method       Non-zero Loadings  cPVE          Subjects correctly classified
                1stPC 2ndPC        1stPC  2ndPC  SVM
   PCA          2000  2000         0.1955 0.3175 97
   SPCA         240   238          0.0333 0.0591 97
   SPC          45    59           0.0215 0.0383 67
   Fused sPCA   1644  1410         0.1792 0.2787 123
   Grouped sPCA 1330  970          0.1731 0.2652 119
   [93]Open in a new tab

   We also conduct pathway enrichment analysis using bioinformatics
   software ToppGene Suite [[94]21]. We take the first PC as an example
   for illustration. We identify the genes that have non-zero loadings in
   the first PC from the proposed sparse PCA methods and existing methods,
   and obtain significantly enriched pathways that are associated with
   glioblastoma for each method. We seek to identify methods that have
   more glioblastoma-associated pathways, and whether these overlap. Table
   [95]4 shows the Glioblastoma-related pathways found by the proposed
   methods and existing sparse PCA methods. Among the existing sparse PCA
   methods, both SPC and SPCA find Spinal Cord Injury pathway. Compared
   with the existing methods, Fused and Grouped sPCA find a few new
   Glioblastoma-related pathways: Proteoglycans in cancer, Transcriptional
   misregulation in cancer, Pathways in cancer, Bladder cancer, and
   Angiogenesis. These pathways have been demonstrated in existing
   literatures to be associated with Glioblastoma [[96]22–[97]27]. We do
   not conduct pathway enrichment analysis with the results of traditional
   PCA because traditional PCA does not perform any variable selection and
   automatically select all variables. We also plot the first two PC
   loadings by Fused and Grouped sPCA in Additional file [98]1: Figure S4
   and the loadings of genes enriched in Glioblastoma-related pathways are
   highlighted in color. These results indicate that the proposed methods
   may be more sensitive in detecting disease related signals and thus can
   identify more biologically important genes.

Table 4.

   Enriched Glioblastoma-related pathways for the genes in first PC by
   different sPCA methods
   Pathway ID Pathway name P-value Gene
   From input In annotation
   Fused sPCA
    739007 Spinal cord injury 7.43E-18 45 112
    782000 Proteoglycans in cancer 5.77E-11 55 225
    523016 Transcriptional misregulation in cancer 3.31E-7 40 179
    83105 Pathways in cancer 3.36E-7 61 327
    83115 Bladder cancer 6.10E-6 14 38
   Grouped sPCA
    739007 Spinal Cord Injury 1.97E-14 36 112
    523016 Transcriptional misregulation in cancer 4.06E-7 34 179
    83105 Pathways in cancer 2.58E-5 46 327
    [99]P00005 Angiogenesis 4.90E-5 26 150
   SPC
    739007 Spinal Cord Injury 1.43E-5 5 112
   SPCA
    739007 Spinal Cord Injury 6.46E-5 8 112
   [100]Open in a new tab

Discussions

   In this paper, we propose two novel structured sparse PCA methods.
   Through extensive simulation studies and an application to Glioblastoma
   gene expression data, we demonstrate that incorporating known
   biological information improves the performance of sparse PCA methods.
   Specifically, our simulation study indicates that the proposed methods
   can decrease reconstruction and estimation errors, and increase
   sensitivity and proportions of variation explained, especially when
   number of variables is large. Compared with Fused sPCA and existing PCA
   methods, Grouped sPCA achieves the lowest reconstruction error and
   estimation error for correctly specified and mis-specified graph
   structure. On the other hand, Fused sPCA has higher sensitivity values.
   Because we utilize prior biological information, the proposed methods
   usually have less sparse PC loadings compared with the existing sPCA
   methods and thus lower specificity. However, there is a trade-off
   between sparsity and the benefit from extra information. Consistent
   with the simulations results, the real data analysis demonstrates that
   the proposed methods generate less sparse PC loadings. However, the
   classification results show the advantages of incorporating biological
   information into sparse PCA.

   The proposed methods require the structure of variables to be known in
   advance and specified during analysis. In real data analysis, this task
   is not trivial and it may take some efforts in searching for a proper
   variable structure to use. Regarding this, we make the following
   comments. First of all, many sources of structural information may be
   available to use including KEGG pathway [[101]20], Panther pathway
   [[102]28], Human protein reference database [[103]29]. It may be
   helpful to conduct some exploratory analysis such as Additional file
   [104]1: Figure S2 to confirm the need for using biological information.
   Additional file [105]1: Figure S2 demonstrates that gene pairs
   connected in the same pathway generally have higher correlation than
   gene pairs unconnected in the same pathway, and further than gene pairs
   in different pathways. Second, our simulation study indicates that even
   if the structural information is irrelevant as in the biological
   information randomly specified section, the proposed methods still
   perform well, especially Grouped sPCA method.

   Our proposed methods have some limitations. First, when structural
   information includes a large number of edges, the proposed methods,
   particularly, Fused sPCA, may generate PC loadings that include more
   false positive selections. To solve this problem, one potential
   approach is to obtain a smaller but more relevant biological structure.
   Second, the proposed methods, especially Grouped sPCA may be
   computationally slow in the presence of a large number of edges. Based
   on our experience with the simulations and the real data set, Fused
   sPCA is computationally more efficient than Grouped sPCA since we are
   able to vectorize the penalty for Fused sPCA in the algorithm. Lastly,
   it has been observed that many studies used gene expression data that
   are inefficiently and insufficiently pre-processed or normalized, which
   leads to failure of eliminating technical noise or batch effects
   [[106]30]. Our proposed methods do not provide steps for pre-processing
   or normalizing data. The users should adequately pre-process gene
   expression data to remove potential technical noises and batch effects
   before applying our methods.

   Our structured sparse PCA methods are aimed for estimating sparse PCs
   and can be considered a dimension reduction technique. Subsequent
   analyses could use the estimated PCs in a number of different ways. For
   example, one could use PCs for visualizing gene expression data,
   clustering, or building prediction model. Following suggestions from a
   reviewer, we conducted one additional set of simulations to assess the
   prediction performance of using the top k PCs that achieve a certain
   proportion of total variation explained, and the impact of different
   threshold values for the proportion of total variation explained. We
   used a simulation setting similar to Setting 2 in the Simulation
   section with 100 subject, 500 variables, and 100 simulated datasets.
   The cumulative proportions of variation explained by the first two PCs
   are 30%. We generated a binary outcome variable using the first PC
   through a logistic regression model: l o g i t(P r(Y [i]=1))=0.5+P C
   [1i]. The simulation results presented in Additional file [107]1: Table
   S6 show that Fused sPCA has the highest prediction accuracy among all
   the sparse PCA methods when 30, 50, and 60% are used as the threshold,
   consistent with our findings in real data analysis. Also, the
   prediction accuracy is not very sensitive to the choice of threshold
   values. Of note, in these simulations, the proportion of total
   variation explained by all PCs estimated using sparse PCA methods fails
   to reach 70% for our method and 60% for other methods, which is likely
   due to regularization/sparsity. It has been reported previously
   [[108]14, [109]31] that sparse PCA generates PC solutions that explain
   smaller proportions of total variation than standard PCA. Future
   research is needed to investigate more principled approaches for
   choosing the top k PCs in subsequent analysis and to understand why the
   proportion of total variation explained by all PCs estimated using
   sparse PCA methods fails to reach certain threshold and potential
   remedy for this limitation.

   Although we apply the proposed methods to analysis of gene expression
   data, our methods are flexible and general enough to be applied to
   other data types, such as epi-genomics data discussed in the review
   paper by Qin et al. [[110]32]. Besides the potential application to
   other data, some extensions are of potential interest. One may use
   alternative convex optimization solvers other than the CVX solver in
   Matlab used in our work, potentially to speed up the computations. In
   addition, Fused and Grouped sPCA only incorporate the edge information
   in a graph. As variables are often grouped into pathways, sPCA using
   hierarchical penalties [[111]33] can be developed to incorporate group
   membership information in addition to edge information.

Conclusions

   The proposed sparse PCA methods Fused and Grouped sparse PCA can
   effectively incorporate prior biological information in variable
   selection, leading to improved feature selection and more interpretable
   principal component loadings and potentially providing insights on
   molecular underpinnings of complex diseases.

Acknowledgements