Abstract

   MicroRNAs are small non-coding RNAs that influence gene expression by
   binding to the 3’ UTR of target mRNAs in order to repress protein
   synthesis. Soon after discovery, microRNA dysregulation has been
   associated to several pathologies. In particular, they have often been
   reported as differentially expressed in healthy and tumor samples. This
   fact suggested that microRNAs are likely to be good candidate
   biomarkers for cancer diagnosis and personalized medicine. With the
   advent of Next-Generation Sequencing (NGS), measuring the expression
   level of the whole miRNAome at once is now routine. Yet, the
   collaborative effort of sharing data opens to the possibility of
   population analyses. This context motivated us to perform an in-silico
   study to distill cancer-specific panels of microRNAs that can serve as
   biomarkers. We observed that the problem of finding biomarkers can be
   modeled as a two-class classification task where, given the miRNAomes
   of a population of healthy and cancerous samples, we want to find the
   subset of microRNAs that leads to the highest classification accuracy.
   We fulfill this task leveraging on a sensible combination of data
   mining tools. In particular, we used: differential evolution for
   candidate selection, component analysis to preserve the relationships
   among miRNAs, and SVM for sample classification. We identified 10
   cancer-specific panels whose classification accuracy is always higher
   than 92%. These panels have a very little overlap suggesting that
   miRNAs are not only predictive of the onset of cancer, but can be used
   for classification purposes as well. We experimentally validated the
   contribution of each of the employed tools to the selection of
   discriminating miRNAs. Moreover, we tested the significance of each
   panel for the corresponding cancer type. In particular, enrichment
   analysis showed that the selected miRNAs are involved in oncogenesis
   pathways, while survival analysis proved that miRNAs can be used to
   evaluate cancer severity. Summarizing: results demonstrated that our
   method is able to produce cancer-specific panels that are promising
   candidates for a subsequent in vitro validation.

Introduction

   Timing and accuracy in cancer diagnosis are among the most critical
   factors that influence the clinical history of a patient. Until
   recently, the histological analysis of a small sample of tumor cells
   has been the only tool for cancer classification. The complexity of
   this pathology and the histological similarity of certain sub-classes,
   however, have motivated researchers to find easier diagnosis techniques
   that can also be used on a large scale [[32]1].

   MicroRNAs (miRNAs) are short non-coding RNAs, whose size is
   approximately ranged between 22 and 25bp, that influence the regulation
   of target genes by imperfect binding to complementary regions of
   messenger transcripts [[33]2]. Being involved in several biological
   processes [[34]3–[35]5], modifications of the expression profile of
   this class of RNAs have been investigated in conjunction with cancer
   [[36]5]. A further property that makes miRNAs an attractive target for
   research of non-invasive biomarkers is that they are released outside
   the cell and can be easily quantified in the serum [[37]6] using
   RT-qPCR [[38]7]. Moreover, the relationship between cellular and
   circulating miRNAs has already been elucidated [[39]8]. These facts
   open to a new generation of miRNA-based non-invasive biomarkers
   [[40]9].

   Over the last decade, several studies on cancer have pointed out
   specific miRNAs as putative biomarkers for a variety of purposes (see
   reference [[41]10] for a survey). Applications range from the
   classification of the tissue of origin [[42]11] or the cancer type
   [[43]12–[44]14] to the personalized therapy [[45]15, [46]16]. However,
   the discovery process has been complicated from the limited amount of
   data and from the fact that differential expression has been the only
   discriminating feature to drive researchers.

   The drop of the cost of NGS technologies for whole miRNAome analysis
   and the availability of public repositories of miRNA profiles of
   cancerous samples (i.e. The Cancer Genome Atlas [[47]17]), have raised
   the question whether the traditional pipeline in which the in-vitro
   exploratory activity is performed upstream the statistical analysis can
   be turned, making machine learning approaches to drive subsequent
   in-vitro experiments. Although few preliminary publications [[48]18]
   foster a positive answer to this question, an agreed compendium of the
   miRNA profiles associated with all the cancer types is still far away.

   In this work, we argue that panels based only on miRNAs differentially
   expressed between tumor and control samples might lack possible complex
   relationships among patterns of expression levels. Consequently, we
   propose to map the task of discovering putative miRNA biomarkers into
   the machine learning problem of selecting a restricted set of features
   that lead to the highest classification accuracy of a two-class
   classification task.

   This problem formulation, however, raises the issue of building a
   pipeline that provides the highest classification accuracy. Since
   feature selection and classification are probably the two most studied
   problems in machine learning [[49]19] [[50]20] and in bioinformatics
   [[51]21], [[52]22] [[53]23], exploring all the possible alternatives is
   impossible. We thus fixed some choices according to the general
   agreement in the literature and followed an explorative approach for
   other components.

   Experiments on the 10 common cancer types from [[54]24] showed that our
   proposed approach improves upon differential expression-based
   state-of-the-art methods not only in terms of accuracy, but also in
   terms of other relevant performance measures (i.e. FDR, sensitivity,
   specificity, etc.). The value of this result is twofold: firstly, it
   opens to a new methodological approach to differential expression
   analysis; moreover, it can be considered as a preliminary piece of
   evidence that relationships among the expression values of miRNAs might
   not be linear.

   To conclude this work, we investigated the biological role of miRNAs in
   our panels. We found that most of them interact with the morphogenesis
   process and are involved in pathways which regulate cellular
   proliferation, growth, and survival.

Materials and methods

   The key idea behind our approach is that of mapping the problem of
   finding putative miRNA biomarkers into a two-class classification task.
   However, important differences between the two problems exist. In fact,
   increasing the accuracy of the canonical classification task can be
   achieved by providing new training examples; while in our setting the
   number of involved elements is bounded by the limited availability of
   samples. Another important difference is that, in our case, a high
   accuracy is not enough. In fact, we are interested in finding a
   restricted panel of miRNAs responsible for the correct classification.
   This last fact suggests that we can leverage on feature selection so as
   to maximize classification accuracy.

   In this paper, we used an evolutionary optimization method called
   Differential Evolution (DE) [[55]25–[56]27] to explore the space of the
   subsets of miRNAs. The outcome of the classification performed via a
   Support Vector Machine (SVM) [[57]28] is used to assess the quality of
   the selection while Kernel Principal Component Analysis (KPCA) [[58]29]
   and Principal Component Analysis (PCA) [[59]30] are used to keep
   unaltered the structural characteristics of the whole dataset during
   the selection process. In order to prevent the intrinsic randomness of
   DE causes the inclusion of irrelevant results in the returned panel, we
   have run the optimization algorithm 50 times and selected the final set
   of miRNAs by means of a finishing algorithm based on a majority voting
   scheme.

Component analysis

   Feature selection is a common tool to scale down a high-dimensional
   space. According with the purpose of the application, two main
   strategies can be chosen: dimensionality reduction and variable
   elimination (see [[60]31] for an in-depth examination). Component
   analysis, as all the orthogonal transformations, belongs to the first
   category. The main advantage of this strategy is that the reduced space
   still maintains the structural properties of the original space.
   However, once a correlation in the reduced space is found, reverting to
   the original correlated dimensions is not possible. The Differential
   Evolution algorithm, instead, is a wrapper-based approach belonging to
   the class of variable elimination. Methods of this category iteratively
   select a subset of dimensions of the original space attempting to
   maximize a certain objective function (usually classification
   accuracy). Working with dimensions in the original space, the advantage
   of this approach is that of exactly knowing the variables under
   consideration. On the other hand, complex relationships among the
   features are not taken into account causing the removal of a dimension
   to brake possible unknown relationships.

   In our framework, the need to identify the subset of miRNAs that
   maximize classification would force us to use variable elimination.
   However, this would mean giving up with the possibility of involving
   complex expression patterns in the process of selection of a panel and,
   in turn, reducing to a standard differential analysis. Combining the
   strengths of dimensionality reduction and variable elimination is hence
   needed. We achieved this goal by leveraging on the following two
   results. In [[61]32] the authors proposed a dimensionality reduction
   scheme in which the dataset is randomly partitioned into K homogeneous
   groups and PCA is applied to each of them. The final dataset consists
   in a matrix in which the i-th column is the principal component of the
   corresponding partition. Experiments in [[62]32] showed that this
   reduction strategy still preserves the structural characteristics of
   the original dataset, suggesting that for large-enough partitions, PCA
   obeys a sort of distributive law. In [[63]33] the authors showed that a
   wrapper-based dimensionality reduction approach (i.e. a method that
   exploits the outcome of classification to select features) can be
   improved applying in cascade a filter-based method (i.e. an algorithm
   independent of the classification algorithm).

   Let E be a s × d matrix with the expression levels of d miRNAs on s
   samples and let E(v) be a subset of n < d columns of E as specified by
   a n-dimensional vector v.

   Building upon [[64]32], we pre-processed E by applying component
   analysis and extracting the first n components (for a fixed value of
   n). The resulting matrix
   [MATH: <mover accent="true"><mi>E</mi><mo>^</mo></mover> :MATH]
   represents the projection of E into a new smaller space with s rows and
   n columns. We observed that, when
   [MATH: <mover accent="true"><mi>E</mi><mo>^</mo></mover> :MATH]
   maintains the same structural characteristics of the original
   expression matrix (i.e. saving most of its variance), it can be used as
   an encoder to compress a miRNA vector of E into a smaller one with a
   marginal loss of information. Consequently, multiplying each vector of
   E(v) (thus selecting n miRNAs) with the matrix
   [MATH: <mover accent="true"><mi>E</mi><mo>^</mo></mover> :MATH]
   we obtain a new matrix
   [MATH: <mrow><mover
   accent="true"><mi>E</mi><mo>^</mo></mover><mrow><mo>(</mo><mi>v</mi><mo
   >)</mo></mrow></mrow> :MATH]
   that: keeps unaltered most of the structural characteristics of the
   original expression matrix, and highlights the contribution of the
   selected miRNAs.

   Choosing a convenient component analysis algorithm able to preserve
   variance is complicated by the fact that different datasets can have
   profound differences that reflect on the outcome of the methods. PCA is
   fast and appropriate when data do not exhibit non-linear relationships,
   while KPCA is more able to preserve complex relationships. We
   empirically chose per dataset component analysis by testing the
   accuracy of classification either after PCA or KPCA and deciding for
   the most promising.

   Following a philosophy similar to that of [[65]33], we used the DE
   algorithm to select the miRNAs specified in the vector v and the matrix
   [MATH: <mrow><mover
   accent="true"><mi>E</mi><mo>^</mo></mover><mrow><mo>(</mo><mi>v</mi><mo
   >)</mo></mrow></mrow> :MATH]
   (in place of E(v)) as input for the classifier that computes the
   fitness function. As a consequence, the fitness value does no longer
   reflect the classification accuracy of a panel based on E(v).
   Nevertheless,
   [MATH: <mrow><mover
   accent="true"><mi>E</mi><mo>^</mo></mover><mrow><mo>(</mo><mi>v</mi><mo
   >)</mo></mrow></mrow> :MATH]
   is likely to be more effective than E(v) to quickly lead DE to an
   optimal solution.

Differential evolution

   Differential Evolution (DE) [[66]25–[67]27] is an evolutionary
   optimization tool aimed at finding a global optimum solution in a
   n-dimensional real parameter space
   [MATH: <msup><mi mathvariant="script">R</mi><mi>n</mi></msup> :MATH]
   .

   Like any other evolutionary algorithm, DE maintains a set V = {v[1], …,
   v[I]} of I candidate solutions (I = 50 in our experiments) and applies
   a series of operators to evolve to the next generation. Each solution,
   in turn, is a vector v storing n features (miRNAs in our case). For
   clarity, we often add a subscript in the notation to denote the
   generation.

   Let M be the sorted static list of miRNAs involved in a given cancer
   dataset. Say |M| = d. A vector v ∈ V consists of n disjoint references