Abstract MicroRNAs are small non-coding RNAs that influence gene expression by binding to the 3’ UTR of target mRNAs in order to repress protein synthesis. Soon after discovery, microRNA dysregulation has been associated to several pathologies. In particular, they have often been reported as differentially expressed in healthy and tumor samples. This fact suggested that microRNAs are likely to be good candidate biomarkers for cancer diagnosis and personalized medicine. With the advent of Next-Generation Sequencing (NGS), measuring the expression level of the whole miRNAome at once is now routine. Yet, the collaborative effort of sharing data opens to the possibility of population analyses. This context motivated us to perform an in-silico study to distill cancer-specific panels of microRNAs that can serve as biomarkers. We observed that the problem of finding biomarkers can be modeled as a two-class classification task where, given the miRNAomes of a population of healthy and cancerous samples, we want to find the subset of microRNAs that leads to the highest classification accuracy. We fulfill this task leveraging on a sensible combination of data mining tools. In particular, we used: differential evolution for candidate selection, component analysis to preserve the relationships among miRNAs, and SVM for sample classification. We identified 10 cancer-specific panels whose classification accuracy is always higher than 92%. These panels have a very little overlap suggesting that miRNAs are not only predictive of the onset of cancer, but can be used for classification purposes as well. We experimentally validated the contribution of each of the employed tools to the selection of discriminating miRNAs. Moreover, we tested the significance of each panel for the corresponding cancer type. In particular, enrichment analysis showed that the selected miRNAs are involved in oncogenesis pathways, while survival analysis proved that miRNAs can be used to evaluate cancer severity. Summarizing: results demonstrated that our method is able to produce cancer-specific panels that are promising candidates for a subsequent in vitro validation. Introduction Timing and accuracy in cancer diagnosis are among the most critical factors that influence the clinical history of a patient. Until recently, the histological analysis of a small sample of tumor cells has been the only tool for cancer classification. The complexity of this pathology and the histological similarity of certain sub-classes, however, have motivated researchers to find easier diagnosis techniques that can also be used on a large scale [[32]1]. MicroRNAs (miRNAs) are short non-coding RNAs, whose size is approximately ranged between 22 and 25bp, that influence the regulation of target genes by imperfect binding to complementary regions of messenger transcripts [[33]2]. Being involved in several biological processes [[34]3–[35]5], modifications of the expression profile of this class of RNAs have been investigated in conjunction with cancer [[36]5]. A further property that makes miRNAs an attractive target for research of non-invasive biomarkers is that they are released outside the cell and can be easily quantified in the serum [[37]6] using RT-qPCR [[38]7]. Moreover, the relationship between cellular and circulating miRNAs has already been elucidated [[39]8]. These facts open to a new generation of miRNA-based non-invasive biomarkers [[40]9]. Over the last decade, several studies on cancer have pointed out specific miRNAs as putative biomarkers for a variety of purposes (see reference [[41]10] for a survey). Applications range from the classification of the tissue of origin [[42]11] or the cancer type [[43]12–[44]14] to the personalized therapy [[45]15, [46]16]. However, the discovery process has been complicated from the limited amount of data and from the fact that differential expression has been the only discriminating feature to drive researchers. The drop of the cost of NGS technologies for whole miRNAome analysis and the availability of public repositories of miRNA profiles of cancerous samples (i.e. The Cancer Genome Atlas [[47]17]), have raised the question whether the traditional pipeline in which the in-vitro exploratory activity is performed upstream the statistical analysis can be turned, making machine learning approaches to drive subsequent in-vitro experiments. Although few preliminary publications [[48]18] foster a positive answer to this question, an agreed compendium of the miRNA profiles associated with all the cancer types is still far away. In this work, we argue that panels based only on miRNAs differentially expressed between tumor and control samples might lack possible complex relationships among patterns of expression levels. Consequently, we propose to map the task of discovering putative miRNA biomarkers into the machine learning problem of selecting a restricted set of features that lead to the highest classification accuracy of a two-class classification task. This problem formulation, however, raises the issue of building a pipeline that provides the highest classification accuracy. Since feature selection and classification are probably the two most studied problems in machine learning [[49]19] [[50]20] and in bioinformatics [[51]21], [[52]22] [[53]23], exploring all the possible alternatives is impossible. We thus fixed some choices according to the general agreement in the literature and followed an explorative approach for other components. Experiments on the 10 common cancer types from [[54]24] showed that our proposed approach improves upon differential expression-based state-of-the-art methods not only in terms of accuracy, but also in terms of other relevant performance measures (i.e. FDR, sensitivity, specificity, etc.). The value of this result is twofold: firstly, it opens to a new methodological approach to differential expression analysis; moreover, it can be considered as a preliminary piece of evidence that relationships among the expression values of miRNAs might not be linear. To conclude this work, we investigated the biological role of miRNAs in our panels. We found that most of them interact with the morphogenesis process and are involved in pathways which regulate cellular proliferation, growth, and survival. Materials and methods The key idea behind our approach is that of mapping the problem of finding putative miRNA biomarkers into a two-class classification task. However, important differences between the two problems exist. In fact, increasing the accuracy of the canonical classification task can be achieved by providing new training examples; while in our setting the number of involved elements is bounded by the limited availability of samples. Another important difference is that, in our case, a high accuracy is not enough. In fact, we are interested in finding a restricted panel of miRNAs responsible for the correct classification. This last fact suggests that we can leverage on feature selection so as to maximize classification accuracy. In this paper, we used an evolutionary optimization method called Differential Evolution (DE) [[55]25–[56]27] to explore the space of the subsets of miRNAs. The outcome of the classification performed via a Support Vector Machine (SVM) [[57]28] is used to assess the quality of the selection while Kernel Principal Component Analysis (KPCA) [[58]29] and Principal Component Analysis (PCA) [[59]30] are used to keep unaltered the structural characteristics of the whole dataset during the selection process. In order to prevent the intrinsic randomness of DE causes the inclusion of irrelevant results in the returned panel, we have run the optimization algorithm 50 times and selected the final set of miRNAs by means of a finishing algorithm based on a majority voting scheme. Component analysis Feature selection is a common tool to scale down a high-dimensional space. According with the purpose of the application, two main strategies can be chosen: dimensionality reduction and variable elimination (see [[60]31] for an in-depth examination). Component analysis, as all the orthogonal transformations, belongs to the first category. The main advantage of this strategy is that the reduced space still maintains the structural properties of the original space. However, once a correlation in the reduced space is found, reverting to the original correlated dimensions is not possible. The Differential Evolution algorithm, instead, is a wrapper-based approach belonging to the class of variable elimination. Methods of this category iteratively select a subset of dimensions of the original space attempting to maximize a certain objective function (usually classification accuracy). Working with dimensions in the original space, the advantage of this approach is that of exactly knowing the variables under consideration. On the other hand, complex relationships among the features are not taken into account causing the removal of a dimension to brake possible unknown relationships. In our framework, the need to identify the subset of miRNAs that maximize classification would force us to use variable elimination. However, this would mean giving up with the possibility of involving complex expression patterns in the process of selection of a panel and, in turn, reducing to a standard differential analysis. Combining the strengths of dimensionality reduction and variable elimination is hence needed. We achieved this goal by leveraging on the following two results. In [[61]32] the authors proposed a dimensionality reduction scheme in which the dataset is randomly partitioned into K homogeneous groups and PCA is applied to each of them. The final dataset consists in a matrix in which the i-th column is the principal component of the corresponding partition. Experiments in [[62]32] showed that this reduction strategy still preserves the structural characteristics of the original dataset, suggesting that for large-enough partitions, PCA obeys a sort of distributive law. In [[63]33] the authors showed that a wrapper-based dimensionality reduction approach (i.e. a method that exploits the outcome of classification to select features) can be improved applying in cascade a filter-based method (i.e. an algorithm independent of the classification algorithm). Let E be a s × d matrix with the expression levels of d miRNAs on s samples and let E(v) be a subset of n < d columns of E as specified by a n-dimensional vector v. Building upon [[64]32], we pre-processed E by applying component analysis and extracting the first n components (for a fixed value of n). The resulting matrix [MATH: E^ :MATH] represents the projection of E into a new smaller space with s rows and n columns. We observed that, when [MATH: E^ :MATH] maintains the same structural characteristics of the original expression matrix (i.e. saving most of its variance), it can be used as an encoder to compress a miRNA vector of E into a smaller one with a marginal loss of information. Consequently, multiplying each vector of E(v) (thus selecting n miRNAs) with the matrix [MATH: E^ :MATH] we obtain a new matrix [MATH: E^(v) :MATH] that: keeps unaltered most of the structural characteristics of the original expression matrix, and highlights the contribution of the selected miRNAs. Choosing a convenient component analysis algorithm able to preserve variance is complicated by the fact that different datasets can have profound differences that reflect on the outcome of the methods. PCA is fast and appropriate when data do not exhibit non-linear relationships, while KPCA is more able to preserve complex relationships. We empirically chose per dataset component analysis by testing the accuracy of classification either after PCA or KPCA and deciding for the most promising. Following a philosophy similar to that of [[65]33], we used the DE algorithm to select the miRNAs specified in the vector v and the matrix [MATH: E^(v) :MATH] (in place of E(v)) as input for the classifier that computes the fitness function. As a consequence, the fitness value does no longer reflect the classification accuracy of a panel based on E(v). Nevertheless, [MATH: E^(v) :MATH] is likely to be more effective than E(v) to quickly lead DE to an optimal solution. Differential evolution Differential Evolution (DE) [[66]25–[67]27] is an evolutionary optimization tool aimed at finding a global optimum solution in a n-dimensional real parameter space [MATH: Rn :MATH] . Like any other evolutionary algorithm, DE maintains a set V = {v[1], …, v[I]} of I candidate solutions (I = 50 in our experiments) and applies a series of operators to evolve to the next generation. Each solution, in turn, is a vector v storing n features (miRNAs in our case). For clarity, we often add a subscript in the notation to denote the generation. Let M be the sorted static list of miRNAs involved in a given cancer dataset. Say |M| = d. A vector v ∈ V consists of n disjoint references