Abstract
MicroRNAs are small non-coding RNAs that influence gene expression by
binding to the 3’ UTR of target mRNAs in order to repress protein
synthesis. Soon after discovery, microRNA dysregulation has been
associated to several pathologies. In particular, they have often been
reported as differentially expressed in healthy and tumor samples. This
fact suggested that microRNAs are likely to be good candidate
biomarkers for cancer diagnosis and personalized medicine. With the
advent of Next-Generation Sequencing (NGS), measuring the expression
level of the whole miRNAome at once is now routine. Yet, the
collaborative effort of sharing data opens to the possibility of
population analyses. This context motivated us to perform an in-silico
study to distill cancer-specific panels of microRNAs that can serve as
biomarkers. We observed that the problem of finding biomarkers can be
modeled as a two-class classification task where, given the miRNAomes
of a population of healthy and cancerous samples, we want to find the
subset of microRNAs that leads to the highest classification accuracy.
We fulfill this task leveraging on a sensible combination of data
mining tools. In particular, we used: differential evolution for
candidate selection, component analysis to preserve the relationships
among miRNAs, and SVM for sample classification. We identified 10
cancer-specific panels whose classification accuracy is always higher
than 92%. These panels have a very little overlap suggesting that
miRNAs are not only predictive of the onset of cancer, but can be used
for classification purposes as well. We experimentally validated the
contribution of each of the employed tools to the selection of
discriminating miRNAs. Moreover, we tested the significance of each
panel for the corresponding cancer type. In particular, enrichment
analysis showed that the selected miRNAs are involved in oncogenesis
pathways, while survival analysis proved that miRNAs can be used to
evaluate cancer severity. Summarizing: results demonstrated that our
method is able to produce cancer-specific panels that are promising
candidates for a subsequent in vitro validation.
Introduction
Timing and accuracy in cancer diagnosis are among the most critical
factors that influence the clinical history of a patient. Until
recently, the histological analysis of a small sample of tumor cells
has been the only tool for cancer classification. The complexity of
this pathology and the histological similarity of certain sub-classes,
however, have motivated researchers to find easier diagnosis techniques
that can also be used on a large scale [[32]1].
MicroRNAs (miRNAs) are short non-coding RNAs, whose size is
approximately ranged between 22 and 25bp, that influence the regulation
of target genes by imperfect binding to complementary regions of
messenger transcripts [[33]2]. Being involved in several biological
processes [[34]3–[35]5], modifications of the expression profile of
this class of RNAs have been investigated in conjunction with cancer
[[36]5]. A further property that makes miRNAs an attractive target for
research of non-invasive biomarkers is that they are released outside
the cell and can be easily quantified in the serum [[37]6] using
RT-qPCR [[38]7]. Moreover, the relationship between cellular and
circulating miRNAs has already been elucidated [[39]8]. These facts
open to a new generation of miRNA-based non-invasive biomarkers
[[40]9].
Over the last decade, several studies on cancer have pointed out
specific miRNAs as putative biomarkers for a variety of purposes (see
reference [[41]10] for a survey). Applications range from the
classification of the tissue of origin [[42]11] or the cancer type
[[43]12–[44]14] to the personalized therapy [[45]15, [46]16]. However,
the discovery process has been complicated from the limited amount of
data and from the fact that differential expression has been the only
discriminating feature to drive researchers.
The drop of the cost of NGS technologies for whole miRNAome analysis
and the availability of public repositories of miRNA profiles of
cancerous samples (i.e. The Cancer Genome Atlas [[47]17]), have raised
the question whether the traditional pipeline in which the in-vitro
exploratory activity is performed upstream the statistical analysis can
be turned, making machine learning approaches to drive subsequent
in-vitro experiments. Although few preliminary publications [[48]18]
foster a positive answer to this question, an agreed compendium of the
miRNA profiles associated with all the cancer types is still far away.
In this work, we argue that panels based only on miRNAs differentially
expressed between tumor and control samples might lack possible complex
relationships among patterns of expression levels. Consequently, we
propose to map the task of discovering putative miRNA biomarkers into
the machine learning problem of selecting a restricted set of features
that lead to the highest classification accuracy of a two-class
classification task.
This problem formulation, however, raises the issue of building a
pipeline that provides the highest classification accuracy. Since
feature selection and classification are probably the two most studied
problems in machine learning [[49]19] [[50]20] and in bioinformatics
[[51]21], [[52]22] [[53]23], exploring all the possible alternatives is
impossible. We thus fixed some choices according to the general
agreement in the literature and followed an explorative approach for
other components.
Experiments on the 10 common cancer types from [[54]24] showed that our
proposed approach improves upon differential expression-based
state-of-the-art methods not only in terms of accuracy, but also in
terms of other relevant performance measures (i.e. FDR, sensitivity,
specificity, etc.). The value of this result is twofold: firstly, it
opens to a new methodological approach to differential expression
analysis; moreover, it can be considered as a preliminary piece of
evidence that relationships among the expression values of miRNAs might
not be linear.
To conclude this work, we investigated the biological role of miRNAs in
our panels. We found that most of them interact with the morphogenesis
process and are involved in pathways which regulate cellular
proliferation, growth, and survival.
Materials and methods
The key idea behind our approach is that of mapping the problem of
finding putative miRNA biomarkers into a two-class classification task.
However, important differences between the two problems exist. In fact,
increasing the accuracy of the canonical classification task can be
achieved by providing new training examples; while in our setting the
number of involved elements is bounded by the limited availability of
samples. Another important difference is that, in our case, a high
accuracy is not enough. In fact, we are interested in finding a
restricted panel of miRNAs responsible for the correct classification.
This last fact suggests that we can leverage on feature selection so as
to maximize classification accuracy.
In this paper, we used an evolutionary optimization method called
Differential Evolution (DE) [[55]25–[56]27] to explore the space of the
subsets of miRNAs. The outcome of the classification performed via a
Support Vector Machine (SVM) [[57]28] is used to assess the quality of
the selection while Kernel Principal Component Analysis (KPCA) [[58]29]
and Principal Component Analysis (PCA) [[59]30] are used to keep
unaltered the structural characteristics of the whole dataset during
the selection process. In order to prevent the intrinsic randomness of
DE causes the inclusion of irrelevant results in the returned panel, we
have run the optimization algorithm 50 times and selected the final set
of miRNAs by means of a finishing algorithm based on a majority voting
scheme.
Component analysis
Feature selection is a common tool to scale down a high-dimensional
space. According with the purpose of the application, two main
strategies can be chosen: dimensionality reduction and variable
elimination (see [[60]31] for an in-depth examination). Component
analysis, as all the orthogonal transformations, belongs to the first
category. The main advantage of this strategy is that the reduced space
still maintains the structural properties of the original space.
However, once a correlation in the reduced space is found, reverting to
the original correlated dimensions is not possible. The Differential
Evolution algorithm, instead, is a wrapper-based approach belonging to
the class of variable elimination. Methods of this category iteratively
select a subset of dimensions of the original space attempting to
maximize a certain objective function (usually classification
accuracy). Working with dimensions in the original space, the advantage
of this approach is that of exactly knowing the variables under
consideration. On the other hand, complex relationships among the
features are not taken into account causing the removal of a dimension
to brake possible unknown relationships.
In our framework, the need to identify the subset of miRNAs that
maximize classification would force us to use variable elimination.
However, this would mean giving up with the possibility of involving
complex expression patterns in the process of selection of a panel and,
in turn, reducing to a standard differential analysis. Combining the
strengths of dimensionality reduction and variable elimination is hence
needed. We achieved this goal by leveraging on the following two
results. In [[61]32] the authors proposed a dimensionality reduction
scheme in which the dataset is randomly partitioned into K homogeneous
groups and PCA is applied to each of them. The final dataset consists
in a matrix in which the i-th column is the principal component of the
corresponding partition. Experiments in [[62]32] showed that this
reduction strategy still preserves the structural characteristics of
the original dataset, suggesting that for large-enough partitions, PCA
obeys a sort of distributive law. In [[63]33] the authors showed that a
wrapper-based dimensionality reduction approach (i.e. a method that
exploits the outcome of classification to select features) can be
improved applying in cascade a filter-based method (i.e. an algorithm
independent of the classification algorithm).
Let E be a s × d matrix with the expression levels of d miRNAs on s
samples and let E(v) be a subset of n < d columns of E as specified by
a n-dimensional vector v.
Building upon [[64]32], we pre-processed E by applying component
analysis and extracting the first n components (for a fixed value of
n). The resulting matrix
[MATH: E^ :MATH]
represents the projection of E into a new smaller space with s rows and
n columns. We observed that, when
[MATH: E^ :MATH]
maintains the same structural characteristics of the original
expression matrix (i.e. saving most of its variance), it can be used as
an encoder to compress a miRNA vector of E into a smaller one with a
marginal loss of information. Consequently, multiplying each vector of
E(v) (thus selecting n miRNAs) with the matrix
[MATH: E^ :MATH]
we obtain a new matrix
[MATH: E^(v) :MATH]
that: keeps unaltered most of the structural characteristics of the
original expression matrix, and highlights the contribution of the
selected miRNAs.
Choosing a convenient component analysis algorithm able to preserve
variance is complicated by the fact that different datasets can have
profound differences that reflect on the outcome of the methods. PCA is
fast and appropriate when data do not exhibit non-linear relationships,
while KPCA is more able to preserve complex relationships. We
empirically chose per dataset component analysis by testing the
accuracy of classification either after PCA or KPCA and deciding for
the most promising.
Following a philosophy similar to that of [[65]33], we used the DE
algorithm to select the miRNAs specified in the vector v and the matrix
[MATH: E^(v) :MATH]
(in place of E(v)) as input for the classifier that computes the
fitness function. As a consequence, the fitness value does no longer
reflect the classification accuracy of a panel based on E(v).
Nevertheless,
[MATH: E^(v) :MATH]
is likely to be more effective than E(v) to quickly lead DE to an
optimal solution.
Differential evolution
Differential Evolution (DE) [[66]25–[67]27] is an evolutionary
optimization tool aimed at finding a global optimum solution in a
n-dimensional real parameter space
[MATH: Rn :MATH]
.
Like any other evolutionary algorithm, DE maintains a set V = {v[1], …,
v[I]} of I candidate solutions (I = 50 in our experiments) and applies
a series of operators to evolve to the next generation. Each solution,
in turn, is a vector v storing n features (miRNAs in our case). For
clarity, we often add a subscript in the notation to denote the
generation.
Let M be the sorted static list of miRNAs involved in a given cancer
dataset. Say |M| = d. A vector v ∈ V consists of n disjoint references