Abstract
Disease-related protein-coding genes have been widely studied, but
disease-related non-coding genes remain largely unknown. This work
introduces a new vector to represent diseases, and applies the newly
vectorized data for a positive-unlabeled learning algorithm to predict
and rank disease-related long non-coding RNA (lncRNA) genes. This novel
vector representation for diseases consists of two sub-vectors, one is
composed of 45 elements, characterizing the information entropies of
the disease genes distribution over 45 chromosome substructures. This
idea is supported by our observation that some substructures (e.g., the
chromosome 6 p-arm) are highly preferred by disease-related protein
coding genes, while some (e.g., the 21 p-arm) are not favored at all.
The second sub-vector is 30-dimensional, characterizing the
distribution of disease gene enriched KEGG pathways in comparison with
our manually created pathway groups. The second sub-vector complements
with the first one to differentiate between various diseases. Our
prediction method outperforms the state-of-the-art methods on benchmark
datasets for prioritizing disease related lncRNA genes. The method also
works well when only the sequence information of an lncRNA gene is
known, or even when a given disease has no currently recognized long
non-coding genes.
Keywords: chromosome preference, vectorization, long noncoding RNA
INTRODUCTION
Benefiting from the breakthroughs of the next generation sequencing
(NGS) technologies [[34]1], disease related protein coding genes have
been widely studied during the last decades. Non-coding genes have been
also found involved in human disease development by functioning as
regulators of their target protein coding genes [[35]2]. Especially,
long non-coding RNAs (lncRNAs), a kind of RNA that do not encode
proteins and are longer than 200nt, have been found to contain
significant genetic information and functions [[36]3]. The
dysregulation of lncRNAs can result in the dysfunction of their target
protein coding genes or their participated cellular processes, causing
the development of diseases. For example, the lncRNA NEAT1 was reported
to be a potential target of ERα and is an important mediator for
maintenance of prostate cancer [[37]4]. Loc285194, another lncRNA, is
also a tumor suppressor that regulates p53 [[38]5]. Increasing number
of studies have been focusing on the application of disease-lncRNA
associations including disease diagnosis [[39]6], survival prediction
[[40]7] and RNA therapeutics [[41]8]. However, the function annotation
of lncRNA genes such as their roles in disease development is remaining
largely unknown.
Genomic locus inferring methods [[42]9–[43]10], computational methods
including gene-lncRNA co-expression methods [[44]11–[45]12], network
analysis methods [[46]13], similarities analysis or semi-supervised
learning methods [[47]14], supervised learning methods [[48]15] and
others [[49]16] can speed up this area of research for disease gene
prediction. The network analysis heavily relies on the topology
properties of the constructed networks. The semi-supervised learning
methods depend on accurate similarity measurements between diseases and
lncRNAs. The supervised learning approach has not been extensively
explored because of lack of reliable negative samples of disease
related lncRNA genes.
We propose to use a positive-unlabeled learning (PU-learning) method to
predict disease related lncRNA genes. PU learning can well address the
problem of lacking reliable negative samples to gain high prediction
performance. In this work, we also introduce a novel vector
[MATH:
<Vd>
:MATH]
to represent a disease d, and a novel vector
[MATH:
<VLnc> :MATH]
to represent an lncRNA gene Lnc. We merge these two vectors as
[MATH:
<Vd,V<
/mi>Lnc>
:MATH]
to represent the pair of disease d and the lncRNA gene Lnc. The
prediction problem is: whether this merged vector can be mapped to 1 or
0 with a certain level of probability. If it is mapped to 1 with a high
probability (e.g., 90%), then it means that the disease d is related to
the lncRNA gene Lnc under a high probability. Otherwise, the disease d
has little relationship with lncRNA gene Lnc.
The novel disease vector representation
[MATH:
<Vd>
:MATH]
consists of two sub-vectors. The elements of the first sub-vector
[MATH:
<Vdch
r> :MATH]
represent the chromosome substructures’ distribution information
entropies of the genes related to the disease d. We consider 45
chromosome substructures in this work (details presented later).
This idea for disease representation is inspired by a chromosome
substructure enrichment analysis of the disease related protein coding
genes. It is similar to gene pathway enrichment analysis that the
protein gene set of a disease can be enriched at each chromosome
substructure containing the protein gene set. We have observed that
about 16.2% of 2802 diseases' genes can be enriched to chromosome 6
p-arm (with Fisher's exact test, p-value<0.05), implying a strong
chromosome preference of disease genes. This preference is
significantly higher than the second most enriched chromosome 2 q-arm
(containing just 5.92% of the 2802 diseases). Furthermore, no disease
gene set can be enriched to the chromosome 21 p-arm. Our hypothesis is
that genes are located at various positions on chromosomes and
mitochondrion, and the distribution of disease related protein coding
genes on the chromosomes can be used to characterize the differences
between diseases.
The second sub-vector
[MATH:
<Vdpa
th> :MATH]
represents the KEGG pathway groups’ distribution information entropies
of disease d related genes enriched KEGG pathways. Human KEGG pathways
[[50]17] can be divided into 30 groups. By the disease gene KEGG
pathway enrichment analysis on the 2802 diseases, we have observed that
almost all these KEGG pathways are involved in disease developments.
The distribution of disease gene sets on KEGG pathway groups is also
uneven. For example, more than 30% of the 2802 diseases are associated
with 6 pathways including hsa04933: AGE-RAGE signaling pathway in
diabetic complications and hsa05321: Inflammatory bowel disease (IBD).
In comparison, as many as 61 kinds of pathways are related to less than
1% of these diseases.
Comparing with existing disease characterization methods through
computing similarities of disease related coding or non-coding genes
[[51]18], semantics [[52]19], phenotypes [[53]20, [54]21], symptoms
[[55]22] and ontology [[56]23], our disease vectorization
[MATH:
<Vdch
r,Vdpath> :MATH]
is much simpler. It does not need repeated set operations such as union
and intersection or large scale of text mining. Our disease vectors are
also effective to capture unique disease characteristics. The disease
similarity can reach to the average area under ROC curve (AUC) of
0.9458 when the diseases are represented by our vectors. However,
FunSim [[57]18] and a disease symptom representation method [[58]22]
have only 0.9202 and 0.7674 AUC respectively on the same set of
diseases.
The vector
[MATH:
<VLnc> :MATH]
representing an lncRNA gene Lnc consists of two sub-vectors
[MATH:
<VLnc
seq><
/mrow> :MATH]
and
[MATH:
<VLnc
prof<
mo>> :MATH]
as well.The first one represents its sequence’s k-mer frequencies, and
the second one represents its expression profiles. Merging the two
disease sub-vectors
[MATH:
<Vdch
r> :MATH]
and
[MATH:
<Vdpa
th> :MATH]
, the two lncRNA sub-vectors
[MATH:
<VLnc
seq><
/mrow> :MATH]
and
[MATH:
<VLnc
prof<
mo>> :MATH]
, we can represent a disease-lncRNA gene pair (denoted d-Lnc) as
[MATH:
<Vd,V<
/mi>Lnc>
:MATH]
Procedures for constructing the main sub-vectors are shown in Figure
[59]1.
Figure 1. The flowchart for the vectorization representation of a
disease-lncRNA gene pair.
[60]Figure 1
[61]Open in a new tab
A disease-lncRNA gene pair can be represented by the integration of
four sub-vectors including disease gene chromosome substructures’
distribution information entropy vector (disease gene distribution
vector), the disease gene enriched pathway groups’ distribution
information entropy vector (disease pathway distribution vector), the
lncRNA gene sequence’s k-mer frequency vector and the lncRNA gene
expression profile.
Disease related lncRNA genes should also prefer to co-expressing with
other genes that are associated with this disease (such as those lncRNA
genes which regulate some of the disease related protein coding genes).
With this hypothesis, we add these co-expression features as the fifth
sub-vector
[MATH:
<Vco−exp> :MATH]
to the merged vector
[MATH:
<Vd,V<
/mi>Lnc>
:MATH]
. From our baseline classifier selection experiments, we have proved
that this new sub-vector can further improve the prediction
performance.
A bagging SVM for PU learning algorithm [[62]24] is adopted to
prioritize disease related lncRNA genes. This model was trained on a
set of disease-lncRNA vectors. On three data sets retrieved from three
disease-lncRNA association databases: LncRNADisease [[63]9], Lnc2Cancer
[[64]25] and MNDR [[65]26], the overall AUC scores of leave-one-out
cross validation (LOOCV) by our method are 0.8016, 0.8335 and 0.7527
respectively. This performance is significantly superior to two
state-of-the-art methods: LRLSLDA [[66]14] (0.6882, 0.7308 and 0.6346)
and LRLSLDA-ILNCSIM [[67]27] (0.6949, 0.7390 and 0.6435). Especially
when only the sequence information of the lncRNA genes is available,
our method can still work well for the prediction. The overall LOOCV
AUC scores for the three datasets are 0.7889, 0.8266 and 0.7216. The
results of the following leave-one-disease-out cross-validation
(LODOCV) experiments show the ability of our method to predict without
known disease related lncRNA genes for a given disease as the average
AUC value is 0.7356 for the LncRNADisease dataset. There are 68 out of
162 diseases can achieve the AUC values bigger than 0.9.
RESULTS
Chromosome preference and disfavor of disease genes
In the understanding of unique characteristics of disease genes on the
chromosomes, we constructed chromosome enrichment analysis of disease
genes. The process is similar to the implementation of Fisher’s exact
test for pathway enrichment analysis which we have described in
Algorithm [68]1. The main difference is that the pathway genes are
replaced with the chromosome involved genes. We note that only protein
coding genes are considered for the chromosome preference analysis of
disease genes as the non-coding disease genes are under prediction.
Algorithm 1. Disease vectorization.
Input disease d related gene set d[g], Approved genes G, human pathway
set P, each pathway p[j] contained gene set g[pj], parameter k1, k2, T;
1: Sort G according to the chromosome location of g[i], sort P
according to the ids of p[j];
2: Separate G according to the natural chromosome structure such as
chr1 p-arm, chr1 q-arm,…. There are totally S chromosome substructures,
i.e. chr[1], chr[2], …, chr[u], …, chr[S];
3: Divide P into T groups, i.e. p[g1], p[g2], …, p[gw], …, p[gT];
4: Generate the initial status series of each S_chr[u] = (0,0, …, 0)
and S_p[gw] = (0, 0, …, 0);
5: Map d[g] to chr[u] according to its location and change the
corresponding status in S_chr[u] as 1;
6: Set k = k1
7: for u = 1 to S do
8: Scan S_chr[u] with window size of k and step size 1;
9: Compute the frequency of qth k-mer sub-status series as f[q];
10: Compute the information entropy of the k-mer sub-status series for
S_chr[u] as
[MATH: IEchru=−fq×∑q=1
2klo
g(fq); :MATH]
11: end for
12: for j = 1 to M do
13: Count genes in g[pj] as L = Length (g[pj]);
14: Count genes mapped into p[j] as
[MATH:
B=Lengt
h(dg∩gpj)<
mo>; :MATH]
15: Do the fisher’s exact test:
16: [h, p, stats] = fishertest ([L- B, B;A - L - n + B; n - B]), where
p is the p-value;
17: if p <= 0.05 then
18: Change the status of p[j] in S_p[gw] as 1;
19: end if
20: end for
21: Set k = k2
22: for w = 1 to T do
23: Scan S_p[gw] with window size of k and step size 1;
24: Compute the frequency of vth k-mer sub-status series as f[v]
25: Compute the information entropy of the k- mer sub-status series for
S_p[gw] as
[MATH: IEPgw=−fv×∑v=1
2klog(fv); :MATH]
26: end for
Output The disease gene distribution entropy vector:
[MATH: <Vdchr>=<IEchr1,IEchr2,…,IEchrS><
/mrow> :MATH]
Output The disease pathway distribution entropy vector:
[MATH: <Vdpath>=<IEpg1,IEpg2,…,IEpgr><
/mrow> :MATH]
[69]Open in a new tab
The 24 chromosomes of human genome can be naturally divided into 48
substructures with the p-arm and the q-arm as two substructures for
each chromosome. However, for chromosome 13 (chr13), there is only one
protein gene on the centromere and there is no approved protein gene
located at its p-arm; for chromosome 14, only one gene is located at
its p-arm; and there is no gene located at the p-arm of chr15 or chr22.
Thus, these four chromosomes were not divided. We consider the
mitochondrion as a special chromosome which cannot be divided into two
substructures. In total, we have 45 chromosome substructures, namely
S=45 in Algorithm [70]1. Figure [71]2 and Figure [72]3 show the
statistics of the chromosome substructure enrichment analysis for the
disease genes of each of the 2802 diseases.
Figure 2. The disease chromosome enrichment analysis pie graph.
[73]Figure 2
[74]Open in a new tab
Subchr means chromosome substructure. We did the statistics of how many
chromosomes a disease gene set enriches. More than a half (53%) of the
2802 diseases are just enriched to only one chromosome substructure,
while just 3% of these diseases can be enriched to more than 4
chromosome substructures.
Figure 3. The disease chromosome enrichment analysis results.
[75]Figure 3
[76]Open in a new tab
The y-axis are percentages of diseases that enriched to each of the
chromosome substructures. The x-axis are the indexes of the chromosome
substructures. The bar graph at the top right shows the statistics of
the numbers of chromosome substructures that contained by diseases with
given percentages scopes.
There are about 75% of the diseases whose related gene sets can be
enriched to no more than 1 chromosome substructure (Figure [77]2).
There are just 3% of the diseases whose related gene sets can be
enriched to more than 4 chromosome substructures. This distribution of
disease genes on the chromosome substructures reveals that the disease
genes of a given disease are very likely located at a neighborhood
region. As indicated by Figure [78]3, the p-arm of chromosome 6 is the
most preferred substructure of disease genes - about 16.2% of the
disease related gene sets can be enriched here. This percentage is
significantly higher than the other substructures (all no more than
6%). Interestingly, there is no disease related gene set that can be
enriched to the p-arm of chromosome 21. From the top-right bar graph of
Figure [79]3, we can also see that 16 out of the 45 chromosome
substructures are enriched by only 1% -2% of the 2802 gene sets. There
are 10 and 8 substructures can be enriched by 2%-3% and 3%-4% of the
2802 gene sets. Thus, most of the chromosome substructures (38 out of
45) can be enriched by no more than 3% of the 2802 gene sets. These
observations suggest a phenomenon that disease genes are unevenly
distributed in the 45 chromosome substructures. The genes related to a
disease are preferred at a physical neighborhood close to each other in
the chromosomes. This observation of chromosome preference lays down
the foundation for our disease vector representation.
We also conducted pathway enrichment analysis to understand the
distribution of disease genes in human KEGG pathways. We found that
disease genes are also unevenly enriched in these pathways. More than
30% of the 2802 diseases are associated with one of the top 6 pathways
such as hsa04933: AGE-RAGE signaling pathway in diabetic complications,
and hsa05321: Inflammatory bowel disease (IBD). In contrast, 61 out of
303 pathways are related to less than 1% of these diseases. More
details are reported in [80]Supplementary File 1.
Performance on the prediction of highly similar diseases using our disease
vector representation
We tested the performance of our vectorization model for computing
disease similarities on the dataset downloaded from the supplementary
files of Cheng's paper [[81]18]. It contains a candidate disease set
and a benchmark set of similar disease pairs. The disease set is
composed of 2802 diseases and their related genes. There are 70 similar
disease pairs in the benchmark set. Zhou et al. [[82]22] proposed a
symptom representation method for measuring disease similarities. To
compare this method with ours, we downloaded their similarity scores
between 1596 diseases and mapped these diseases to the 2802- disease
set. Totally 1012 diseases and 56 similar disease pairs in the
benchmark set can be mapped. These two disease sets have been stored in
[83]Supplementary File 2.
Following cheng's method, we drew a ROC curve to display how our method
can rank the similar pairs in the benchmark set comparing with those
randomly selected unknown disease pairs. That means, for a given
threshold, if the similarity of a pair in the benchmark set exceeds
this threshold, it is defined as a true positive, otherwise, as a false
negative. Inversely, an unknown disease pair exceeds the threshold is
defined as a false positive. A total of 560 testing disease-disease
pairs were randomly selected from the 1012 candidate diseases (but not
overlapping with the benchmark set). This process was repeated 100
times.
There are three parameters, i.e. k1, k2 and T, for Algorithm [84]1 and
one parameter θ for equation [85](3) need to be tuned. According to the
HGNC database, there are 19025 approved protein coding genes. Because
the minimum length of the chromosome substructure is 9 (only 9 protein
coding genes on this substructure), thus the parameter k1 was changed
from 1 to 9 with the step size of 1. There are 303 different human KEGG
pathways. To simplify our model, we set T=30 with the first 29 groups
containing 10 pathways while the last group has 13 pathways. Finally,
the disease genes chromosome substructures’ distribution information
entropy (disease gene distribution entropy) feature is represented as a
45-dimensional vector while the disease gene enriched pathway groups’
distribution information entropy (disease pathway distribution entropy)
feature is a 30-dimensional vector. k2 was changed from 1 to 10 with
the step size of 1. The integration parameter θ was in the range of [0,
1].
When k1=9, we can get the biggest average AUC=0.9429. Meanwhile, when
k2=8, the AUC value with just pathway distribution entropy vectors can
achieve 0.8872. Thus, we set k1=9 and k2=8 for the subsequent
experiments.
We also compared the performances of our methods (namely the entropy
vector methods and the status series vector methods), the FunSim
[[86]18] and symptom representation method [[87]22]. We implemented the
FunSim according to the published paper. Then, the AUC values were
computed according to the scores via different methods. During the
comparison, θ was set to be 0 to 1 with the step size of 0.1. When
θ=0.8, the integrated similarity method can work the best with average
AUC=0.9458. We drew the corresponding overall ROC curves (all the 100
times repeat experiments’ results are combined together to compute the
False Positive Rate and True Positive Rate; thus, the overall AUC
values are smaller than the average AUC values) of the 100 times
experiments in Figure [88]4. More comparison results for the original
2802 disease set can be found in the [89]Supplementary File 1.
Figure 4. The ROC curves of different methods for computing the disease
similarities.
[90]Figure 4
[91]Open in a new tab
There are 7 ROC curves: the disease pathway distribution entropy vector
method (θ=0, AUC=0.8555); the disease gene distribution entropy vector
method (θ=1, AUC=0.9067); the integrated similarity method (θ=0.8,
AUC=0.9094); the pathway status series vector method (AUC=0.7867); the
disease gene status series vector method (AUC=0.5882); FunSim
(AUC=0.8858) and Symptom representation method (AUC=0.7455).
Figure [92]4 shows that the integrated similarity method is better than
the other methods. However, it just improves 0.0027 on the AUC value
comparing with just disease gene distribution entropy vector method
(k1=9, θ=1). It implies that there are not much complementary between
the disease gene distribution entropy and disease pathway distribution
entropy features as to compute the similarities of diseases. The
integrated similarity method with θ=0.8 outperforms the FunSim and
symptom representation method by improving AUC values of 0.0236 and
0.1639 respectively. In comparison, the status series vector methods
cannot work as well as the entropy vector methods. The entropy vector
methods (disease gene distribution entropy vs. disease gene status
series and disease pathway distribution entropy vs. pathway status
series) improve the overall AUC values by 0.3185 and 0.0688. This
proves our “part overcomes the whole” hypothesis that our dividing and
information entropy strategy for representing diseases is more
effective than the original status series.
Performance on the prediction and prioritization of disease related lncRNA
genes
The performance of our disease vectorization method for predicting and
prioritizing disease related lncRNA genes was tested and evaluated on
three data sets: the lncRNADisease dataset (454 positive samples, i.e.,
454 known associations between some diseases and some lncRNA genes),
the lnc2cancer dataset (594 positive samples) and the MNDR dataset (176
positive samples). See details of these data sets at the section
Materials and Methods.
Classifier and parameter selection for final prediction model with the
lncRNADisease dataset
We used both liner and RBF kernel for the SVM-based positive-unlabled
learning method to conduct cross validation on the lncRNADisease data
set. The number of positive samples is 454, and the number of unlabeled
samples (i.e., the number of unknown associations) is 29840, derived by
exhaustively pairing the 162 diseases and 187 lncRNAs in the
lncRNADisease data set after the deduction of the number of 454
positive samples. Recall that our vector representation for a pair of
disease and lncRNA gene consists of five sub-vectors. Here, we choose
different combination of these sub-vectors to understand that all of
these sub-vectors are important for the prediction. The steps are
presented in Algorithm [93]2.
Algorithm 2. A bagging SVM for prioritizing the disease related lncRNA genes.
Input Positive dataset PO, unlabeled dataset UN, bootstrap sample size
R, bootstrap number V, SVM parameters, feature type W;
1: for a=1 to 100 do
2: Randomly select |PO| of unlabeled samples as negative samples;
3: Implement a 5-fold cross validation on the positive-negative dataset
with feature type W and do grid search of SVM parameters;
4: end for
5: Use F1 score as the metric, determine the optimal SVM parameters
opPara and the optimal feature type W[op];
6:
[MATH: ∀ xϵUN,n(x)←0,f(x)←0; :MATH]
7: for b=1 to V do
8: Draw a bootstrap sample UN[b] of size R in UN;
9: Train a classifier f[b] to discriminate PO against UN[b] with opPara
and W[op];
10: For any
[MATH: xϵ(UN\UNb)
:MATH]
, update:
11:
[MATH: f(x)←f(x)+fb(x);
:MATH]
12:
[MATH: n(x)←n(x)+1; :MATH]
13: end for
Output The score
[MATH: s(x)=f(x)n(x),
mo>for xϵ UN. :MATH]
[94]Open in a new tab
The basic combination of the sub-vectors is to merge the disease gene
distribution entropy sub-vector
[MATH:
<Vdch
r> :MATH]
and lncRNA sequence's k-mer frequency sub-vector
[MATH:
<VLnc
seq><
/mrow> :MATH]
Here, we set k=3 (k-mer size for lncRNA sequence) and k1=9 (k-mer size
for disease gene series) in the previous section. This basic feature
vector is a 109-dimensional (45+64) feature vector, simply denoted by
sf1+sf3. We name it the type-0 feature vector. Adding other sub-vectors
such as the disease pathway distribution entropy vector
[MATH:
<Vdpa
th> :MATH]
(sf2, 30-dimensional), lncRNA expression profile
[MATH:
<VLnc
prof<
mo>> :MATH]
(sf4, 16-dimensional), the basic feature vector can be expanded into
another three feature types, i.e., the feature type 1∼3 in Table [95]1.
Furthermore, the co-expression feature namely the fifth sub-vector
[MATH:
<Vco−exp> :MATH]
(sf5, 3-dimensional) was added to each of the former combinations to
form four more feature types which are showed in the last four lines of
Table [96]1.
Table 1. Feature types and their corresponding performance.
Type Combination of sub-vectors Liner kernel RBF kernel
0 sf1, sf3 C = 7, F1 = 0.6668 C = -1, G = -1, F1 = 0.6734
1 sf1, sf2, sf3 C = 2, F1 = 0.6895 C = 3, G = -5, F1 = 0.7024
2 sf1, sf3, sf4 C = 7, F1 = 0.6692 C = 6, G = -2, F1 = 0.6734
3 sf1, sf2, sf3, sf4 C = 0, F1 = 0.6942 C = 5, G = -7, F1 = 0.7058
4 sf1, sf3, sf5 C = 8, F1 = 0.6658 C = 0, G = -2, F1 = 0.6768
5 sf1, sf2, sf3, sf5 C = 0, F1 = 0.6906 C = 4, G = -6, F1 =0.7032
6 sf1, sf3, sf4, sf5 C = 1, F1 = 0.6708 C = 0, G = -2, F1 = 0.6748
7 sf1, sf2, sf3, sf4, sf5 C = 2, F1 = 0.7004 C = 3, G = -5, F1 = 0.7114
[97]Open in a new tab
Under these different types of vector representation, we tested the
liner kernel and RBF kernel with parameter c = 2^C, where C ranges from
-8 to 8 with the step size of 1 (liner kernel and RBF kernel) and g =
2^G, where G ranges from -8 to 8 with the step size of 1 (RBF kernel).
The best 5-fold cross validation results (F1 scores) on the
lncRNADisease dataset are shown in the third and fourth columns of
Table [98]1.
Adding the disease pathway distribution entropy sub-vector
[MATH:
<Vdpa
th> :MATH]
(i.e., sf2) can improve the performance for predicting disease-lncRNA
associations (type1 vs. type0, type3 vs. type2, type7 vs. type6,
averagely improved by 0.0257 for liner SVM and 0.0307 for RBF SVM
respectively). However, the improvement by adding the lncRNA expression
profile is not as high as adding the disease pathway distribution
entropy sub-vector (0.0052 for liner SVM, 0.0021 for RBF SVM
averagely). The co-expression feature vector
[MATH:
<Vco−exp> :MATH]
can further improve the prediction performance averagely by 0.0039 and
0.0033 for liner SVM and RBF svm respectively. The combination of all
the 5 sub-vectors (i.e., the type 7 feature vector) worked the best
among the 8 types of feature vectors (on average improving by 0.0223
for liner SVM and 0.0243 for RBF SVM). Furthermore, the RBF kernel
outperforms the liner kernel (on average improving by 0.0092). Thus,
our baseline classifier is the RBF SVM (C = 3, G = -5) with the type 7
feature vector representation (W=7).
Using all the sub-vectors (i.e., the type 7 feature vector) to
represent a pair of disease and lncRNA gene, the 5-fold cross
validation AUC results on the lncRNADisease dataset by bagging SVM are
shown in Figure [99]5, using different bootstrap sample size R and the
bootstrap number V. Here, we repeated the experiment 10 times. The AUC
values were computed via comparing the scores of known pairs (set to be
unknown during the cross validation) with those unknown ones. We note
that we simply set R=|PO| as Mordelet [[100]24] had proved that setting
R to be the same as the size of positive samples is a safe choice for
the bagging SVM.
Figure 5. The boxplot graph of the AUC values for the 5-fold cross validation
experiments.
[101]Figure 5
[102]Open in a new tab
The x-axis is the value of V, and the y-axis is the corresponding AUC
values. The changes of the AUC values with different V are tiny. For a
given V, the prediction results are stable.
The AUC values change in a narrow scope (0.79-0.81) when the bootstrap
number V varies from 10 to 400. In fact, the running time for computing
the scores of unknown samples increases significantly when V is
increasing. As bigger V achieves weak improvement of the performance
but results in significant increase of time cost, we suggest to fix
V=10. This is consistent with the conclusion of Mordelet's report that
when R is large, the SVM usually rarely benefits from bagging. Thus,
our final PU learning classifier is built with following parameters:
RBF kernel SVM with C = 3, G = -5, V = 10, R= |PO| and W = 7.
Performance comparison and case studies
In comparison with two state-of-the-art disease-lncRNA association
prediction methods LRLSLDA [[103]14] and LRLSLDA_ILNCSIM [[104]27], our
leave-one-out cross-validation AUC performance is much better on the
three datasets (Figure [105]6). We note that the source codes of these
two existing methods are not available, but we implemented their
algorithms for a fair comparison. Their datasets are not available
either.
Figure 6. The leave-one-out cross validation results based on three datasets
with different methods.
[106]Figure 6
[107]Open in a new tab
Four methods were compared, our method with type 7 (W=7) feature and
type 1 (W=1) feature, LRLSLDA method and the LRLSLDA_ILNCSIM method.
Our type 7 method works best for all three datasets.
Our method with type 7 feature vector has a superior performance
(AUC=0.8016, 0.8335 and 0.7527 on the three datasets) over the other
three methods: the type 1 vector method (AUC=0.7889, 0.8266 and
0.7216), the LRLSLDA (AUC=0.6882, 0.7308 and 0.6346) and the
LRLSLDA_ILNCSIM (AUC=0.6949, 0.7390 and 0.6435). We note that our type
1 vector needs just the accessible information such as disease genes
and lncRNA sequences, but it can achieve close performance as the type
7 vector method did.
We also did the leave-one-disease-out cross validation when assuming
that all the related lncRNAs of a given disease are unknown. Then we
computed the possibilities of the lncRNAs to be associated with the
disease. The AUC value was used to test how are those already known
related lncRNAs ranked comparing with the unknown ones. There are more
than 40% (68 out of 162 diseases) of the diseases can achieve an AUC
value higher than 0.9. The average AUC of all the diseases is 0.7356.
This suggests that our method is capable of predicting disease-lncRNA
associations even without knowing any association with a given disease.
We did an experiment to predict disease related lncRNAs using the known
454 positive samples and the 29840 unlabeled samples by PU learning.
The predicted results were validated using two other datasets (166
lnc2cancer samples and 29 MNDR samples overlap with the 29840 unlabeled
samples). The ranking scores of the 29840 unlabeled samples and a ROC
curve are plotted in Figure [108]7.
Figure 7. The final prediction test on the lncRNADisease dataset.
[109]Figure 7
[110]Open in a new tab
The x-axis is the unknown disease-lncRNA pairs' predicted ranks. The
y-axis are the predicted scores which means the possibilities of the
samples to be positive. The predicted results were validated via the
lnc2cancer and MNDR datasets. The validated samples were marked on the
score curve. The ROC curve that compares the scores of the validated
samples and the remain unknown samples is drawn at the top right of
this figure. The AUC value achieves 0.9005.
Figure [111]7 shows that most of the validated samples are ranked at
good positions. We regarded those 166 lnc2cancer samples and 29 MNDR
samples as positive while remaining unknown samples as negative and
draw the ROC curve at the top right of Figure [112]7. It achieves the
AUC value of 0.9005, which reveals that our prediction can always rank
the positive samples well. We also did case studies for breast cancer
and prostate cancer. Breast cancer is the leading type of cancer in
women, accounting for 25% of all women cancer patients [[113]28].
Prostate cancer is the second most common type of cancer and the fifth
leading cause of cancer-related death in men [[114]28]. We list in
Table [115]2 top 5 lncRNAs that are (possibly) related to these two
cancer types.
Table 2. Case studies for predicting breast cancer and prostate cancer
related lncRNAs.
Cancer type Predicted related lncRNA Predicted scores Validated status
breast cancer UCA1 0.8685 validated by lnc2cancer
breast cancer DLEU2 0.8375 validated by literatures
breast cancer EPB41L4A-AS1 0.8356 not validated
breast cancer LINC00271 0.8297 validated by literatures
breast cancer 7SK 0.8280 validated by literatures
prostate cancer UCA1 0.9220 validated by lnc2cancer
prostate cancer BCYRN1 0.8983 not validated
prostate cancer HOTAIR 0.8952 validated by lnc2cancer
prostate cancer ZFAS1 0.8810 not validated
prostate cancer BOK-AS1 0.8800 not validated
[116]Open in a new tab
The most-top ranked lncNA that is related to breast cancer is UCA1.
This annotation has been already recorded in the lnc2cancer database.
The second highest ranked lncRNA is DLEU2. In fact, DLEU2 is frequently
deleted in malignancy [[117]29]. It functions as a critical host gene
of the cell cycle inhibitory microRNAs miR-15a and miR-16-1. Both of
these two microRNAs are related to the breast cancer [[118]30]. The
4^th and 5^th top-ranked lncRNAs LINC0271 [[119]31] and 7SK [[120]32]
are related to breast cancers. As to prostate cancer, two top-ranked
lncRNAs UCA1 and HOTAIR have been actually stored in the lnc2cancer
database. These case studies support that our disease vector
representation and PU learning methods are effective to prioritize
disease related lncRNA genes.
MATERIALS AND METHODS
Datasets of diseases and disease related genes were collected and
relevant human KEGG pathways were collected as well for the
construction of the disease vectorization model and the disease gene
prediction method. The details of the datasets and prediction
algorithms are presented below.
Diseases, disease genes and KEGG pathways
The Medical Subject Headings (MeSH) [[121]33], Comparative
Toxicogenomics Database (CTD) [[122]34], Disease Ontology (DO)
[[123]35] and Online Mendelian Inheritance in Man (OMIM) [[124]36] are
widely visited databases containing massive amount of disease related
information. However, there is no standard for the adoption of disease
names or ids between these databases. We mapped disease names to DO ids
using the DO, MeSH and CTD as dictionaries. Similarly, for genes, we
did id or name conversion using the data records from the HUGO Gene
Nomenclature Committee (HGNC) [[125]37] database. It contains reference
records of genes among a great number of widely used databases. In this
work, we mainly mapped the genes obtained from various resources to
entrez gene ids [[126]38]. We downloaded the HGNC database on Jun 17,
2016. There are totally 39670 approved gene records with entrez gene
ids including 19025 protein coding genes and 20645 non-protein coding
genes.
We downloaded disease-gene associations from the supplementary file of
a published article [[127]18] which contains 117,190 associations
between 2817 diseases and 12063 genes. The authors collected these data
records from database SIDD [[128]39]. Each of the diseases has a unique
id from database DO. After data correction and redundancy removal
according to the latest version of the databases DO, MeSH, CTD and
HGNC, we obtained a set of 114754 disease-gene associations between
2802 diseases and 10893 genes (including 10321 protein coding genes and
572 non-protein coding genes). The human KEGG pathways were extracted
from the KEGG database on June 21, 2016. There are 303 unique pathways
containing 7060 unique genes (all have an entrez gene id). All these
datasets are listed in [129]Supplementary File 3 and [130]Supplementary
File 4.
Associations between diseases and lncRNAs
The disease-lncRNA associations were obtained from three databases:
lncRNADisease (downloaded on April 18, 2016), lnc2cancer (downloaded on
July 4, 2016) and MNDR (downloaded on June 30, 2016). There are 1102,
1239 and 754 disease-lncRNA associations (redundant and unclear
information are existing). For the diseases, we mapped them to DO. To
construct our PU learning model for disease related lncRNA prediction,
we collected the sequences and expression profiles of the lncRNAs.
We mapped each of these lncRNAs to its corresponding ensembl gene id,
RefSeq accession id, entrez gene id and other detail information. This
process was manually finished via searching and comparing the lncRNA
related databases such as ensembl [[131]40], NONCODE [[132]41],
Lncipedia [[133]42], lncRNAdb [[134]43], and HGNC. Then, lncRNA
sequences were extracted from the RefSeq [[135]44]. We downloaded the
expression level of 60245 genes (coding or non-coding genes matched
with an ensembl id and gene symbol) in 16 tissues from the Expression
Atlas [[136]45].
Finally, we obtained 454 disease-lncRNA associations from lncRNADisease
(between 162 diseases having known disease genes and 187 lncRNAs with
known sequences and expression levels). Those 594 (79 cancers, 310
lncRNAs) and 176 (86 diseases, 57 lncRNAs) more pairs were extracted
from lnc2cancer and MNDR respectively. For those diseases that not
exist in the above 2802 ones, disease genes were obtained from the CTD,
DisgeNet [[137]46], OMIM and malaCard [[138]47]. The datasets are
stored in [139]Supplementary File 5.
Disease gene chromosome preference analysis and disease vectorization method
Human genes are located on mitochondrion and 24 unique chromosomes
including 22 autosomes and two sex chromosomes. The genes’ locations on
the chromosomes or mitochondrion have been labeled by the HGNC
database. As disease related genes are distributed at varies locations
and have a different number of each disease, we hypothesize that the
gene distribution differences between diseases on the chromosomes or
mitochondrion may reflect the divergences of the diseases. We also
hypothesize that disease genes may have some preferred chromosomes for
some diseases. This hypothesis can be investigated by the disease
genes’ chromosomes enrichment analysis via fisher's exact test
[[140]48]. Thus, it is better to characterize the distribution
properties of disease genes on each of the chromosomes instead of on
the whole known gene set (we call it a “part overcomes the whole”
hypothesis).
On the basis of these hypotheses, we considered to vectorize a disease
via modeling the distribution property of its related gene set.
However, with just the gene distribution characteristics, there may be
no gene function information involved. Thus, we considered to extract
the distribution properties of disease gene enriched KEGG pathways
comparing to all the known pathways to inject complementary information
for our vector representation of diseases. This vectorization process
includes the following steps:
Step1: Initialization. Sorting all known genes according to their
chromosome locations and sorting all the human KEGG pathways by their
ids.
Step2: Grouping. Dividing the genes and pathways into groups. Producing
a status series for each group with the length equals to the number of
genes or pathways it contains. These statuses are initialized to be 0
(inactivated).
Step3: Mapping. For a given disease related gene set, mapping it to the
gene groups and mapping its enriched pathways to the pathway groups.
Then, setting the corresponding status of a gene or pathway in the
status series to be 1 (activated) if it has been mapped.
Step4: Vectorization. Calculating the status series’ k-mer information
entropy of each gene group or pathway group to quantify them and
constructing two sub-vectors for a given disease.
Here, dividing all the genes and pathways into groups is to apply our
“part overcomes the whole” hypothesis. In our Results section, we
demonstrate that this strategy (part) is more effective for
characterizing diseases comparing to the status series without dividing
(the whole). As a chromosome always contains a p-arm and a q-arm, we
divide the genes into groups according to the natural chromosome
substructures. For the pathway status series, we divide it into T
groups on average. (There is no guidance for us to divide pathways
similar to chromosome structure). Finally, this vectorization model
includes two parts: disease gene set vectorization and disease gene
enriched pathway set vectorization.
Let d represent a disease, and d[g] = {g[1], g[2], …, g[c], …, g[n]} be
its related gene set. Let all of the approved genes from HGNC be G =
{g[1], g[2], …, g[i], …, g[N]}, and the pathway set from KEGG database
be P = {p[1], p[2], …, p[j], …, p[M]}. Let the unique genes in P be
represented as G[p] = {g[1], g[2], …, g[A]} while each pathway related
gene set as g[pj]. We define a k-mer sub-status series as (s1, s2, …,
s[r], …, s[k]), where s[r] = 0 or 1. By definition, there can be 2^k
possible k-mer sub-status series. The detail process is described as a
pseudo codes in Algorithm [141]1 and outlined in Figure [142]1. The
source codes can be referred to Supplementary Codes.
The outputs of our vectorization algorithm are two fix length vectors:
the disease gene distribution entropy vector
[MATH:
<Vdch
r> :MATH]
with length of S and the disease pathway distribution entropy vector
[MATH:
<Vdpa
th> :MATH]
with length of T. To test whether our disease vectors are effective and
to determine the parameters such as the k-mer sizes, we apply them to
compute the disease similarities. For two given diseases d1 and d2, we
derive their disease gene distribution entropy vectors
[MATH:
<Vd1chr>
:MATH]
,
[MATH:
<Vd2chr>
:MATH]
and their disease pathway distribution entropy vectors
[MATH:
<Vd1path><
/mrow> :MATH]
,
[MATH:
<Vd2path><
/mrow> :MATH]
with Algorithm [143]1. Then we compute the similarity between d1 and d2
with their vectors. The similarity between d1 and d2 is denoted as Sim
(d1, d2) and computed by:
[MATH:
{sim<
mi>Ge(d1,
d2)=0
if‖Vd1ch
r‖×
‖Vd2
mrow>chr‖<
/mo>=0;subspace(Vd1chr,Vd1chr) else :MATH]
(1)
[MATH:
{sim<
mi>Pe(d1,
d2)=0
if‖
Vd1path‖×‖Vd
1path‖=0;subspace(Vd1pa
th,Vd1path
mi>) e<
mi>lse :MATH]
(2)
[MATH:
Sim(d1,d2)=
e−[θ×
mo>simGe+(1−θ)×simPe]
mo> :MATH]
(3)
where θ is a parameter to mediate the ratio of each vector's
contribution to the similarity. || ● || means the norm. subspace(x, y)
is the function to obtain the angle between two vectors x and y. A
larger value of Sim(d1, d2) shows more similar between the two
diseases.
The four parameters k1(the size of k-mer for gene series), k2 (the size
of k-mer for pathway series), T and θ can be determined via a
performance test through comparing the disease similarity with a
benchmark dataset. We first set θ=1 to optimize k1 and set θ=0 to
optimize k2, T with the objective of achieving the best performance.
Then, the three parameters are set as the optimal values to select the
best θ. Similarly, we can also apply subspaces between the disease gene
status series (a disease is represented as a fixed-length vector with
the elements equal to 0 or 1) or the pathway status series themselves
instead to measure the similarities of diseases. We call them the
disease gene status series vector method and the pathway status series
vector method. In the Results section, we compare the performances of
our status series methods and our entropy vector methods to prove our
“part overcomes the whole” hypothesis.
Prioritizing disease related lncRNA genes
We always just have the positive samples of disease-lncRNA
associations, as the negative samples, namely the lncRNAs that do not
relate to the diseases, are neglected or even cannot be obtained.
Supervised learning algorithms are unable to deal with these
situations. However, the Positive Unlabeled learning (PU learning)
method [[144]49] can address this issue effectively. PU learning has
been an effective method for solving similar problems in bioinformatics
such as disease gene prediction [[145]50], predicting conformational
B-cell epitopes [[146]51], splicing elucidation [[147]52] and kinase
substrate prediction [[148]53]. These PU learning approaches are mainly
derived from two types of PU learning algorithms: the biased-svm
[[149]54] and Elkan et al's lemmas [[150]55]. The application of Elkan
et al's lemmas requires the satisfaction of “selected completely at
random assumption”, while the biased-svm methods need to tune a set of
parameters. Mordelet et al [[151]24] proposed a bagging svm model for
PU learning and proved that their model can match and even outperform
the biased-svm algorithm. Especially, the bagging svm for PU learning
algorithm can run considerably faster. We adopt this bagging svm PU
learning to prioritize disease related lncRNA genes.
Let Lnc be a lncRNA gene, represented as Lnc=l[1]l[2]… l[e] …l[O]. We
calculate its k-mer frequency
[MATH:
<VLnc
seq><
/mrow> :MATH]
and its expression profile
[MATH:
<VLnc
prof<
mo>> :MATH]
. As there are four kinds of nucleotides in a lncRNA sequence (i.e.,
l[e] ϵ {A, G, C, T} ), there are 4^k possible k-mers. These k-mers are
sorted by their alphabetic order. Their frequencies are counted via the
window sliding mechanism with the window size of k and a step size 1,
which are then the elements of the vector
[MATH:
<VLnc
seq><
/mrow> :MATH]
. The expression profile of Lnc can be extracted from the Expression
Atlas [[152]45]. The expression levels of the lncRNA gene in the 16
tissues are the elements of the vector
[MATH:
<VLnc
prof<
mo>> :MATH]
.
Then for a disease-lncRNA pair, e.g. d-Lnc, we construct another
feature namely vector
[MATH:
<Vco−exp> :MATH]
, called the co-expression levels. This sub-vector can be constructed
on the basis of the principle that a disease related lncRNA gene may
show the preference of co-expressing with other genes associating with
this disease (such as the lncRNA’s targets). This sub-vector contains
three elements, i.e. the maximum, minimum and average spearman
correlation coefficients
[MATH:
(<Vc<
mi>o−exp>=<<
/mo>maxco−exp,minco
−exp,avgco−<
mi>exp>)<
/mrow> :MATH]
between the expression profile of Lnc and all the known disease d
related genes’ expression profiles.
The whole disease-lncRNA feature vector is formed by combining the five
sub-vectors: the disease gene distribution entropy vector
[MATH:
<Vdch
r> :MATH]
(sf1), disease pathway distribution entropy vector
[MATH:
<Vdpa
th> :MATH]
(sf2), lncRNA sequence's k-mer frequency
[MATH:
<VLnc
seq><
/mrow> :MATH]
(sf3), lncRNA expression profile
[MATH:
<VLnc
prof<
mo>> :MATH]
(sf4), and the co-expression features
[MATH:
<Vco−exp> :MATH]
(sf5). The pseudo codes for prioritizing the disease related lncRNAs
with the bagging SVM for PU learning model are shown as Algorithm
[153]2.
In Algorithm [154]2, |PO| means the sample size of the positive
dataset. The feature type means the type of combination of the five
sub-vectors. The feature vector
[MATH:
<Vdch
r,VLncseq
> :MATH]
is used as the basic feature type. Adding the remaining sub-features to
this basic type makes new feature types. The best one can be identified
via comparing the results of the cross-validation experiments. After
obtaining the scores for the unlabeled samples, we sort them. The
larger scores imply that the samples are more likely to be positive
ones.
CONCLUSION
In this article, we proposed a novel disease vectorization
representation and applied for a PU learning method to predict and
prioritize disease related lncRNA genes. A disease is newly
characterized here using the distribution properties of disease genes
on the chromosome substructures and its related KEGG pathways to all
the pathways. Our vectorization model can be applied to compute the
disease similarities effectively. Testing on the benchmark datasets,
our method can work better than the state-of-the-art methods.
Especially, it can also work with only lncRNA sequences information or
without known related association.
Future work has been planned to improve the performance of our
vectorization model. First, more accurate disease genes will be
collected as our model critically relies on the reliability of disease
genes. Secondly, more information will be introduced to decrease the
disease gene dependency such as the disease symptom, the disease
semantics and so on. Furthermore, the relationship between the disease
genes and lncRNA targets will be considered to extract more effective
features to predict disease-lncRNA gene associations.
SUPPLEMENTARY MATERIALS FIGURES AND TABLES
[155]oncotarget-08-78901-s001.pdf^ (2.2MB, pdf)
[156]oncotarget-08-78901-s002.xlsx^ (148KB, xlsx)
[157]oncotarget-08-78901-s003.xls^ (7.2MB, xls)
[158]oncotarget-08-78901-s004.xlsx^ (1.9MB, xlsx)
[159]oncotarget-08-78901-s005.xlsx^ (4.8MB, xlsx)
[160]oncotarget-08-78901-s006.zip^ (763.3KB, zip)
Acknowledgments