Abstract
Measuring conditional relatedness between a pair of genes is a
fundamental technique and still a significant challenge in
computational biology. Such relatedness can be assessed by gene
expression similarities while suffering high false discovery rates.
Meanwhile, other types of features, e.g., prior-knowledge based
similarities, is only viable for measuring global relatedness. In this
paper, we propose a novel machine learning model, named Multi-Features
Relatedness (MFR), for accurately measuring conditional relatedness
between a pair of genes by incorporating expression similarities with
prior-knowledge based similarities in an assessment criterion. MFR is
used to predict gene-gene interactions extracted from the COXPRESdb,
KEGG, HPRD, and TRRUST databases by the 10-fold cross validation and
test verification, and to identify gene-gene interactions collected
from the GeneFriends and DIP databases for further verification. The
results show that MFR achieves the highest area under curve (AUC)
values for identifying gene-gene interactions in the development, test,
and DIP datasets. Specifically, it obtains an improvement of 1.1% on
average of precision for detecting gene pairs with both high expression
similarities and high prior-knowledge based similarities in all
datasets, comparing to other linear models and coexpression analysis
methods. Regarding cancer gene networks construction and gene function
prediction, MFR also obtains the results with more biological
significances and higher average prediction accuracy, than other
compared models and methods. A website of the MFR model and relevant
datasets can be accessed from [40]http://bmbl.sdstate.edu/MFR.
Introduction
Biological functions of a gene are cooperating with others when they
are in a common cellular environment or the same pathway. Measuring
relatedness between a pair of genes is increasingly crucial for
understanding the underlying complex interactions and functional
relationships of a biological system. Measured relatedness between a
pair of genes has been routinely used to construct biological
networks^[41]1–[42]5 and to predict novel genomic
functions^[43]6–[44]8. The gene-gene interaction is usually modeled as
a 0/1 (non-interacting/interacting) binary relation between a pair of
genes, while the relatedness implies a degree of the relationship
between a pair of genes.
The relatedness can be measured by two types of features: expression
similarities and prior-knowledge based similarities. The first type of
features is usually used to measure the conditional relatedness that is
the coexpression level between a pair of genes under a certain
condition, such as in inflammation or tumor tissues, according to the
correlation between their expression patterns^[45]9–[46]15, including
but not limited to, Pearson correlation coefficient (PCC)^[47]16,
Spearman rank correlation (SRC)^[48]17, mutual information
(MI)^[49]18–[50]21, partial Pearson correlation (PPC)^[51]22–[52]24,
and conditional mutual information (CMI)^[53]25. Several coexpression
databases have been constructed based on a wild range of available
expression data, e.g., the COXPRESdb^[54]26 and the GeneFriends^[55]27.
The second type of features is usually used to measure gene-gene
relatedness using the documented biological data and functional
annotations in public domain^[56]28–[57]30, e.g., the gene function
database Gene Ontology (GO)^[58]31, the homologous gene database
orthoDB^[59]32, the biological pathway databases KEGG^[60]33 and
Reactome^[61]34,[62]35, the protein-protein interaction (PPI) databases
HPRD^[63]36 and DIP^[64]37, and the transcriptional regulatory
databases HTRIdb^[65]38 and TRRUST^[66]39.
However, there is still a considerable room for improvement of accuracy
and robustness in measuring gene-gene conditional relatedness by
expression similarities and prior-knowledge based similarities,
respectively. First, the accuracy of using expression similarities need
to be improved. For example, PCC is known to have a high false
discovery rate, especially when the sample size is small, as mentioned
in refs^[67]40,[68]41, which severely impacts the results of further
computational analysis and biological interpretations. As there are
over 400 million gene pairs in human, a slight increase in false
discovery rate would bring an over-estimated number of gene-gene
interactions. Second, the robustness of using prior-knowledge based
similarities need to be improved as they are only viable for measuring
global relatedness^[69]28–[70]30. Their experiments are usually
conducted in a common environment, making prior-knowledge based
similarities are not suitable for measuring gene-gene conditional
relatedness.
Here, we propose a novel machine learning model, Multi-Features
Relatedness (MFR), for measuring conditional relatedness between a pair
of genes with an assessment criterion. The goal of MFR is to accurately
measure conditional relatedness between genes by integrating expression
similarities and prior-knowledge based similarities. Specifically, a
gene pair with a low expression similarity will be given a low rank
even though they have a high prior-knowledge based similarity, as their
relations are not specified under current condition from our point of
view; and a gene pair with a high expression similarity and a low
prior-knowledge based similarity will also be scored a low rank, as it
tends to be a false discovery prediction in coexpression analysis. Gene
pairs with both high expression similarities and high prior-knowledge
based similarities will be kept and recommended by this model.
Intuitively, the problem can be formulated into a single-objective
generalized linear logit regression problem under the following
hypotheses: (i) fitting of relatedness supported by expression
similarities is equal to fitting of relatedness supported by
prior-knowledge based similarities; (ii) both features contribute to
fitting on the same level; and (iii) the fitting target relatedness are
0/1 (non-interacting/interacting). We use support vector machine
(SVM)^[71]42 with the linear kernel to solve this regression problem
and optimize suitable parameters of relevant features. MFR is used to
predict gene-gene interactions extracted from the COXPRESdb, KEGG, and
TRRUST databases and a benchmark dataset of Pan et al.’s study^[72]43
by the 10-fold cross validation and test verification, and to identify
gene-gene interactions collected from the GeneFriends and DIP databases
for further verification. The results show that MFR achieves the
highest area under curve (AUC) values for identifying gene-gene
interactions in the development, test and DIP datasets. Specifically,
it obtains an improvement of 1.1% on average of precision for detecting
gene pairs with both high expression similarities and high
prior-knowledge based similarities in all datasets, comparing to other
linear models and coexpression analysis methods. In terms of cancer
gene networks construction and gene function prediction, MFR also
obtains the results with more biological significances and higher
average prediction accuracy than other compared models and methods.
Materials and Methods
MFR workflow
There are five steps in the MFR workflow as shown in Fig. [73]1: (i)
gene pair samples collection from the COXPRESDdb, KEGG and TRRUST
databases and a benchmark dataset from a published study^[74]43; (ii)
gene features extraction from the GEO, GO and orthoDB databases for
assessing similarity-based gene pair features; (iii) 12
similarity-based gene pair features calculation using four gene
features and the Reactome and HTRIdb databases; (iv) SVM-based model
construction by a 10-fold cross validation, where our model is
repeatedly trained by 81% gene pairs and developed by other 9% in 10
times; and (v) application of the developed model to detect gene-gene
interactions in the remaining 10% gene pairs and the other two
verification datasets (the GeneFrineds and DIP datasets), construct
cancer gene network, and predict gene functions. The results are
compared with other linear models and coexpression analysis methods,
including logit regression, linear discriminant analysis (LDA)^[75]44,
PCC, SRC, MI, PPC, and CMI. The trained MFR model is saved as an R
data, and the datasets and the results of the current study can be
freely downloaded at [76]http://bmbl.sdstate.edu/MFR for academic uses,
further verification, and biological analysis.
Figure 1.
[77]Figure 1
[78]Open in a new tab
Workflow of MFR model. Five steps are in the workflow, including (i)
gene pair samples collection, (ii) gene features extraction, (iii) gene
pair features calculation, (iv) SVM model construction and (v)
verification and discussion.
Model construction dataset
The gene-pair dataset for MFR model training, development and test is
composed of the coexpression and prior-knowledge sub datasets. The
former one is retrieved from the COXPRESdb database, where the
positives and negatives are the coexpressed and discoexpressed gene
pairs, respectively; and the latter one is made up by the KEGG, PPI,
and TRRUST sub-sub datasets, where the positives are the gene pairs
composed by genes involved in the same pathways, with PPIs or
transcriptional regulation relationships, and negatives are the gene
pairs composed by genes involved in different pathways, without PPIs
and transcriptional regulation relationships. The structure of each sub
dataset and sub-sub dataset are listed in Table [79]1. Some of the
negative gene pairs are obtained by permutation of the positives, and
then selected randomly to make sure the same number of positives for
construction of a model with high generalization. To keep the bias from
random permutation and selection, we repeat the process of the dataset
generation by 100 times giving rise to 100 datasets. Each of these
datasets is used to train, develop and test, and the average AUC value
and positive predictive value (PPV) are calculated to develop suitable
hyperparameters and compared to other models or methods, where the
training, development and test sets are obtained according to the
detailed proportion of the sub and sub-sub datasets. In each of the 100
datasets, we obtain 67,000 positive gene pairs and 74,560 negative gene
pairs. It notices that the numbers of the positive and negative gene
pairs are counted after remove the gene pairs without enough gene-pair
features. Also, the fitting target MFR values for the positive gene
pairs are marked as 1 s and those for the negatives as 0 s. The
detailed information can be found in the following sub-sections.
Table 1.
Structure of MFR dataset.
sub dataset Coexpression Prior-knowledge based
sub-sub dataset KEGG PPI TURRUST
Resource database The COXPRESdb^[80]26 The KEGG^[81]33 Ref.^[82]43 The
TRRUST^[83]39
Type of gene pair Positive Negative Positive Negative Positive Negative
Positive Negative
Sample size 30,353 29,607 13,386 13,386 18,227 26,533 5,034 5,034
[84]Open in a new tab
The coexpression sub dataset
In the COXPRESdb database, PCC for each gene pair is transferred to the
Mutual Rank (MR) value^[85]45. The smaller of an MR value, the higher
coexpression intensity of the corresponding gene pairs have, and the
coexpressed genes of a specific target gene are ranked by their MR
values in increasing order. For each gene, we select the first 50 genes
in its coexpressed gene list to compose 50 coexpressed gene pairs from
the Hsa-m.c4-1 and Mmu-m.c3-1 datasets, respectively. Then the commonly
coexpressed gene pairs in both datasets are used as the positive gene
pairs. In avoid of the imbalance issue between the positive and
negative gene pairs, we select 80 genes in the middle of the
coexpressed gene list for each gene to compose 80 discoexpressed gene
pairs from the Hsa-m.c4-1 and Mmu-m.c3-1 datasets, respectively. Then
the commonly discoexpressed gene pairs in both datasets are used as the
negative gene pairs, where PCC of each negative gene pair is around 0.
In total, there are 30,353 positive gene pairs and 29,607 negative gene
pairs generated in the coexpression sub dataset.
The prior-knowledge sub dataset
The prior-knowledge sub dataset is composed by the KEGG, PPI, and
TRRUST sub-sub datasets, and the collection of gene pairs are listed as
follows.
(A) The KEGG sub-sub dataset. The genes and pathways in metabolism,
genetic information processing, environmental information processing,
and cellular processes are downloaded from the KEGG database. The
13,386 positive gene pairs are composed by the genes involved in at
least three same KEGG pathways, and the 13,386 negatives are randomly
selected gene pairs composed by the genes involved in different KEGG
pathways to keep the balanced number between the positive and the
negative gene pairs.
(B) The PPI sub-sub dataset. This dataset is collected from the
study^[86]43, which has been used as the standard test set for PPI
prediction^[87]46–[88]48, as its reasonable sampling and the balanced
number between the positive and negative gene pairs. The 18,227
positive gene pairs are the ones with PPIs from the HPRD database, and
the 26,533 negatives are composed of genes located in different
organelles, in addition to those gene pairs without PPIs proved by
experiments, which are collected from the Negatome database^[89]49.
(C) The TRRUST sub-sub dataset. The 5,034 gene pairs with
transcriptional regulatory relationships from the TRRUST dataset are
used as the positive gene pairs. Then we randomly permutate
transcription-factor genes with regulated genes as the negative gene
pairs, making sure to obtain 5,034 negatives as the same number as the
positives.
Gene features
MFR uses 12 similarity-based gene pair features to assess conditional
relatedness between a pair of genes. Ten out of these 12 features are
calculated using four gene features, including gene-expression level,
GO annotation, homologous profile, and subcellular localization. More
details are listed as follows.
Expression data
Six hundred two datasets with 15,679 samples from the GEO
database^[90]50 based on the unique Affymetrix Human Genome U133 Plus
2.0 Array platform (released on Dec. 2017) are used as expression data
source. Then the pre-processing steps are executed, including log2
scale and quantile normalization. After removing genes without the
UniProt IDs^[91]51, 16,391 protein-coding genes in human are retained
for further expression data analysis.
Gene ontology data
The GO annotations for human genes are obtained from the GO database
(435,975 annotations released on Dec. 2017). Only 43,340 biological
process GO terms with experimental evidence are used as functional
annotations for genes in our study. The structure of these GO terms can
be described as a tree, where the relationships among GO terms fall
into four categories: “is a”, “part of”, “has part” and “regulates”.
However, we only use 456,781 “is a” relations to assess the GO
similarity between genes
Homologous data
Over 22 million genes from over 5,000 species, including 169,376 human
homologous genes from 20 species, are used to construct the homologous
profile data by the orthoDB database (version 9.1).
Subcellular localization data
A total of 160,537 cellular component annotations of human genes from
the GO database (released on Dec. 2017) are used as the subcellular
source to measure subcellular localization similarity between a pair of
genes.
Verification and discussion resources
Besides a test verification, we compare MFR with other linear models
(logit regression, LDA) and coexpression analysis methods (PCC, SRC,
MI, PPC, and CMI), regarding performances in further verification for
the GeneFriends and DIP datasets, construction of cancer gene network,
and prediction of KEGG metabolomic gene functions. These resources are
described as follows.
The GeneFriends and DIP datasets
With the elimination of gene pairs without enough gene-pair features,
overall 9,146 coexpressed gene pairs with top 20 PCC values for each
gene from the GeneFriends database are used as the positive gene pairs.
Considering real coexpressed gene pairs are rare in the whole human
genome, the 9,146 randomly selected negative gene pairs generated by
permutation of the first and the second genes in the positive gene
pairs. Similarly, a total of 1,489 gene pairs with the direct-PPIs from
the DIP database (leased on Dec. 2017) are used as the positive gene
pairs. The negative gene pairs are 1,489 randomly selected gene pairs
composed by permutating the first and the second genes in positive gene
pairs. Because the negative gene pairs in the GeneFriends and DIP
datasets are both generated by random permutation and selection. To
avoid the bias of such random sampling, we repeat the whole dataset
generation process for 100 times giving rise to 100 GeneFriends and 100
DIP datasets, respectively. The average AUC value and PPV of each of
the 100 datasets are used to compare models or methods in verification.
Cancer gene-expression data
The RNA-seq data of four cancer types are downloaded from the TCGA
database^[92]52, each having at least ten cancer samples and ten normal
samples, with more details showcased in Table [93]2. Before further
analysis, this expression data is pre-processed, including log2 scale
and quantile normalization.
Table 2.
Sample size of RNA-seq data for four cancer types.
Type Cancer (Samples) Normal (Samples)
Bladder urothelial carcinoma (BLCA) 408 19
Breast invasive carcinoma (BRCA) 1095 113
Colon adenocarcinoma (COAD) 285 41
Lung adenocarcinoma (LUAD) 515 59
[94]Open in a new tab
KEGG metabolic genes
In total, 1,403 genes of 84 metabolic pathways from the KEGG database
are used to compare different models and methods regarding predicting
gene functions. Specifically, 100 out of these genes are randomly
selected as the genes without any prior knowledge, and then their
functions are predicted by analyzing functional annotations of other
1,303 genes. Such a process is repeated for 100 times, and the average
prediction rates are used to indicate the capability for gene function
prediction.
Gene pair feature calculation
While traditional coexpression analysis methods use a signal type of
features to measure conditional relatedness between genes, MFR uses
multi-features including both expression similarities and
prior-knowledge based similarities. Twelve similarity-based gene pair
features are used in MFR which are defined as follows.
Seven features based on expression similarities
We firstly use average expression levels of each gene, exp1, and exp2,
as the first two features for a gene pair. The following five features
are a gene pair’s coexpression levels measured by PCC, SRC, PPC, MI,
and CMI. PCC is used to measure linear coexpression relationship; SRC
and MI are used to measure non-linear coexpression relationship, where
different from SRC based on ranks, MI determines how similar the joint
distribution of two genes’ expression levels is to the products of
factored marginal distribution for indicating the association between
their expressions. PPC is used to measure direct linear coexpression,
which is the coexpression relationship between a pair of genes measured
avoiding any influence of other genes; Similarly, CMI is used to
measure direct non-linear coexpression.
One feature based on the gene ontology similarity
The GO similarity (goSim) is used as the eighth feature because the
genes with interaction are considered being involved in the similar
biological process. It can be defined as:
[MATH:
goSimi,j=maxo∈Oi,q<
mo>∈Oj2×log(Pms(o,q))log(P(o))+log(P(q)) :MATH]
1
[MATH: Pms(o,q)=min<
/mi>c∈A(o,q)P(c) :MATH]
2
[MATH: P(o)=|D(o)|+1|D(root)|+1 :MATH]
3
where O[i] and O[j] indicates the GO term sets used for annotating gene
i and j, respectively; A(o, q) is the common ancestor set of GO term o
and q; P(o) is the probability of a gene annotated by an instance of GO
term o^[95]53; D(o) and D(root) indicate the descendant GO term sets of
GO term o, and the root GO term, respectively.
One feature based on subcellular localization similarity
The ninth feature, subcellular localization similarity (lcSim), is used
to calculate the probability for two protein-coding genes appearing in
a common organelle. It can be defined as:
[MATH:
lcSimi,j=
|Li∩Lj||Li∪Lj| :MATH]
4
where L[i] and L[j] are the subcellular localization sets of proteins
encoded by the genes i and j, repressively.
One feature based on homology similarity
Since common presence and absence of two genes in many species suggest
a potential functional relatedness between them, the homology
similarity (hgSim) is used as the 10th feature calculated using an
improved Pearson correlation method^[96]54 as:
[MATH: hgSim<
/mrow>i,j=N×M−n<
/mrow>i×nj(N×ni−ni2)×(N×nj−nj2) :MATH]
5
where n[i] and n[j] are the numbers of species whose genome contains
the orthologous genes of gene i and j, respectively; N = 21 is the
total number of species we use, and M is the number of species whose
genome contains both orthologous genes of gene i and j.
One feature based on Reactome similarity
Overall, 202,772 gene-gene interactions derived from the Reactome
pathways are used to construct an unweighted graph, in which nodes
represent genes and edges represent interactions between genes. The
normalized distance of a gene pair is used as the 11th feature named as
Reactome similarity (rxSim), which is defined as:
[MATH:
rxSimi,j=1−disi,jd
ismax<
/mi> :MATH]
6
where dis[i,j] is the shortest distance between gene i and j, and
dis[max] is the shortest distance between the farthest gene pair in the
graph.
One feature based on transcriptional regulatory similarity
Totally 284 transcription factors, 18,302 regulated genes, and 51,871
transcriptional regulatory interactions between them are obtained from
the HTRIdb database. If there is a record that a gene pair has a
transcriptional regulatory interaction, the transcriptional regulatory
similarity (trSim) used as the 12th feature of this gene pair is 1,
otherwise is 0.
SVM model construction
MFR is designed based on SVM, which is a supervised learning model,
with associated learning algorithms for classification and regression
analysis. The motivation is to classify data by using the best
hyperplane that is the one that represents the most extensive
separation, or margin, between two classes. We take a total of 12
similarity-based gene pair features as input, and the output value as
an assessment criterion, namely MFR, for detecting the conditional
relatedness between a pair of genes (see Fig. [97]2). For model
training, we provide the target MFR values (labels) marked as 1 s and
0 s for the positive and the negative gene pairs, respectively. Given
X = {x[1], x[2], …, x[n]} and Y = {y[1], y[2], …, y[n]}, where x[i] and
y[i] indicates the vector of 12 similarity-based gene-pair features and
the target MFR value (label) of the ith gene pair, repressively, the
MFR model can be construction by conduction Formula (7):
[MATH: maxα<
mo
stretchy="true">(∑i=
1nαi−12<
/mfrac>∑i,j=<
/mo>1nαiα
jyiyjxiT⋅xj
mrow>)s.
t.∑i=1nαiyi<
/mrow>=0;0≤αi≤C
,i=1,2,<
mi>…,n :MATH]
7
where α = {α[1],α[2], …, α[n]} indicates Lagrange multipliers, which
are solved by SMO (sequential minimal optimization)^[98]55. Then
predicted
[MATH: MFR^i :MATH]
value of ith gene pairs is defined as:
[MATH: MFR^i=sigmoid(∑j=
1nαjyjxjT⋅xi+bˆ) :MATH]
8
where
[MATH: bˆ :MATH]
indicates the bias defined as ref.^[99]42.
Figure 2.
[100]Figure 2
[101]Open in a new tab
Structure of the MFR model. The model is based on SVM and uses 12
similarity-based gene pair features as input; and the output value,
namely MFR, is applied as an assessment criterion for measuring
conditional relatedness between genes.
Because there are not enough positive gene pairs with both high
expression similarities and high prior-knowledge based similarities for
directly training, we collect positive gene pairs with high expression
similarities and the corresponding negatives to compose the
coexpression sub dataset. Similarly, we collect positive gene pairs
with high priori-knowledge based similarities and the corresponding
negatives to compose the priori-knowledgesub dataset. Then MFR is
trained by gene pairs in the whole dataset including both coexpression
and prior-knowledge sub datasets at the same time to provide our model
the capability for identification of gene pairs with both high
expression similarities and high prior-knowledge based similarities,
rather than trained by coexpression sub dataset or prior-knowledge sub
dataset separately. And a higher MFR value indicates that two genes are
more likely to be interacting with each other. In detail, we employ
LIBSVM^[102]56 with the linear kernel to implement our model.
MFR is constructed by the 10-fold cross validation, in which we use 81%
of the gene pairs for training and 9% for development. The procedure is
repeated by 10 times. The hyperparameters with the highest average AUC
value of the whole cross-validation are selected. Then we use the rest
10% gene pairs to conduct test verification. The result of our model in
the 10-fold cross validation and test verification is compared with
those of other linear models or coexpression analysis methods as shown
in Results (see Figs [103]3 and [104]4). After training and
development, the weight w[S] of the gene-pair feature S is finalized as
w[exp1] = −0.810, w[exp2] = −0.807, w[PCC] = −0.017, w[SRC] = 0.840,
w[MI] = 4.875, w[PPC] = 2.414, w[CMI] = −0.055, w[goSim] = 0.972,
w[lcSim] = 1.198, w[hgSim] = 0.433, w[rxSim] = 0.544 and
w[trSim] = 0.668, indicating MI, PPC, goSim and lcSim are the most
important gene-pair features for MFR model, while PCC and CMI are the
least important. The top four important features contain two expression
similarities and two priori-knowledge based similarities indicating
both kinds of features contribute to accurately measuring relatedness
of a pair of genes. MI and PPC obtain the largest weights among
expression similarities maybe because, before calculation of MI, the
expression levels of genes are discretized according to study^[105]57,
making MI get stronger robustness on the noise of gene expressions, and
PPC has more complementarity with MI compared with other expression
similarities, as other expression similarities, specially PCC and CMI,
have some resemblance with MI^[106]58,[107]59. The larger weights of
goSim and lcSim than other priori-knowledge based similarities indicate
two genes with the related functions and the similar organelle
locations mostly have a strong relatedness. The negative weights of
exp1 and exp2 indicate the punishment of the exorbitant expression, as
two of the genes in a gene pair are very hard to have exorbitant
expressions both, and the exorbitant expression of a gene usually
implies a gap of expression with the other gene, indicating a low
relatedness between these genes.
Figure 3.
[108]Figure 3
[109]Open in a new tab
(A) ROCs of nine models or methods for identifying gene-gene
interactions by the 10-fold cross-validation. (B) Average PPVs of nine
models or methods for detecting B0/B1 matched gene pairs by 10-fold
cross-validation.
Figure 4.
[110]Figure 4
[111]Open in a new tab
ROCs of nine models or methods for identifying gene-gene interactions
in the (A) test, (C) GeneFriends and (E) DIP datasets. Average PPVs of
nine models or methods for detecting B0/B1 matched gene pairs in the
(B) test, (D) GeneFriends and (F) DIP datasets.
Performance evaluation
We compare the performances of MFR with other two linear models, i.e.,
logit regression and LDA^[112]44, and five coexpression analysis
methods (PCC, SRC, PPC, MI, and CMI). We choose logit regression and
LDA because they are widely used multi-features generalized linear
logit regression models^[113]60–[114]63. And the five coexpression
analysis methods are selected since they are traditional methods in
measuring conditional relatedness between a pair of
genes^[115]16–[116]25. To make a fair comparison with linear models
with multi-features, we also add the sixth coexpression analysis
method, so-called CXP, which is the integration of PCC, SRC, PPC, MI,
and CMI. Specially, the average value of these five methods is used as
the assessment criterion of CXP, comparable with the result from other
multi-features methods, such as MFR, logit, and LDA. First, we compare
different models and methods in detecting gene-gene interactions on
verification datasets using the receiver operating characteristic curve
(ROC)^[117]64, where gene-gene interactions indicate positive gene
pairs with high expression similarities or high prior-knowledge based
similarities. And we use PPV^[118]65 to compare different models and
methods in identifying gene pairs with both high expression
similarities and high prior-knowledge similarities as defined in
Section 2.7.2. Then, we conduct pathway enrichment analysis to identify
the pathways significantly influenced by the increased glutamine and
glutamate metabolism, on gene modules identified in cancer gene
networks, where nodes represent up-regulated genes and edges show
relatedness measured by each model or method. Finally, the
shortest-path method^[119]66 is applied to predict functions of genes
pretending to have no prior knowledge, on the KEGG metabolic gene
networks, where nodes represent genes involved in KEGG metabolism
pathways and edges represent relatedness calculated using different
models and methods, respectively.
Receiver operating characteristic curve
The ROC curve with its area under the curve (AUC) is a widely used
evaluation tool for performance comparison of different methods. It is
made by plotting true positive rate (TPR) against false positive rate
(FPR), which are defined as:
[MATH: TPR(n)=TP(n)P :MATH]
9
[MATH: FPR(n)=FP(n)N :MATH]
10
where TP(n) indicates the true positive among top n ranked gene-gene
interactions, FP(n) indicates the false positive among top n ranked
gene-gene interactions, P indicates the total number of interacting
gene pairs, and N indicates the total number of non-interacting gene
pairs.
Positive predictive value
The positive predictive value (PPV), so-called precision, is an
intuitive indicator for evaluating prediction results among models, and
a high value of PPV indicates the accuracy of a model. PPV is defined
as:
[MATH:
PPV=TPTP+FP :MATH]
11
where TP and FP are the true positive and the false positive among gene
pairs, respectively.
As it is very hard to give a precise definition of a gene pair with
both high expression similarities and high prior-knowledge based
similarities, we define a gene pair labeled a B0 match if its PCC or
SRC values larger than 0.5 and the goSim and lcSim values larger than
0.5; and labeled a B1 match if its PCC or SRC values larger than 0.3
and the goSim and lcSim values larger than 0.3. And then the PPV of top
5% ranked gene pairs against B0 matched gene pairs, and the PPV of the
top 10% ranked gene pairs against B1 match gene pairs are used to
approximately compare models in terms of prediction of gene pairs with
both high expression similarities and high prior-knowledge based
similarities.
Up-regulated genes identification
A gene is identified to be up-regulated if the fold-change between the
average expression level in cancer samples and that in normal samples
is greater than 1.5 and with a q-value < 0.05 measured by the limma
t-test^[120]67.
Fast greedy modularity optimization method
In the study^[121]68, a method was proposed to find modules in networks
by greedy optimization of modularity^[122]69. The fast-greedy
modularity optimization method^[123]70 performs the same greedy
optimization as the method of^[124]68, but it runs much faster due to
the lower computational cost.
Pathway enrichment analysis
Pathway enrichment analysis is conducted over a given set C of
up-regulated genes against the pathways in KEGG. The statistic
significant p-value of gene set C with n genes enriching pathway P with
K genes can be defined as:
[MATH: Pvalue(k)=1−∑<
/mo>i=0k−1(Ki)(N−Kn−i
)(Nn) :MATH]
12
where N = 18,420 is the total number of human genes and k is the number
of genes in
[MATH: C∩P :MATH]
. Then the p-value is adjusted to be a q-value to restrict the false
discovery rate^[125]71. And we consider the C enriches P if
q-value < 0.01.
Shortest-path method
For identifying all the genes with GO annotations on the shortest path,
the shortest-path method^[126]66 is applied to find the lowest common
ancestor of their GO annotations. If the ancestor is less than three
levels below the root of the GO tree, it is assigned to the genes
without any GO annotation on the shortest path as their functions. A
gene is labeled a L0 match if one of the predicted GO annotations is
its known GO annotation and labeled a L1 match if one of the predicted
GO annotations is its known GO annotations’ direct parents^[127]66.
Then L0 and L1 match ratios relative to the total number of genes
without any GO annotations are used to compare each model or method
regarding gene function prediction.
Results
10-fold cross-validation
We compare the precision of identifying gene-gene interactions by MFR
with the other linear models and coexpression analysis methods on the
development datasets. The ROC results by the 10-fold cross-validation
of different models and methods are showcased in Fig. [128]3. The
linear models including MFR, logit, and LDA are more suitable for
detecting gene-gene interactions, as their average values of AUC are
all larger than those of coexpression analysis methods. Among these
linear models, our model based on SVM performs the best and obtains the
largest average AUC value of 0.819. In terms of prediction of the gene
pairs with both high expression similarities and high prior-knowledge
based similarities, the average PPVs of B0 and B1 matched gene pairs
for linear models are also larger than those for coexpression analysis
methods, where MFR obtains the best performance for the largest average
PPVs of B0 and B1 matched gene pairs of 0.988 and 0.866, respectively.
Verifications on the test, GeneFriends and DIP datasets
The robustness evaluation is carried out through examining the
performances of different models and methods in detecting gene-gene
interactions, and in identifying gene pairs with both high expression
similarities and high prior-knowledge based similarities on three kinds
of verification datasets, including the test datasets, GeneFriends
datasets, and DIP datasets. Specially, the results on GeneFriends
datasets indicate the robustness in detecting gene-gene interactions
and gene pairs from coexpression data, and those on DIP datasets
indicate the robustness in identifying gene-gene interactions and gene
pairs from prior-knowledge based data. As showcased in Fig. [129]4, the
linear models (MFR, logit, and LDA) are better from the result of
verification, as their average AUC values and PPVs are all larger than
those of coexpression analysis methods. MFR obtains the largest average
AUC values on all verification datasets except the GeneFriends dataset,
and the largest average PPVs on all verification datasets, indicating
our SVM-based model has the best robustness.
Cancer gene network construction
The relatedness between a pair of genes can be used as a similarity
between the corresponding nodes in a constructed biological network,
where genes in a set of highly interconnected genes (module) tend to be
involved with relative biological processes. We utilize this property
to predict metabolic pathways significantly influenced by increased
glutamine and glutamate metabolism in four cancer types, which are
BLAC, BRCA, COAD, and LUAD. Glutamine and glutamate metabolism are
reported to be increased in various cancers^[130]72,[131]73, especially
in bladder cancer^[132]74, breast cancer^[133]75–[134]81, colon
cancer^[135]76,[136]78,[137]79,[138]82, and lung
cancer^[139]76,[140]78,[141]79,[142]83. They are also considered to be
closely related to cancer’s proliferation, invasion, and
metastasis^[143]84. For each cancer type, we measure relatedness
between up-regulated metabolic genes using MFR, other linear models,
and coexpression analysis methods, respectively. Then the up-regulated
metabolic genes and their relatedness in each cancer type are used to
construct networks for each model and methods, where nodes represent
genes, and two genes are connected if the MR of their relatedness is
among top three. We collect 21 genes, including eight rate-limiting
enzyme genes for glutaminolysis and 13 genes directly catalyzing
reactions of glutamine or glutamate, defined as the gene markers for
glutamine and glutamate metabolism (see Table [144]3), inspired by a
recent study^[145]85. After identifying modules containing up-regulated
gene markers, the pathway enrichment analysis is conducted on such
modules to predict metabolic pathways directly influenced by increased
glutamine and glutamate metabolism, which are the enriched with
up-regulated gene markers, as shown in Fig. [146]5 and Supplement
Figures [147]S1–[148]S3.
Table 3.
Gene markers for glutamine and glutamate metabolism.
Gene Description Go Term
ASNS Asparagine Synthetase asparagine biosynthetic process
ALDH18A1 Aldehyde Dehydrogenase 18 Family Member A1 proline
biosynthetic process
CAD Carbamoyl-Phosphate Synthetase 2, Aspartate Transcarbamylase, And
Dihydroorotase ‘de novo’ pyrimidine nucleobase biosynthetic process
CS Citrate Synthase tricarboxylic acid cycle
CTPS CTP Synthase 1 ‘de novo’ CTP biosynthetic process
CTPS2 CTP Synthase 2 ‘de novo’ CTP biosynthetic process
DLD Dermcidin 2-oxoglutarate metabolic process
DLST Dihydrolipoamide S-Succinyltransferase tricarboxylic acid cycle
GFPT1 Glutamine-Fructose-6-Phosphate Transaminase 1
UDP-N-acetylglucosamine biosynthetic process
GFPT2 Glutamine-Fructose-6-Phosphate Transaminase 2
UDP-N-acetylglucosamine biosynthetic process
GLUL Glutamate-Ammonia Ligase glutamine biosynthetic process
GLS Glutaminase glutamate biosynthetic process
GLS2 Glutaminase 2 glutamate biosynthetic process
OGDH Oxoglutarate Dehydrogenase tricarboxylic acid cycle
GGDHL Oxoglutarate Dehydrogenase-like tricarboxylic acid cycle
PFAS Phosphoribosylformylglycinamidine Synthase ‘de novo’ IMP
biosynthetic process
PPAT Phosphoribosyl Pyrophosphate Amidotransferase ‘de novo’ IMP
biosynthetic process
PSAT1 Phosphoserine Aminotransferase 1 ‘de novo’ IMP biosynthetic
process
GCLC Glutamate-Cysteine Ligase Catalytic Subunit glutathione
biosynthetic process
GCLM Glutamate-Cysteine Ligase Modifier Subunit glutathione
biosynthetic process
GSS Glutathione Synthetase glutathione biosynthetic process
[149]Open in a new tab
Figure 5.
[150]Figure 5
[151]Open in a new tab
Metabolic pathways are predicted to be directly influenced by increased
glutamine and glutamate metabolism in nine BRCA gene networks.
As shown in Fig. [152]6, we obtain the best prediction results from
MFR-based networks. We predict 15 pathways directly influenced by
increased glutamine and glutamate metabolism in all four cancer types,
which is the most among all the models and methods. For example, in
BRCA, there are three pathways are predicted to be directly related to
increased glutamine and glutamate metabolism, agreeable with
studies^[153]74,[154]86. However, only one or two of the three pathways
are predicted by other models or methods. For MFR, the prediction of
the glycine, serine, and threonine metabolism pathway is further
confirmed as PSPH (phosphoserine phosphatase) found to be up-regulated
in BRCA. Especially, PSPH acts as a rate-limiting enzyme involved in
serine synthesis from glutamate^[155]87.
Figure 6.
[156]Figure 6
[157]Open in a new tab
Number of metabolic pathways predicted to be directly influenced by
increased glutamine and glutamate metabolism in four cancer types.
These pathways were predicted in cancer gene networks, where nodes
represent up-regulated metabolic genes and edges represent relatedness
between genes, measured by the five linear models and six coexpression
analysis methods.
Gene function prediction
We randomly select 100 out of 1,403 genes involved in the KEGG
metabolism pathways and pretend that there is no prior knowledge with
them, and then we predict their functions through analyzing GO
annotations of other 1,303 genes. This process repeats for 100 times.
First, we use MFR, the other linear models, and coexpression analysis
methods to measure the relatedness between each pair from the 1,403
genes, respectively. For each linear model, as selected genes pretend
to be without prior knowledge, we mainly use expression similarities
(PCC, SRC, MI, PPC, and MI) to calculated their relatedness with other
genes, and set other gene-pair features to be 0.5. The relatedness
measured by different models and methods are normalized as follow: (i)
rank the values of each model or method; and (ii) for each model or
method, replace its values with the corresponding PCC values according
to the ranks. Then the 1,403 genes and their relatedness are used to
construct gene networks for different models and methods, respectively.
In the constructed networks, nodes represent genes, edges represent
relatedness between genes, and measured values of relatedness are used
as the weights of edges. The edges with weights less than 0.6 are
removed based on the procedure in the previous study^[158]66. Finally,
each network contains 1,403 nodes and 14,067 edges.
A broadly used shortest-path method is applied to predict the function
of selected genes. As shown in Fig. [159]7, in MFR-based gene network,
the shortest-path method achieves notably accurate results, where it
successfully calls average 39.81%/26.33% of the selected genes at the
L1/L0 levels. However, it only calls average 39.62%/25.28%,
39.62%/25.28%, 5.08%/4.58%, 10.38%/7.39%, 7.43%/5.41%, 7.71%/6.65%,
0.13%/0.19% and 2.43%/1.35% of the selected genes at the L1/L0 levels
in logit-regression-, LDA-, PCC-, SRC-, PPC-, MI- CMI- and CXP- based
networks, respectively. Overall, the results suggest that MFR
outperforms other models and methods regarding gene function
prediction, as it constructs better networks on genes with prior
knowledge and benefits functional prediction of genes.
Figure 7.
[160]Figure 7
[161]Open in a new tab
Percentages of L0- and L1-matched selected genes in the nine KEGG
metabolic gene networks. In these networks, nodes represent genes
involved in KEGG metabolism pathways, and edges represent relatedness
between genes, measured by the nine models or methods.
Discussion and Conclusion
In this paper, we propose a novel machine learning model for measuring
conditional relatedness between genes, named MFR, by integrating seven
expression similarities and five prior-knowledge based similarities.
Specifically, gene pairs with both high expression similarities and
high prior-knowledge based similarities will be kept and recommended by
our model. At first, we conduct the MFR model in 10-fold
cross-validation. Then we used the MFR model in a test verification and
two further verifications on the GeneFriends and DIP datasets. Finally,
the MFR model is used to construct cancer gene networks and predict
gene functions. All the results are compared with those of other models
or methods (see Table [162]4).
Table 4.
Performances of the nine models or methods for different applications.
Application Evaluation MFR Logit LDA PCC SRC MI PPC CMI CXP
10-fold cross-validation AUC 0.819 0.818 0.818 0.699 0.692 0.664 0.695
0.484 0.686
B0 + B1 0.927 0.916 0.916 0.495 0.469 0.456 0.366 0.152 0.455
Test verification AUC 0.823 0.822 0.822 0.696 0.690 0.658 0.691 0.484
0.682
B0 + B1 0.873 0.856 0.856 0.440 0.518 0.428 0.270 0.172 0.477
GeneFriends verification AUC 0.816 0.821 0.821 0.815 0.764 0.733 0.823
0.484 0.782
B0 + B1 0.962 0.957 0.957 0.571 0.471 0.483 0.613 0.091 0.485
DIP verification AUC 0.727 0.724 0.724 0.604 0.617 0.586 0.602 0.487
0.600
B0 + B1 0.727 0.713 0.713 0.544 0.507 0.519 0.438 0.142 0.463
Constructing a cancer gene network NPP 15 12 14 10 10 11 12 8 10
Predicting gene function L0 + L1 33.07 32.45 32.45 4.83 8.89 6.42 7.18
0.16 1.89
[163]Open in a new tab
B0 + B1 indicates the average value of PPVs of B0- and B1-matched
genes; NPP indicates the number of predicted metabolic pathways;
L0 + L1 indicates the average number of L0- and L1-matched genes
In terms of identifying gene-gene interactions, multi-features models,
such as MFR, logit and LDA performance better than coexpression
analysis methods including PCC, SRC, MI, PPC, CMI, and CXP in the
10-fold cross-validation and verifications. Hence, the models
integrating both expression similarities and prior-knowledge based
similarities can avoid the shortage of using only one kind of
expression similarities. And among those multi-feature models, MFR
performances best in the 10-fold cross-validation, test verification,
and one further verification on DIP datasets (except GeneFriends
datasets), indicating the SVM-based model is more suitable for
resolving the conflict of fitting relatedness supported by coexpression
and those supported by prior knowledge at the same time. It also
notices that MFR has better performances in the datasets containing the
gene pairs extracted from both coexpression data and prior-knowledge
based data (i.e., development and test datasets) and datasets
containing the gene pairs extracted from prior-knowledge based data
(e.g., the DIP datasets). On the contrary, logit and LDA models have
better performances in the datasets containing the gene pairs extracted
from only coexpression data, such as the GeneFriends datasets. In other
words, logit and LDA models prefer gene pairs with high expression
similarities, comparable with MFR. As a result, MFR is relatively good
at detecting gene pairs with both high expression similarities and
prior-knowledge based similarities and obtains the best results in all
the datasets. For a real biological problem, some of the important gene
pairs usually having attributes such as coexpression, like positive
gene pairs collected from coexpression data, and the others typically
have attributes such as PPI, like positive gene pairs collected from
prior-knowledge based data. Additionally, gene pairs with both high
expression similarities and high prior-knowledge based similarities are
more likely the real important interacting gene pairs. So MFR is more
suitable for practical applications, such as biological network
construction and genomic function prediction, and can perform the best
as our results show.
The MFR is fundamentally a regression model, including two kinds of
core elements, features, and model. So, for the next step, we plan to
improve the MFR model on these core elements. First, we will improve
the MFR model through obtaining and using more available and more
accurate prior knowledge, as the MFR has high accuracy and robustness,
and its dependency on the prior-knowledge based similarities make it
adaptable. Second, with the development of deep learning technology,
recently more and more computational methods are constructed based on
deep learning models. As deep learning models automatically learn the
complex functions for mapping input features to output results,
deep-learning-based methods achieve to state-of-the-art accuracy of
many prediction tasks, including image recognition^[164]88–[165]90 and
natural language processing^[166]91–[167]93. Therefore, we will use
deep learning models, such as the deep belief network, to replace SVM
for MFR to improve accuracy and robustness.
Supplementary information
[168]41598_2019_40780_MOESM1_ESM.pdf^ (783.5KB, pdf)
Using Machine Learning to Measure Relatedness Between Genes: A
Multi-Features Model
Acknowledgements