Abstract

   Measuring conditional relatedness between a pair of genes is a
   fundamental technique and still a significant challenge in
   computational biology. Such relatedness can be assessed by gene
   expression similarities while suffering high false discovery rates.
   Meanwhile, other types of features, e.g., prior-knowledge based
   similarities, is only viable for measuring global relatedness. In this
   paper, we propose a novel machine learning model, named Multi-Features
   Relatedness (MFR), for accurately measuring conditional relatedness
   between a pair of genes by incorporating expression similarities with
   prior-knowledge based similarities in an assessment criterion. MFR is
   used to predict gene-gene interactions extracted from the COXPRESdb,
   KEGG, HPRD, and TRRUST databases by the 10-fold cross validation and
   test verification, and to identify gene-gene interactions collected
   from the GeneFriends and DIP databases for further verification. The
   results show that MFR achieves the highest area under curve (AUC)
   values for identifying gene-gene interactions in the development, test,
   and DIP datasets. Specifically, it obtains an improvement of 1.1% on
   average of precision for detecting gene pairs with both high expression
   similarities and high prior-knowledge based similarities in all
   datasets, comparing to other linear models and coexpression analysis
   methods. Regarding cancer gene networks construction and gene function
   prediction, MFR also obtains the results with more biological
   significances and higher average prediction accuracy, than other
   compared models and methods. A website of the MFR model and relevant
   datasets can be accessed from [40]http://bmbl.sdstate.edu/MFR.

Introduction

   Biological functions of a gene are cooperating with others when they
   are in a common cellular environment or the same pathway. Measuring
   relatedness between a pair of genes is increasingly crucial for
   understanding the underlying complex interactions and functional
   relationships of a biological system. Measured relatedness between a
   pair of genes has been routinely used to construct biological
   networks^[41]1–[42]5 and to predict novel genomic
   functions^[43]6–[44]8. The gene-gene interaction is usually modeled as
   a 0/1 (non-interacting/interacting) binary relation between a pair of
   genes, while the relatedness implies a degree of the relationship
   between a pair of genes.

   The relatedness can be measured by two types of features: expression
   similarities and prior-knowledge based similarities. The first type of
   features is usually used to measure the conditional relatedness that is
   the coexpression level between a pair of genes under a certain
   condition, such as in inflammation or tumor tissues, according to the
   correlation between their expression patterns^[45]9–[46]15, including
   but not limited to, Pearson correlation coefficient (PCC)^[47]16,
   Spearman rank correlation (SRC)^[48]17, mutual information
   (MI)^[49]18–[50]21, partial Pearson correlation (PPC)^[51]22–[52]24,
   and conditional mutual information (CMI)^[53]25. Several coexpression
   databases have been constructed based on a wild range of available
   expression data, e.g., the COXPRESdb^[54]26 and the GeneFriends^[55]27.
   The second type of features is usually used to measure gene-gene
   relatedness using the documented biological data and functional
   annotations in public domain^[56]28–[57]30, e.g., the gene function
   database Gene Ontology (GO)^[58]31, the homologous gene database
   orthoDB^[59]32, the biological pathway databases KEGG^[60]33 and
   Reactome^[61]34,[62]35, the protein-protein interaction (PPI) databases
   HPRD^[63]36 and DIP^[64]37, and the transcriptional regulatory
   databases HTRIdb^[65]38 and TRRUST^[66]39.

   However, there is still a considerable room for improvement of accuracy
   and robustness in measuring gene-gene conditional relatedness by
   expression similarities and prior-knowledge based similarities,
   respectively. First, the accuracy of using expression similarities need
   to be improved. For example, PCC is known to have a high false
   discovery rate, especially when the sample size is small, as mentioned
   in refs^[67]40,[68]41, which severely impacts the results of further
   computational analysis and biological interpretations. As there are
   over 400 million gene pairs in human, a slight increase in false
   discovery rate would bring an over-estimated number of gene-gene
   interactions. Second, the robustness of using prior-knowledge based
   similarities need to be improved as they are only viable for measuring
   global relatedness^[69]28–[70]30. Their experiments are usually
   conducted in a common environment, making prior-knowledge based
   similarities are not suitable for measuring gene-gene conditional
   relatedness.

   Here, we propose a novel machine learning model, Multi-Features
   Relatedness (MFR), for measuring conditional relatedness between a pair
   of genes with an assessment criterion. The goal of MFR is to accurately
   measure conditional relatedness between genes by integrating expression
   similarities and prior-knowledge based similarities. Specifically, a
   gene pair with a low expression similarity will be given a low rank
   even though they have a high prior-knowledge based similarity, as their
   relations are not specified under current condition from our point of
   view; and a gene pair with a high expression similarity and a low
   prior-knowledge based similarity will also be scored a low rank, as it
   tends to be a false discovery prediction in coexpression analysis. Gene
   pairs with both high expression similarities and high prior-knowledge
   based similarities will be kept and recommended by this model.
   Intuitively, the problem can be formulated into a single-objective
   generalized linear logit regression problem under the following
   hypotheses: (i) fitting of relatedness supported by expression
   similarities is equal to fitting of relatedness supported by
   prior-knowledge based similarities; (ii) both features contribute to
   fitting on the same level; and (iii) the fitting target relatedness are
   0/1 (non-interacting/interacting). We use support vector machine
   (SVM)^[71]42 with the linear kernel to solve this regression problem
   and optimize suitable parameters of relevant features. MFR is used to
   predict gene-gene interactions extracted from the COXPRESdb, KEGG, and
   TRRUST databases and a benchmark dataset of Pan et al.’s study^[72]43
   by the 10-fold cross validation and test verification, and to identify
   gene-gene interactions collected from the GeneFriends and DIP databases
   for further verification. The results show that MFR achieves the
   highest area under curve (AUC) values for identifying gene-gene
   interactions in the development, test and DIP datasets. Specifically,
   it obtains an improvement of 1.1% on average of precision for detecting
   gene pairs with both high expression similarities and high
   prior-knowledge based similarities in all datasets, comparing to other
   linear models and coexpression analysis methods. In terms of cancer
   gene networks construction and gene function prediction, MFR also
   obtains the results with more biological significances and higher
   average prediction accuracy than other compared models and methods.

Materials and Methods

MFR workflow

   There are five steps in the MFR workflow as shown in Fig. [73]1: (i)
   gene pair samples collection from the COXPRESDdb, KEGG and TRRUST
   databases and a benchmark dataset from a published study^[74]43; (ii)
   gene features extraction from the GEO, GO and orthoDB databases for
   assessing similarity-based gene pair features; (iii) 12
   similarity-based gene pair features calculation using four gene
   features and the Reactome and HTRIdb databases; (iv) SVM-based model
   construction by a 10-fold cross validation, where our model is
   repeatedly trained by 81% gene pairs and developed by other 9% in 10
   times; and (v) application of the developed model to detect gene-gene
   interactions in the remaining 10% gene pairs and the other two
   verification datasets (the GeneFrineds and DIP datasets), construct
   cancer gene network, and predict gene functions. The results are
   compared with other linear models and coexpression analysis methods,
   including logit regression, linear discriminant analysis (LDA)^[75]44,
   PCC, SRC, MI, PPC, and CMI. The trained MFR model is saved as an R
   data, and the datasets and the results of the current study can be
   freely downloaded at [76]http://bmbl.sdstate.edu/MFR for academic uses,
   further verification, and biological analysis.

Figure 1.

   [77]Figure 1
   [78]Open in a new tab

   Workflow of MFR model. Five steps are in the workflow, including (i)
   gene pair samples collection, (ii) gene features extraction, (iii) gene
   pair features calculation, (iv) SVM model construction and (v)
   verification and discussion.

Model construction dataset

   The gene-pair dataset for MFR model training, development and test is
   composed of the coexpression and prior-knowledge sub datasets. The
   former one is retrieved from the COXPRESdb database, where the
   positives and negatives are the coexpressed and discoexpressed gene
   pairs, respectively; and the latter one is made up by the KEGG, PPI,
   and TRRUST sub-sub datasets, where the positives are the gene pairs
   composed by genes involved in the same pathways, with PPIs or
   transcriptional regulation relationships, and negatives are the gene
   pairs composed by genes involved in different pathways, without PPIs
   and transcriptional regulation relationships. The structure of each sub
   dataset and sub-sub dataset are listed in Table [79]1. Some of the
   negative gene pairs are obtained by permutation of the positives, and
   then selected randomly to make sure the same number of positives for
   construction of a model with high generalization. To keep the bias from
   random permutation and selection, we repeat the process of the dataset
   generation by 100 times giving rise to 100 datasets. Each of these
   datasets is used to train, develop and test, and the average AUC value
   and positive predictive value (PPV) are calculated to develop suitable
   hyperparameters and compared to other models or methods, where the
   training, development and test sets are obtained according to the
   detailed proportion of the sub and sub-sub datasets. In each of the 100
   datasets, we obtain 67,000 positive gene pairs and 74,560 negative gene
   pairs. It notices that the numbers of the positive and negative gene
   pairs are counted after remove the gene pairs without enough gene-pair
   features. Also, the fitting target MFR values for the positive gene
   pairs are marked as 1 s and those for the negatives as 0 s. The
   detailed information can be found in the following sub-sections.

Table 1.

   Structure of MFR dataset.
   sub dataset Coexpression Prior-knowledge based
   sub-sub dataset KEGG PPI TURRUST
   Resource database The COXPRESdb^[80]26 The KEGG^[81]33 Ref.^[82]43 The
   TRRUST^[83]39
   Type of gene pair Positive Negative Positive Negative Positive Negative
   Positive Negative
   Sample size 30,353 29,607 13,386 13,386 18,227 26,533 5,034 5,034
   [84]Open in a new tab

The coexpression sub dataset

   In the COXPRESdb database, PCC for each gene pair is transferred to the
   Mutual Rank (MR) value^[85]45. The smaller of an MR value, the higher
   coexpression intensity of the corresponding gene pairs have, and the
   coexpressed genes of a specific target gene are ranked by their MR
   values in increasing order. For each gene, we select the first 50 genes
   in its coexpressed gene list to compose 50 coexpressed gene pairs from
   the Hsa-m.c4-1 and Mmu-m.c3-1 datasets, respectively. Then the commonly
   coexpressed gene pairs in both datasets are used as the positive gene
   pairs. In avoid of the imbalance issue between the positive and
   negative gene pairs, we select 80 genes in the middle of the
   coexpressed gene list for each gene to compose 80 discoexpressed gene
   pairs from the Hsa-m.c4-1 and Mmu-m.c3-1 datasets, respectively. Then
   the commonly discoexpressed gene pairs in both datasets are used as the
   negative gene pairs, where PCC of each negative gene pair is around 0.
   In total, there are 30,353 positive gene pairs and 29,607 negative gene
   pairs generated in the coexpression sub dataset.

The prior-knowledge sub dataset

   The prior-knowledge sub dataset is composed by the KEGG, PPI, and
   TRRUST sub-sub datasets, and the collection of gene pairs are listed as
   follows.

   (A) The KEGG sub-sub dataset. The genes and pathways in metabolism,
   genetic information processing, environmental information processing,
   and cellular processes are downloaded from the KEGG database. The
   13,386 positive gene pairs are composed by the genes involved in at
   least three same KEGG pathways, and the 13,386 negatives are randomly
   selected gene pairs composed by the genes involved in different KEGG
   pathways to keep the balanced number between the positive and the
   negative gene pairs.

   (B) The PPI sub-sub dataset. This dataset is collected from the
   study^[86]43, which has been used as the standard test set for PPI
   prediction^[87]46–[88]48, as its reasonable sampling and the balanced
   number between the positive and negative gene pairs. The 18,227
   positive gene pairs are the ones with PPIs from the HPRD database, and
   the 26,533 negatives are composed of genes located in different
   organelles, in addition to those gene pairs without PPIs proved by
   experiments, which are collected from the Negatome database^[89]49.

   (C) The TRRUST sub-sub dataset. The 5,034 gene pairs with
   transcriptional regulatory relationships from the TRRUST dataset are
   used as the positive gene pairs. Then we randomly permutate
   transcription-factor genes with regulated genes as the negative gene
   pairs, making sure to obtain 5,034 negatives as the same number as the
   positives.

Gene features

   MFR uses 12 similarity-based gene pair features to assess conditional
   relatedness between a pair of genes. Ten out of these 12 features are
   calculated using four gene features, including gene-expression level,
   GO annotation, homologous profile, and subcellular localization. More
   details are listed as follows.

Expression data

   Six hundred two datasets with 15,679 samples from the GEO
   database^[90]50 based on the unique Affymetrix Human Genome U133 Plus
   2.0 Array platform (released on Dec. 2017) are used as expression data
   source. Then the pre-processing steps are executed, including log2
   scale and quantile normalization. After removing genes without the
   UniProt IDs^[91]51, 16,391 protein-coding genes in human are retained
   for further expression data analysis.

Gene ontology data

   The GO annotations for human genes are obtained from the GO database
   (435,975 annotations released on Dec. 2017). Only 43,340 biological
   process GO terms with experimental evidence are used as functional
   annotations for genes in our study. The structure of these GO terms can
   be described as a tree, where the relationships among GO terms fall
   into four categories: “is a”, “part of”, “has part” and “regulates”.
   However, we only use 456,781 “is a” relations to assess the GO
   similarity between genes

Homologous data

   Over 22 million genes from over 5,000 species, including 169,376 human
   homologous genes from 20 species, are used to construct the homologous
   profile data by the orthoDB database (version 9.1).

Subcellular localization data

   A total of 160,537 cellular component annotations of human genes from
   the GO database (released on Dec. 2017) are used as the subcellular
   source to measure subcellular localization similarity between a pair of
   genes.

Verification and discussion resources

   Besides a test verification, we compare MFR with other linear models
   (logit regression, LDA) and coexpression analysis methods (PCC, SRC,
   MI, PPC, and CMI), regarding performances in further verification for
   the GeneFriends and DIP datasets, construction of cancer gene network,
   and prediction of KEGG metabolomic gene functions. These resources are
   described as follows.

The GeneFriends and DIP datasets

   With the elimination of gene pairs without enough gene-pair features,
   overall 9,146 coexpressed gene pairs with top 20 PCC values for each
   gene from the GeneFriends database are used as the positive gene pairs.
   Considering real coexpressed gene pairs are rare in the whole human
   genome, the 9,146 randomly selected negative gene pairs generated by
   permutation of the first and the second genes in the positive gene
   pairs. Similarly, a total of 1,489 gene pairs with the direct-PPIs from
   the DIP database (leased on Dec. 2017) are used as the positive gene
   pairs. The negative gene pairs are 1,489 randomly selected gene pairs
   composed by permutating the first and the second genes in positive gene
   pairs. Because the negative gene pairs in the GeneFriends and DIP
   datasets are both generated by random permutation and selection. To
   avoid the bias of such random sampling, we repeat the whole dataset
   generation process for 100 times giving rise to 100 GeneFriends and 100
   DIP datasets, respectively. The average AUC value and PPV of each of
   the 100 datasets are used to compare models or methods in verification.

Cancer gene-expression data

   The RNA-seq data of four cancer types are downloaded from the TCGA
   database^[92]52, each having at least ten cancer samples and ten normal
   samples, with more details showcased in Table [93]2. Before further
   analysis, this expression data is pre-processed, including log2 scale
   and quantile normalization.

Table 2.

   Sample size of RNA-seq data for four cancer types.
                  Type                 Cancer (Samples) Normal (Samples)
   Bladder urothelial carcinoma (BLCA) 408              19
   Breast invasive carcinoma (BRCA)    1095             113
   Colon adenocarcinoma (COAD)         285              41
   Lung adenocarcinoma (LUAD)          515              59
   [94]Open in a new tab

KEGG metabolic genes

   In total, 1,403 genes of 84 metabolic pathways from the KEGG database
   are used to compare different models and methods regarding predicting
   gene functions. Specifically, 100 out of these genes are randomly
   selected as the genes without any prior knowledge, and then their
   functions are predicted by analyzing functional annotations of other
   1,303 genes. Such a process is repeated for 100 times, and the average
   prediction rates are used to indicate the capability for gene function
   prediction.

Gene pair feature calculation

   While traditional coexpression analysis methods use a signal type of
   features to measure conditional relatedness between genes, MFR uses
   multi-features including both expression similarities and
   prior-knowledge based similarities. Twelve similarity-based gene pair
   features are used in MFR which are defined as follows.

Seven features based on expression similarities

   We firstly use average expression levels of each gene, exp1, and exp2,
   as the first two features for a gene pair. The following five features
   are a gene pair’s coexpression levels measured by PCC, SRC, PPC, MI,
   and CMI. PCC is used to measure linear coexpression relationship; SRC
   and MI are used to measure non-linear coexpression relationship, where
   different from SRC based on ranks, MI determines how similar the joint
   distribution of two genes’ expression levels is to the products of
   factored marginal distribution for indicating the association between
   their expressions. PPC is used to measure direct linear coexpression,
   which is the coexpression relationship between a pair of genes measured
   avoiding any influence of other genes; Similarly, CMI is used to
   measure direct non-linear coexpression.

One feature based on the gene ontology similarity

   The GO similarity (goSim) is used as the eighth feature because the
   genes with interaction are considered being involved in the similar
   biological process. It can be defined as:
   [MATH:
   <mi>g</mi><mi>o</mi><mi>S</mi><mi>i</mi><msub><mrow><mi>m</mi></mrow><m
   row><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>=</mo><mi>m</mi><mi
   >a</mi><msub><mrow><mi>x</mi></mrow><mrow><mi>o</mi><mo>∈</mo><msub><mr
   ow><mi>O</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>,</mo><mi>q</mi><
   mo>∈</mo><msub><mrow><mi>O</mi></mrow><mrow><mi>j</mi></mrow></msub></m
   row></msub><mfrac><mrow><mn>2</mn><mo>×</mo><mspace
   width=".25em"></mspace><mi>log</mi><mo
   stretchy="false">(</mo><mi>P</mi><mi>m</mi><mi>s</mi><mo
   stretchy="false">(</mo><mi>o</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo><mo
   stretchy="false">)</mo></mrow><mrow><mi>log</mi><mo
   stretchy="false">(</mo><mi>P</mi><mo
   stretchy="false">(</mo><mi>o</mi><mo stretchy="false">)</mo><mo
   stretchy="false">)</mo><mo>+</mo><mspace
   width=".25em"></mspace><mi>log</mi><mo
   stretchy="false">(</mo><mi>P</mi><mo
   stretchy="false">(</mo><mi>q</mi><mo stretchy="false">)</mo><mo
   stretchy="false">)</mo></mrow></mfrac> :MATH]
   1
   [MATH: <mi>P</mi><mi>m</mi><mi>s</mi><mo
   stretchy="false">(</mo><mi>o</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo><mo>=</mo><mi>m</mi><mi>i</mi><msub><mrow><mi>n<
   /mi></mrow><mrow><mi>c</mi><mo>∈</mo><mi>A</mi><mo
   stretchy="false">(</mo><mi>o</mi><mo>,</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msub><mi>P</mi><mo
   stretchy="false">(</mo><mi>c</mi><mo stretchy="false">)</mo> :MATH]
   2
   [MATH: <mi>P</mi><mo stretchy="false">(</mo><mi>o</mi><mo
   stretchy="false">)</mo><mo>=</mo><mspace
   width=".25em"></mspace><mfrac><mrow><mrow><mo
   stretchy="true">|</mo><mrow><mi>D</mi><mo
   stretchy="false">(</mo><mi>o</mi><mo stretchy="false">)</mo></mrow><mo
   stretchy="true">|</mo></mrow><mo>+</mo><mn>1</mn></mrow><mrow><mrow><mo
   stretchy="true">|</mo><mrow><mi>D</mi><mo
   stretchy="false">(</mo><mi>r</mi><mi>o</mi><mi>o</mi><mi>t</mi><mo
   stretchy="false">)</mo></mrow><mo
   stretchy="true">|</mo></mrow><mo>+</mo><mn>1</mn></mrow></mfrac> :MATH]
   3

   where O[i] and O[j] indicates the GO term sets used for annotating gene
   i and j, respectively; A(o, q) is the common ancestor set of GO term o
   and q; P(o) is the probability of a gene annotated by an instance of GO
   term o^[95]53; D(o) and D(root) indicate the descendant GO term sets of
   GO term o, and the root GO term, respectively.

One feature based on subcellular localization similarity

   The ninth feature, subcellular localization similarity (lcSim), is used
   to calculate the probability for two protein-coding genes appearing in
   a common organelle. It can be defined as:
   [MATH:
   <mi>l</mi><mi>c</mi><mi>S</mi><mi>i</mi><msub><mrow><mi>m</mi></mrow><m
   row><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>=</mo><mfrac><mrow>
   <mo
   stretchy="false">|</mo><msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></m
   row></msub><mo>∩</mo><msub><mrow><mi>L</mi></mrow><mrow><mi>j</mi></mro
   w></msub><mo stretchy="false">|</mo></mrow><mrow><mo
   stretchy="false">|</mo><msub><mrow><mi>L</mi></mrow><mrow><mi>i</mi></m
   row></msub><mo>∪</mo><msub><mrow><mi>L</mi></mrow><mrow><mi>j</mi></mro
   w></msub><mo stretchy="false">|</mo></mrow></mfrac> :MATH]
   4

   where L[i] and L[j] are the subcellular localization sets of proteins
   encoded by the genes i and j, repressively.

One feature based on homology similarity

   Since common presence and absence of two genes in many species suggest
   a potential functional relatedness between them, the homology
   similarity (hgSim) is used as the 10th feature calculated using an
   improved Pearson correlation method^[96]54 as:
   [MATH: <mi
   mathvariant="normal">hg</mi><mi>S</mi><mi>i</mi><msub><mrow><mi>m</mi><
   /mrow><mrow><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>=</mo><mfra
   c><mrow><mi>N</mi><mo>×</mo><mi>M</mi><mo>−</mo><msub><mrow><mi>n</mi><
   /mrow><mrow><mi>i</mi></mrow></msub><mo>×</mo><msub><mrow><mi>n</mi></m
   row><mrow><mi>j</mi></mrow></msub></mrow><mrow><msqrt><mrow><mo
   stretchy="false">(</mo><mi>N</mi><mo>×</mo><msub><mrow><mi>n</mi></mrow
   ><mrow><mi>i</mi></mrow></msub><mo>−</mo><msubsup><mrow><mi>n</mi></mro
   w><mrow><mi>i</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo
   stretchy="false">)</mo><mo>×</mo><mo
   stretchy="false">(</mo><mi>N</mi><mo>×</mo><msub><mrow><mi>n</mi></mrow
   ><mrow><mi>j</mi></mrow></msub><mo>−</mo><msubsup><mrow><mi>n</mi></mro
   w><mrow><mi>j</mi></mrow><mrow><mn>2</mn></mrow></msubsup><mo
   stretchy="false">)</mo></mrow></msqrt></mrow></mfrac> :MATH]
   5

   where n[i] and n[j] are the numbers of species whose genome contains
   the orthologous genes of gene i and j, respectively; N = 21 is the
   total number of species we use, and M is the number of species whose
   genome contains both orthologous genes of gene i and j.

One feature based on Reactome similarity

   Overall, 202,772 gene-gene interactions derived from the Reactome
   pathways are used to construct an unweighted graph, in which nodes
   represent genes and edges represent interactions between genes. The
   normalized distance of a gene pair is used as the 11th feature named as
   Reactome similarity (rxSim), which is defined as:
   [MATH:
   <mi>r</mi><mi>x</mi><mi>S</mi><mi>i</mi><msub><mrow><mi>m</mi></mrow><m
   row><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub><mo>=</mo><mn>1</mn><mo
   >−</mo><mfrac><mrow><mi>d</mi><mi>i</mi><msub><mrow><mi>s</mi></mrow><m
   row><mi>i</mi><mo>,</mo><mi>j</mi></mrow></msub></mrow><mrow><mi>d</mi>
   <mi>i</mi><msub><mrow><mi>s</mi></mrow><mrow><mi>m</mi><mi>a</mi><mi>x<
   /mi></mrow></msub></mrow></mfrac> :MATH]
   6

   where dis[i,j] is the shortest distance between gene i and j, and
   dis[max] is the shortest distance between the farthest gene pair in the
   graph.

One feature based on transcriptional regulatory similarity

   Totally 284 transcription factors, 18,302 regulated genes, and 51,871
   transcriptional regulatory interactions between them are obtained from
   the HTRIdb database. If there is a record that a gene pair has a
   transcriptional regulatory interaction, the transcriptional regulatory
   similarity (trSim) used as the 12th feature of this gene pair is 1,
   otherwise is 0.

SVM model construction

   MFR is designed based on SVM, which is a supervised learning model,
   with associated learning algorithms for classification and regression
   analysis. The motivation is to classify data by using the best
   hyperplane that is the one that represents the most extensive
   separation, or margin, between two classes. We take a total of 12
   similarity-based gene pair features as input, and the output value as
   an assessment criterion, namely MFR, for detecting the conditional
   relatedness between a pair of genes (see Fig. [97]2). For model
   training, we provide the target MFR values (labels) marked as 1 s and
   0 s for the positive and the negative gene pairs, respectively. Given
   X = {x[1], x[2], …, x[n]} and Y = {y[1], y[2], …, y[n]}, where x[i] and
   y[i] indicates the vector of 12 similarity-based gene-pair features and
   the target MFR value (label) of the ith gene pair, repressively, the
   MFR model can be construction by conduction Formula (7):
   [MATH: <mtable columnalign="left"><mtr><mtd><mi
   mathvariant="italic">ma</mi><msub><mrow><mi
   mathvariant="italic">x</mi></mrow><mrow><mi>α</mi></mrow></msub><mrow><
   mo
   stretchy="true">(</mo><mrow><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=
   </mo><mn>1</mn></mrow><mi>n</mi></munderover><msub><mrow><mi>α</mi></mr
   ow><mrow><mi>i</mi></mrow></msub><mo>−</mo><mfrac><mn>1</mn><mn>2</mn><
   /mfrac><munderover><mo>∑</mo><mrow><mi>i</mi><mo>,</mo><mi>j</mi><mo>=<
   /mo><mn>1</mn></mrow><mi>n</mi></munderover><msub><mrow><mi>α</mi></mro
   w><mrow><mi>i</mi></mrow></msub><msub><mrow><mi>α</mi></mrow><mrow><mi>
   j</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi></mrow
   ></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>j</mi></mrow></msub><msu
   bsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow><mrow><mi
   >T</mi></mrow></msubsup><mo>⋅</mo><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>j</mi></mrow></msub></
   mrow><mo
   stretchy="true">)</mo></mrow></mtd></mtr><mtr><mtd><mi>s</mi><mo>.</mo>
   <mi>t</mi><mo>.</mo><munderover><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn
   >1</mn></mrow><mi>n</mi></munderover><msub><mrow><mi>α</mi></mrow><mrow
   ><mi>i</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi>i</mi><
   /mrow></msub><mo>=</mo><mn>0</mn><mo>;</mo><mn>0</mn><mo>≤</mo><msub><m
   row><mi>α</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>≤</mo><mi>C</mi>
   <mo>,</mo><mi>i</mi><mo>=</mo><mn>1</mn><mo>,</mo><mn>2</mn><mo>,</mo><
   mi>…</mi><mo>,</mo><mi>n</mi></mtd></mtr></mtable> :MATH]
   7

   where α = {α[1],α[2], …, α[n]} indicates Lagrange multipliers, which
   are solved by SMO (sequential minimal optimization)^[98]55. Then
   predicted
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>M</mi><mi>F</mi><mi>R</mi></mrow><mrow><mo
   stretchy="true">^</mo></mrow></mover></mrow><mrow><mi>i</mi></mrow></ms
   ub> :MATH]
   value of ith gene pairs is defined as:
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>M</mi><mi>F</mi><mi>R</mi></mrow><mrow><mo
   stretchy="true">^</mo></mrow></mover></mrow><mrow><mi>i</mi></mrow></ms
   ub><mo>=</mo><mi>s</mi><mi>i</mi><mi>g</mi><mi>m</mi><mi>o</mi><mi>i</m
   i><mi>d</mi><mrow><mo
   stretchy="true">(</mo><mrow><munderover><mo>∑</mo><mrow><mi>j</mi><mo>=
   </mo><mn>1</mn></mrow><mi>n</mi></munderover><msub><mrow><mi>α</mi></mr
   ow><mrow><mi>j</mi></mrow></msub><msub><mrow><mi>y</mi></mrow><mrow><mi
   >j</mi></mrow></msub><msubsup><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>j</mi></mrow><mrow><mi
   >T</mi></mrow></msubsup><mo>⋅</mo><msub><mrow><mi
   mathvariant="bold-italic">x</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>+</mo><mover accent="true"><mi>b</mi><mo>ˆ</mo></mover></mrow><mo
   stretchy="true">)</mo></mrow> :MATH]
   8

   where
   [MATH: <mover accent="true"><mi>b</mi><mo>ˆ</mo></mover> :MATH]
   indicates the bias defined as ref.^[99]42.

Figure 2.

   [100]Figure 2
   [101]Open in a new tab

   Structure of the MFR model. The model is based on SVM and uses 12
   similarity-based gene pair features as input; and the output value,
   namely MFR, is applied as an assessment criterion for measuring
   conditional relatedness between genes.

   Because there are not enough positive gene pairs with both high
   expression similarities and high prior-knowledge based similarities for
   directly training, we collect positive gene pairs with high expression
   similarities and the corresponding negatives to compose the
   coexpression sub dataset. Similarly, we collect positive gene pairs
   with high priori-knowledge based similarities and the corresponding
   negatives to compose the priori-knowledgesub dataset. Then MFR is
   trained by gene pairs in the whole dataset including both coexpression
   and prior-knowledge sub datasets at the same time to provide our model
   the capability for identification of gene pairs with both high
   expression similarities and high prior-knowledge based similarities,
   rather than trained by coexpression sub dataset or prior-knowledge sub
   dataset separately. And a higher MFR value indicates that two genes are
   more likely to be interacting with each other. In detail, we employ
   LIBSVM^[102]56 with the linear kernel to implement our model.

   MFR is constructed by the 10-fold cross validation, in which we use 81%
   of the gene pairs for training and 9% for development. The procedure is
   repeated by 10 times. The hyperparameters with the highest average AUC
   value of the whole cross-validation are selected. Then we use the rest
   10% gene pairs to conduct test verification. The result of our model in
   the 10-fold cross validation and test verification is compared with
   those of other linear models or coexpression analysis methods as shown
   in Results (see Figs [103]3 and [104]4). After training and
   development, the weight w[S] of the gene-pair feature S is finalized as
   w[exp1] = −0.810, w[exp2] = −0.807, w[PCC] = −0.017, w[SRC] = 0.840,
   w[MI] = 4.875, w[PPC] = 2.414, w[CMI] = −0.055, w[goSim] = 0.972,
   w[lcSim] = 1.198, w[hgSim] = 0.433, w[rxSim] = 0.544 and
   w[trSim] = 0.668, indicating MI, PPC, goSim and lcSim are the most
   important gene-pair features for MFR model, while PCC and CMI are the
   least important. The top four important features contain two expression
   similarities and two priori-knowledge based similarities indicating
   both kinds of features contribute to accurately measuring relatedness
   of a pair of genes. MI and PPC obtain the largest weights among
   expression similarities maybe because, before calculation of MI, the
   expression levels of genes are discretized according to study^[105]57,
   making MI get stronger robustness on the noise of gene expressions, and
   PPC has more complementarity with MI compared with other expression
   similarities, as other expression similarities, specially PCC and CMI,
   have some resemblance with MI^[106]58,[107]59. The larger weights of
   goSim and lcSim than other priori-knowledge based similarities indicate
   two genes with the related functions and the similar organelle
   locations mostly have a strong relatedness. The negative weights of
   exp1 and exp2 indicate the punishment of the exorbitant expression, as
   two of the genes in a gene pair are very hard to have exorbitant
   expressions both, and the exorbitant expression of a gene usually
   implies a gap of expression with the other gene, indicating a low
   relatedness between these genes.

Figure 3.

   [108]Figure 3
   [109]Open in a new tab

   (A) ROCs of nine models or methods for identifying gene-gene
   interactions by the 10-fold cross-validation. (B) Average PPVs of nine
   models or methods for detecting B0/B1 matched gene pairs by 10-fold
   cross-validation.

Figure 4.

   [110]Figure 4
   [111]Open in a new tab

   ROCs of nine models or methods for identifying gene-gene interactions
   in the (A) test, (C) GeneFriends and (E) DIP datasets. Average PPVs of
   nine models or methods for detecting B0/B1 matched gene pairs in the
   (B) test, (D) GeneFriends and (F) DIP datasets.

Performance evaluation

   We compare the performances of MFR with other two linear models, i.e.,
   logit regression and LDA^[112]44, and five coexpression analysis
   methods (PCC, SRC, PPC, MI, and CMI). We choose logit regression and
   LDA because they are widely used multi-features generalized linear
   logit regression models^[113]60–[114]63. And the five coexpression
   analysis methods are selected since they are traditional methods in
   measuring conditional relatedness between a pair of
   genes^[115]16–[116]25. To make a fair comparison with linear models
   with multi-features, we also add the sixth coexpression analysis
   method, so-called CXP, which is the integration of PCC, SRC, PPC, MI,
   and CMI. Specially, the average value of these five methods is used as
   the assessment criterion of CXP, comparable with the result from other
   multi-features methods, such as MFR, logit, and LDA. First, we compare
   different models and methods in detecting gene-gene interactions on
   verification datasets using the receiver operating characteristic curve
   (ROC)^[117]64, where gene-gene interactions indicate positive gene
   pairs with high expression similarities or high prior-knowledge based
   similarities. And we use PPV^[118]65 to compare different models and
   methods in identifying gene pairs with both high expression
   similarities and high prior-knowledge similarities as defined in
   Section 2.7.2. Then, we conduct pathway enrichment analysis to identify
   the pathways significantly influenced by the increased glutamine and
   glutamate metabolism, on gene modules identified in cancer gene
   networks, where nodes represent up-regulated genes and edges show
   relatedness measured by each model or method. Finally, the
   shortest-path method^[119]66 is applied to predict functions of genes
   pretending to have no prior knowledge, on the KEGG metabolic gene
   networks, where nodes represent genes involved in KEGG metabolism
   pathways and edges represent relatedness calculated using different
   models and methods, respectively.

Receiver operating characteristic curve

   The ROC curve with its area under the curve (AUC) is a widely used
   evaluation tool for performance comparison of different methods. It is
   made by plotting true positive rate (TPR) against false positive rate
   (FPR), which are defined as:
   [MATH: <mi>T</mi><mi>P</mi><mi>R</mi><mo
   stretchy="false">(</mo><mi>n</mi><mo
   stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mi>T</mi><mi>P</mi><mo
   stretchy="false">(</mo><mi>n</mi><mo
   stretchy="false">)</mo></mrow><mi>P</mi></mfrac> :MATH]
   9
   [MATH: <mi>F</mi><mi>P</mi><mi>R</mi><mo
   stretchy="false">(</mo><mi>n</mi><mo
   stretchy="false">)</mo><mo>=</mo><mfrac><mrow><mi>F</mi><mi>P</mi><mo
   stretchy="false">(</mo><mi>n</mi><mo
   stretchy="false">)</mo></mrow><mi>N</mi></mfrac> :MATH]
   10

   where TP(n) indicates the true positive among top n ranked gene-gene
   interactions, FP(n) indicates the false positive among top n ranked
   gene-gene interactions, P indicates the total number of interacting
   gene pairs, and N indicates the total number of non-interacting gene
   pairs.

Positive predictive value

   The positive predictive value (PPV), so-called precision, is an
   intuitive indicator for evaluating prediction results among models, and
   a high value of PPV indicates the accuracy of a model. PPV is defined
   as:
   [MATH:
   <mi>P</mi><mi>P</mi><mi>V</mi><mo>=</mo><mfrac><mrow><mi>T</mi><mi>P</m
   i></mrow><mrow><mi>T</mi><mi>P</mi><mo>+</mo><mi>F</mi><mi>P</mi></mrow
   ></mfrac> :MATH]
   11

   where TP and FP are the true positive and the false positive among gene
   pairs, respectively.

   As it is very hard to give a precise definition of a gene pair with
   both high expression similarities and high prior-knowledge based
   similarities, we define a gene pair labeled a B0 match if its PCC or
   SRC values larger than 0.5 and the goSim and lcSim values larger than
   0.5; and labeled a B1 match if its PCC or SRC values larger than 0.3
   and the goSim and lcSim values larger than 0.3. And then the PPV of top
   5% ranked gene pairs against B0 matched gene pairs, and the PPV of the
   top 10% ranked gene pairs against B1 match gene pairs are used to
   approximately compare models in terms of prediction of gene pairs with
   both high expression similarities and high prior-knowledge based
   similarities.

Up-regulated genes identification

   A gene is identified to be up-regulated if the fold-change between the
   average expression level in cancer samples and that in normal samples
   is greater than 1.5 and with a q-value < 0.05 measured by the limma
   t-test^[120]67.

Fast greedy modularity optimization method

   In the study^[121]68, a method was proposed to find modules in networks
   by greedy optimization of modularity^[122]69. The fast-greedy
   modularity optimization method^[123]70 performs the same greedy
   optimization as the method of^[124]68, but it runs much faster due to
   the lower computational cost.

Pathway enrichment analysis

   Pathway enrichment analysis is conducted over a given set C of
   up-regulated genes against the pathways in KEGG. The statistic
   significant p-value of gene set C with n genes enriching pathway P with
   K genes can be defined as:
   [MATH: <mi>P</mi><mi>v</mi><mi>a</mi><mi>l</mi><mi>u</mi><mi>e</mi><mo
   stretchy="false">(</mo><mi>k</mi><mo
   stretchy="false">)</mo><mo>=</mo><mn>1</mn><mo>−</mo><munderover><mo>∑<
   /mo><mrow><mi>i</mi><mo>=</mo><mn>0</mn></mrow><mrow><mi>k</mi><mo>−</m
   o><mn>1</mn></mrow></munderover><mfrac><mrow><mrow><mo
   stretchy="true">(</mo><mrow><mtable><mtr><mtd><mi>K</mi></mtd></mtr><mt
   r><mtd><mi>i</mi></mtd></mtr></mtable></mrow><mo
   stretchy="true">)</mo></mrow><mrow><mo
   stretchy="true">(</mo><mrow><mtable><mtr><mtd><mrow><mi>N</mi><mo>−</mo
   ><mi>K</mi></mrow></mtd></mtr><mtr><mtd><mrow><mi>n</mi><mo>−</mo><mi>i
   </mi></mrow></mtd></mtr></mtable></mrow><mo
   stretchy="true">)</mo></mrow></mrow><mrow><mo
   stretchy="true">(</mo><mrow><mtable><mtr><mtd><mi>N</mi></mtd></mtr><mt
   r><mtd><mi>n</mi></mtd></mtr></mtable></mrow><mo
   stretchy="true">)</mo></mrow></mfrac> :MATH]
   12

   where N = 18,420 is the total number of human genes and k is the number
   of genes in
   [MATH: <mi>C</mi><mo>∩</mo><mi>P</mi> :MATH]
   . Then the p-value is adjusted to be a q-value to restrict the false
   discovery rate^[125]71. And we consider the C enriches P if
   q-value < 0.01.

Shortest-path method

   For identifying all the genes with GO annotations on the shortest path,
   the shortest-path method^[126]66 is applied to find the lowest common
   ancestor of their GO annotations. If the ancestor is less than three
   levels below the root of the GO tree, it is assigned to the genes
   without any GO annotation on the shortest path as their functions. A
   gene is labeled a L0 match if one of the predicted GO annotations is
   its known GO annotation and labeled a L1 match if one of the predicted
   GO annotations is its known GO annotations’ direct parents^[127]66.
   Then L0 and L1 match ratios relative to the total number of genes
   without any GO annotations are used to compare each model or method
   regarding gene function prediction.

Results

10-fold cross-validation

   We compare the precision of identifying gene-gene interactions by MFR
   with the other linear models and coexpression analysis methods on the
   development datasets. The ROC results by the 10-fold cross-validation
   of different models and methods are showcased in Fig. [128]3. The
   linear models including MFR, logit, and LDA are more suitable for
   detecting gene-gene interactions, as their average values of AUC are
   all larger than those of coexpression analysis methods. Among these
   linear models, our model based on SVM performs the best and obtains the
   largest average AUC value of 0.819. In terms of prediction of the gene
   pairs with both high expression similarities and high prior-knowledge
   based similarities, the average PPVs of B0 and B1 matched gene pairs
   for linear models are also larger than those for coexpression analysis
   methods, where MFR obtains the best performance for the largest average
   PPVs of B0 and B1 matched gene pairs of 0.988 and 0.866, respectively.

Verifications on the test, GeneFriends and DIP datasets

   The robustness evaluation is carried out through examining the
   performances of different models and methods in detecting gene-gene
   interactions, and in identifying gene pairs with both high expression
   similarities and high prior-knowledge based similarities on three kinds
   of verification datasets, including the test datasets, GeneFriends
   datasets, and DIP datasets. Specially, the results on GeneFriends
   datasets indicate the robustness in detecting gene-gene interactions
   and gene pairs from coexpression data, and those on DIP datasets
   indicate the robustness in identifying gene-gene interactions and gene
   pairs from prior-knowledge based data. As showcased in Fig. [129]4, the
   linear models (MFR, logit, and LDA) are better from the result of
   verification, as their average AUC values and PPVs are all larger than
   those of coexpression analysis methods. MFR obtains the largest average
   AUC values on all verification datasets except the GeneFriends dataset,
   and the largest average PPVs on all verification datasets, indicating
   our SVM-based model has the best robustness.

Cancer gene network construction

   The relatedness between a pair of genes can be used as a similarity
   between the corresponding nodes in a constructed biological network,
   where genes in a set of highly interconnected genes (module) tend to be
   involved with relative biological processes. We utilize this property
   to predict metabolic pathways significantly influenced by increased
   glutamine and glutamate metabolism in four cancer types, which are
   BLAC, BRCA, COAD, and LUAD. Glutamine and glutamate metabolism are
   reported to be increased in various cancers^[130]72,[131]73, especially
   in bladder cancer^[132]74, breast cancer^[133]75–[134]81, colon
   cancer^[135]76,[136]78,[137]79,[138]82, and lung
   cancer^[139]76,[140]78,[141]79,[142]83. They are also considered to be
   closely related to cancer’s proliferation, invasion, and
   metastasis^[143]84. For each cancer type, we measure relatedness
   between up-regulated metabolic genes using MFR, other linear models,
   and coexpression analysis methods, respectively. Then the up-regulated
   metabolic genes and their relatedness in each cancer type are used to
   construct networks for each model and methods, where nodes represent
   genes, and two genes are connected if the MR of their relatedness is
   among top three. We collect 21 genes, including eight rate-limiting
   enzyme genes for glutaminolysis and 13 genes directly catalyzing
   reactions of glutamine or glutamate, defined as the gene markers for
   glutamine and glutamate metabolism (see Table [144]3), inspired by a
   recent study^[145]85. After identifying modules containing up-regulated
   gene markers, the pathway enrichment analysis is conducted on such
   modules to predict metabolic pathways directly influenced by increased
   glutamine and glutamate metabolism, which are the enriched with
   up-regulated gene markers, as shown in Fig. [146]5 and Supplement
   Figures [147]S1–[148]S3.

Table 3.

   Gene markers for glutamine and glutamate metabolism.
   Gene Description Go Term
   ASNS Asparagine Synthetase asparagine biosynthetic process
   ALDH18A1 Aldehyde Dehydrogenase 18 Family Member A1 proline
   biosynthetic process
   CAD Carbamoyl-Phosphate Synthetase 2, Aspartate Transcarbamylase, And
   Dihydroorotase ‘de novo’ pyrimidine nucleobase biosynthetic process
   CS Citrate Synthase tricarboxylic acid cycle
   CTPS CTP Synthase 1 ‘de novo’ CTP biosynthetic process
   CTPS2 CTP Synthase 2 ‘de novo’ CTP biosynthetic process
   DLD Dermcidin 2-oxoglutarate metabolic process
   DLST Dihydrolipoamide S-Succinyltransferase tricarboxylic acid cycle
   GFPT1 Glutamine-Fructose-6-Phosphate Transaminase 1
   UDP-N-acetylglucosamine biosynthetic process
   GFPT2 Glutamine-Fructose-6-Phosphate Transaminase 2
   UDP-N-acetylglucosamine biosynthetic process
   GLUL Glutamate-Ammonia Ligase glutamine biosynthetic process
   GLS Glutaminase glutamate biosynthetic process
   GLS2 Glutaminase 2 glutamate biosynthetic process
   OGDH Oxoglutarate Dehydrogenase tricarboxylic acid cycle
   GGDHL Oxoglutarate Dehydrogenase-like tricarboxylic acid cycle
   PFAS Phosphoribosylformylglycinamidine Synthase ‘de novo’ IMP
   biosynthetic process
   PPAT Phosphoribosyl Pyrophosphate Amidotransferase ‘de novo’ IMP
   biosynthetic process
   PSAT1 Phosphoserine Aminotransferase 1 ‘de novo’ IMP biosynthetic
   process
   GCLC Glutamate-Cysteine Ligase Catalytic Subunit glutathione
   biosynthetic process
   GCLM Glutamate-Cysteine Ligase Modifier Subunit glutathione
   biosynthetic process
   GSS Glutathione Synthetase glutathione biosynthetic process
   [149]Open in a new tab

Figure 5.

   [150]Figure 5
   [151]Open in a new tab

   Metabolic pathways are predicted to be directly influenced by increased
   glutamine and glutamate metabolism in nine BRCA gene networks.

   As shown in Fig. [152]6, we obtain the best prediction results from
   MFR-based networks. We predict 15 pathways directly influenced by
   increased glutamine and glutamate metabolism in all four cancer types,
   which is the most among all the models and methods. For example, in
   BRCA, there are three pathways are predicted to be directly related to
   increased glutamine and glutamate metabolism, agreeable with
   studies^[153]74,[154]86. However, only one or two of the three pathways
   are predicted by other models or methods. For MFR, the prediction of
   the glycine, serine, and threonine metabolism pathway is further
   confirmed as PSPH (phosphoserine phosphatase) found to be up-regulated
   in BRCA. Especially, PSPH acts as a rate-limiting enzyme involved in
   serine synthesis from glutamate^[155]87.

Figure 6.

   [156]Figure 6
   [157]Open in a new tab

   Number of metabolic pathways predicted to be directly influenced by
   increased glutamine and glutamate metabolism in four cancer types.
   These pathways were predicted in cancer gene networks, where nodes
   represent up-regulated metabolic genes and edges represent relatedness
   between genes, measured by the five linear models and six coexpression
   analysis methods.

Gene function prediction

   We randomly select 100 out of 1,403 genes involved in the KEGG
   metabolism pathways and pretend that there is no prior knowledge with
   them, and then we predict their functions through analyzing GO
   annotations of other 1,303 genes. This process repeats for 100 times.
   First, we use MFR, the other linear models, and coexpression analysis
   methods to measure the relatedness between each pair from the 1,403
   genes, respectively. For each linear model, as selected genes pretend
   to be without prior knowledge, we mainly use expression similarities
   (PCC, SRC, MI, PPC, and MI) to calculated their relatedness with other
   genes, and set other gene-pair features to be 0.5. The relatedness
   measured by different models and methods are normalized as follow: (i)
   rank the values of each model or method; and (ii) for each model or
   method, replace its values with the corresponding PCC values according
   to the ranks. Then the 1,403 genes and their relatedness are used to
   construct gene networks for different models and methods, respectively.
   In the constructed networks, nodes represent genes, edges represent
   relatedness between genes, and measured values of relatedness are used
   as the weights of edges. The edges with weights less than 0.6 are
   removed based on the procedure in the previous study^[158]66. Finally,
   each network contains 1,403 nodes and 14,067 edges.

   A broadly used shortest-path method is applied to predict the function
   of selected genes. As shown in Fig. [159]7, in MFR-based gene network,
   the shortest-path method achieves notably accurate results, where it
   successfully calls average 39.81%/26.33% of the selected genes at the
   L1/L0 levels. However, it only calls average 39.62%/25.28%,
   39.62%/25.28%, 5.08%/4.58%, 10.38%/7.39%, 7.43%/5.41%, 7.71%/6.65%,
   0.13%/0.19% and 2.43%/1.35% of the selected genes at the L1/L0 levels
   in logit-regression-, LDA-, PCC-, SRC-, PPC-, MI- CMI- and CXP- based
   networks, respectively. Overall, the results suggest that MFR
   outperforms other models and methods regarding gene function
   prediction, as it constructs better networks on genes with prior
   knowledge and benefits functional prediction of genes.

Figure 7.

   [160]Figure 7
   [161]Open in a new tab

   Percentages of L0- and L1-matched selected genes in the nine KEGG
   metabolic gene networks. In these networks, nodes represent genes
   involved in KEGG metabolism pathways, and edges represent relatedness
   between genes, measured by the nine models or methods.

Discussion and Conclusion

   In this paper, we propose a novel machine learning model for measuring
   conditional relatedness between genes, named MFR, by integrating seven
   expression similarities and five prior-knowledge based similarities.
   Specifically, gene pairs with both high expression similarities and
   high prior-knowledge based similarities will be kept and recommended by
   our model. At first, we conduct the MFR model in 10-fold
   cross-validation. Then we used the MFR model in a test verification and
   two further verifications on the GeneFriends and DIP datasets. Finally,
   the MFR model is used to construct cancer gene networks and predict
   gene functions. All the results are compared with those of other models
   or methods (see Table [162]4).

Table 4.

   Performances of the nine models or methods for different applications.
   Application Evaluation MFR Logit LDA PCC SRC MI PPC CMI CXP
   10-fold cross-validation AUC 0.819 0.818 0.818 0.699 0.692 0.664 0.695
   0.484 0.686
   B0 + B1 0.927 0.916 0.916 0.495 0.469 0.456 0.366 0.152 0.455
   Test verification AUC 0.823 0.822 0.822 0.696 0.690 0.658 0.691 0.484
   0.682
   B0 + B1 0.873 0.856 0.856 0.440 0.518 0.428 0.270 0.172 0.477
   GeneFriends verification AUC 0.816 0.821 0.821 0.815 0.764 0.733 0.823
   0.484 0.782
   B0 + B1 0.962 0.957 0.957 0.571 0.471 0.483 0.613 0.091 0.485
   DIP verification AUC 0.727 0.724 0.724 0.604 0.617 0.586 0.602 0.487
   0.600
   B0 + B1 0.727 0.713 0.713 0.544 0.507 0.519 0.438 0.142 0.463
   Constructing a cancer gene network NPP 15 12 14 10 10 11 12 8 10
   Predicting gene function L0 + L1 33.07 32.45 32.45 4.83 8.89 6.42 7.18
   0.16 1.89
   [163]Open in a new tab

   B0 + B1 indicates the average value of PPVs of B0- and B1-matched
   genes; NPP indicates the number of predicted metabolic pathways;
   L0 + L1 indicates the average number of L0- and L1-matched genes

   In terms of identifying gene-gene interactions, multi-features models,
   such as MFR, logit and LDA performance better than coexpression
   analysis methods including PCC, SRC, MI, PPC, CMI, and CXP in the
   10-fold cross-validation and verifications. Hence, the models
   integrating both expression similarities and prior-knowledge based
   similarities can avoid the shortage of using only one kind of
   expression similarities. And among those multi-feature models, MFR
   performances best in the 10-fold cross-validation, test verification,
   and one further verification on DIP datasets (except GeneFriends
   datasets), indicating the SVM-based model is more suitable for
   resolving the conflict of fitting relatedness supported by coexpression
   and those supported by prior knowledge at the same time. It also
   notices that MFR has better performances in the datasets containing the
   gene pairs extracted from both coexpression data and prior-knowledge
   based data (i.e., development and test datasets) and datasets
   containing the gene pairs extracted from prior-knowledge based data
   (e.g., the DIP datasets). On the contrary, logit and LDA models have
   better performances in the datasets containing the gene pairs extracted
   from only coexpression data, such as the GeneFriends datasets. In other
   words, logit and LDA models prefer gene pairs with high expression
   similarities, comparable with MFR. As a result, MFR is relatively good
   at detecting gene pairs with both high expression similarities and
   prior-knowledge based similarities and obtains the best results in all
   the datasets. For a real biological problem, some of the important gene
   pairs usually having attributes such as coexpression, like positive
   gene pairs collected from coexpression data, and the others typically
   have attributes such as PPI, like positive gene pairs collected from
   prior-knowledge based data. Additionally, gene pairs with both high
   expression similarities and high prior-knowledge based similarities are
   more likely the real important interacting gene pairs. So MFR is more
   suitable for practical applications, such as biological network
   construction and genomic function prediction, and can perform the best
   as our results show.

   The MFR is fundamentally a regression model, including two kinds of
   core elements, features, and model. So, for the next step, we plan to
   improve the MFR model on these core elements. First, we will improve
   the MFR model through obtaining and using more available and more
   accurate prior knowledge, as the MFR has high accuracy and robustness,
   and its dependency on the prior-knowledge based similarities make it
   adaptable. Second, with the development of deep learning technology,
   recently more and more computational methods are constructed based on
   deep learning models. As deep learning models automatically learn the
   complex functions for mapping input features to output results,
   deep-learning-based methods achieve to state-of-the-art accuracy of
   many prediction tasks, including image recognition^[164]88–[165]90 and
   natural language processing^[166]91–[167]93. Therefore, we will use
   deep learning models, such as the deep belief network, to replace SVM
   for MFR to improve accuracy and robustness.

Supplementary information

   [168]41598_2019_40780_MOESM1_ESM.pdf^ (783.5KB, pdf)

   Using Machine Learning to Measure Relatedness Between Genes: A
   Multi-Features Model

Acknowledgements