Abstract
Background
Machine learning can effectively nominate novel genes for various
research purposes in the laboratory. On a genome-wide scale, we
implemented multiple databases and algorithms to predict and prioritize
the human aging genes (PPHAGE).
Results
We fused data from 11 databases, and used Naïve Bayes classifier and
positive unlabeled learning (PUL) methods, NB, Spy, and Rocchio-SVM, to
rank human genes in respect with their implication in aging. The PUL
methods enabled us to identify a list of negative (non-aging) genes to
use alongside the seed (known age-related) genes in the ranking
process. Comparison of the PUL algorithms revealed that none of the
methods for identifying a negative sample were advantageous over other
methods, and their simultaneous use in a form of fusion was critical
for obtaining optimal results (PPHAGE is publicly available at
[33]https://cbb.ut.ac.ir/pphage).
Conclusion
We predict and prioritize over 3,000 candidate age-related genes in
human, based on significant ranking scores. The identified candidate
genes are associated with pathways, ontologies, and diseases that are
linked to aging, such as cancer and diabetes. Our data offer a platform
for future experimental research on the genetic and biological aspects
of aging. Additionally, we demonstrate that fusion of PUL methods and
data sources can be successfully used for aging and disease candidate
gene prioritization.
Keywords: Genome-wide, Prioritization, Human aging genes, Positive
unlabeled learning, Machine learning
Background
Prior understanding of the genetic basis of a disease is a crucial step
for the better diagnosis and treatment of the disease [[34]1]. Machine
learning methods help specialists and biologists the use of functional
or inherent properties of genes in the selection of candidate genes
[[35]2]. Perhaps the question that is posed to researchers is why all
research is aimed at identifying pathogenic rather than non-pathogenic
genes. The answer may lie in the fact that genes introduced as
non-pathogens may be documented as disease genes later on.
Biologists apply computation, mathematics methods, and algorithms to
develop machine learning methods of identifying novel candidate disease
genes [[36]3]. Based on the principle of “guilt by association”,
similar or identical diseases share genes that are very similar in
function or intrinsic properties, or have direct physical
protein-protein interactions [[37]4]. Most methods of predicting
candidate genes employ various biological data, such as protein
sequence, functional annotation, gene expression, protein-protein
interaction networks, regulatory data and even orthogonal and
conservation data, to identify similarities with respect to the
principle of association based on similarity [[38]5]. These methods are
categorized as unsupervised, supervised, and semi-supervised [[39]6].
Unsupervised methods cluster the genes based on their proximity and
similarity to the known disease genes, and rank them by various
methods. Supervised methods create a boundary between disease genes and
non-disease genes, and utilize this boundary to select candidate genes.
Several studies have been performed to address different aspects of the
methodology and have expanded the use of various methods and tools
[[40]3, [41]7–[42]12].
The tools that are available for candidate gene prioritization can be
classified with respect to efficiency, computational algorithms, data
sources, and availability [[43]13–[44]15]. Available prioritization
tools can be categorized into specific and general tools [[45]16].
Specific tools are used to prioritize candidate genes associated with a
specific disease. In these methods, information related to a specific
tissue involved in the disease or other information related to the
disease is employed. General tools can be applied for most diseases,
and various data sources are often used in these tools. Gene
prioritization tools can be divided into two types of single-species
and multi-species. Single-species tools are only usable for a specific
species, such as human or mouse. Multi-species tools have the ability
to prioritize candidate genes in several different species. For
example, the ENDEAVOR software can prioritize the candidate genes in
six different species [[46]17]. With respect to computational
algorithms, candidate prioritization tools are primarily divided into
two groups of complex network-based methods and similarity-based
methods [[47]5]. The inevitable completeness and existence of errors in
biological data sources necessitate fusion of multiple data sources
[[48]18]. Most gene targeting methods, therefore, use multiple data
sources to improve performance.
The purpose of this study was to design a machine to identify and
prioritize novel candidate aging genes in human. We examined the
existing methods of identifying human non-aging (negative) genes in the
machine learning techniques, and then made a binary classifier for
predicting novel candidate genes, based on the positively and
negatively learned genes. Gene ranking was based on the principle of
the similarity among positive genes through “guilt by association”.
Thus, across the unlabeled genes, genes that were less similar in
respect with the known genes were employed as negative sample.
Results
The three positive unlabeled learning (PUL) algorithms, Naïve Bayes
(NB), Spy, and Rocchio-SVM, were used to evaluate the underlying data,
and to compare them to the eight datasets introduced with respect to
performance. All samples of a class with a higher frequency were
unlabeled. We applied the algorithm to predict the labels. These
methods utilize a two-step strategy and are intended to extract a
reliable negative sample from the main data (Table [49]1).
Table 1.
Datasets used to evaluate reliable negative sample extraction
algorithms
Number of instances Number of attributes Data set names
756 754 Parkinson’s Disease Classification Data Set [[50]19]
345 7 Liver Disorders Data Set [[51]20]
1024 10 Cloud Data Set [[52]21]
351 34 Ionosphere Data Set [[53]22]
19,020 11 MAGIC Gamma Telescope Data Set [[54]23]
961 6 Mammographic Mass Data Set [[55]24]
569 32 Breast Cancer Wisconsin (Diagnostic) Data Set [[56]25]
208 60 Connectionist Bench (Sonar, Mines vs. Rocks) Data Set [[57]26]
[58]Open in a new tab
We also randomly selected 70% of the positive samples as the training
set, and the remainder as the test set. To determine the classifier,
positive and negative samples were equally selected to ensure that the
classifier did not have any bias at the training step. Therefore, we
compared the three algorithms with eight data sources extracted from
the UCI database (Additional file [59]1).
Comparison of the parameters of the three algorithms for all data sets
revealed similar results in F_measure. For example, in data set 1, the
precision of the Roc-SVM method, (approximately 2–3%,) was better than
those of the other two methods. However, the recall of the NB method
(approximately 4–6%,) was better than those of the other two methods,
and Roc-SVM method had a lower false positive rate than that of the
other two methods (Table [60]2). In addition, comparison between the
parameters of the three algorithms for data set 2, revealed that the
precision of the NB method was better than that of the other two
methods, the recall SPY method was 5% better than that of the other two
methods, and the NB method had a lower false positive rate than that of
the other two methods. Therefore, none of the methods had an absolute
superiority. Since the results were very similar, the output of the
three methods was combined.
Table 2.
Performance evaluation of the reliable negative sample extraction
algorithms
Data set Algorithm FPR% FNR% Precision % Recall % F_measure %
Parkinson’s Disease NB 37.25 4.57 95.43 89.78 92.52
SPY 8.70 16.11 97.42 83.89 90.15
Roc-SVM 6.52 15.00 98.08 85.00 91.07
Liver Disorders NB 17.65 5.71 73.33 94.29 82.50
SPY 36.14 0 40.00 100 57.14
Roc-SVM 31.33 5.00 42.22 95.00 58.46
Cloud NB 18.88 7.93 84.83 92.07 88.30
SPY 9.52 14.92 92.77 85.08 88.76
Roc-SVM 6.32 16.51 96.72 83.49 89.62
Ionosphere NB 47.62 8.33 88.51 91.67 90.06
SPY 26.32 6.98 94.12 93.02 93.57
Roc-SVM 33.33 8.89 94.25 91.11 92.66
MAGIC Gamma Telescope NB 10.49 44.44 68.18 55.56 61.22
SPY 17.88 36.22 53.88 63.78 58.42
Roc-SVM 6.68 47.18 77.65 52.82 62.87
Mammographic Mass NB 7.25 33.72 85.07 66.28 74.51
SPY 11.96 10.00 62.07 90.00 73.47
Roc-SVM 1.95 28.57 94.34 71.43 81.30
Breast Cancer Wisconsin NB 13.85 12.26 91.18 87.74 89.42
SPY 9.09 10.48 94.00 89.52 91.71
Roc-SVM 22.50 22.14 91.89 77.86 84.30
Connectionist Bench (Sonar, Mines vs. Rocks) NB 13.85 12.26 91.18 87.74
89.42
SPY 16.67 7.69 80.00 92.31 85.71
Roc-SVM 22.50 22.14 91.89 77.86 84.30
[61]Open in a new tab
The three PUL algorithms were applied to extract reliable negative
samples and to compare them with respect to performance. In this
algorithm, only 303 positive samples were given as input, which enabled
extraction of reliable negative samples from the remaining data.
Subsequently, from the positive and negative data, a new classifier was
trained to identify novel candidate genes to be utilized for
prioritization and ranking. A total of 328 negative genes were
extracted from each positive and negative gene, with a threshold of 11
replicates per negative gene (Additional file [62]2), and the Naïve
Bayes binary classifiers were trained in a 10-fold cross-validation
(Table [63]3). Additional file [64]2 contains results for all
thresholds. The ROC chart for training and test data is shown in
Fig. [65]1.
Table 3.
Model performance evaluation by Naïve Bayes on the aging data
Precision % Recall % F measure % Accuracy % AUC %
Train 80.78 76.95 78.81 78.52 83.81
Test 87.09 81.82 84.37 84.13 88.99
[66]Open in a new tab
Fig. 1.
[67]Fig. 1
[68]Open in a new tab
ROC curves. ROC was performed to evaluate the performance of the Naïve
Bayes model at the training and test steps, which resulted in similar
values for both curves
We trained multiple binary classifiers using all features in the
positive genes and reliable negative data to compare the NB classifier
to other classifiers. We investigated the performance of binary SVM
[[69]27], NB, and libD3C [[70]28] classifiers in the dataset with
10-Fold cross validation, using Weka [[71]29]. All classifiers had
similar performance in the main data set (Table [72]4).
Table 4.
Performance evaluation comparison by multiple binary classifier in the
aging data
TP rate % FP rate% Precision % Recall % F measure % AUC %
SVM 80 21.1 82 80 79.6 79.5
libD3C 85.1 15.3 85.3 85.1 85 91.9
NB 81.1 19.7 82.4 81.1 80.9 86
[73]Open in a new tab
A major challenge in classification is to reduce the dimensionality of
the feature space. Some methods, such as PCA, are linear combinations
of the original features. In this research, we investigated the PCA
method in the final model, which eliminated some of the original input
features and retained a minimum subset of features that yielded the
best classification performance. In addition, the feature selection
technique was used to select the best subset of features that were
satisfying to the model in respect with the subset of the main
features. A fixed number of top ranked features were selected to design
a classifier. A suitable technique for feature selection is
minimal-redundancy-maximal-relevance (mRMR) [[74]30]. We also used mRMR
for feature selection in the main data, and then compared multiple
binary classifiers in the positive and reliable negative genes. We
investigated the top 500 ranked features that were extracted from the
mRMR tool to compare the classifiers. All of the selected classifiers
yielded acceptable results (Table [75]5).
Table 5.
Performance evaluation comparison by multiple binary classifier in the
aging data after feature selection
TP rate % FP rate% Precision % Recall % F measure % AUC %
SVM 83.5 17.1 84.2 83.5 83.4 83.2
libD3C 84.6 15.7 84.8 84.6 84.6 92.3
NB 81.9 18.5 82.1 81.9 81.9 86.8
[76]Open in a new tab
Model accuracy assurance is very difficult when the model applied to a
separate test suite includes positive and unlabeled samples. This
challenge is critical in instances which lack negative sample. Thus, we
compared the evaluation metric with the data. We generated data for all
10 models in the training section to predict the residual genes, and
extracted the genes that were identified by the 10 models as positive
genes, yielding a total of 3531 final candidate genes.
To compare the output of the method with the known tools for
prioritizing the genes, the output of the model was compared with two
softwares, Endeavor [[77]17] and ToppGene [[78]31], in the seed genes.
(the list of seed genes in the form of K-Fold with K = 3 was utilized
for the mentioned tools). Two metrics for comparing the tools with the
proposed model were considered. The first metric calculated the average
ranking for the seed genes, and the second metric determined the number
of seed genes on the lists as 10, 50, 100, 500, and 1000.
A tool that had more seed genes at the top of the list and a lower
average rating compared with the remaining tools, received a higher
ranking. Table [79]6 shows the output of the tools and the PPHAGE
method for determining the number of test genes on the known lists.
Table [80]7 shows the output of tools and the PPHAGE method for the
average rank score on different lists.
Table 6.
Number of detected seed genes in comparison to the output of tools
Tools Rank Fold1 Fold2 Fold3
Endeavour < 10 1 0 1
< 50 2 0 2
< 100 4 1 2
< 500 11 12 17
< 1000 24 25 25
ToppGene < 10 2 0 1
< 50 11 0 2
< 100 16 1 2
< 500 44 12 17
< 1000 62 25 25
PPHAGE < 10 2 2 0
< 50 7 4 5
< 100 12 12 9
< 500 50 35 38
< 1000 66 61 67
[81]Open in a new tab
Table 7.
Average rank of the seed genes in comparison to the output of tools
Fold1 Fold2 Fold3
Endeavour 1851 1918 1877
ToppGene 926 849 1024
PPHAGE 833 919 930
[82]Open in a new tab
The top 25 genes that received the highest weight among all candidate
aging genes (Table [83]8), were validated in a number of instances,
based on experimental evidence, age-related diseases, and genome-wide
association studies (GWAS). A list of all candidate positive aging
genes is provided in Additional file [84]3.
Table 8.
The top 25 human candidate aging genes
Rank Gene symbol Relevance Reference Database reference
1 NAP1L4 Nucleosome Assembly [[85]32, [86]33]
2
CCNI
(CYC1)
Parkinson Disease [[87]34] BEFREE
3 RPL3 Ribosomal Protein [[88]35]
4 FZD5 Alzheimer’s Disease [[89]36] BEFREE
5 BRD2
Diabetes Mellitus, Non-Insulin-Dependent
Osteoporosis, Postmenopausal
Colorectal Cancer
[[90]37–[91]40] BEFREE
6 ATP8A2 ATPase Phospholipid Transporting [[92]41]
7 SRSF11 Serine And Arginine Rich Splicing Factor [[93]42]
8 BBIP1
9 IL10
Cardiovascular Diseases
Diabetes Mellitus, Non-Insulin-Dependent
Colorectal Cancer
Atherosclerosis
Parkinson Disease
Alzheimer’s Disease
Arthritis
Heart failure
[[94]43, [95]44]
[[96]45–[97]47]
[[98]48, [99]49]
[[100]50, [101]51]
[[102]52–[103]54]
[[104]55–[105]57]
[[106]58–[107]60]
[[108]61–[109]63]
CTD_human
RGD
LHGDN
BEFREE
HPO
10 FYCO1
Cataract, autosomal recessive congenital 2
Cataract
[[110]64, [111]65]
UNIPROT
GENOMICS_ENGLAND
HPO
CTD_human
11 PSMB2
12 NSF Parkinson Disease [[112]66–[113]70]
GWASDB
GWASCAT
BEFREE
13 OAZ1
14 ZFP36L1
15 PCLO Diabetes Mellitus, Non-Insulin-Dependent [[114]71] BEFREE
16 GAB2
Alzheimer’s Disease
Colorectal Cancer
Osteopetrosis
[[115]72–[116]75]
[[117]76, [118]77]
[[119]78]
BEFREE
GWASDB
GWASCAT
17 QKI
Coronary heart disease
Colorectal Cancer
[[120]79]
BEFREE
UNIPROT
18 ZNF638
19 RGS3
20 XPO6
21 ATP8B1 Colorectal Cancer [[121]80] BEFREE
22 ITM2C
23 RBFOX1
Heart failure
Colorectal Cancer
[[122]81]
[[123]82]
BEFREE
24 DLC1
Colorectal Cancer
Hereditary Diffuse Gastric Cancer
Coronary heart disease
Increased gastric cancer
[[124]83]
[[125]84]
[[126]85]
BEFREE
CTD_human
HPO
25 MVK
Arthritis
Cataract
HPO
HPO
[127]Open in a new tab
Discussion
On a genome-wide scale, we used three PUL methods to create a method
for the isolation of human aging genes from other genes. The combined
use of several methods as a fusion of their output was advantageous
over using one single method.
Following are examples of the identified genes and experimental or GWAS
link between these genes and aging. On the list of the 25 top genes,
NAP1L4 encodes a member of the nucleosome assembly protein (NAP)
family, which interacts with both core and linker histones, and
shuttles between the cytoplasm and nucleus, suggesting a role as
histone chaperone. Histone protein levels decline during aging, and
dramatically affect chromatin structure. Remarkably, the lifespan can
be extended by manipulations that reverse the age-dependent changes to
chromatin structure, indicating the pivotal role of chromatin structure
in aging [[128]32]. In another example, gene expression of NAP1L4
increases with age in the skin tissue [[129]33]. Findings of GWAS link
a number of the identified genes to age-related disorders, such as GAB2
and late onset Alzheimer’s disease [[130]86], and QKI and coronary
heart disease/myocardial infarction [[131]79]. Interestingly, GWAS
reports also link QKI to successful aging [[132]87].
RPL3 encodes a ribosomal protein that is a component of the 60S
subunit. The encoded protein belongs to the L3P family of ribosomal
proteins, and is increased in gene expression during aging of skeletal
muscle [[133]88]. In another example, FZD5 is involved in prostate
cancer, which is the most common malignancy in older men. ATP8A2 is
another gene subject to deterioration and loss of function over time.
RYR2 (Additional file [134]3) encodes a ryanodine receptor found in
cardiac muscle sarcoplasmic reticulum. Mutations in this gene are
associated with stress-induced polymorphic ventricular tachycardia and
arrhythmogenic right ventricular dysplasia and methylation analysis of
CpG sites in DNA from blood cells showed a positive correlation between
RYR2 and age [[135]89]. In additional examples, differential expression
with age was identified in BCAS3, TUFM and DST in the skin [[136]33].
Gene expression revealed a significant increase in the expression of
hippocampal TLR3 from elderly (aged 69–99 years old) compared to cells
from younger individuals (aged 20–52 years old) [[137]90]. Similarly,
differential expression with age was identified in RORA in the adipose
tissue [[138]33].
In order to investigate the implication of the identified candidate
genes in aging, we conducted a comprehensive analysis of 330 human
pathways in the KEGG. Each of the pathways was examined in the seed and
candidate genes, and direct association was detected in a number of
instances. For example IL10 activates STAT3 in the FOXO signaling
pathway. In another example, GAB2 has a regulatory role for PLCG2 in
the osteoclast differentiation pathway, as well as an activating role
in the chronic myeloid leukemia pathway. Likewise, FOS is an expression
target for IL10 in the T cell receptor signaling pathway.
Enrichment analysis was performed using the Enrichr tool, based on the
candidate genes and the negative genes [[139]91] to examine whether the
candidate and negative genes were correctly selected in respect with
aging. The analysis of candidate genes was performed on 3531 genes from
the rest of the test genes (i.e. excluding the positive seed and
reliable negative genes). Most diseases that were associated with the
candidate genes were diseases that occur with aging (e.g. colorectal
cancer and diabetes) (Table [140]9).
Table 9.
Indicative diseases associated with the candidate aging genes
Index Name P-value Adjusted
p-value Z-score Combined score
1 Colorectal cancer 1.43e-08 0.000001256 −1.94 35.07
2 Leukemia 6.71e-07 0.00002953 −1.64 23.32
3 Breast_cancer 0.000009246 0.0002357 −1.45 16.76
4 Diabetes 0.00002362 0.0002986 −0.92 9.85
5 Anemia 0.00002185 0.0002986 −0.9 9.68
6 Cardiomyopathy 0.00002757 0.0002986 − 0.59 6.23
[141]Open in a new tab
Ontology analysis of the candidate genes was performed by FUNRICH
[[142]92] (Fig. [143]2), which revealed enrichment for the aging
process and apoptosis. A list of all biological processes associated
with the candidate aging gene is provided in Additional file [144]4.
Fig. 2.
[145]Fig. 2
[146]Open in a new tab
Significant biological processes associated with the candidate aging
genes
In the analysis of the enriched biological pathways, using Enrichr
(Table [147]10), cancer pathways had the highest score. Interestingly,
viral pathways (e.g. EBV and HSV) were enriched in the positive aging
genes compartment, which is in line with the previously reported
immunosenescence and activation of such viruses as a result of aging
[[148]93] .A list of all biological pathways of the candidate genes
extracted by FUNRICH is provided in Additional file [149]5.
Table 10.
Indicative biological pathways associated with the candidate aging
genes
Index Name P-value Adjusted p-value Z-score Combined score
1 Pathways in cancer_Homo sapiens_hsa05200 4.07e-41 1.19e-38 −2.11
196.21
2 Proteoglycans in cancer_Homo sapiens_hsa05205 1.91e-31 2.78e-29 −1.99
140.58
3 Epstein-Barr virus infection_Homo sapiens_hsa05169 3.24e-30 3.15e-28
−1.9 128.92
4 Endocytosis_Homo sapiens_hsa04144 1.19e-28 8.70e-27 −1.89 121.38
5 Regulation of actin cytoskeleton_Homo sapiens_hsa04810 4.30e-26
2.51e-24 −1.82 106.42
6 HTLV-I infection_Homo sapiens_hsa05166 1.01e-25 4.21e-24 −1.79 103.2
7 Protein processing in endoplasmic reticulum_Homo sapiens_hsa04141
7.55e-26 3.68e-24 −1.69 98.04
8 Herpes simplex infection_Homo sapiens_hsa05168 1.24e-25 4.54e-24
−1.61 92.36
9 PI3K-Akt signaling pathway_Homo sapiens_hsa04151 1.79e-22 4.96e-21
−1.83 91.82
10 Focal adhesion_Homo sapiens_hsa04510 1.12e-22 3.63e-21 −1.72 86.98
[150]Open in a new tab
No specific age-related diseases were detected for the identified
negative genes (Table [151]11), which supports the validity of the
model training used. Ontology analysis of the reliable negative genes
(Fig. [152]3), which was also performed by FUNRICH, revealed that most
of the extracted processes had a general role in all cells and could
not be related to specific aging processes. Analyzing the biologic
pathways in the negative genes indicated pathways that were
predominantly unrelated to the aging processes.
Table 11.
Indicative diseases associated with the reliable negative genes
Index Name P-value Adjusted p-value Z-score Combined score
1 Cardiomyopathy,_dilated 0.01658 0.2321 −1.69 6.93
2 Cardiomyopathy 0.03134 0.2416 −1.61 5.57
3 Zellweger_syndrome 0.01588 0.2321 −1.06 4.41
4 Dystonia 0.03451 0.2416 −0.37 1.25
[153]Open in a new tab
Fig. 3.
[154]Fig. 3
[155]Open in a new tab
Significant biological processes associated with the reliable negative
genes
Based on the principle that similar disease genes are likely to have
similar characteristics, some machine learning methods have been
employed to predict new disease genes from known disease genes.
Previous approaches developed a binary classification model that used
known disease genes as a positive training set and unknown genes as a
negative training set. However, the negative sets were often noisy
because unknown genes could include healthy genes and positive
collections. Therefore, the results presented by these methods may not
be reliable. Using computational machine learning methods and
similarity metrics, we identified reliable negative samples, and then
tested the samples using a two-class classifier to identify novel
positive aging genes in human.
Conclusion
We implemented 11 databases and several machine learning methods to
rank the entire human genes, and predicted and prioritized over 3,000
novel candidate age-related genes based on significant ranking scores.
These genes were supported by biological, ontology, and disease
enrichment analyses. Future experimental research is warranted to
verify the significance of the identified genes in human aging.
Methods
Algorithms
A classification method that is referred to as PUL is a
similarity-based algorithm, in which reliable negative samples are
extracted from unlabeled data. In addition, a binary classifier can be
designed and used to identify the candidate genes (Fig. [156]4).
Likewise, some methods identify reliable negative samples from
unlabeled data, which are divided into three general categories: The
first category has a two-stage strategy that runs a supervised
algorithm on the data, by selecting reliable negative samples from
within unlabeled instances [[157]94]. The second category estimates the
probability of positive samples by weighting positive and unlabeled
data. The third category considers unlabeled data as negative samples
with noise.
Fig. 4.
[158]Fig. 4
[159]Open in a new tab
The overall learning scheme based on positive and unlabeled samples,
and extraction of reliable negative samples (step 1), construction of
the binary Classifier (step 2), and prediction and prioritization of
candidate genes (step 3)
In this paper, a two-stage strategy was used to find a reliable
negative sample and three different algorithms, Rocchio [[160]95], NB
[[161]94], and Spy [[162]96], were selected for implementation.
Bayesian classifiers that work explicitly on the possibilities of
different assumptions, such as the NB classifier, which is one of the
most efficient and most effective algorithms available for certain
learning problems, have provided useful practical solutions [[163]97].
The NB classifier can compete with other algorithms and in some cases,
it works better than other algorithms [[164]98]. A NB classifier can be
considered as a simple Bayesian network, which is used for independence
assumptions between features and classes. We chose NB based on the
structure and nature of the data, the independent nature of each data
source, and the high volume of the data and binary features.
An NB classifier with 4-fold cross validation was used to assess the
diagnostic value of every data source. In this assessment, we
identified how much of each data source alone was enough to identify
the genes of aging (Table [165]12). The diagnostic value of all data
sources was estimated at about 70%, except the Literature. We used the
data fusion method to get higher diagnostic value. Because of similar F
Measure values, a fusion Kernel of equal weight was selected for each
data source.
Table 12.
Comparison of the evaluation metric across data sources
Data source Recall Specificity Precision Accuracy F_Measure
Literature 0.58098 0.61453 0.5888 0.5981 0.58478
Annotation 0.77685 0.78668 0.76645 0.78165 0.77133
Pathways 0.73268 0.74538 0.7204 0.73893 0.72605
Gene Ontology 0.79303 0.78843 0.76315 0.78958 0.77703
Phenotype 0.7946 0.81968 0.8158 0.80695 0.80488
Intrinsic properties 0.67963 0.77035 0.78945 0.71835 0.72965
Sequence 0.6901 0.72828 0.71713 0.70885 0.70305
Interaction 0.7378 0.7724 0.76645 0.7543 0.75135
Gene expression 0.75635 0.82148 0.82235 0.7864 0.78735
Regulatory 0.77355 0.79203 0.77633 0.78163 0.77393
[166]Open in a new tab
Since our main data did not contain any negative samples, training a
model to identify and prioritize new positive genes was based on the
three PUL algorithms. An NB classifier was designed following the
extraction of a reliable negative sample and positive genes. Genes were
assigned positive labels for the final ranking, using the weighting
method according to the available data [[167]7] .
The same weight was considered for ranking the candidate genes based on
the selected sources. Similarities among the features were weighted in
the seed genes and candidate genes, using the following formula, and
then sorted based on their total weight:
[MATH: Wi=∑
i=1C∑j=1FCandidateGeneFeatureij∗∑p=1SSeedGenespj, :MATH]
where (C) was the number of candidate genes (n = 3531), (F) was the
number of features (n = 11, 698), (S) was the number of seed genes
(n = 303) in the problem case, and (W) was the weight of each candidate
gene.
Dataset
Aggregate data from 11 human biology databases (Table [168]13),
including 11,698 binary gene features, were collected for 19,462 genes,
of which only 303 genes (seed genes) had positive labels for genes
involved in aging, derived from the GeneAge database [[169]99].
Table 13.
Data sources used in Naïve Bayes classifier for candidate aging genes
Data source name Dataset name Features detail Web address
Literature
OBO
AgeFactDB
The ageing-related information included both by manual and automatic
information extraction from the scientific literature.
[170]https://lov.linkeddata.es/dataset/lov/vocabs/obo
[171]http://agefactdb.jenage.de/
Functional annotation David The list of all functional annotation.
[172]https://david.ncifcrf.gov/
Biological pathways
Reactome
Kegg
The list of biological pathway.
[173]https://reactome.org/
[174]https://www.genome.jp/kegg/pathway.html
Gene Ontology GO The Biological Process, Molecular Function, and
Cellular Component vocabularies. [175]http://www.geneontology.org/
Phenotype
HPO
OMIM
The list of all ageing-related phenotype and associated gene.
[176]https://hpo.jax.org/
[177]https://www.omim.org/
Intrinsic properties
Pfam
PDB
The chromosome number, location, gene segment, gene type, etc.
[178]https://pfam.xfam.org/
[179]https://www.rcsb.org/
Sequence RefSeq The list of all known active site, binding site, chain,
etc. [180]https://www.ncbi.nlm.nih.gov/refseq/
Protein-Protein Interaction
HPRD
String
The list of each gene had a physical interaction with each of the
positive genes.
[181]http://www.hprd.org/
[182]https://string-db.org/
Gene expression
GEO
HAGR
The ageing-related expression included tissue type, overexpressed and
under expressed, etc.
[183]https://www.ncbi.nlm.nih.gov/geo/
[184]http://genomics.senescence.info/gene_expression/index.php
Regulatory RegNetwork The list of all regulatory relationship, such as
miRNA, Transcription factor, etc. [185]http://www.regnetworkweb.org/
Orthologues
CDD
HomoloGene
OrthoDB
The catalog of orthologous protein-coding genes across vertebrates and
known conserved domain.
[186]https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml
[187]https://www.ncbi.nlm.nih.gov/homologene
[188]https://www.orthodb.org/
[189]Open in a new tab
The vector of binary features consisted of 11 main parts, each part of
which was equivalent to one of the data sources. The information for
each data source was a boolean value, and if any gene contained this
value, it scored 1, and otherwise, it scored 0 (Table [190]2). For
example, a part of the biological pathway data contained 330
attributes, which were equivalent to a human pathway in KEGG. If the
intended gene was located in this pathway, it scored 1, and otherwise,
it scored 0. Also for interaction network data, if each gene had a
physical interaction with each of the positive genes, it scored 1, and
otherwise, 0. These data were extracted from the String and HPRD
databases.
Due to the large volume of features, we employed the PCA method to
reduce the size of features. Following PCA implementation, our total
data set was reduced to 4689 attributes, and the Percentage of Variance
(POV) equaled 98% (Fig. [191]5).
Fig. 5.
[192]Fig. 5
[193]Open in a new tab
The Percentage of Variance in Principal Component Analysis
In addition, eight valid data sources from the UCI database
([194]https://archive.ics.uci.edu/ml/index.php) were used to evaluate
the efficiency of the algorithms. In each data set, one of the data
classes with great sample frequency were unlabeled data. Using
algorithms, we identified negative samples and compared them to the
original data (Table [195]3).
Supplementary information
[196]12864_2019_6140_MOESM1_ESM.docx^ (32.4KB, docx)
Additional file 1: Comparison of evaluation metric of three algorithms
in the UCI databases.
[197]12864_2019_6140_MOESM2_ESM.docx^ (404.6KB, docx)
Additional file 2: Results of 10-fold cross-validation in the trained
and test data.
[198]12864_2019_6140_MOESM3_ESM.docx^ (65.9KB, docx)
Additional file 3: A list of all huma candidate positive aging genes.
[199]12864_2019_6140_MOESM4_ESM.xlsx^ (1.3MB, xlsx)
Additional file 4: A list of all biological processes associated with
the candidate aging genes.
[200]12864_2019_6140_MOESM5_ESM.xlsx^ (191.3KB, xlsx)
Additional file 5: A list of all biological pathways of the candidate
genes extracted by FUNRICH.
Acknowledgements