Abstract

   Proteins can move from blood circulation into salivary glands through
   active transportation, passive diffusion or ultrafiltration, some of
   which are then released into saliva and hence can potentially serve as
   biomarkers for diseases if accurately identified. We present a novel
   computational method for predicting salivary proteins that come from
   circulation. The basis for the prediction is a set of physiochemical
   and sequence features we found to be discerning between human proteins
   known to be movable from circulation to saliva and proteins deemed to
   be not in saliva. A classifier was trained based on these features
   using a support-vector machine to predict protein secretion into
   saliva. The classifier achieved 88.56% average recall and 90.76%
   average precision in 10-fold cross-validation on the training data,
   indicating that the selected features are informative. Considering the
   possibility that our negative training data may not be highly reliable
   (i.e., proteins predicted to be not in saliva), we have also trained a
   ranking method, aiming to rank the known salivary proteins from
   circulation as the highest among the proteins in the general
   background, based on the same features. This prediction capability can
   be used to predict potential biomarker proteins for specific human
   diseases when coupled with the information of differentially expressed
   proteins in diseased versus healthy control tissues and a prediction
   capability for blood-secretory proteins. Using such integrated
   information, we predicted 31 candidate biomarker proteins in saliva for
   breast cancer.

Introduction

   Human blood has long been used as an information source for detection
   of human diseases such as liver enzymes for detecting hepatitis,
   white-blood cell counts for infection detection and prostate-specific
   antigen (PSA) for diagnosing prostate cancer. In comparison, human
   saliva has not been used for the same purposes nearly as much. Recent
   large-scale proteomic analyses have revealed that human saliva is also
   rich in proteins [[35]1], some of which come from the blood circulation
   and hence can potentially serve as a general information pool for
   disease biomarker identification. This study is on the development of a
   computational method for identification of the distinct features of
   salivary proteins that come from circulation and an application of the
   identified features to predict proteins that can get into saliva from
   circulation.

   The earliest work on using salivary proteins as disease biomarkers of
   distal organs can be traced back to 1986 when the Kallikreina salivary
   biomarkers for detection of breast cancer and gastrointestinal cancer
   were published [[36]2]. Since then, a number of salivary proteins have
   been found to have elevated levels in patients of specific cancer types
   compared to the healthy population such as PSA for prostate cancer
   [[37]3], c-tumor protein erbB-2 and p53 for breast cancer [[38]4].
   While a few salivary proteins have been found to be relevant to
   specific diseases, there has not been a general and effective approach
   for identifying disease markers in saliva, to the best of our
   knowledge.

   The current understanding about how biomolecules can move from
   circulation into saliva can be summarized as follows. Three mechanisms
   have been identified for biomolecules to travel from circulation into
   saliva [[39]5,[40]6]: active transportation for various proteins such
   as secretory IgA and immunoglobulin E, passive transportation for drugs
   and steroids, and ultrafiltration for small polar molecules such as
   creatinine. The basis of our prediction method is that some of the
   disease-associated proteins in circulation can get into saliva through
   one of these three mechanisms, hence making it possible for us to
   identify them in saliva even for diseases of distal organs.

   Two large datasets for salivary proteins are publicly available. One
   consists of 1,166 proteins and 657 of them are also found in human
   blood [[41]1]. Another one has approximately 2,000 proteins and 26% of
   them are also found in blood [[42]6]. We hypothesize that salivary
   proteins are secreted by the salivary glands either from circulation or
   in response to the biomolecules that get into the glands from
   circulation. In this study, we focus on proteins that come from the
   circulation and leave the prediction work of proteins secreted by
   salivary glands in response to blood proteins that get into the glands
   as a future study.

   We have collected 62 human salivary proteins coming from circulation
   from the published literature, which have been experimentally detected
   by multiple salivary proteomic studies, and used them as the initial
   positive training data. We then expanded this dataset by including
   additional proteins based on Pfam family information [[43]7]. A total
   of 261 proteins are selected at the end as the positive training data.
   We then identified a set of proteins that are deemed not to be able to
   get into saliva, totaling 6,816, and used them as the negative training
   data. We then examined a number of sequence and structure-based
   features to identify those with discerning power between the two sets
   of proteins. Using these features, we have trained a classifier using a
   support vector machine (SVM) to predict proteins that can travel to
   saliva from circulation via salivary glands. In addition, we have also
   trained a ranking method aiming to rank the known blood-originated
   salivary proteins the highest among the background proteins, knowing
   that our negative training data may not be the most reliable. The
   flowchart of the approach is shown in [44]Figure 1.

Figure 1. A flowchart of the approach.

   Figure 1
   [45]Open in a new tab

   We believe that this prediction capability can serve as a general tool
   for predicting proteins that can travel from circulation to saliva.
   Hence when applied in conjunction with capabilities for predicting
   proteins that may be present in circulation of patients of a specific
   disease, this capability can suggest candidate biomarkers in saliva for
   that disease. Using this tool along with gene-expression data of breast
   cancers and a prediction tool for blood-secretory proteins [[46]8], we
   predicted 31 candidate proteins in breast cancer patients’ saliva.

Results

Features of blood-originated salivary proteins

   With the aim of training a SVM-based classifier and rank the predicted
   proteins, we have examined a total of 34 protein features (see
   [47]Table S1 and Material and Methods), represented as a feature vector
   of 1,523 dimension. We then trained a classifier with a linear kernel
   using these features calculated on proteins in both the positive and
   negative training sets, aiming to derive a classifier that can best
   distinguish the positive from the negative samples. We then checked
   which feature elements are relevant to the final classification
   performance by using a feature selection procedure, and removed all the
   irrelevant ones, giving rise to 55 final feature elements. Then a
   manifold ranking method [[48]8] is trained based on the selected
   feature elements with the performance given in [49]Table S2. We have
   assessed the contributions by the 55 feature elements to the
   classification accuracy, using a statistical significance q-value
   [[50]9], and found that the q-values for the 55 feature elements are
   less than 4.0E-5, as shown in [51]Figure 2(A). We have also compared
   the classification performance based on the 55 features versus the top
   10 features, and noted that there is a clear difference in performance,
   as shown in [52]Figure 2(B). The following features are the most
   important ones to our classification accuracy, ranked in the decreasing
   order of their contribution to the classification results: radius,
   Moran autocorrelation, hydrophobicity, Geary autocorrelation, amino
   acid composition, normalized Moreau-Broto autocorrelation, dipeptide
   composition, secondary structure composition and polarity. This
   observation is consistent with our general understanding of secretory
   proteins and salivary proteins. For example, the diffusion coefficient
   is inversely proportional to the molecular radius [[53]10].

Figure 2. The q-value and accuracy of the 55 selected features.

   Figure 2
   [54]Open in a new tab

Performance of the SVM model

   Based on the 55 selected feature elements, we trained a classifier and
   evaluated the performance using 10-fold cross validation by repeating
   the prediction 100 times to derive a performance distribution of the
   classifier. On the training data, the classifier achieved an average
   recall and precision at 88.56% and 90.76%, respectively. We applied the
   general recall-precision curve shown in [55]Figure 3 to the training
   data with 10-fold cross validation to examine the prediction precision
   at each recall level. The AUC of the recall-precision curve is 80.96%.

Figure 3. The recall-precision curve.

   [56]Figure 3
   [57]Open in a new tab

   We also used 41 of the 62 collected proteins as the training data that
   have been reported in the literature before 2000. Out of the 62
   collected proteins, 21 are used as the testing data, which have been
   reported after 2000. On these 21 salivary proteins, our model predicted
   14 (76.19%) to be salivary proteins from blood.

Predicting and ranking the known salivary proteins

   We have run the trained classifier on all 20,209 human proteins in the
   UniProt database [[58]11], among which 5,456 are annotated as secretory
   proteins according to the Uniprot, SPD [[59]12] and LOCATE [[60]13]
   databases, and 1,823 have been detected in saliva from previous
   experiments [[61]1,[62]5,[63]14]. We predicted 2,498 of the Uniprot
   proteins as salivary proteins from circulation, accounting for 12.36%
   of the 20,209 Uniprot proteins. Of the 2,498 proteins, 239 (13.11%) are
   among the 1,823 proteins that have been previously identified in saliva
   experimentally.

   We have also ranked the human proteins in UniProt using a manifold
   ranking method as done in our previous work [[64]8]. 62 known salivary
   proteins coming from circulation were assessed in terms of their
   ranking to be salivary proteins. By using the 62 proteins as positive
   dataset, 27 (43.55%) of the 62 proteins and 136 salivary proteins are
   ranked among the top 1,000 proteins. By using the expanding 261
   proteins as positive dataset, 34 (54.84%) of the 62 proteins are ranked
   among the top 1,000 proteins. Among these 1,000 proteins, 155 are known
   to be salivary proteins identified from other sources (see [65]Table S3
   for the list of the protein names; and also see Material and Methods).
   While we do not know if the prediction of the remaining 845 proteins
   being salivary proteins is correct or not, we suspect that some of them
   are indeed salivary proteins. For example, protein Endothelin-1
   ([66]P05305), ranked the 728th, has been implicated in cancer[[67]15],
   and could be a good salivary biomarker for OSCC development in oral
   lichen planus patients [[68]16]. Tissue inhibitor of metalloproteinases
   1 (TIMP-1) ([69]P01033), ranked the 434th, has been identified as a
   potential biomarker in diseases such as cancer, cardiovascular diseases
   and diabetes. Moreover, this protein has been reported to be a salivary
   protein [[70]17].

   After the training of our classifier, we did another round of
   literature search for additional salivary proteins that have been
   associated with human diseases and do not overlap with our training
   data. Overall 47 salivary proteins are found, shown in [71]Table S4.
   These proteins are relevant to different diseases such as periodontal
   disease [[72]18,[73]19], oral squamous cell carcinoma [[74]20,[75]21],
   Sjögren's syndrome [[76]22-[77]24], breast cancer [[78]25-[79]29],
   malignant pelvic tumors, and malignant ovarian tumors [[80]30]. We
   found that 3 (6.38%) of these 47 proteins are ranked among the top
   1,000, 8 (17.02%) among the top 2,000 and 12 (25.53%) among the top
   3,000, as shown in [81]Table 1. The p-values for having such rankings
   if assuming that the ranking is random are 0.211, 0.088 and 0.038,
   respectively.

Table 1. Comparison of the ranking result with human saliva biomarkers for
many sorts of diseases.

   Total protein number Known salivary biomarker number Top number
   Salivary biomarker included in top number P-vlaue
   20209 47 500 1 0.367
   20209 47 1000 3 0.211
   20209 47 1500 6 0.076
   20209 47 2000 8 0.088
   20209 47 2500 11 0.026
   20209 47 3000 12 0.038
   20209 47 3500 13 0.052
   20209 47 4000 13 0.123
   20209 47 4500 15 0.082
   20209 47 5000 16 0.098
   20209 47 5500 19 0.017
   20209 47 6000 20 0.020
   [82]Open in a new tab

   We then carried out a pathway enrichment analysis among the top 1,000
   ranked proteins, using DAVID [[83]31] against the Gene Ontology, KEGG
   [[84]32], BBID [[85]33] and BIOCARTA [[86]34] databases to gain an
   understanding about the cellular functions and subcellular locations of
   these predicted salivary proteins, using the whole set of human
   proteins as the background. We noted that the most significantly
   enriched biological processes are immune response, antigen processing
   and presentation, cell adhesion, defense response, response to
   wounding, and inflammatory response. In addition, the most
   significantly enriched cellular components are extracellular region,
   membrane and MHC protein complex which all make biological sense (see
   [87]Table S5).

Application to breast cancer for identification of salivary biomarkers

   Based on a public transcriptomic dataset collected on breast cancer and
   matching control samples (see Materials and Methods), we identified
   1,502 consistently differentially expressed genes in breast cancer
   versus control tissue samples. We then used the gene expression data as
   an approximate protein-expression data here; and applied our trained
   classifier to these proteins and predicted 248 of them to be blood
   secretory using a prediction tool for blood secretory proteins that we
   previously developed [[88]8]. Out of these proteins, we predicted 31
   are movable to saliva. [89]Table 2 provides the detailed information of
   these 31 proteins as candidate salivary biomarkers for breast cancer.

Table 2. Proteins as candidate salivary biomarkers for breast cancer.

   Gene symbol UniProt ID  Manifold ranking Fold change
   F10         [90]P00742  195              0.667
   CFD         [91]P00746  227              0.593
   TIMP2       [92]P16035  241              0.573
   CCL14       [93]Q16627  297              0.595
   FBLN1       [94]P23142  324              0.663
   FBLN5       [95]Q9UBX5  336              0.542
   EFEMP2      [96]O95967  363              0.622
   IGF1        [97]P05019  394              0.525
   EFEMP1      [98]Q12805  439              0.474
   AZGP1       [99]P25311  440              1.563
   WISP2       [100]O76076 613              0.581
   CLEC3B      [101]P05452 720              0.570
   CD93        [102]Q9NPY3 724              0.586
   LEPR        [103]P48357 1034             0.638
   FABP5       [104]Q01469 1072             0.633
   IL6R        [105]P08887 1111             0.551
   ALCAM       [106]Q13740 1115             1.764
   MCAM        [107]P43121 1119             0.527
   CFB         [108]P00751 1148             1.647
   PDCD6       [109]O75340 1382             1.562
   BCHE        [110]P06276 1416             0.665
   DMBT1       [111]Q9UGM3 1531             0.632
   CD163       [112]Q86VB7 1539             0.628
   NCAM1       [113]P13591 1681             0.640
   LTF         [114]P02788 1693             1.583
   SRPX        [115]P78539 1959             0.556
   FBN1        [116]P35555 2192             0.625
   CFH         [117]P08603 2400             0.550
   VWF         [118]P04275 2518             0.537
   CD99        [119]P14209 2867             0.623
   TF          [120]P02787 2907             0.578
   [121]Open in a new tab

   As of now, very little data is available regarding salivary proteins
   that can be indicative of breast cancer. The only data we can get hold
   of is the salivary proteins considered by Streckfus et al. to be
   informative for diagnosing breast cancer [[122]27]. Their predicted
   list consists of 37 proteins given in [123]Table S6. We have compared
   our prediction of 31 proteins with this list, 4 of the 31 proteins are
   in their list [[124]27-[125]29], as shown in [126]Table 3, which has a
   p-value at 2.89e-7. The relatively low level of overlap between the two
   sets of predictions is not particularly surprising, which is consistent
   with previously published studies by different groups on blood
   biomarkers for different cancers. This is possibly caused by the
   differences in detailed conditions under which the biological samples,
   i.e., cancer tissues and saliva, are collected, as well as the
   less-than-perfect prediction methods employed, on top of the overall
   very challenging nature of the problem.

Table 3. Prediction Proteins used as salivary biomarkers for the detection
breast cancer.

   Not included in the training positive dataset
     __________________________________________________________________

   Accession Protein Name Ratio P-value Blood Secretory
   [127]Q01469 Epidermal fatty acid-binding protein 0.633 0.000257 Yes
   [128]P02788 Lactotransferrin 1.583 0.000244 Yes
   Included in the training positive dataset
   Accession Protein Name Ratio P-value
   [129]P02787 Transferrin 0.578 0.000013 Yes
   [130]P25311 Zinc-alpha-2-glycoprotein 1.563 0.000940 Yes
   [131]Open in a new tab

   We have also carried out a pathway and subcellular location enrichment
   analysis similar to that in the above. We noted that the most enriched
   biological processes by these 31 proteins are response to wounding,
   acute inflammatory response, cell adhesion, biological adhesion and
   immune response, which are all known to be involved in the development
   of or in defense of cancer. Besides, the most enriched cellular
   locations are extracellular region and cell surface ([132]Table S7).
   The most enriched pathways are complement and coagulation cascades, and
   the second enriched pathways are cell adhesion molecules (CAMs)
   ([133]Table S8).

Discussion and Conclusion

   A reliable prediction capability for proteins that can travel from
   circulation to saliva will represent a highly useful tool as it can
   provide a candidate list of biomarkers specific to a particular
   disease. This will allow targeted searches for effective biomarkers in
   saliva using antibody-based techniques, in comparison with the
   traditional search strategies by direct comparisons among proteomic
   data collected from saliva samples of multiple patients and healthy
   controls, which have proved to be ineffective in searches for
   biomarkers in blood [[134]8,[135]35] and urine [[136]36]. Here we
   demonstrated that it is possible to develop one such tool, which by no
   means represents the possibly most reliable tool for such a prediction.
   The key contribution of work is the proof of principle that we can
   possibly identify distinguishing features between proteins that can
   move to saliva from circulation and proteins that cannot get into
   saliva. In addition the identified features can also provide useful
   information to the mechanism studies of how proteins move between blood
   and saliva. In the future study, we hope that our method could be used
   in conjunction with the technology platforms for saliva diagnostics,
   and identify the definitive disease-associated salivary biomarkers.

Materials and Methods

Collecting salivary proteins coming from blood and generating negative
training data

   There is no existing dataset about proteins that can move from
   circulation to saliva. Proteins that have been found in both salivary
   proteome and blood proteome cannot serve this purpose since some of the
   salivary proteins may not come from circulation, instead are secreted
   from the salivary glands in response to other biomolecules that get
   into the glands from circulation. Therefore, we collected proteins that
   can move from circulation to saliva and have been experimentally
   validated and reported in the literature, such as IgA [[137]6,[138]37],
   albumin and Zn-alpha2-glycoprotein [[139]38]. 62 such proteins are
   found from the literature and used as the positive training data, shown
   in [140]Table S9. Considering the relatively small size of this
   positive training dataset, we added additional proteins from the same
   Pfam families of these 62 proteins with sequence similarities lower
   than 30% to our training set, assuming that proteins in the same Pfam
   family have the same properties in getting into saliva. To avoid the
   issue of over-representing any particular family, we limit to have at
   most five additional members per family, specifically the most distant
   five members of each of the 62 proteins. This gives rise to a total of
   261 proteins, which are used as the positive training data.

   Generating the negative training data is a challenge since our
   information about which proteins are movable or not is clearly
   incomplete at this point. We employed a method similar to that proposed
   by Cui et al. [[141]35] by choosing proteins from the Pfam families not
   containing any proteins that have been detected in saliva. For each
   such family, we choose five members as the negative training data. In
   addition, we keep only those with at least five peptides in the Plasma
   Proteome Project (PPP) database [[142]39], the largest human plasma
   protein database. As a result, 6,816 proteins are selected as the
   negative dataset.

Feature construction

   To train a classifier for proteins that are movable from circulation to
   saliva, we consider the following features, which can be grouped into
   four categories: (i) general sequence features such as sequence length,
   amino acid composition and di-peptide composition; (ii) physicochemical
   properties such as hydrophobicity, normalized Van der Waals volume,
   polarity, polarizability, charges, solubility, unfoldability and
   disordered regions; (iii) domains/motifs such as signal peptides,
   transmembrane domains and twin-arginine signal peptides motif (TAT);
   and (iv) structural properties such as secondary structural content and
   radius of gyration, totaling 34 features, represented by 1,523 feature
   elements. The details of these features are provided in [143]Table S1.

Feature selection and classification

   For each protein, we calculated a feature vector of 1,523 dimensions
   defined above. We first trained a classifier using all the 1,523
   feature values on the training data, and then applied a two-stage
   feature-selection procedure to remove those irrelevant and redundant
   features. A permutation test and q-value [[144]9] are used to identify
   and remove the irrelevant features. 10,000 permutations are generated
   and used to calculate the statistical significance on the relevance of
   individual feature elements to the prediction accuracy. Then, we used
   the approach proposed by Storey and Tibshirani [[145]9] to calculate
   the q-value, which is used to control the False Discovery Rate (FDR)
   [[146]40], in terms of the p-value obtained from the permutation test.
   We used 0.005 as the q-value cutoff to remove less relevant features,
   giving rise to 1,087 retained feature elements. In the second step, an
   improved feature selection method (SVM-RFE) that considers dependence
   relationships among features [[147]41] is applied to rank these
   features. Then we went through an iterative classification and feature
   removal procedure to have kept only 55 feature elements, which give
   essentially the same classification result as using the larger feature
   set.

   A SVM-based classifier is trained on the training data using the 55
   feature elements for each protein, and the output is 1 or -1
   representing if the input protein is movable to saliva or not. The
   following parameters are used to evaluate the prediction performance:
   recall, precision and the area under curve (AUC) of the
   recall-precision curve [[148]42], defined as follows:
   [MATH:
   <mrow><mrow><mtext>recall</mtext><mo>=</mo><mtext> </mtext><mfrac><mrow
   ><mi>T</mi><mi>P</mi></mrow><mrow><mi>T</mi><mi>P</mi><mo>+</mo><mi>F</
   mi><mi>N</mi></mrow></mfrac></mrow></mrow> :MATH]
   (1)
   [MATH:
   <mrow><mrow><mtext>precision</mtext><mo>=</mo><mtext> </mtext><mfrac><m
   row><mi>T</mi><mi>P</mi></mrow><mrow><mi>T</mi><mi>P</mi><mo>+</mo><mi>
   F</mi><mi>P</mi></mrow></mfrac></mrow></mrow> :MATH]
   (2)

   where TP is the number of true positives, FP refers to the number of
   false positives, and FN is the number of false negatives.

A method for ranking predicted salivary proteins

   We have also ranked the predicted salivary proteins using the manifold
   ranking algorithm as in our previous work [[149]8]. The essence of a
   manifold ranking algorithm [[150]43,[151]44] can be intuitively
   explained as follows: the problem is defined on two datasets, a true
   sample set and a background set. Our goal is to rank the individual
   members of the background set according to their relevance to the true
   samples. A weighted graph is used to represent the combined true and
   the background set, with each sample represented as a node of the graph
   and each pair of nodes being represented as an edge with a weight
   defined as the similarity between the two nodes in the feature space.
   Then an evidence propagation process starts, in which each true sample
   propagates its presence to its neighboring nodes to increase their
   relevance to the true sample set, where the increased relevance is
   valued proportionally to the corresponding edge weight in the graph. An
   overall relevance score of each node is summed over all the scores
   propagated to it from all the relevant true samples, by which elements
   in the background set can be ranked at the end. For our problem, the
   true sample set is the same as the positive training dataset defined in
   the previous section, and the background contains all the 20,209 human
   proteins in UniProt minus the positive set.

Identification of genes differentially expressed in breast cancer

   The microarray gene expression datasets [152]GSE15852 for 43 paired
   samples of breast cancer and adjacent normal tissues are downloaded
   from the GEO database of NCBI [[153]45]. For these samples, we applied
   t test and fold-change to identify differentially expressed genes in
   cancer versus control samples. The expression fold changes of each gene
   can be calculated using the following formula:
   graphic file with name pone.0080211.e003.jpg (3)

   where fc [i] is the ratio of the gene expression value on cancer sample
   versus control sample of gene i. c [ij] is the expression value of gene
   i of cancer sample in patient j, and n [ij] is the expression value of
   gene i of normal sample in patient j. m = 43 is the sample number. The
   fc [i] value is greater than one for up-regulated genes and less than
   one for down-regulated genes. To identify differentially expressed
   genes, we choose 1.5 as the threshold of fold change (1/1.5 for
   down-regulation). Then we can obtain the differentially expressed genes
   between cancer samples versus control samples.

P-value calculation for comparison of the ranking result with human saliva
biomarkers

   We calculated the statistical significance p-value assuming the
   underlying distribution for our problem follows a hypergeometric
   distribution [[154]46], i.e., the probability of selecting s tails in n
   draws without replacement from a finite population of size N coins each
   with an equal probability in selecting a head versus a tail containing
   exactly S tails, calculated as follows:
   [MATH:
   <mrow><mrow><mi>P</mi><mo>(</mo><mi>x</mi><mo>=</mo><mi>s</mi><mo>)</mo
   ><mo>=</mo><mfrac><mrow><mi>C</mi><mrow><mo>(</mo><mrow><mi>S</mi><mo>,
   </mo><mi>s</mi></mrow><mo>)</mo></mrow><mo>⋅</mo><mi>C</mi><mrow><mo>(<
   /mo><mrow><mi>N</mi><mo>−</mo><mi>S</mi><mo>,</mo><mi>n</mi><mo>−</mo><
   mi>s</mi></mrow><mo>)</mo></mrow></mrow><mrow><mi>C</mi><mrow><mo>(</mo
   ><mrow><mi>N</mi><mo>,</mo><mi>n</mi></mrow><mo>)</mo></mrow></mrow></m
   frac><mo>=</mo><mfrac><mrow><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mi
   >S</mi></mtd></mtr><mtr><mtd><mi>s</mi></mtd></mtr></mtable></mrow><mo>
   )</mo></mrow><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mrow><mi>N</mi><m
   o>−</mo><mi>S</mi></mrow></mtd></mtr><mtr><mtd><mrow><mi>n</mi><mo>−</m
   o><mi>s</mi></mrow></mtd></mtr></mtable></mrow><mo>)</mo></mrow></mrow>
   <mrow><mrow><mo>(</mo><mrow><mtable><mtr><mtd><mi>N</mi></mtd></mtr><mt
   r><mtd><mi>n</mi></mtd></mtr></mtable></mrow><mo>)</mo></mrow></mrow></
   mfrac></mrow></mrow> :MATH]
   (4)

   Where C(a, b)=a!/[b!(a−b)!], N is the number of human proteins, n is
   the number of the selected top proteins, S is the number of proteins
   used as salivary biomarkers, and s is the number of proteins that are
   among the 47 known salivary biomarkers and among the top n predicted
   candidate proteins. N is 20,209 and S is 47. [155]Table 1 shows the
   p-values, for different s and n.

Supporting Information

   Table S1

   A list of initial features for prediction of salivary proteins from
   blood circulation.

   (XLS)
   [156]Click here for additional data file.^ (17KB, xls)
   Table S2

   Features of blood-originated salivary proteins as selected by recursive
   feature elimination method.

   (XLS)
   [157]Click here for additional data file.^ (26.5KB, xls)
   Table S3

   A list of top 1000 blood-originated salivary proteins that ranked by
   manifold ranking method.

   (XLS)
   [158]Click here for additional data file.^ (91.5KB, xls)
   Table S4

   Salivary proteins that have been associated with human diseases.

   (XLS)
   [159]Click here for additional data file.^ (39KB, xls)
   Table S5

   Result of GO enrichment analysis among the top 1,000 ranked proteins.

   (XLS)
   [160]Click here for additional data file.^ (75KB, xls)
   Table S6

   A list of candidate up- and down-regulated salivary proteins in breast
   cancer.

   (XLS)
   [161]Click here for additional data file.^ (27.5KB, xls)
   Table S7

   Result of GO enrichment analysis among the 31 predicted proteins.

   (XLS)
   [162]Click here for additional data file.^ (44.5KB, xls)
   Table S8

   Result of Pathway enrichment analysis among the 31 predicted proteins.

   (XLS)
   [163]Click here for additional data file.^ (34KB, xls)
   Table S9

   A list of proteins that can move from circulation to saliva and have
   been experimentally validated and reported in the literature.

   (XLS)
   [164]Click here for additional data file.^ (32KB, xls)

Acknowledgments