Abstract Background Feature extraction (FE) is difficult, particularly if there are more features than samples, as small sample numbers often result in biased outcomes or overfitting. Furthermore, multiple sample classes often complicate FE because evaluating performance, which is usual in supervised FE, is generally harder than the two-class problem. Developing sample classification independent unsupervised methods would solve many of these problems. Results Two principal component analysis (PCA)-based FE, specifically, variational Bayes PCA (VBPCA) was extended to perform unsupervised FE, and together with conventional PCA (CPCA)-based unsupervised FE, were tested as sample classification independent unsupervised FE methods. VBPCA- and CPCA-based unsupervised FE both performed well when applied to simulated data, and a posttraumatic stress disorder (PTSD)-mediated heart disease data set that had multiple categorical class observations in mRNA/microRNA expression of stressed mouse heart. A critical set of PTSD miRNAs/mRNAs were identified that show aberrant expression between treatment and control samples, and significant, negative correlation with one another. Moreover, greater stability and biological feasibility than conventional supervised FE was also demonstrated. Based on the results obtained, in silico drug discovery was performed as translational validation of the methods. Conclusions Our two proposed unsupervised FE methods (CPCA- and VBPCA-based) worked well on simulated data, and outperformed two conventional supervised FE methods on a real data set. Thus, these two methods have suggested equivalence for FE on categorical multiclass data sets, with potential translational utility for in silico drug discovery. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0574-4) contains supplementary material, which is available to authorized users. Keywords: Unsupervised feature extraction, Principal component analysis, Variational Bayes, Posttraumatic stress disorder, Heart disease, In silico drug discovery, chooseLD, FAMS Background Feature extraction (FE) is an important task in bioinformatic analyses, as there are often more features than samples. The number of bases spanning linear space is at most equivalent to the number of independent vectors. Accordingly, more features than samples inevitably leads to redundancy. Although dimensional reduction is often used to eliminate redundancy, it is far from true redundancy elimination as reconstructed bases are usually the linear combination of all features, which is not always necessary for spanning entire linear space. Instead of dimensional reduction, FE can be used to eliminate redundancy, and is often performed to maximize performance of targeted tasks (supervised FE), e.g., discrimination between samples or regression analysis, although fewer samples than features often creates difficulties due to overfitting and/or bias. Multiple class samples commonly provide additional problems when supervised FE is used, complicating performance evaluations compared with two-class samples. Although pairwise evaluations (e.g., one versus one or one versus others) are possible, they are time-consuming. FE using a set of pairwise evaluations is often even more difficult, because it is hard to maximize performance simultaneously for all pairwise evaluations with commonly selected features for all pairs. In addition, supervised FE is often heavily sample-dependent, and alternative sample sets often provide alternative optimal FE. Moreover, with categorical classes the problems are greater, since frequently used FE with regression analyses, e.g., lasso [[29]1], cannot be directly applied to categorical multiclass data sets. In order to avoid these difficulties, unsupervised FE is useful as it is assumed to be more robust and stable, although this has not been extensively studied because of implementation difficulties, e.g., supervision is not based upon performance, therefore FE has nothing suitable for optimization. Variational Bayesian approaches eliminate redundancy in an unsupervised manner, and have recently been proposed as promising unsupervised FE methods. In this paper, we used variational Bayes (VB) principal component analysis (PCA) [[30]2,[31]3] to perform unsupervised FE. In addition, a simpler and more conventional PCA (CPCA)-based unsupervised FE method (CPCAFE) that worked well with a simulated data set was proposed as an alternative to VBPCA. As VBPCA is more computationally challenging, CPCAFE is an even more promising alternative unsupervised FE candidate than the VB approach. To demonstrate applicability to translational research, we used CPCA-based, and partially, VBPCA-based unsupervised FE (VBPCAFE) to examine heart disease associated with posttraumatic stress disorder (PTSD). PTSD [[32]4] is caused by exposure to high-magnitude, life-threatening stressors, i.e., traumatic events. PTSD patients typically develop acute stress responses that include symptoms of arousal, anxiety, sadness, grief, agitation, irritability, and sleep disturbances. Heart disease is associated with PTSD, with a meta-analysis showing association between coronary heart disease (CHD) and PTSD [[33]5]. Moreover, hospitalization due to cardiovascular disease is associated with PTSD caused by the September 11, 2001, World Trade Center disaster [[34]6]. PTSD was also associated with CHD using a prospective twin study design [[35]7]. However, the underlying genetic background of the association is not well known. By applying CPCAFE and VBPCAFE to publically available mRNA and microRNA (miRNA) expression data [[36]8] from stressed mouse hearts, we identified aberrantly expressed miRNAs and mRNAs. Biological feasibility of the identified miRNAs and mRNAs was determined (by negative correlation between miRNAs and mRNAs, and Kyoto Encyclopedia of Genes and Genomes (KEGG) [[37]9] pathway enrichment of miRNA target genes), showing suitability of our selections. Two supervised FE methods, specifically, regression analysis using categorical pseudo variables and backward elimination using Hilbert-Schmidt norm of the cross-covariance operator (BAHSIC), identified unstable and biologically less feasible sets of mRNAs and miRNAs that were not always negatively correlated, and suggested superiority of unsupervised FE methods. Among the aberrantly expressed genes identified by CPCAFE, fatty acid binding protein 3 (FABP3) was considered a potential drug target candidate by structural investigations, and inhibitory drugs were sought using the in silico drug discovery tool, chooseLD, a profile-based drug discovery program. Results and discussion FE methods applied to a simulated categorical multiclass data set We performed FE using a simulated categorical multiclass data set, to determine the limitations of FE and usefulness of our proposed methods. The data set consisted of 100 simulated ensembles of 20 samples with 100 features, of which only 10 features were distinct between four classes, and with each class consisting of 5 samples (see [38]Methods). As sample values of each feature were obtained from an identical mixture of four Gaussian distributions, FE is difficult but discrimination between four classes varies between three cases (shown in Figure [39]1; a typical feature boxplot with distinct expression among classes). We named these three cases as easy (s=2), medium (s=1), and hard (s=0.5). Furthermore, we did not use order between the four classes, so the data could be treated as a categorical data set. Figure 1. Figure 1 [40]Open in a new tab Boxplot of typical features with distinct values between four classes (1≤i≤5) in the simulated data set; s=2 (easy), 1 (medium), and 0.5 (hard). In order to demonstrate the limitations of FE, we first tested one vs one t test based FE (see [41]Methods). No significantly different features were detected between the four classes in any of the three cases (Figure [42]2; confusion tables are available in Additional file [43]1), demonstrating that FE on a categorical multiclass data set is difficult. Figure 2. Figure 2 [44]Open in a new tab Matthews correlation coefficients and F measure for various FE methods applied to the simulated data set.t test, one vs one t test based FE; CRP, categorical regression based FE (using adjusted P-values); CRR, categorical regression based FE (using ranked P-values); BAHSIC, backward elimination using Hilbert-Schmidt norm of the cross-covariance operator; VBPCAFE, variational Bayes principal component analysis based unsupervised FE; CPCAFE, conventional principal component analysis based unsupervised FE. Next, we applied a more sophisticated method, specifically, categorical regression based FE (using significance based on adjusted P-values, see [45]Methods). The number of correctly extracted features improved, with Matthews correlation coefficients of 0.95, 0.35, and 0.05, and F measures of 0.98, 0.15, and 0.005 for easy, medium, and hard cases, respectively (Figure [46]2; confusion tables are available in Additional file [47]1). However, for the hard case, both values are almost zero. In order to improve performance, we selected the top 10 ranked features with the smallest P-values, independent of significance (using P-value ranks, see [48]Methods). This resulted in Matthews correlation coefficients of 0.97, 0.61, and 0.22, and F measures of 0.97, 0.65, and 0.30 for easy, medium, and hard cases, respectively (Figure [49]2; confusion tables are available in Additional file [50]1). Thus, FE accuracy is reasonable, but selecting features using non-significant P-values is not ideal, highlighting the problem of using FE on a categorical multiclass data set. In order to overcome this, we next tested BAHSIC [[51]10] (see [52]Methods), a recently proposed non-parametric FE method that has been used on high-dimensional microarray data sets. The obtained Matthews correlation coefficients were 0.84, 0.49, and 0.18, and F measures were 0.86, 0.54, and 0.27 for easy, medium, and hard cases, respectively (Figure [53]2; confusion tables are available in Additional file [54]1). Although performance is relatively improved (without using highly ranked but non-significant features), BAHSIC does not use P-values, and again the results are not completely satisfactory, especially for the easy case: a Matthews coefficient of 0.84 is less than the 0.95 achieved by categorical regression based FE. Therefore, we next considered VBPCAFE. In order to perform FE with VBPCA, original VBPCA was extended to incorporate feature dependence (see [55]Methods). Prior distribution in this extension has feature-dependent parameters, and therefore automatic elimination of irrelevant features was expected. The [MATH: CBi 1 :MATH] histogram reflects the prior distribution parameter of the first principal component (PC) (Figure [56]3), with smaller values assumed to be irrelevant features. Originally, C [B] was introduced to eliminate irrelevant PCs, and therefore has only q (PC) and not i (feature) dependence. However, we needed to eliminate irrelevant features as well; therefore, in our study C [B] must also have i dependence. Although we did not use sample labelling information, as expected, irrelevant features had smaller [MATH: CBi 1 :MATH] . Averaged A [j1] (contribution of the jth sample to the first PC) over 100 ensembles (Figure [57]4), demonstrates that A [j1] represents the distinction between the four classes. Thus, features not coincident with A [j1] (and with smaller [MATH: CBi 1 :MATH] ) were eliminated, producing Matthews correlation coefficients of 0.91, 0.39, and 0.04, and F measures of 0.92, 0.46, and 0.14 for easy, medium, and hard cases, respectively (Figure [58]2; confusion tables are available in Additional file [59]1). Figure 3. Figure 3 [60]Open in a new tab Relationship between [MATH: Cbi :MATH] and features with distinct expression among four classes. Left column: histogram obtained from logarithmic [MATH: Cbi :MATH] for 100 independent ensembles. Red color indicates features with distinct expression between four classes. Right column: proportion of features with distinct expression between four classes in each bin. Top: s=2, easy; middle: s=1, medium; bottom: s=0.5, hard cases. Figure 4. Figure 4 [61]Open in a new tab A [j1] boxplot with distinct values between four classes in 100 independent ensembles of the simulated data set; s=2 (easy), 1 (medium), and 0.5 (hard). Although performance has been successfully improved for the easy case, VBPCAFE is computationally challenging and is a potential drawback to its use. Iteration involves updates proportional to the number of features, which can often be as many as several tens of thousands. Thus, a more computationally effective method is of value, and a relatively straightforward idea is to replace [MATH: CBiq :MATH] with B [iq], the qth PC score of ith feature, because a larger B [iq] is expected to result in relevant (thus larger) [MATH: CBiq :MATH] (for example, see eq. ([62]1) in [63]Methods). Indeed, replacing [MATH: CBi 1 :MATH] with B [1i], we obtained Matthews correlation coefficients of 0.88, 0.34, and 0.02, and F measures of 0.89, 0.41, and 0.12 for easy, medium and hard cases, respectively (Figure [64]2, confusion tables are available in Additional file [65]1). These values are comparable with those using [MATH: CBi 1 :MATH] . Considering that CPCAFE requires minimal computational resources, CPCAFE is a promising candidate for applying to real data that often consists of more than several thousand features. In summary, there are at least four methods that achieve comparable FE performance on a categorical multiclass data set: categorical regression based FE (using ranked P-values), BAHSIC, VBPCAFE, and CPCAFE. Each has its own advantages and disadvantages: although categorical regression based FE achieved the best overall performance, it used features with non-significant adjusted P-values. BAHSIC does not have this problem, but as it does not use P-values, its performance using the easy case was poorest of the tested methods. VBPCAFE and CPCAFE were more effective using the easy case, but their performance was poor for the hard case. VBPCAFE is computationally challenging, while CPCAFE is not. However, these conclusions may be viewed as slightly subjective. The methods recommended for the hard case (i.e., categorical regression based FE and BAHSIC) have strict label dependency. Harder class discrimination often means label uncertainty (or incorrectness). Of course, there are many objective labels (e.g., gender or age), but those related to experimental conditions often include subjective criterion or insufficient controls for experimental conditions. Hard class discrimination often originates from these subjective labels, and consequently, robustness to mislabeling is desired, especially for hard cases. In order to test robustness of the four methods towards partial mislabeling, we determined performance after mislabeling (Table [66]1). Figure [67]5 shows performance degradation caused by little, medium, and heavy partial mislabeling. Little, medium, and heavy were defined by the distance between true and wrong labels. The number of samples with wrong labels included only a proportion of the data, and never more than 20%. Although only 20% of samples were mislabeled, performance degradation with BAHSIC and categorical regression based FE was considerable for medium and heavy mislabeling. Conversely, VBPCAFE and CPNAFE performance were not affected by mislabeling as these two methods do not require labeling information. This indicates that the two methods recommended for hard cases are not particularly robust, even against partial mislabeling, and therefore not always recommended. Table 1. Artificial partial mislabeling introduced to the simulated data set to check robustness of FE (a) (b) (c) Corr. coef. 0.92 0.68 0.60 Class 1 2 3 4 1 2 3 4 1 2 3 4 1 4 1 0 0 4 0 1 0 4 0 0 1 2 1 4 0 0 0 4 0 1 0 4 1 0 3 0 0 4 1 1 0 4 0 0 1 4 0 4 0 0 1 4 0 1 0 4 1 0 0 4 [68]Open in a new tab Rows and columns represent true and modified labels. Numbers represent the number of samples with correct and modified labels. (a) little; (b) medium; and (c) heavy mislabeling. Correlation coefficients represent the amount of mislabeling (larger correlation coefficients correspond to less mislabeling). Figure 5. Figure 5 [69]Open in a new tab Matthews correlation coefficients and F measure for various FE methods applied to the simulated data set with partial mislabeling, indicated in Table [70]1. Correlation coefficients between mislabeled and true labeling were (a) 0.92, (b) 0.68, and (c) 0.60, as shown in Table [71]1. Other notations are the same as in Figure [72]2. In conclusion, CPCAFE or VBPCAFE are the best methods for FE from categorical multiclass data sets, as they maintain robustness and show relatively stable and good performance for most cases. Translational use of CPCAFE In order to determine the effectiveness of CPCAFE in a real application, we examined PTSD associated heart disease. Our data set consisted of multiclass categorical samples, including several experimental conditions with no pre-defined rank orders (see [73]Methods), and is suitable for estimating CPCAFE performance. Two-dimensional embedding of miRNA profiles are shown (Figure [74]6). Two probes (corresponding to mmu-miR-302c and -370) with extremely large values were deemed to be erroneous signals and excluded prior to embedding. Since each miRNA was attributed to multiple probes, removing these two probes did not mean exclusion of these miRNAs from the analysis. In addition, probes with relatively large PC2 projections were excluded and the 100 top-ranked outliers along PC1 selected. This strategy ensured selection of PC1 enhanced probes, as outliers along both PC1 and PC2 were not expected to be PC1 specific (PCs other than PC1 or PC2 were excluded, with our rationale discussed in Additional file [75]2). CPCAFE assumes that if a set of probes are outliers along a PC, they behave in a group oriented manner. If not, they are not outliers and are instead located on the origin, where the majority of probes without biological feasibility are presumed to be located. This criterion requires no user input and probe biological relevance (e.g., distinct expression between treatment and control samples) is unknown. Using these assumptions, we further investigated a selected 100 probes, representing 27 unique miRNAs, specifically, mmu-miR-451, -22, -133b, -709, -126-3p, -30c, -29a, -143, -24, -23b, -133a, -378, -30b, -29b, -125b-5p, -675-5p, -16, -26a, -30e, -1983, -691, -23a, -690, -207, and -669l, and mmu-let-7b and -7g. We investigated differential expression of these miRNAs between treatment and control samples using various statistical tests (e.g., t-test, Wilcoxon rank sum test, and Kolmogorov-Smirnov test), but obtained no significant results. However, owing to the small number of samples (four samples for treatment and control conditions), the significance criterion (i.e., P<0.05) was not satisfied by any miRNA examined, including the 27 selected miRNAs. Therefore, we compared the selected and other miRNAs as two separate groups. Boxplots of logarithmic P-values obtained by t tests between miRNAs in treatment and control samples are shown (Figure [76]7). Figure 6. Figure 6 [77]Open in a new tab Two-dimensional embedding of miRNA expression determined by PCA. Each dot represents a probe. Red dots indicate the 100 top-ranked probes with larger PC1 scores, selected as outliers based on the criterion of CPCAFE. A few probes with relatively large PC2 scores were excluded to allow selection of PC1 enhanced probes. See Additional file [78]2 for more detail. Figure 7. Figure 7 [79]Open in a new tab Boxplots of logarithmic P-values obtained from t tests comparing control and treatment samples.P-values < 0.5 indicate greater miRNA upregulation in treatment samples than controls (justification for using logarithmic P-values is available in Additional file [80]2). From left to right, the experimental conditions were 2, 5, and 10 days of stress and 1 day of rest; 5 days of stress and 10 days of rest; and 10 days of stress and 42 days of rest. Right-hand boxes (“selected”) are the 27 miRNAs identified by PCA-based unsupervised FE, and left-hand boxes (“others”) the remaining miRNAs. P-values shown above each plot were calculated using t tests to compare logarithmic P-values between selected and other miRNAs. P [>]s (P [<]s) is the rejection probability that rejects the null hypothesis (both sets have equal means) in favor of the alternative hypothesis (“others” have a larger or smaller mean than “selected”). Based on these criteria, selected miRNAs are upregulated in treatment samples with 2 days of stress and 1 day of rest, and downregulated in treatment samples for the other conditions. The 27 selected miRNAs have larger or smaller P-values than the other miRNAs, indicating that CPCAFE successfully selected differentially expressed miRNAs between treatment and control samples (justification for using logarithmic P-values is provided in Additional file [81]2). Although the selected miRNAs were expressed distinctly from the other miRNAs, this does not demonstrate biological relevance. As our aim was to investigate the underlying transcriptomic background of PTSD-mediated heart disease, preliminary validation of our approach can be provided by prior involvement of the selected miRNAs with heart disease. To address this, we performed literature searches (Table [82]2). Of the 27 selected miRNAs, 23 (excluding miR-1983, -691, -690, and -207) were previously reported to be related to heart disease, suggesting our miRNA selection is biologically useful. Table 2. Summary of studies with association between the 27 selected miRNAs and heart disease miRNAs Ref. Description miR-451 [[83]39] Upregulated in heart due to ischemia miR-22 [[84]40] Elevated serum levels in patients with stablechronic systolic heart failure miR-133 [[85]41] Downregulated in transverse aortic constrictionand isoproterenol-induced hypertrophy miR-709 [[86]42] Upregulated in rat heart four weeks after chronicdoxorubicin treatment miR-126 [[87]43] Association with outcome of ischemic andnonischemic cardiomyopathy in patients withchronic heart failure miR-30 [[88]44] Inversely related to CTGF in two rodent modelsof heart disease, and human pathological leftventricular hypertrophy miR-29 [[89]45] Downregulated in the heart region adjacent toan infarct miR-143 [[90]46] Molecular key to switching of the vascular smoothmuscle cell phenotype that plays a critical role incardiovascular disease pathogenesis miR-24 [[91]47] Regulates cardiac fibrosis after myocardial infarction miR-23 [[92]48] Upregulated during cardiac hypertrophy miR-378 [[93]49] Cardiac hypertrophy control miR-125 [[94]50] Important regulator of hESC differentiation to cardiacmuscle(potential therapeutic application) miR-675 [[95]51] Elevated in plasma of heart failure patients let-7 [[96]52] Aberrant expression of let-7 members incardiovascular disease miR-16 [[97]53] Circulating prognostic biomarker in critical limbischemia miR-26 [[98]54] Downregulated in a rat cardiac hypertrophy model miR-669 [[99]55] Prevents skeletal muscle differentiation in postnatalcardiac progenitors [100]Open in a new tab To further confirm biological suitability of the identified miRNAs, we examined KEGG pathway enrichment using miRNA target genes (see [101]Methods). Associations between enriched KEGG pathways and heart disease are summarized (Table [102]3). Next, we examined the 21 top-ranked pathways (the complete list is available in Additional file [103]3), with 17 found to be related to heart disease. Overall, these findings validate both our selection of miRNAs, and utility of CPCAFE. Table 3. Summary of studies with association between heart disease and KEGG pathways enriched by miRNA target genes KEGG pathway Enrichment P -value Ref. Description Axon guidance 3.61×10^−14 [[104]56] Axon guidance of sympathetic neurons to cardiomyocytes is a promising target for regulation of cardiac function in diseased hearts Colorectal cancer 1.92×10^−10 [[105]57] Significant association between colorectal neoplasm and coronary artery disease Chronic myeloid leukaemia 1.92×10^−10 [[106]58] Imatinib mesylate (a therapeutic agent for chronic myeloid leukaemia) has cardiotoxicity Glutamatergic synapse 3.35×10^−9 — — Hepatitis B 5.86×10^−9 [[107]59] Hepatitis B virus causes varied forms of heart disease Pancreatic cancer 6.60×10^−9 — — Acute myeloid leukaemia 2.54×10^−8 [[108]60] Cause of acute ischemic heart disease Focal adhesion 6.65×10^−8 [[109]61] Focal adhesion kinase deletion attenuates pressure overload-induced hypertrophy MAPK signaling pathway 3.50×10^−7 [[110]62] Plays an important role in cardiac and vascular disease pathogenesis Endometrial cancer 4.55×10^−7 [[111]63] Cardiovascular disease is the leading cause of death among endometrial cancer patients Chagas disease 6.33×10^−7 [[112]64] Association with heart disease T cell receptor signaling pathway 9.82×10^−7 — — ErbB signaling pathway 1.01×10^−6 [[113]65] ErbB-signaling pathway proteins are potential drug targets for heart failure treatment Prostate cancer 2.47×10^−6 [[114]66] Coronary artery disease and prostate cancer are both common diseases sharing many risk factors Neurotrophin signaling pathway 3.08×10^−6 — — Toxoplasmosis 3.16×10^−6 [[115]67] Toxoplasmosis is a cause of heart disease Bacterial invasion of epithelial cells 3.84×10^−6 — — TGF-beta signaling pathway 7.35×10^−6 [[116]68] Upregulated in infarcted myocardium Nonsmall-cell lung cancer 9.00×10^−6 — — VEGF signaling pathway 1.17×10^−5 [[117]69] A VEGF inhibitor has cardiotoxicity Dopaminergic synapse 2.19×10^−5 [[118]70] Dopamine agonists affect the cardiovascular system [119]Open in a new tab Moreover, our results suggest that the aberrantly expressed miRNAs are likely involved in PTSD-mediated heart disease. Hence, further investigation of the miRNA target genes may clarify the underlying molecular biology and transcriptomic background of PTSD-mediated heart disease. In this regards, we applied PCA to mRNA expression. Two-dimensional embedding of mRNA expression is shown (Figure [120]8). To determine correlation between miRNA and mRNA expression, we compared their expression profiles and determined the contribution of each sample to PC1 (Figure [121]9(a)), showing negative correlation between PC1s. Additionally, to determine if the observed correlation is coincident with experimental conditions, scatterplots of averaged values within each condition were examined (Figure [122]9(b)). This strengthened the correlations but did not change significance, indicating that the observed negative correlation between mRNA and miRNA expression is reliable. Next, we selected mRNAs using CPCAFE. To select PC1 enhanced probes, the 100 top-ranked outliers along PC1 were selected, excluding probes with relatively large projections to PC2. In total, 59 unique mRNAs were identified (RefSeq mRNA IDs are provided in Additional file [123]4). To confirm negative correlation between miRNAs and miRNA target mRNAs, correlation coefficients were determined (see Additional file [124]4). There were no targets common to the selected 59 mRNAs and TarBase, employed by DNA intelligent Analysis (DIANA)-mirpath [[125]11] and used in KEGG pathway analysis (see [126]Methods), possibly because of TarBase’s experiment-oriented, thus context-dependent, nature (i.e., TarBase does not include PTSD). Thus, instead we used seed matching to identify miRNA target genes, with so called 7mer-m8 [[127]12] detecting exact matches to positions 2-8 of mature miRNAs (seed + position 8). Among the 59 mRNAs, 24 were targeted by at least one of the 27 selected miRNAs. In addition, 47 pairs of miRNAs and miRNA target genes were identified. In total, there were 45/47 negative correlation coefficients between miRNAs and miRNA target genes. We also examined correlation coefficient significance (see [128]Methods), with 26/47 pairs (more than half) associated with significant correlations (two positive correlations were judged insignificant), and confirming negative correlation between miRNAs and miRNA target genes. Figure 8. Figure 8 [129]Open in a new tab Two-dimensional embedding of mRNA expression by PCA. Notations are the same as in Figure [130]6. See Additional file [131]2 for more detail. Figure 9. Figure 9 [132]Open in a new tab Comparison of sample contributions to PC1 between mRNA and miRNA.(a) Scatterplot comparing mRNA and miRNA expression profiles for sample contribution to PC1 (i.e., components of the first loading vector). Pearson’s correlation coefficient =−0.37 (P=0.01). (b) Averaged contributions within each condition. Pearson’s correlation coefficient =−0.69 (P=0.01). XY-Zd with X = C (control); X = T (treatment); Y, stress days; and Z, rest days in experimental conditions. See Additional file [133]2 for more detail. In order to determine if the selected 59 mRNAs were differentially expressed between control and treated samples, we examined the logarithmic ratio. Using t tests, the averaged logarithmic ratio of the 59 samples was significantly negative or positive compared with that averaged over other mRNAs, excluding only one condition (see Table [134]4). Thus, our results show that the 59 selected mRNAs are mostly distinctly expressed between control and treated samples. Table 4. P -values calculated by t tests from logarithmic ratios between treated and control samples C2-1d C5-1d C10-1d C5-10d C10-42d P -values vs vs vs vs vs T2-1d T5-1d T10-1d T5-10d T10-42d Control < treated 6.9×10^−4 7.4×10^−8 1.0 0.22 0.03 Control > treated 1.0 1.0 2.35×10^−8 0.78 0.97 [135]Open in a new tab Smaller P-values indicate that the difference in mean logarithmic ratio is more significant in the selected 59 mRNAs than that in the other mRNAs. For descriptions of each experimental condition, see the caption for Figure [136]9. For more methodological detail, see Additional file [137]2. Although the selected mRNAs include many targeted and negatively regulated by miRNAs, and are differentially expressed between control and treated samples, published associations between the selected mRNAs and heart disease would provide direct biological relevance. Association between the identified genes and heart disease are summarized (Table [138]5). In total, 9 genes were both targeted by at least one of the 27 miRNAs and related to heart disease. In addition, 17 genes were related to heart disease but not targeted by any of the 27 miRNAs (see Additional file [139]5). Among the genes not associated with heart disease, 15 genes were targeted by at least one of the 27 miRNAs, and only 18 genes were not targeted by any. Because approximately half (26 in total) of the selected 59 genes are related to heart disease, our gene selection is feasible from a biological point of view. Table 5. Summary of studies with association between heart disease and genes selected by CPCAFE Gene RefSeq mRNA miRNAs Heart disease association ( P -values from the Gendooserver) Fabp3 [140]NM_010174 miR-709 Cardiomegaly (0.031) Cox4i1 [141]NM_009941 miR-709 Cardiac output, low (5×10^−4) Atp5g1 [142]NM_001161419 miR-29a/b-3p, Cardiomyopathies (1.5×10^−3) miR-16 Cox6a2 [143]NM_009943 miR-23a/b-3p Heart disease (1.3×10^−3); Cardiomyopathies (2.3×10^−3);Heart failure (5.0×10^−3) Aldoa [144]NM_001177307 miR-16 Cardiomyopathy, dilated (2.0×10^−3) Cox5a [145]NM_007747 miR-26a-5p Cardiomyopathies (5.4×10^−3); Myocardial ischemia(7.4×10^−4) Myl2 [146]NM_010861 miR-1983 Heart defects, congenital (4.9×10^−48); Cardiomyopathy, dilated (5.2×10^−21); Cardiomegaly (1.1×10^−17); Heart failure (2.1×10^−9) Myl3 [147]NM_010859 miR-691 Heart defects, congenital (1.1×10^−7); Cardiomegaly(1.2×10^−4); Myocardial infarction (2.4×10^−4); Heart septal defects, ventricular (6.9×10^−4) Tcap [148]NM_011540 miR-207 Cardiomyopathy, dilated (1.1×10^−8); Cardiomyopathy, hypertrophic (1.7×10^−3); Heart defects, congenital(6.0×10^−3); Heart failure (1.2×10^−2) [149]Open in a new tab Finally, we compared the selected 59 genes with KEGG pathways. KEGG pathway analysis had been performed for the miRNAs, but as they are not the only mRNA regulatory mechanism, mRNA expression may be, at least partially, distinct from miRNA expression. We found many genes were concentrated to specific KEGG pathways (see Additional file [150]6). For example, Uqcrh, Uqcrq, Cox6a2, Cox7a1, Cox5a, Cox4i1, Cox8b, Cox6c, Myl2, Myl3, Myh6, Myh8, Tpm1, Tnni3, and Tnnt2 belong to the KEGG pathway “Cardiac muscle contraction” (mmu04260), and most are also components of cytochrome c oxidase, involved in mitochondrial proton transfer. Thus, it is biologically reasonable that heart disease is associated with aberrant expression of these genes. Mybpc3, Tnni3, and Tnnt2 belong to the KEGG pathway “Hypertrophic cardiomyopathy” (mmu05410), also coincident with a role in heart disease. Atp5g1, Atp5b, Atp5g3, Atp5h, Atp5a1, Atp5e, Atp5j2, Ndufa13, and Ndufs6 belong to “Oxidative phosphorylation” (mmu00190 + 11951). This KEGG pathway is an essential part of mitochondrial energy metabolism, and malfunction is likely to be directly related to muscle functionality. Mdh2 and Aco2 belong to the “Citrate cycle” (mmu00020), and are involved in energy production, while Fabp3 belongs to the “peroxisome proliferator-activated receptor (PPAR) signaling pathway” (mmu03320), involved in muscle function. Thus, based on these KEGG pathway functions, the genes identified by CPCAFE are related to heart disease. In order to further confirm the validity of our methodology, i.e., integrated analysis of mRNA and miRNA expression, we have also performed KEGG pathway analysis by the Database for Annotation, Visualization and Integrated Discovery (DAVID) [[151]13] restricted to 24 genes targeted by at least one of 27 miRNAs (see [152]Methods). DAVID identified five KEGG pathways associated with significant P-values adijutsted by Benjamini and Hochberg (BH) critetion (Table [153]6). Two out of five were related to cradiadic diseases (“Cardiac muscle contraction” and “Oxidative phosphorylation”[[154]14]) and three were related to neurodegenerative diseases (“Parkinson’s disease”, “Alzheimer’s disease” and “Huntington’s disease”). Thus, these are very coincident with those causing PTSD mediated heart diseases. Figure [155]10 summarizes the analyses performed in the above. Table 6. KEGG pathway enriched by 24 mRNAs targeted by 27 miRNAs, identified by DAVID KEGG pathway Number of genes % P -values Adjusted P -values Cardiac muscle contraction 7 30.4 2.30E-09 3.20E-08 Parkinson’s disease 7 30.4 5.80E-08 4.10E-07 Oxidative phosphorylation 6 26.1 2.30E-06 1.10E-05 Alzheimer’s disease 6 26.1 1.20E-05 4.20E-05 Huntington’s disease 6 26.1 1.20E-05 3.50E-05 [156]Open in a new tab Number of genes are genes included in pathway, % is the ratio genes included in pathway among 24 genes. P-values and those adjusted by BH criterion were provided by DAVID. Figure 10. Figure 10 [157]Open in a new tab Schematic that illustrates biological validations towards 27 miRNAs and 59 mRNAs. Our findings suggest that PTSD-mediated heart disease may be caused by malfunction in “Cardiac muscle contraction” and/or “Oxidative phosphorylation” of energy metabolism. These malfunctions are related to aberrant gene expression likely mediated by aberrant miRNA expression. From a therapeutic point of view, PTSD-mediated heart disease may be treated based upon this knowledge. To perform in silico drug discovery for these genes, the 26 genes associated with heart disease were investigated. First, tertiary structures were predicted (see [158]Methods), and found to be similar to predicted or experimentally determined tertiary structures available in the Protein Data Bank (PDB), indicating our tertiary structures are reliable (see Additional file [159]7). Among the proteins with tertiary structures predicted or available in PDB, FABP3 has a “pocket” to which inhibitors can bind, and was therefore selected as a candidate drug target for in silico drug discovery. FABP3 is an acid binding protein and its function can be blocked by inhibition of acid binding. Moreover, FABP3 is upregulated in patients with ventricular-septal defects in comparison to normal controls [[160]15]. FABP3 is also a member of FABPs that play critical roles in the PPAR signaling pathway identified above. Thus, it is likely that FABP3 is a key protein in PTSD-mediated heart disease. The 10 top-ranked drug candidate compounds obtained by in silico drug discovery (see [161]Methods) are listed (Table [162]7). The complete list of compounds ranked by FPAScores is available in Additional file [163]8. The compounds include promising drug candidates, for example, two heat shock protein 90 (HSP90) inhibitors are upregulated in dilated cardiomyopathy [[164]16], and two PPAR inhibitors are regarded as pharmacological therapeutic targets [[165]17]. Furthermore, overexpression of two Cyclin-dependent kinase 2 (CDK2) inhibitors results in smaller mononuclear cardiomyocytes [[166]18]. All of these candidates should be investigated further. Table 7. Top ranked FABP3 inhibitor compounds identified by in silico drug discovery Rank FPAScore Drug name DrugBank/CHEMBL No. Target/Activity reported in DrugBankand CHEMBL 1 907.4 Oxaprozin DB00991/CHEMBL1071 PTGS1 (Cox1) & PTGS2 (Cox2) inhibitor 2 901.9 — DB08539/CHEMBL249736 Inhibits human CDK2 3 866.2 — DB06964/CHEMBL252124 HSP90 β binding affinity 4 864.4 — DB08702/— Metallo- β-lactamase L1* 5 826.7 — DB08396/— Ig heavy chain V-III region CAM* 6 809.3 — DB06908/CHEMBL209749 PPAR α/γ binding affinity 7 790.5 — DB08483/CHEMBL47590 Agonist activity at human PPAR δ/γ/α; CDK2/4 inhibition 8 773.8 Flavoxate DB01148/CHEMBL1493 CHRM1/2 antagonist; PDE 4/7/8 inhibitor 9 771.6 — DB07594/CHEMBL399530 Inhibits HSP90 activity 10 761.2 Rolitetracycline DB01301/CHEMBL1237046 Inhibitor of 30S ribosomal protein S9 & 16S rRNA; Inhibits synthetic amyloid β-42 fibrillization [167]Open in a new tab *pharmacological action unknown. Stability and comparison with other methods We have shown biological feasibility of CPCAFE when applied to PTSD. However, it is also important to compare its performance against other methods, as CPCAFE superiority is doubtful if any other method can achieve comparable performance. Hence, we used a modified data set to ensure performance is due to the method and not the data set. If slight modifications of the sample data set drastically decrease performance, CPCAFE superiority is doubtful. In addition, VBPCAFE performance on the PTSD data set is unknown. If VBPCAFE derives completely different outcomes from CPCAFE, proposing CPCAFE as an alternative to VBPCAFE (with a firmer theoretical base, albeit a more time-consuming method) would be less convincing. First, we examined equivalence between CPCAFE and VBPCAFE. VBPCAFE is too time-consuming to be directly applied to the whole PTSD data set, therefore we created a data set small enough to be used directly that consisted of 200 features, with 100 features selected by CPCAFE and 100 distinct features. If VBPCAFE and CPCAFE are equivalent, the 100 features selected by CPCAFE should have larger [MATH: CBi 1 :MATH] s than those attributed to the other 100 features (see [168]Methods). Feature frequencies selected from the 100 top-ranked probes with larger [MATH: CBi 1 :MATH] values after investigation of 100 independent ensembles are shown (Figure [169]11). Scatterplots between [MATH: CBi 1 :MATH] and B [i1] are also shown (Figure [170]12). As expected (from eq. ([171]1) in [172]Methods), [MATH: CBi 1 :MATH] is quadratically dependent upon B [i1], definitely demonstrating equivalence between CPCAFE and VBPCAFE when applied to a real data set. Of course, there is still the possibility that VBPCAFE applied to the whole PTSD data set would select a completely different set of features from the 100 features selected by CPCAFE. However, we believe it is unlikely as there are almost no features not selected by CPCAFE, within the 100 top-ranked in any of the 100 ensembles. Thus, CPCAFE and VBPCAFE show equivalence when applied to not only simulated data, but also a real data set. Figure 11. Figure 11 [173]Open in a new tab Frequency of 100 probes selected by CPCAFE and also identified by VBPCAFE within the 100 top-ranked probes with largest [MATH: CBi 1 :MATH] values over 100 independent ensembles.(a) miRNA; and (b) mRNA. Red and black circles correspond to probes selected or not selected, respectively, by CPCAFE. Figure 12. Figure 12 [174]Open in a new tab Scatterplots between B [i1] (horizontal axis) and [MATH: CBi 1 :MATH] (vertical axis). Red and black open circles correspond to 100 features selected or not selected, respectively, by CPCAFE. Solid green lines indicate quadratic regression lines. (a) miRNA; and (b) mRNA. Next, we examined CPCAFE stability, i.e., sample independence, by performing a 4-fold cross-validation study (see [175]Methods). We specifically used a 4-fold cross-validation as there are only four biological replicates in each experimental condition, therefore using any other cross-validation would not be straightforward. The results from 100 independent ensembles are shown (see Additional file [176]9). For mRNAs, 78 probes were always selected by CPCAFE and only 10 probes not always selected, demonstrating high sample independence. Among the 78 probes always selected, 53 had associated RefSeq IDs that were part of the 59 RefSeq IDs selected by applying CPCAFE to the whole data set. For miRNAs, 27 probes were always selected and no probe not always selected (see Additional file [177]9). Among the 27 selected mature miRNAs, 25 were part of the 27 mature miRNAs selected by CPCAFE previously. Thus, CPCAFE selected almost all features 100% of the time, demonstrating stability and suggesting that performance is not likely due to the selected data set. As CPCAFE and VBPCAFE show potential equivalence, it is likely that VBPCAFE also shares this stability. Third, we determined if other conventional supervised, and thus possibly sample dependent, FE methods can achieve a comparable performance to CPCAFE. As there are as many as 12 experimental setups without any pre-defined rankings or orderings, it is not straightforward to use a popular multiclass oriented FE with regression analyses, e.g., lasso [[178]1]. Not only is there no way to attribute labels with actual values to each experimental setup, but it is also not known if it is possible, in principal, to align these 12 experimental conditions in one dimensional order. Thus, the number of usable supervised FE methods that can be tested on the present PTSD data set is limited. Nevertheless, we identified and examined two methods. The first was regression analysis, which attributes a 0 or 1 to each experimental setup with a linear combination assumed to represent mRNA/miRNA expression (categorical regression-based FE, see [179]Methods). Each feature is then ranked by significance (small P-values are attributed to regression), and the 100 top-ranked features selected. Using this method, we identified 100 mRNA/miRNA probes and investigated biological feasibility and stability of selected ones. For mRNAs, among the 100 selected probes, only 23 had associated RefSeq IDs (see Additional file [180]4), a smaller number than CPCAFE, suggesting that CPCAFE more readily identifies biologically important genes (with the underlying assumption that biologically important genes are more likely to have RefSeq IDs). Disease association of the selected genes was also poorer than those selected by CPCAFE (Table [181]8). Only 4/23 genes (Ttn, Ubr2, Gata5, and Hs2st1) were associated with heart failure-related diseases, in contrast with the 26 heart disease associated genes (out of 59 genes) selected by CPCAFE. Even extending the target species to human, only one additional gene (Cxc11) was identified. Altogether, these results show that CPCAFE has greater power for identifying biologically feasible genes. For miRNAs, among the 100 selected probes, 77 were identified with unique mature miRNA names (see Additional file [182]4), a much larger number than CPCAFE. Again, this demonstrates the biological feasibility of CPCAFE, as each miRNA is attributed to multiple probes and should therefore have been selected simultaneously. Thus with miRNAs, a smaller number of identified unique mature miRNA names reflects more plausible FE. The identified miRNAs were uploaded to DIANA-mirpath for KEGG pathway analysis (see Additional file [183]3), and 74 KEGG pathways identified (66 with CPCAFE), although the number of miRNAs related to each pathway is much smaller: using CPCAFE, on average seven miRNAs contributed to each pathway, while with categorical regression-based FE, only four miRNAs contributed, despite almost three times larger (77 vs 27) miRNAs uploaded (P = 1.67×10^−17, difference between mean values computed by t test). This suggests that categorical regression-based FE is less able to identify biologically important miRNAs than CPCAFE. We also determined if miRNA target mRNAs were negatively correlated with miRNA expression, and identified 180 mRNA-miRNA pairs but with only 35 pairs negatively correlated (see Additional file [184]4). This again contrasts with CPCAFE, with almost all mRNA-miRNA pairs negatively correlated. Correlation coefficient significance was examined, and only 20 significantly correlated pairs were identified. Since all significant correlation coefficients were negative, categorical correlation-based FE can extract correct features, but is less able to confidently screen more features than CPCAFE; more than half of the pairs identified by CPCAFE were associated with significant negative correlations. This is consistent with the limited number of miRNAs selected by categorical regression-based FE contributing to KEGG pathway analysis. Stability was also compared with CPCAFE (see Additional file [185]9). For mRNAs, only 24 probes were always selected among 100 cross-validation ensembles, with 122 probes selected only once. For miRNAs, only eight probes were always selected (see Additional file [186]9). The next highest frequencies selected were 93 and 95, corresponding to only two probes each. Conversely, 33 and 29 probes were selected only once and twice, respectively. Thus, with regards to stability, CPCAFE outperforms FE with categorical regression analysis. Table 8. Disease association of mRNAs selected by categorical regression-based FE and BAHSIC RefSeq mRNA Gene Heart disease association ( P -values: mouse, human) Categorical regression-based FE [187]NM_011652 Ttn Cardiomyopathy, dilated (5.49×10^−7, 1.80×10^−10); Cardiomyopathies (1.5×10^−4, 9.19×10^−5); Cardiomyopathy, hypertrophic (5.6×10^−3, 4.74×10^−10); Heart defects, congenital (1.90×10^−2, —); Heart failure (—, 1.02×10^−5); Cardiomyopathy, hypertrophic, familial (—, 1.02×10^−5) [188]NM_019494 Cxcl11 Cardiomyopathy, hypertrophic (—, 3.05×10^−2) [189]NM_001177374 Ubr2 Heart Defects, congenital (8.64×10^−4, —); Cardiomegaly (1.46×10^−3, —) [190]NM_008093 Gata5 Heart defects, congenital (1.40×10^−7, —); Heart septal defects (4.12×10^−3, —); Hypertrophy, right ventricular (1.36×10^−4, —); Cardiovascular abnormalities (1.49×10^−3, —); Cardiomyopathy, dilated (8.14×10^−3, —); Cardiomyopathies (9.27×10^−3, —); Cardiomegaly (1.75×10^−2, 7.60×10^−3) [191]NM_011828 Hs2st1 Cardiomyopathies (3.87×10^−3, —); BAHSIC [192]NM_009722 Atp2a2 Cardiomegaly (7×10^−21, 1.33×10^−3); Cardiomyopathy, dilated (2.68^−12, 7.37^−9); Cardiomyopathies (5.87×10^−12, 9.69 × 10^−4); Heart Failure (6.50 × 10^−12, —); Cardiac Output, low (2.01 × 10^−7, —); Arrhythmias, cardiac (8.85 ×10^−7, —); Hypertrophy, left ventricular (3.03×10^−6, 3.38×10^−2); Myocardial reperfusion injury (2.33×10^−5, —) Myocardial stunning (3.42×10^−3, —); Atrial fibrillation (5.42×10^−3, —); Myocardial infarction (6.11×10^−3, —); Heart disease (2.38×10^−2, 7.00×10^−4); Ventricular dysfunction, left (2.74×10^−2, 3.4×10^−4); Cardiomyopathy, restrictive (—, 2.12×10^−3); Myocardial ischemia (—, 2.91×10^−3); Mitral valve insufficiency (—, 3.68×10^−3); Cardiomyopathy, hypertrophic (—, 3.49×10^−2) [193]NM_001164171 Myh6 (Already identified by CPCAFE) [194]NM_008725 Nppa Heart Defects, congenital (2.53×10^−55, 9.09×10^−9); Cardiomegaly (1.11×10^−28, —); Cardiomyopathies (7,57×10^−17, —); Hypertrophy, left ventricular (9.16×10^−12, 1.19×10^−13); Cardiomyopathy, dilated (6.47×10^−11, 2.41×10^−8); Heart disease (9,72×10^−8, 1.11×10^−4); Endocardial cushion defects (3.51×10^−6, —); Heart septal defects (4.63×10^−6, —); Cardiovascular abnormalities (6.34×10^−5, —); Tachycardia, ectopic atrial (1.97×10^−4, —); Arrhythmias, cardiac (4.35×10^−4, —); Cardiomyopathy, hypertrophic, familial (4.90×10^−3, —); Heart septal defects, ventricular (5.70×10^−3, —); Hypertrophy, right ventricular (1.04×10^−2, —); Heart failure (1.04×10^−2, 5.07×10^−29); Cardiomyopathy, hypertrophic (2.20×10^−2, 4.40×10^−2) Ventricular dysfunction, left (—, 3.04×10^−10); Myocardial reperfusion injury (— 1.76×10^−7); Cardiovascular disease (—, 4.21×10^−6); Myocardial infarction (—, 9.73×10^−6); Mitral valve insufficiency (—, 1.04×10^−5); Heart valve disease (—, 3.27×10^−5); Myocardial ischemia (—, 1.49×10^−4); Ventricular dysfunction (—,.198×10^−3); Cardiomyopathy, restrictive (—, 2.68×10^−2); Ventricular dysfunction, right (—, 3.53×10^−3) [195]NM_008103 Gcm1 Heart defects, congenital (6×10^−3, —) [196]NM_009608 Actc1 Heart defects, congenital (2.54×10^−38, —); Cardiomyopathy, dilated (6.55×10^−9, 2.16×10^−14); Heart septal defects (5.15×10^−7, 6.80×10^−4); Arrhythmias, cardiac (4.90×10^−5, —); Cardiomyopathies (2.75×10^−4, 1.23×10^−2); Heart septal defects, atrial (1.33×10^−3); Cardiovascular abnormalities (3.85×10^−2, —); Cardiomegaly (4.45×10^−2, 1.01×10^−4); Cardiomyopathy, hypertrophic (—, 4.91×10^−13); Cardiomyopathy, hypertrophic, familial (—, 1.97×10^−6); Cardiomyopathy, restrictive (—, 5.87×10^−4); Heart septal defects, atrial (—, 1.73×10^−3); Hypertrophy, left ventricular (—, 1.09×10^−2) [197]NM_013468 Ankrd1 Heart defects, congenital (3.98×10^−12, 4.70×10^−5); Cardiomegaly (2.52×10^−6, 1.00×10^−2); Cardiomyopathies (9.11×10^−5, —); Heart septal defects (6.18×10^−4, —); Cardiomyopathy, hypertrophic, familial (9.66×10^−4, —); Cardiomyopathy, dilated (1.22×10^−2, 1.22×10^−2); Heart failure (—, 3.04×10^−4); Hypertrophy, left ventricular (—, 7.55×10^−3) [198]NM_011540 Tcap (Already identified by CPCAFE) [199]NM_177369 Myh8 (Already identified by CPCAFE) [200]NM_007450 Slc25a4 (Already identified by CPCAFE) [201]NM_010174 Fabp3 (Already identified by CPCAFE) [202]NM_008084 Gapdh (Already identified by CPCAFE) [203]NM_019494 Cxcl11 Already identified by categorical regression based FE [204]NM_013463 Gla Cardiomyopathy, hypertrophic (—, 2.46×10^−2); Heart disease (—, 2.64×10^−2); Cardiomyopathies (—, 3.10×10^−2) [205]NM_001038592 Glrx2 Myocardial reperfusion injury (—, 1.56×10^−3) [206]Open in a new tab P-values obtained from the Gendoo server for both mouse and human. Genes annotated as “Already identified by CPCAFE” are listed in Table [207]5. It is possible that CPCAFE outperforms FE with categorical regression analysis because it is too simple (naive) an alternative. Therefore to compare CPCAFE performance with a more sophisticated method, we used BAHSIC (see [208]Methods). For mRNAs, among the 100 selected probes, only 37 had RefSeq IDs, again less than CPCAFE (see Additional file [209]4). Of these 37 mRNAs, only 16 were associated with heart failure-related disease (Table [210]8), even when the target species was extended to human. Thus, 43% of extracted RefSeq associated mRNAs are related to heart disease in the Gendoo server. This is similar to the 45% reported for genes selected by CPCAFE. Considering that the total number of selected RefSeq associated mRNAs is larger, this suggests that CPCAFE outperforms BAHSIC as well as categorical regression based FE. Interestingly, 15/37 mRNAs (not always including the 16 disease associated genes) were also selected by CPCAFE, despite use of distinct algorithms by BAHSIC and CPCAFE. Nine genes selected by CPCAFE (in addition to the six genes listed in Table [211]8 or Additional file [212]5), but with missing disease association were Synrg, Milr1, Exosc2, Medag, 2610028H24Rik, pbld1, Hbb-bt, 1700047I17Rik2, and Hba-a2. Thus, mRNAs extracted by CPCAFE and BAHSIC overlap significantly. Furthermore, if we also consider PC2 outliers (see Additional file [213]2) that were excluded in the previous analyses, an additional 15 genes (Rhox8, Kctd14, Ttc38, G l r x2, Ndor1, Dcc, E l k1, G c m1, Hccs, Nppa, A c t c1, Pigr, Slc10a1, C x c l11, and Med16) have been identified by CPCAFE (Bolditalic six genes are also associated with heart failure-related disease, see Additional file [214]2). Consequently, there is a substantial increase in overlap significance between mRNAs extracted by CPCAFE and BAHSIC. Figure [215]13 shows the relationship between CPCAFE, BAHSIC, and heart failure-related disease association. The many genes identified by both methods and associated with disease show that our method (i.e., treating expression data as a categorical multiclass data set and not a collection of pairwise comparisons) can identify biologically relevant genes. Figure 13. Figure 13 [216]Open in a new tab Venn diagram of mRNAs extracted by CPCAFE and BAHSIC, and heart failure-related disease association. Numbers in parentheses correspond to those with outliers along PC2 (see Additional file [217]2). For miRNAs, among the 100 probes, 47 unique mature miRNAs were selected (see Additional file [218]4), a much larger number than CPCAFE (27). The 47 miRNAs were uploaded to DIANA-mirpath (see Additional file [219]4), and 65 KEGG pathways identified as significant. Although this number is similar to that obtained by CPCAFE (66), considering that the number of uploaded miRNAs was almost twice (47 vs 27), CPCAFE is still superior to BAHSIC. In fact, the averaged number of miRNAs with target mRNAs enriched in each pathway, was approximately seven, similar to that obtained by CPCAFE despite two-times more uploaded miRNAs. We also compared biological significance of the KEGG pathways obtained by BAHSIC and CPCAFE. In contrast to 17 reported pathways related to heart-failure related disease (from the top most significant 21 KEGG pathways) using CPCAFE (Table [220]3), only 11 identified by BASHSIC (of which, seven were also identified by CPCAFE; Table [221]3) were related to heart-failure related disease by literature searching (Table [222]9). Thus, from a biological point of view, CPCAFE outperforms BAHSIC. We also determined if targeted mRNAs negatively correlate with miRNA expression, and identified 164 mRNA-miRNA pairs, although only 73 were negatively correlated (see Additional file [223]4). Conversely, with CPCAFE, almost all pairs are negatively correlated. Correlation coefficient significance was examined, with only 33 significantly correlated pairs identified. Since all significant correlation coefficients were negative, this shows that BAHSIC cannot correctly extract features, and is less able to confidently screen more features than CPCAFE. Again this coincides with the limited number of miRNAs selected by BAHSIC that contribute to KEGG pathway analysis. Table 9. Summary of studies with association between heart disease and KEGG pathways enriched by BAHSIC identified miRNA target genes KEGG pathway Enrichment P -value Ref. Description Long-term depression 5.37×10^−19 [[224]71] Depression has been linked with additional health problems, including heart disease Prion diseases 3.13×10^−17 [[225]72] Prion-induced amyloid heart disease with high blood infectivity in transgenic mice Axon guidance 5.99×10^−14 (Already identified by CPCAFE) Fatty acid biosynthesis 2.01×10^−12 [[226]73] A relationship between fatty acid synthase andcardiac calcium signaling has been suggested Calcium signaling pathway 2.04×10^−10 (Already identified by CPCAFE) Gap junction 5.03×10^−10 [[227]74] Gap junction alterations in human cardiac disease Prostate cancer 6.04×10^−10 (Already identified by CPCAFE) Pancreatic cancer 1.63×10^−9 — — Glutamatergic synapse 2.02×10^−8 — — Endometrial cancer 2.02×10^−8 (Already identified by CPCAFE) Long-term potentiation 2.65×10^−8 — — Neurotrophin signaling pathway 2.81×10^−8 — — Focal adhesion 8.34×10^−8 (Already identified by CPCAFE) TGF-beta signaling pathway 2.00×10^−7 (Already identified by CPCAFE) Endocytosis 2.27×10^−7 — — Cholinergic synapse 9.60×10^−7 — — Regulation of actin cytoskeleton 1.33×10^−6 — — Colorectal cancer 3.52×10^−6 (Already identified by CPCAFE) Salmonella infection 5.65×10^−6 — — PI3K-Akt signaling pathway 1.01×10^−5 — — [228]Open in a new tab Pathways annotated as “Already identified as CPCAFE” are listed in Table [229]3. BAHSIC stability was compared (see Additional file [230]9). For mRNAs, only one probe was selected only 14 times among the 100 independent ensembles, and was also the most frequently selected. The second most frequent four probes were selected only 13 times, and 2133 probes were selected only once. For miRNAs, 68 probes were always selected among 100 cross-validation ensembles (see Additional file [231]9). The second most frequent probes were selected 99 times, but this only included three probes. The third, fourth, and fifth frequent probes were selected 98, 97, and 96 times, but included only 2, 3 and 1 probes, respectively. Thus, with stability, CPCAFE definitely outperforms BAHSIC. In conclusion, CPCAFE outperformed two conventional supervised FE methods on both stability and biological feasibility, and performed equally well to VBPCAFE, which has more theoretical confidence. Thus, CPCAFE and VBPCAFE are the most suitable FE methods for categorical multiclass problems. We have included diagrams summarizing the discussed points (Figure [232]14). Figure [233]14(a) shows the number of unique miRNAs. The same 100 probes were extracted using all methods, and miRNAs assigned to multiple probes. As all probes to which the same miRNAs are assigned should be extracted at the same time, smaller numbers of unique miRNAs assigned to 100 probes shows better extraction. In this context, CPCAFE outperformed the other two methods (Venn diagram is also available in Figure [234]15). Figure [235]14(b) shows the number of mRNAs with RefSeq IDs. As mRNAs with RefSeq IDs are more likely to have known biological significance, a larger number of RefSeq IDs accompanying mRNAs shows selection of more biologically relevant mRNAs. In this context, CPCAFE outperformed the other two methods. Figure [236]14(c) shows the number of RefSeq miRNA genes related to heart failure-related disease. Larger numbers suggest that FE has correctly extracted mRNAs related to the target disease, therefore CPCAFE outperformed the other two methods. Figure [237]14(d) shows the number of KEGG pathways identified by uploading the miRNAs shown in Figure [238]14(a) to DIANA-mirpath. Although categorical regression based FE outperformed the other two methods, it may have incorrectly extracted the most unique miRNAs in Figure [239]14(a), therefore increased number of KEGG pathways does not always reflect method superiority. Indeed, counting the average number of miRNAs contributing to each pathway (Figure [240]14(e)), indicates that CPCAFE and BAHSIC are comparable while categorical regression based FE performs worse. This suggests that categorical regression based FE cannot identify as many KEGG pathways as CPCAFE or BAHSIC if the same number of unique miRNAs are extracted (albeit experimentally). Thus, it is reasonable to assume that CPCAFE and BAHSIC outperformed categorical regression based FE in this context. Figure [241]14(f) shows the possible number of mRNA/miRNAs pairs, i.e., product of the unique miRNAs (Figure [242]14(a)) and RefSeq accompanying mRNAs (Figure [243]14(b)). Although these numbers are comparable among the three FE methods, the number of miRNA and targeted mRNA pairs is distinct (Figure [244]14(g)). Considering the ratio (Figure [245]14(h)) (obtained by dividing the numbers in Fig. [246]14(g) by those in Figure [247]14(f)), CPCAFE appears worst (i.e., smallest), but this is reversed if significance is considered. Figure [248]14(i) shows the number of miRNA and targeted mRNA pairs with significant correlations. Converting these numbers to a ratio (Figure [249]14(j)) by dividing the number of pairs of miRNAs and targeted mRNAs (Figure [250]14(g)), CPCAFE outperformed the other two methods. The same result is obtained when considering negative correlations (Figures [251]14(k) and (l)), which are desirable as miRNAs suppress target mRNA expression. Overall, these results (from Figure [252]14(g)-(l)) suggest that CPCAFE extracted significantly more miRNA and targeted mRNA pairs than the other two methods, which had apparently extracted more. In conclusion, based on our summarized discussions, we conclude that CPCAFE outperforms two other FE methods. Figure 14. Figure 14 [253]Open in a new tab Barplots comparing CPCAFE, categorical regression based FE, and BAHSIC. See text for details about panels included, from (a) to (l). Figure 15. Figure 15 [254]Open in a new tab Venn diagram of miRNAs extracted by CPCAFE and BAHSIC. Conclusions We have extended VBPCA for FE. Although VBPCAFE is an effective method when applied to simulated data, feature-dependent extension inevitably makes it computationally challenging. Thus, we replaced VBPCA with the simpler CPCA, and achieved reasonable FE performance on simulated data. In order to demonstrate CPCAFE effectiveness on real data, we performed an integrated analysis of mRNA and miRNA expression from stressed mouse heart, investigating the underlying molecular biology and transcriptomic background of PTSD-mediated heart disease. CPCAFE successfully identified aberrantly expressed miRNAs and their negatively regulated target mRNAs. Biological significance of the identified miRNAs and mRNAs was confirmed. Equivalence between CPCAFE and VBPCAFE was demonstrated by applying both methods to the PTSD data set. Two conventional supervised FE methods were also tested, with CPCAFE outperforming them both. In silico drug discovery was performed for a selected gene, FABP3, with the top-ranked compounds identified including protein inhibitors reportedly upregulated in heart disease. Methods Simulations based on synthetic data of categorical multiclass samples To simulate multiclass samples with N features and M samples, expression values were modeled using Gaussian distributions with μ [k] mean (k class). Standard deviations were fixed at 0.5. μ [k] was [MATH: μk=< /mo>k1< mrow>K112 s,k=1,, K, :MATH] with K indicating class number. Considering x [ij] to be an expression of the ith feature of the jth sample, it obeys: [MATH: xijNμk,12 ,1i N2,< mtd>(k1) MK<jkMKNμk< /mi>,12,N< mn>2<iN ,(k1)MK<jkMKNε,1< mrow>2,N<i N, :MATH] where [MATH: N(μ,σ ) :MATH] is the normal distribution with a mean of μ and standard deviation of σ. The reason for including positive and negative μ [k] signs was to simulate coexistence of up/downregulated genes among multiple classes. The s parameter represents difficulty of distinguishing among multiple classes. Larger (smaller) s indicates easier (harder) samples with fixed K. ε is a random number satisfying the probability, P: [MATH: P(ε=μk)= 1K :MATH] In the present study, N=100,N ^′=10,K=4, and M=20 were used, and performances averaged over 100 independent ensembles. One vs one t test based FE of a simulated categorical multiclass data set P values, [MATH: Pik ,k :MATH] , attributed to each i were computed using t tests to compare between [MATH: xij;(k< /mi>1)M< mi>K<jk MK :MATH] and [MATH: xij;( k1)MK<jkMK :MATH] , for all K(K−1)/2 pairs of k and k ^′. [MATH: Pik ,k :MATH] was adjusted by BH criterion [[255]19], with each k,k ^′ pair and the ith gene adjusted. [MATH: Pik ,k :MATH] s <0.05 for all (k,k ^′) pairs were identified as distinct features between multiple classes. Categorical regression-based FE Categorical regression-based FE was defined as follows, x [ij] reflects “expression” of the jth feature of the ith samples, therefore x [ij] can be represented as: [MATH: xij= Ci0+ kCikδjk, :MATH] with δ [jk] = 1 only when the jth sample belongs to the kth category, otherwise it is 0. Category summation was taken and C [ik]s are fitting parameters. Because independent variables are categorical, the above regression equation belongs to a category of equations often named as categorical regression. For each ith feature, P-values were computed using the lm function implemented in R[[256]20] (this can be easily performed if factors corresponding to experimental setups are used as independent variables in lm). FE was used for simulated categorical multiclass regression in two ways. The first selected features with BH criterion [[257]19] adjusted P-values <0.05. The second selected the top 10 features with smallest P-values. Finally, for the PTSD data set, the 100 top-ranked features with smallest P-values were selected as extracted features. BAHSIC BAHSIC uses the Hilbert-Schmidt norm of the cross-covariance operator (HSIC), defined as follows: [MATH: ||Cjk|< mo>|HS2 ixijxij δjkδjk< msub>δkk< mrow>jj+ixijxij jjδjkδjk< msub>δkk< mrow>jj2ixijxij′′j′′δ jkδj′ ′′kδkk< /mrow>j′′′ jj, :MATH] where 〈·〉[j] and [MATH: ·j j :MATH] are averaged over all j and j,j ^′ pairs, respectively, and [MATH: δkk :MATH] is cronecker’s delta. The reason why linear kernel was employed was because it was shown to achieve best performances when applied to microarray gene expression [[258]10]. In BAHSIC, features with smaller HSIC are iteratively discarded until the desired number of features remain. The number of features discarded at each step was 1 for the simulated categorical multiclass data set and 10% of remaining features for the PTSD data set. R code for this algorithm is available in Additional file [259]2. Extended VBPCA Before extending VBPCA, conventional VBPCA is briefly explained. X={x [ij]} was modeled [[260]2,[261]3] as: [MATH: X=BA T+E, :MATH] where B and A are N×Q and M×Q matrices, respectively, and E is a N×M matrix obeying a Gaussian distribution with zero mean and standard deviation, σ [E]. The purpose of VBPCA is to obtain an optimal approximation using Q≪N,M. Thus, following conventional notation and after substituting X=V, the conditional probability is: [MATH: p(V|A,B )exp1σE2||VBAT ||Fro2, :MATH] where ||·||[Fro] is the Frobenius norm matrix. In order to obtain optimal Q values, a prior distribution was assumed: [MATH: P(A)exp12 trAC A1ATP(B)exp12 trBC B1BT, :MATH] where C [A] and C [B] are Q×Q diagonal positive matrices qth (q=1,…,Q) is a diagonal element of C [A] and C [B] that expresses the importance of the obtained qth principal component. Free energy was minimized as: [MATH: F=||V||Fro< mn>22σE2+NM2logσE2+M2log|CA||ΣA< /msub>|+N 2log|CB| |ΣB|(CBi)1+12trCA1ÂTÂ+MΣA+CB1< /msubsup>B^TB^+N< /mi>ΣB< /mfenced>+σE22ÂTVTB^+ÂTÂ+MσAB^TB^+NΣB+const. :MATH] where Σ [A] and Σ [B] are variance matrices of [MATH: Â :MATH] and [MATH: B^ :MATH] , estimated from A and B by the variational method. Locally optimal [MATH: Â,B^, ΣA,ΣB,CB :MATH] , and σ [E] were obtained by performing the following iterative updates in this order: [MATH: ΣA σE2B^TB^+N< /mi>ΣB+σE2CA1< mrow>1 ÂVT B^ΣAσE2ΣB σE2ÂTÂ+MΣA+σE2CB1< /msubsup>1,B^VÂΣ< /mi>BσE2 ,(Σ< mrow>A)qq,< mi>CBq||B^q||2 N+(ΣB< /msub>)qq,q=< /mo>1,,Q,< mtd>σE21NM||V||Fro2tr2VT< /mi>B^ ÂT+trÂTÂ+MΣAB^TB^+N< /mi>ΣB< /mfenced>, :MATH] where [MATH: B^q :MATH] is the qth columnar vector of matrix [MATH: B^ :MATH] , and [MATH: CBq :MATH] is the qth diagonal element of C [B]. ith row V vector. In addition to the above iteration, which should be repeated from top to bottom, i.e., Σ [A] in the left hand side of the first equation substituted to Σ [A] in the right hand side of the second equation, C [A] should be I in order to simulate conventional PCA [[262]21]. It is also known that C [B] can be used to estimate the optimal number of PCs [[263]2]. Next, we extended VBPCA so that it can extract features. In order to do this, C [B] was assumed to have i,(i=1,…,N) dependence, and should be denoted as [MATH: Cbi :MATH] . Thus, P(B) should also have i dependence as: [MATH: P(Biq)ex p( Biq)22CBiq. :MATH] Furthermore, it is required that Σ [B] has i dependence, and should also be denoted as [MATH: ΣBi :MATH] . For example, in the above equations, N C [B] and N Σ [B] are replaced with [MATH: iCBi< /msubsup> :MATH] and [MATH: iΣBi< /msubsup> :MATH] , respectively. Then [MATH: N2log|CB< /mrow>||ΣB| :MATH] in F is replaced with [MATH: 12ilog|CBi||ΣBi | :MATH] . [MATH: trCB1B^TB^ :MATH] in F is replaced with [MATH: iBi^T< mfenced close=")" open="(" separators="">CBi1Bi^ :MATH] , where [MATH: B^i :MATH] is the ith row vector of [MATH: B^ :MATH] . [MATH: CB 1N.ΣB :MATH] and N Σ [B] in F are replaced with [MATH: iCBi1ΣBi :MATH] and [MATH: iΣBi< /msubsup> :MATH] , respectively. As a result: [MATH: F=||V||Fro22σE2+NM2logσE2+M2log|CA||ΣA< /msub>|+1 2ilog|CB< /mi>i|< mo>|ΣBi|+12 iBi^T< mfenced close=")" open="(" separators="">CBi1Bi^+1 2trCA1ÂTÂ+MΣA+iCBi1ΣBi< /mtr>+σE22ÂTVTB^+ÂTÂ+MσAB^TB^+iΣBi+const. :MATH] is obtained. Reflecting these extensions, the above iteration rules are modified as: [MATH: ΣA σE2B^TB^+iΣBi+σE2CA11ÂVT B^ΣAσE2ΣBiσE2ÂTÂ+MΣA+σE2CBi1< mn>1,i=1,,N< mtr>B^iVi ÂΣBiσE2,i=1,< mo>,NC~Aq||A^q||2 M+(ΣA< /msub>)qq,q=< /mo>1,,QCAqC~Aqq C~Aq,q=< mn>1,,QCBiq(B< mrow>iq)2+ΣBiqq,q=< /mo>1,,Q,i=1,,N :MATH] (1) [MATH: σE 21NM||V||Fro2tr2VT< /mi>B^ ÂT+trÂTÂ+MΣAB^TB^+iΣBi, :MATH] where V [i] is the ith row V vector, and [MATH: A^q :MATH] is the qth columnar vector of matrix [MATH: Â :MATH] . Because of extension, divergence was suppressed by introducing C [A] normalization instead of using constant values. Therefore, [MATH: Cbi :MATH] expresses ith feature importance, and C [A] (instead of the former C [B]) represents importance of the qth PC, and must be non-constant. The order of iterations in eq. ([264]1) is arbitrary since each iteration is independent of one another. In the present study, the above iterations began by substituting pre-computed PCs (using the prcomp function in R[[265]20]) in A and B, to compensate for slow convergence. After a suitable number of iterations, parameters with the smallest F values were used for further analysis. Convergence was judged if extracted features change after more than 100 iterations. R code for this algorithm is available in Additional file [266]2. Statistical analysis of CPCAFE and VBPCAFE applied to simulated data For VBPCAFE, the top-ranked features with larger [MATH: CBi 1 :MATH] values were extracted after convergence, judged by changes in extracted features after more than 100 iterations. For CPCAFE [[267]22-[268]29], expression data ({x [ij]}) was embedded into low dimensional space using PCA. After selecting the PC for FE, outliers along the PC were extracted i. e., the top 10 features with larger absolute projections to the selected PC. Evaluation of FE performance In order to evaluate FE performance in a simulated categorical multiclass data set, two measures were used for discrimination of binary classes with unequal member numbers. In this computation, true positive (TP) represents the number of features with distinct expression between classes (i≤N ^′), which are identified by the specified feature. Conversely, true negative (TN) is the number of features with no distinct expression between classes (i>N ^′), and are not identified by the specified feature. False positive (FP) is the number of features with no distinct expression between classes (i>N ^′), but are identified by the specified feature. False negative (FN) is the number of features with distinct expression between classes (i≤N ^′), but are not identified by the specified feature. Accordingly, the Matthews correlation coefficient is defined as: [MATH: TP·TNFP·FN(FP+FN)(FP+TP)(TN+TP)(TN+FN) , :MATH] while the F measure is: [MATH: 2TP(TP+FP)+(TP+FN), :MATH] where T P+F P corresponds to the number of features identified by specific FE, and T P+F N represents the number of features with distinct expression between classes (i≤N ^′), thus representing the harmonic mean of sensitivity [MATH: TPTP+FN :MATH] and precision [MATH: TPTP+FP :MATH] . Translational application to PTSD associated heart disease The overall work flow is illustrated in Figure [269]16. Figure 16. Figure 16 [270]Open in a new tab Overall study work flow. The methods used for data processing and evaluation are indicated in red and blue, respectively. Dotted lines in sample descriptions (i.e., 1 or 2 day(s) stress vs 1 day rest) indicate missing control samples (see [271]Methods for more detail). mRNA and miRNA expression mRNA and miRNA expression data were obtained from the Gene Expression Omnibus (GEO) using the GEO ID: [272]GSE52875 [[273]8]. Expression was observed in stressed mouse hearts. The length of time subservient mice were exposed to aggressor mice varied, with variable rest times after exposure. Exposure and rest times were 1, 2, 3, or 10 days of exposure and 1 day of rest, 5 days of exposure and 1 or 10 days of rest, and 10 days of exposure and 42 days of rest (in total, seven distinct treatment conditions). Controls were housed separately from the aggressor mice in personal cages, and prepared for all treatment conditions except for 1 or 3 days of exposure and 1 day of rest. Thus, there were only five control conditions. For each condition there were four biological replicates and therefore 48 samples in total were available. mRNA expression was included in the subseries [274]GSE52866 and [275]GSE52871. GSE52866_RAW.tar and GSE52871_RAW.tar (provided as Supplementary Data in GEO) were downloaded, and the gProcessedSignal in each of 48 GSM files used for analysis. miRNA expression was included in the subseries [276]GSE52869 and [277]GSE52872. From [278]GSE52872, eight raw Exiqon files (provided as Supplementary Data in GEO) were downloaded, and gProcessedSignal and rProcessedSignal used as miRNA profiles for further analysis. From [279]GSE52869, 16 files (provided as Supplementary Data in GEO) were downloaded, and “Spot Mean Intensity Cyanine3” and “Spot Mean Intensity Cyanine5H” used as miRNA profiles for further analysis. The conditions attributed to each sample were provided by the file names. mRNA and miRNA expression were not normalized, apart from for CPCAFE, where mRNA expression was normalized to have zero mean and a standard deviation of 1. KEGG pathway analysis of miRNAs using DIANA-mirpath For CPCAFE, the 27 identified miRNAs were uploaded to the DIANA-mirpath server [[280]11]. Although DIANA-mirpath requires mature miRNA names used in miRBase (rel. 18), the mature miRNA names in GEO ID: [281]GSE52875 were used in the previous release, therefore the uploaded mature miRNAs are not exactly the same as the 27 identified miRNAs. Target gene prediction was performed using Tarbase [[282]30], a database using experimentally validated targets that are more biologically plausible. Combined target genes sets were used for KEGG pathway enrichment analysis. False discovery rate corrected P-values were used to screen KEGG pathways. Direct weblinks to DIANA-mirpath results are provided in Additional file [283]10. The same analyses were performed for categorical regression-based FE and BAHSIC, excluding the number of uploaded miRNAs. Disease association analysis of genes using the Gendoo server The Gendoo server [[284]31], a literature-based disease association database, was used to identify associations between selected mRNAs and disease. As mRNAs were identified using RefSeq mRNA IDs, these were interpreted to gene symbols. Obtained gene symbols were uploaded to the Gendoo server with mouse being the specified target species. Human was also tested for categorical regression-based FE and BAHSIC, to compensate for the small number of hits. KEGG pathway analysis of mRNAs by DAVID Employed 24 genes were mRNAs that have non zero values in the column named as “target” of the “CPCA based” sheet in Additional file [285]4. Refseq IDs for these 24 genes were identified and were uploaded to DAVID [[286]13] server using default setting. Then KEGG pathways identified were extracted. Tertiary protein structure prediction Tertiary protein structures were predicted using two prediction servers, Protein Homology/analogY Recognition Engine V2.0 (phyre2) [[287]32] and full automatic modeling systems (FAMS) [[288]33]. Among the 26 genes investigated, 14 genes were included in the PDB [[289]34]. Tertiary protein structures of the remaining genes were predicted by either phyre2 or FAMS. The complete list of template proteins and amino acid sequences of investigated genes in fasta format is available in Additional file [290]11. In silico drug discovery for FABP3 ChooseLD [[291]35] was used for FABP3 drug discovery. Using the PDB structure of chain A in 1HMR as the template protein, and four ligands (9-OCTADECENOIC ACID in 1HMR, OLEIC ACID in 1HMS, STEARIC ACID in 1HMT, and PALMITIC ACID in 2HMB) as template ligands (acids not provided in 1HMR were mapped to 1HMR prior to chooseLD execution), with four additional FABP3 inhibitors from ChEMBL [[292]36] as template ligands (see Additional file [293]2 for more detail), drug candidate compounds were selected from DrugBank [[294]37]. ChooseLD was applied to 1450 compounds with a Tanimoto index >0.20 from at least one of the eight template ligands. The 1450 compounds were selected from 6510 compounds with tertiary structures computed by Babel software [[295]38], from among the 6583 compounds listed in DrugBank [[296]37]. The 1450 compounds were ranked based on Finger Print Alignment Scores (FPAScores) averaged over three independent runs. Generation of a test PTSD data set In order to test equivalence between CPCAFE and VBPCAFE on PTSD data, a test data set was generated. In addition to the 100 probes selected by CPCAFE, an additional 100 probes were chosen from those remaining, generating a test data set with 200 features. VBPCAFE was applied to the generated data set, and [MATH: CBi 1 :MATH] s values calculated. The top-ranked 200 features with largest [MATH: CBi 1 :MATH] s values were selected as extracted features. FE was performed over 100 independent ensembles, and the frequency (number of times each feature was selected) counted. Frequency number = 100, indicates that the feature was always selected. FE cross-validation In order to test FE stability on the PTSD data set, a 4-fold cross-validation was performed, with each experiment repeated four times. For each experimental setup, three out of four replicates were randomly selected, generating a data set with 36 samples. Since the total possible number of independent samplings was 4^12≃2×10^7, it is impossible to test all samplings. Therefore, 100 samplings were tested, which was large enough to demonstrate superior stability of CPCAFE compared with the two conventional supervised methods of categorical regression-based FE and BAHSIC. Significance of correlation coefficients Correlation coefficient significance was investigated by transforming the correlation coefficient (r) to the statistical test variable (t): [MATH: t=rM21r2< /mrow>, :MATH] obeying the t distribution of M−2 degrees of freedom. As there were 48 samples in our study, M= 48. Computed P-values were adjusted by BH criterion [[297]19]. Correlation coefficients with adjusted P-values <0.05 were regarded as significant. Acknowledgements