Abstract

Background

   Identifying disease correlated features early before large number of
   molecules are impacted by disease progression with significant
   abundance change is very advantageous to biologists for developing
   early disease diagnosis biomarkers. Disease correlated features have
   relatively low level of abundance change at early stages. Finding them
   using existing bioinformatic tools in high throughput data is a
   challenging task since the technology suffers from limited dynamic
   range and significant noise. Most existing biomarker discovery
   algorithms can only detect molecules with high abundance changes,
   frequently missing early disease diagnostic markers.

Results

   We present a new statistic called early response index (ERI) to
   prioritize disease correlated molecules as potential early biomarkers.
   Instead of classification accuracy, ERI measures the average
   classification accuracy improvement attainable by a feature when it is
   united with other counterparts for classification. ERI is more
   sensitive to abundance changes than other ranking statistics. We have
   shown that ERI significantly outperforms SAM and Localfdr in detecting
   early responding molecules in a proteomics study of a mouse model of
   multiple sclerosis. Importantly, ERI was able to detect many disease
   relevant proteins before those algorithms detect them at a later time
   point.

Conclusions

   ERI method is more sensitive for significant feature detection during
   early stage of disease development. It potentially has a higher
   specificity for biomarker discovery, and can be used to identify
   critical time frame for disease intervention.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12859-017-1712-y)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Disease correlated features, Early stage of disease,
   Biomarker discovery, Feature selection, Gene/protein expression change,
   Multiple Sclerosis

Background

   Identifying disease correlated molecules at early stages, before the
   disease process induces high abundance changes in large number of
   molecules, is a challenging but important problem. The identification
   will not only lead to the discovery of early diagnostic biomarkers,
   intervening the disease early for high risk individuals will also be
   viable. However, before the disease process induces significant
   changes, most disease correlated biomarkers only have small abundance
   changes. Several contemporary data mining algorithms are reviewed in
   [[33]1–[34]3] which include multivariate analysis [[35]4], CART
   analysis [[36]5], voting panel approach [[37]6], artificial neural
   network [[38]7], genetic algorithm [[39]8], Significance Analysis of
   Microarrays (SAM) [[40]9], KTSP [[41]10], and most recently BBPC
   [[42]11]. However, few progresses has been made towards overcoming the
   challenge of early diagnostic biomarker discovery. Most feature
   selection algorithms are designed to boost classification accuracy that
   favor the detection of features (proteins or genes) with high abundance
   changes (See Fig. [43]1([44]c) for example). Typically, several hundred
   of candidate biomarkers can be identified at the onset clinical stage
   with significant abundance changes, and in order to discover a true
   biomarker for early disease diagnosis, each of them has to be tested
   using expensive low throughput methods like ELISA to confirm their
   early involvement [[45]12]. To overcome this problem, gene set analysis
   based methods [[46]13] has been proposed. However, gene set analysis
   can only be applied to features with previously known functions, and it
   does not return individual features that can be tested routinely in low
   throughput methods. Consequently, there is an urgent need in developing
   bioinformatic tools for early diagnostic markers detection based on
   high throughput data.

Fig. 1.

   Fig. 1
   [47]Open in a new tab

   Protein expression time profile across different days for (a) proteins
   identified by ERI at day 5; (b) the significant proteins identified by
   SAM and Localfdr at day 10 but not significant according to ERI at day
   5; and (c) significant proteins identified by SAM and Localfdr at day 5

   We address this problem by proposing a new statistics called Early
   Response Index (ERI) that is sensitive to changes of feature
   (protein/gene) expression at early stages. Instead of focusing on
   features that are the most differentially expressed, this index
   quantifies how much classification accuracy improvement a feature can
   provide when it is combined with other features in a classifier. The
   rational is that although a potential biomarker may not have a
   differential expression that can be regarded as significant by existing
   tools, its ability in improving classification accuracy may have been
   significant, and can be detected in high throughput data. Features with
   significant ERIs can be prioritized for testing as early diagnositic
   biomarkers. The ERI method is fundamentally different from other
   feature selection methods [[48]14] that also consider the synergy of
   features. These “synergistic” methods focus on finding subsets of
   features that maximize classification performance rather than the
   average performance improvement a feature can bring as in ERI.
   Consequently, they tend to return features with high abundance changes
   and miss potential biomarkers suitable for early diagnosis.

   We applied the developed ERI algorithm to a locally collected
   proteomics dataset of experimental autoimmune encephalomyelitis (EAE),
   and the results has been compared to SAM and Localfdr [[49]15]
   algorithm, which are typical feature selection algorithms with
   controlled false positive rates (FPR). Although the EAE dataset is a
   time series data, most existing time series data processing software
   [[50]16, [51]17] are not applicable because these algorithms require a
   controlling time series data, which is absent in the tested dataset.
   Other multivariate and synergistic feature selection algorithms such as
   KTSP [[52]10] are not designed to return a comprehensive list of
   significantly expressed features, and are not included in our
   comparison study.

   EAE is an animal model of autoimmune neuroinflammatory demyelinating
   disease and the most common pre-clinical model for studying human
   multiple sclerosis [[53]18] which is a disease in the central nervous
   system (CNS). The proteomics dataset was generated using M2-proteomics
   [[54]19, [55]20], a previously described quantitative mass spectrometry
   method which utilizes TMT labels [[56]21, [57]22]. The analyzed dataset
   was collected using 160 mice at 8 time points [[58]23]. At each time
   point, 18 to 20 mice were used to obtain the proteomic expression
   profile of the mice’s brain homogenate. Day 5 and day 7 are considered
   as the pre-clinical onset stage of the disease (early stage), when no
   obvious symptoms are observed; overt symptom appear at approximately
   day 10 [[59]23]. In this work, our goal was to identify significantly
   differentially expressed proteins at day 5 which is the pre-onset stage
   of the disease. Day 5 is considered here because a large number of
   proteins show significant expression changes after this time and hence,
   it is a critical turning point of disease development.

   As expected, existing data mining tools such as SAM and Localfdr tends
   to detect many significantly expressed proteins at later stages of
   disease development. For instance, SAM detected 191 proteins with 0%
   FPR on day 10, however, only 6.81% of these proteins were detected by
   SAM on day 5 at the same FPR. Similarly, 152 proteins were identified
   by Localfdr in day 10 out of which merely 4% proteins were determined
   as significant by the same algorithm on day 5. In contrast, 25.6% of
   the proteins identified by both SAM and Localfdr on day 10 can be
   detected by ERI on day 5. In total, ERI detected 73 significantly
   expressed proteins on day 5 at 0% FPR, while SAM and Localfdr detected
   only 35 and 18 proteins respectively in an identical experiment. SAM
   couldn’t identify 50 out of these 73 proteins that were detected by ERI
   whereas Localfdr failed to detect 59 of those proteins. Similar results
   were also obtained on day 25. Identification of a large number of
   non-overlapping proteins by ERI illustrates the efficacy of the
   proposed method in early biomarker detection.

   Testing results also show that detecting biomarkers earlier on day 5
   results in a higher specificity of CNS related pathways in a pathway
   enrichment analysis using DAVID [[60]24]. The specificity is 80% (4 of
   5), 20% (1 of 5) and 100% (1 of 1) using the features selected by ERI,
   SAM and Localfdr on day 5 (Additional file [61]1). On the other hand,
   the specificity is 46 and 21.7% using SAM and Localfdr on day 10. This
   also highlights the need of detecting biomarkers earlier. ERI method
   doesn’t enrich any pathways on day 10 because it detects only 2
   proteins as differentially expressed at that time point. This result is
   expected because on day 10, a large number of proteins have high
   expressions which indicate that day 10 is not an early stage anymore in
   terms of protein expression changes.

   To understand if the ERI algorithm is applicable to clinically
   collected datasets, we further applied ERI on 9 clinically collected
   gene expression datasets. Since these datasets are not collected from
   the early stages, ERI didn’t return as many genes as the other two
   methods under consideration. However, in 2/3 of the datasets, ERI
   returned significant number of genes, and this shows the wide
   applicability of ERI.

   The much higher sensitivity of ERI at an early stage offers a new
   statistical tool for identifying features that are involved in the
   earlier stage of the disease. Currently, the ERI algorithm is
   applicable to animal model studies in which the disease causing event
   is known. In the future, when more high quality retrospective
   longitudinal clinical data will be available, it is expected that ERI
   will have broad applications in clinical studies.

Methods

   In the process of developing the early response index (ERI) method, the
   complete work can be categorized into three sections–colletion of EAE
   dataset and pre-processing of the data, algorithm development and
   performance evaluation methodology of the developed algorithm.

EAE dataset collection

   The EAE dataset is a large scale proteomic dataset based on 6-plex TMT
   labeling and Tandem Mass Spectrometry, which has been collected using
   160 mice at 8 time points (day -1, 0, 5, 7, 10, 15, 20, and 25). TMT
   6-plex allows the simultaneous quantification of six samples in 6 TMT
   channels in one LC-MS run so that experimental variation can be reduced
   [[62]25]. At each time point, 18-20 mice were used to obtain the
   proteomic expression profile of the mice’s brain. Samples from the
   18-20 mice are analyzed using 5 runs each day, and within each run of
   the 6-plex TMT labeling experiment, channel 1 and 6 were reference
   channels, and the rests were informative channels reserved for
   measuring protein expression levels of up to 4 mice. The reference
   channels measured samples pooled from all mice across all days. For
   details of the experiments, see [[63]23].

Pre-processing of the EAE dataset

   Peptide identification and quantification were performed using Mascot.
   For details of searching parameters please see [[64]23]. The experiment
   runs total 40 times generating 5 datasets per day over 8 days. These
   datasets were arranged such that, the rows represent different peptides
   and the columns provide various information of the identified peptides
   including their abundance measurements on all 6 channels.

   Additional file [65]1: Figure S1 shows the workflow involved in
   preprocessing the EAE experimental data (Additional file [66]1). To
   process raw data, we first merged repeatedly identified peptides within
   each run by adding their abundance values. Then, to minimize the effect
   of inherent experimental and biological noise and to eliminate the bias
   due to experimental factors, we divided the abundance measurements of 4
   informative channels by the average of 2 reference channels for each
   identified peptide. As a result, the variance was reduced by 2 folds
   and the data became free from experimental bias.

   After normalizing using the reference channels, we quantile normalized
   the 4 ratios on the informative channels across all datasets to remove
   any channel effects, because the channels were randomly assigned for
   the 20 biological replicates on each day, which should generate
   identical ratio distributions across the 4 channels.

   We then grouped the quantile normalized peptide ratios by their
   associated proteins. For a particular protein, its abundance level was
   obtained by taking the median of ratios for all unique peptides of the
   protein measured in the same channel in each run.

   After protein quantification, we unionized all quantified proteins
   across all 8 time points, and obtained a 734-by-8 cell structure in
   Matlab, where 734 rows correspond to all of the proteins quantified
   across all days. The columns correspond to 8 time points. Each cell
   contains the expression measurement of 20 mice for a particular protein
   on a certain day.

   Due to the selective nature of tandem mass spectrometry, only a
   fraction of all fragmented peptides has been identified in each run as
   well as most of the proteins were not quantified in all TMT channels.
   To address the problem of missing values, we discarded proteins that
   have less than 12 measurements out of 18-20 total measurements. For
   proteins with 12 or more measurements, we replaced the missing values
   with measurements randomly sampled from the 12 or more existing
   measurements. After this step, 313 proteins remained for day 0 and day
   5.

Description of clinical datasets

   To assess if ERI can be applied generally, we downloaded and and
   applied ERI method on 9 clinically collected datasets. Table [67]1
   provides a summary of the datasets. Six of the datasets have binary
   classes, while the [68]GSE14333, [69]GSE27854 and CNS datasets are
   multi-classes. We are focused on studying the two class problem, and in
   the [70]GSE14333 dataset, patients having colorectal cancer (CRC) of
   stage I and II tumors are combined as single class representing
   non-invasive tumors, while patients with stage III tumors, which
   represent invasive tumors are treated as another class. In the
   [71]GSE27854 dataset, CRC patients with stages I and II were defined as
   one class, while stage III and IV patients are combined as other class.
   In the CNS dataset, the original study was composed of three different
   sets of samples (Dataset A,B,C) ranging from children with
   medulloblastomas to adults with malignant gliomas. We analyzed dataset
   C only, which consists of medulloblastoma survivors and non-survivors.

Table 1.

   Summary of clinical datasets used in this study
   Dataset Genes Samples class size No of features Source
   (ERI) (SAM) (Localfdr)
   [72]GSE14333 54675 138/91 0 2 0 [[73]39]
   [74]GSE27854 54675 57/58 0 0 0 [[75]40]
   CNS 7129 21/39 4 0 2 [[76]41]
   Colon Cancer 2000 40/22 3 46 8 [[77]42]
   GLI-85 22283 26/59 51 1458 1198 [[78]43]
   Lung Cancer 7129 24/62 0 0 0 [[79]44]
   Prostate Cancer 10509 50/52 11 946 769 [[80]45]
   SMK-CAN-187 19993 90/97 8 289 271 [[81]46]
   Breast Cancer 22283 138/71 5 14 2 [[82]47]
   [83]Open in a new tab

The Algorithm for calculating Early Response Index

   Early Response Index (ERI) is calculated in the following process.
   Suppose a pair of features are F [i] and F [j], and we use them as
   features of a Support-Vector-Machine (SVM) classifier [[84]26]. Suppose
   the classification accuracy rate achievable is A c c(F [i],F [j]) when
   the SVM classifier is trained using the expression levels of F [i] and
   F [j] in a training sample set and evaluated on a testing set in a
   cross-validation scheme. Suppose A c c(F [i]) and A c c(F [j])
   represent the classification accuracies if only F [i] or F [j] are used
   as the SVM feature. We define the performance improvement due to
   combining F [i] and F [j] as the improvement score (IS) of F [i] due to
   F [j],
   [MATH: <mtext
   mathvariant="italic">IS</mtext><mo>(</mo><msub><mrow><mi>F</mi></mrow><
   mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>F</mi></mrow><mr
   ow><mi>j</mi></mrow></msub><mo>)</mo><mo>=</mo><mtext
   mathvariant="italic">Acc</mtext><mo>(</mo><msub><mrow><mi>F</mi></mrow>
   <mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>F</mi></mrow><m
   row><mi>j</mi></mrow></msub><mo>)</mo><mo>−</mo><mtext
   mathvariant="italic">max</mtext><mfenced close="]" open="["
   separators=""><mrow><mtext
   mathvariant="italic">Acc</mtext><mo>(</mo><msub><mrow><mi>F</mi></mrow>
   <mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>,</mo><mtext
   mathvariant="italic">Acc</mtext><mo>(</mo><msub><mrow><mi>F</mi></mrow>
   <mrow><mi>j</mi></mrow></msub><mo>)</mo></mrow></mfenced><mo>,</mo>
   :MATH]
   1

   We calculate this improvement score for all possible combinations of
   features, and the overall early response index (ERI) of a feature F [i]
   is obtained by taking the average of all improvement score of F [i]
   when it is combined with other features:
   [MATH: <mtext
   mathvariant="italic">ERI</mtext><mo>(</mo><msub><mrow><mi>F</mi></mrow>
   <mrow><mi>i</mi></mrow></msub><mo>)</mo><mo>=</mo><mfrac><mrow><munder>
   <mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>,</mo><mi>j</mi><mo>≠</mo><m
   i>i</mi></mrow></munder><mtext
   mathvariant="italic">IS</mtext><mo>(</mo><msub><mrow><mi>F</mi></mrow><
   mrow><mi>i</mi></mrow></msub><mo>,</mo><msub><mrow><mi>F</mi></mrow><mr
   ow><mi>j</mi></mrow></msub><mo>)</mo></mrow><mrow><mi>N</mi><mo>−</mo><
   mn>1</mn></mrow></mfrac><mo>,</mo> :MATH]
   2

   where N is the total number of proteins.

   We can see that ERI only quantifies the average performance improvement
   due to a protein feature when it is combined with other features
   regardless of the maximum accuracy achievable by individual features.
   It will not favor features that is highly differentially expressed as
   in other feature selection algorithms.

   We selected SVM classifiers for calculating the accuracies because it
   gives us high sensitivity. We investigated the possibility of replacing
   SVM with Naive Bayes (NB) [[85]27] and Random Forest (RF) classifier
   [[86]28]. When using NB, only 13 proteins with significant expressions
   were detected at 0% FPR on day 5 in the EAE dataset, which is 82% less
   than that using SVM. Similarly, ERI method identifies only 10 proteins
   as differentially expressed between day 0 and 5 while using RF instead
   of SVM as classifier. Moreover, 92% (12 of 13) and 80% (8 of 10) of the
   proteins detected by NB and RF consecutively were also detected by SVM.
   Hence, there is little need to calculate ERI again using these
   classifiers after employing SVM.

Feature selection process using ERI

   To apply the ERI as a ranking criteria for feature selection, the
   processing steps were performed as shown in the flow diagram (Fig.
   [87]2).

Fig. 2.

   Fig. 2
   [88]Open in a new tab

   Feature selection flow diagram of ERI algorithm

Estimate classification accuracy of individual features

   To estimate the classification accuracy of single features (protein or
   gene), we used a 5-fold cross validation scheme [[89]29] to avoid the
   problem of overfitting. We trained SVM classifiers using 80% of samples
   from both classes, and evaluated the performance using the rest 20% of
   samples. In order to mitigate any effect of selection bias [[90]30],
   these steps were repeated 5 times by randomly dividing the samples into
   5 folds each time. Finally, the average accuracy over the 5 repetitions
   of 5-folds cross validation was estimated as the classification
   accuracy achievable (see Additional file [91]1).

Estimate classification accuracy of all pairs of features

   If there is a large number of features in a dataset, estimation of the
   accuracy of all feature pairs poses a significant challenge on
   computational complexity. To reduce the complexity, t-test based
   pre-filtering scheme can be used to reduce the number of features as
   described in [[92]3]. We found that when we ranked features based on
   t-test statistics, using more than 300 features does not increase the
   sensitivity of ERI feature selection (Fig. [93]3). In the EAE dataset,
   since there were only 313 proteins measured on both day 0 and day 5,
   all of them were used. In other datasets, the top 300 proteins are
   kept.

Fig. 3.

   Fig. 3
   [94]Open in a new tab

   Comparison of overlap of discovered significant proteins between
   selecting (a) 200 and 300 features; (b) 300 and 400 features in
   pre-filtering step

   After pre-filtering, we considered all possible combinations of pairs
   of the remaining features, and estimated the accuracy achievable using
   pairs of features in the same procedure as that for individual
   features.

Calculation of improvement score (IS) and early response index (ERI)

   The improvement score (IS) of a feature was calculated according to Eq.
   [95]1. To determine the ERI score of the feature, the average of all
   its IS score was calculated (Eq. [96]2). This process was repeated for
   all features. ERI stands for the average performance improvement a
   feature can bring when it is combined with other features for
   classification.

ERI cutoff threshold estimation based on FPR.

   To determine the cutoff threshold on ERI for selecting statistically
   significant features at 0% false positive rate (FPR), we employed a
   permutation scheme as that in SAM, in which the class labels of samples
   were randomly permutated. Subsequently, the ERI scores were calculated
   for all features using the procedure as described in the flow chart in
   Fig. [97]2. Cutoff threshold was estimated at 0% FPR because we
   intended to show the usefulness of our algorithm under the most
   stringent criteria. Since the class labels were randomly permuted, it
   is expected that none of the features would have a high ERI score. The
   procedure was repeated 10 times, and the overall maximum ERI score
   achieved in this repetitive experiment was selected as cutoff
   threshold. A higher number of permutation trials guarantees more robust
   threshold estimation. However, the time complexity increases with the
   increasing number of permutation trials.

   Selection of maximum ERI score as cutoff threshold ensures the 0% FPR
   in true dataset. For example, in our test between day 0 and 5 in the
   EAE dataset, the maximum ERI score achieved by any protein out of the
   10 random permutation trials was 0.0258. Hence, when we set the ERI
   cutoff threshold as 0.026, it ensures a zero FPR for proteins with ERIs
   above 0.026. The ERI threshold was calculated independently for each of
   the tested dataset.

Early response feature selection algorithm evaluation

   The goal of the ERI algorithm is to prioritize and confirm the early
   involvement of disease correlated features as potential early diagnosis
   biomarkers. Thus, the number of statistically significant features
   detected at an early stage of diseases can be used to evaluate the
   performance of various algorithms.

   In this study, we compared the number of significant proteins detected
   by ERI, SAM and Localfdr at different days using day 0 as reference.
   SAM had been chosen as one of the comparable algorithms because this
   method ranks features with a significance score at controlled FPRs.
   According to [[98]31], in which the authors compared six methods for
   identifying differentially expressed genes across multiple conditions,
   SAM is one of the best-performing methods when the sample size is
   greater than 6 (we have 40+ samples in our tests). SAM has been also
   used widely for analyzing differentially express genes for various
   disease conditions until very recently [[99]32, [100]33]. In addition,
   Localfdr has been shown to perform competitively in presence of large
   noise variance [[101]31]. Since, ERI is focused on identifying disease
   correlated features at early stage when the expression changes in real
   biomarkers are minimal and vulnerable to be subdued by other noisy
   features, we also included Localfdr for performance comparison. There
   are existing feature selection algorithms such as KTSP [[102]10] and
   MRMR [[103]34] that strive to improve class prediction performance but
   these algorithms are designed to return a minimal number of features,
   which makes them inappropriate to be compared with ERI for the purpose
   of sensitive detection.

   Under 0% FPR, we used the R-package of SAM called “samr” [[104]35] to
   select a list of significant differentially expressed proteins at
   various time points.

Results and discussion

Detecting early responding features

Number of significant features identified at day 5 and 25

   In the EAE dataset, day 5 is the pre-onset clinical stage when the
   disease symptoms are not obvious. As our experimental data have shown,
   hundreds of proteins will have high abundance changes after this time
   point, and identifying biomarkers that have responded to the disease on
   day 5 will offer opportunities for disease intervention before a large
   number of proteins are affected.

   We first applied the comparable methods between day 5 and day 0. The
   ERI algorithm was applied at 0% FPR with a cutoff score of 0.026 which
   identified 73 proteins as significantly expressed (ERI score ≥0.026).
   In contrast SAM and Localfdr detected 35 and 18 proteins respectively
   at 0% FPR. Out of the 73 ERI identified proteins, 50 proteins were not
   identified by SAM (Fig. [105]4 [106]a) while Localfdr missed 59 of
   those proteins detected by ERI (Fig. [107]4 [108]b). The results
   between day 0 and day 25 are similar. Day 25 marks the initiation of
   the remission process which is characterized by clinical attacks
   (relapses) with diverse neurological dysfunctions, followed by
   functional recovery (remission) [[109]21]. Though, ERI detected 38
   significant proteins at this stage, SAM and Localfdr detected 18 and 7
   proteins respectively. Among the 38 ERI proteins, only 26.3 and 5.2%
   can be detected consecutively by these methods. Interestingly, we also
   discovered that 50% of the significant proteins identified by ERI
   during the remission stage were also identified by ERI during the
   initial stage of the disease.

Fig. 4.

   Fig. 4
   [110]Open in a new tab

   Comparison of overlap of discovered significant proteins (a) between
   ERI and SAM at day 5; (b) between ERI and Localfdr at day 5; (c)
   between ERI at day 5 and SAM, Localfdr at day 10; (d) between SAM at
   day 5 and SAM at day 10 (e) between Localfdr at day 5 and Localfdr at
   day 10 and (f) between day 5 and day 10 by ERI, SAM and Localfdr
   combined

   This shows that ERI is capable of detecting more significant proteins
   than both competitive methods before critical turning points of disease
   development.

Repeatability of significant feature detection

   We investigated how many proteins that were detectable on day 10 could
   be detected earlier on day 5. While SAM detected 191 significant
   proteins on day 10, only 13 (6.8%) of the 191 proteins can be detected
   on day 5 by SAM alone (Fig. [111]4 [112]c). Similarly, out of the 152
   proteins identified by Localfdr as differentially expressed on day 10,
   only 6 proteins (4%) were detected on day 5 too by the same method
   (Fig. [113]4 [114]d). The proteins detected by Localfdr on day 10
   completely overlaps with those detected by SAM on the same day (Fig.
   [115]4 [116]e) which shows that both the methods perform identically
   during disease onset and the proteins identified are truly
   differentially expressed. When we applied ERI, 39 of the 152 proteins
   (25.6%) that were detected by both methods on day 10, can be detected
   on day 5 (Fig. [117]4 [118]e). Combining ERI with SAM and Localfdr
   together can significantly increase the number of repeatedly identified
   proteins. Altogether, the three methods identified 193 unique proteins
   at day 10 and 84 proteins at day 5, out of which 48 are overlapping
   proteins (Fig. [119]4 [120]f). We also noticed that 60.27% of the total
   proteins identified by ERI at day 5 were detected by SAM and Localfdr
   at day 10. The result is similar when we examine the overlap between
   other days.

   This shows that ERI can verify the involvement of more disease
   correlated proteins at an earlier stage than SAM and Localfdr. In
   addition when ERI is combined with those methods, it is possible to
   identify more features that are responding to the disease process at
   multiple stages of the disease.

Expression profile over time for significant proteins

   In Fig. [121]1([122]a), we plot the average expression time profile of
   12 arbitrarily picked proteins from the 73 proteins identified by ERI
   on day 5. Only 12 proteins are randomly picked in the expression
   profile plots because showing more profiles clogs the view. We can see
   that most of these proteins have shown sustained and monotonically
   increasing patterns between day 0 and day 10. In contrast, in Fig.
   [123]1([124]b), we plot the expression profiles of 12 randomly selected
   proteins from the list identified by both SAM and Localfdr on day 10,
   but were not called as significant by ERI on day 5. Even though these
   proteins also show the monotonically increasing pattern as in Fig.
   [125]1([126]a), but they have minimal changes between day 0 and day 5.
   This explains why these proteins were identified by other methods at
   day 10 but not by ERI at day 5. For comparison, the average expression
   time profiles of 12 randomly picked proteins from those identified by
   SAM and Localfdr at day 5 are shown in Fig. [127]1([128]c). It can be
   seen that these proteins generally have little correlation with disease
   development from day 0 to day 10, and their expression levels cannot
   track disease development.

   A comprehensive comparison of average ratio change of ERI picked
   proteins (73) and those detected by other algorithms on day 10 but not
   by ERI (146) on day 5 is shown in the boxplot of Fig. [129]5. The
   protein abundance ratios were calculated as the median expression value
   of proteins across 18-20 samples at day 0 over the the median
   expression value of these proteins at day 5. The absolute value of all
   logarithmic protein ratios were used to create the box plot. The box
   plot shows that ERI identified proteins have higher differential
   expression levels than those not identified by ERI. Therefore, they
   should be prioritized as potential early diagnostic biomarkers.

Fig. 5.

   Fig. 5
   [130]Open in a new tab

   Relative protein abundance change between day 0 and 5 for proteins
   having significant ERI score and those without significant ERI but
   picked up by SAM and Localfdr on day 10

Performance of ERI after day 5

   It shall be noted that ERI is very sensitive to early changes of
   protein expression, but not very sensitive to later stage changes. We
   applied the ERI algorithm between day 0 and all other time points, and
   the number of significant detection is listed in Table [131]2. These
   numbers are compared with that of SAM and Localfdr. We noticed that
   from day 7 to day 20, SAM detects more proteins than ERI when large
   number of proteins showed significant abundance changes, whereas ERI
   detects more on day 5 before significant expression changes occurred in
   large number of proteins (see Fig. [132]1 [133]a), which indicates that
   ERI is more sensitive to the changes in feature expression due to the
   disease. Similar pattern is evident between ERI and Localfdr as well
   except day 15 and 20 when the number of detected significant proteins
   are close to each other. These results imply that ERI method should be
   applied for early stage biomarker discovery.

Table 2.

   Number of significant proteins identified by three methods across
   different days of EAE dataset
   Method   Day 5 Day 7 Day 10 Day 15 Day 20 Day 25
   ERI      73    5     2      13     23     38
   SAM      35    106   191    17     27     18
   Localfdr 18    219   152    5      21     7
   [134]Open in a new tab

   From Table [135]2, it becomes clear that whether ERI can detect a
   significant amount of features depends on when the samples are
   collected. So, the relative number of significant features identified
   by ERI and other methods can be utilized to infer the time frame of
   disease development roughly. If SAM and Localfdr detects more than ERI,
   then it is likely that the disease development has passed its initial
   stage. ERI and SAM can be applied jointly to detect potential disease
   biomarkers through the whole course of a disease.

Biological relevance of identified early stage proteins

   To understand the biological relevance of proteins detected by ERI, SAM
   and Localfdr on day 5, we performed a pathway enrichment analysis
   through DAVID [[136]24]. It turns out that the 73 proteins identified
   by ERI on day 5 enriched five KEGG pathways with p-value less than
   0.05. The pathways are Glycolysis/Gluconeogenesis, gap junction,
   Parkinson’s, Huntington’s and Amyotrophic Lateral Sclerosis (ALS)
   diseases, 4 of which are similar to Multiple Sclerosis disease pathways
   that affect the central nervous system (CNS). In contrast, feeding
   DAVID separately with proteins detected by SAM and Localfdr on day 5
   resulted into only one CNS specific pathways each with p-value <0.05
   (see Additional file [137]1). Considering the fact that multiple
   sclerosis is a CNS disease, it signifies the ERI can return more
   disease related pathways.

   We have also examine the pathways detected by SAM and Localfdr at day
   10, and compared with those detected at day 5 using ERI. SAM returned
   19 enriched KEGG pathways having p-value less than 0.05, and 7 of them
   were CNS specific. On the other hand, 23 pathways were enriched with
   p-value <0.05 by Localfdr day 10 proteins out of which five were CNS
   related. By dividing the number of detected CNS pathways by the number
   of total number of significant pathways, ERI’s specificity in finding
   CNS related pathways is 80% on day 5, while SAM and Localfdr are only
   37 and 21.7% specific on day 10. These results show that performing
   biomarker discovery at an earlier stage using ERI could potentially
   increase the specificity of biomarker discovery. The list of detected
   pathways are provided in Additional file [138]1.

Wide applicability of ERI

   To assess the applicability of ERI method on human dataset, we have
   applied the algorithm at 0% FPR on 9 datasets collected from the
   literature (Table [139]1). These datasets were collected in clinical
   settings after the onset of the disease. Thus, they cannot be used to
   test the efficacy of ERI in discovering early diagnostic biomarkers.
   However, some of the data are still expected to return significant
   features as in the case of EAE dataset on day 25.

   Since there are thousands of measured genes in these clinical datasets,
   we used a pre-filtering step to reduce the number of features to 300
   based on their t-test P values. To test if 300 is an appropriate
   number, we applied the procedure of calculating ERI scores for the
   GLI-85 [[140]43] dataset by keeping 200, 300 and 400 genes after the
   pre-filtering step. We found that when we increase the number of genes
   from 200 to 300, the number of significantly detected genes by ERI
   increases significantly as shown in Fig. [141]3([142]a). However, when
   we increased the number of features to 400, the total number of
   detected features does not increases further (Fig. [143]3 [144]b).
   There is also a significant overlap when selecting different number of
   pre-filtered features. Twelve of the 14 proteins detected when keeping
   200 features are found again with 300 features. The overlap between
   using 300 and 400 features is 68.6%.

   The number of detected significant proteins for the 9 datasets by the
   methods under consideration are listed in Table [145]1. It can be seen
   that at least in 2/3 of the datasets, there are a significant number of
   genes identified by ERI indicating that the method can be applied
   broadly. For most of the dataset, SAM and Localfdr clearly outperforms
   the ERI algorithm in terms of the number of significant genes
   identified. These results are not beyond expectation because in
   clinical settings, the datasets are usually collected after the onset
   stage of diseases. When there exists a significant number of ERI
   detected genes, it is indicative that the disease is progressing to
   another stage, since ERI only detects early responders to a disease
   condition.

   Notably, ERI identified 4 significant genes for the CNS dataset whereas
   Localfdr detected 2 significant genes and SAM failed to detect any. The
   CNS dataset has 60 samples, which includes 39 cases of medulloblastoma
   survivors and 21 treatment failures. All tumor samples were obtained at
   the time of initial surgery prior to treatment. Failure of SAM in
   discovering any significant genes is attributed to the strict false
   positive rate (0% FPR) that we have applied. This is a more stringent
   criteria for significance detection than in the original work
   [[146]41], in which the author discovered a list of 50 markers using a
   signal-to-noise statistic. Interestingly, B-50 neural phosphoprotein
   (GAP43), one of the 4 genes identified by ERI method, has been
   consistently proven to be down-regulated in an independent work by the
   deficiency of nitric oxide synthase (Nos2) which is indirectly involved
   in controlling proliferation and differentiation of medulloblastoma
   developmental process cells [[147]36]. ERI also detected LTC4S, which
   has also been identified in the list of common cancer signature genes
   [[148]37]. Another significant gene identified by ERI in this dataset
   was Endothelial-3 (END3). Though, there is no reference of END3 to be
   directly involved with medulloblastoma patients but this gene has been
   reported to be highly produced by glioblastoma stem cells [[149]38],
   which is a subtype of malignant brain tumors (Glioma) along with
   medulloblastoma. These results show that ERI can return disease
   correlated genes even when SAM failed to return any.

Conclusion

   In this paper, a novel statistics, Early Response Index (ERI) is
   proposed for the detection of disease correlated features for early
   diagnostic biomarker discovery. The proposed method is shown to have
   significantly higher sensitivity in biomarker detection compared to SAM
   and Localfdr before critical turning points of the disease process,
   after which large number of molecules will be impacted by the disease
   with significant abundance changes. Furthermore, ERI is sensitive to
   the time of sample collection. ERI only returns significant number of
   features before critical turning points when the disease is about to
   impact a large number of molecules. Consequently, ERI can be used for
   identifying the best time frame for disease intervention.

   Besides increased sensitivity at early stage, ERI has also been shown
   to have a higher specificity in returning central nervous system (CNS)
   related pathways when it is used early during the disease process than
   using SAM at a later stage. This further illustrates the need for using
   ERI at an earlier stage for more specific and more sensitive biomarker
   discovery.

   In summary, the ERI method has been shown to be very sensitive for
   significant feature detection before critical turning points of disease
   development within the scope of Multiple Sclerosis and some clinical
   cancer dataset used in this study. It potentially has a higher
   specificity for biomarker discovery, and can be used to identify
   critical time frame for disease intervention.

Acknowledgements