Abstract

   Pancreatic ductal adenocarcinoma (PDAC) is generally incurable due to
   the late diagnosis and absence of markers that are concordant with
   expression in several sample sources (i.e., tissue, blood, plasma) and
   platforms (i.e., Microarray, sequencing). We optimized meta-analysis of
   19 PDAC (tissue and blood) transcriptome studies from multiple
   platforms. The key biomarkers for PDAC diagnosis with secretory
   potential were identified and validated in different cohorts. Machine
   learning approach i.e., support vector machine supported by
   leave-one-out cross-validation was used to build and test the
   classifier. We identified a 9-gene panel (IFI27, ITGB5, CTSD, EFNA4,
   GGH, PLBD1, HTATIP2, IL1R2, CTSA) that achieved ∼0.92 average
   sensitivity and ∼0.90 average specificity in distinguishing PDAC from
   healthy samples in five training sets using cross-validation. These
   markers were also validated in proteomics and single-cell
   transcriptomics studies suggesting their prognostic role in the
   diagnosis of PDAC. Our 9-gene classifier can not only clearly
   discriminate between better and poor survivors but can also precisely
   discriminate PDAC from chronic pancreatitis (AUC = 0.95), early stages
   of progression [Stage I and II (AUC = 0.82), IPMA and IPMN (AUC = 1),
   and IPMC (AUC = 0.81)]. The 9-gene marker outperformed the previously
   known markers in blood studies particularly (AUC = 0.84). The
   discrimination of PDAC from early precursor lesions in non-malignant
   tissue (AUC > 0.81) and peripheral blood (AUC > 0.80) may assist in an
   early diagnosis of PDAC in blood samples and thus will also facilitate
   risk stratification upon validation in clinical trials.

   Keywords: biomarker, pancreatic cancer, secretory, transcriptome,
   validation

Introduction

   Pancreatic ductal adenocarcinoma (PDAC) is the most common type of
   pancreatic cancer (PC), which is one of the fatal cancers in the world
   with 5-year survival rate of <5% due to the lack of early diagnosis
   ([25]Fesinmeyer et al., 2005). One of the challenges associated with an
   early diagnosis is distinguishing PDAC from other non-malignant benign
   gastrointestinal diseases such as chronic pancreatitis (CP) due to the
   histopathological and imaging limitations ([26]Brand and Matamoros,
   1998). Although imaging techniques such as endoscopic ultrasound and
   FDG-PET have improved the sensitivity of PDAC detection but have failed
   to distinguish PC from focal mass-forming pancreatitis in >50% cases.
   Dismal prognosis of PC yields from asymptomatic early stages, speedy
   metastatic progression, lack of effective treatment protocols, early
   loco regional recurrence, and absence of clinically useful biomarker(s)
   that can detect PC in its precursor form(s) ([27]Ballehaninna and
   Chamberlain, 2012). Studies have indicated a promising 70% 5-year
   survival for cases where incidental detections happened for stage I
   pancreatic tumors that were still confined to pancreas ([28]Frena,
   2001; [29]Schneider and Schulze, 2003). Therefore, it only seems
   rational to aggressively screen for early detection of PDAC. CA19-9 is
   the most common and the only FDA approved blood-based biomarker for
   diagnosis, prognosis, and management of PC but it has several
   limitations such as poor specificity, lack of expression in the Lewis
   negative phenotype, and higher false positive elevation in the presence
   of obstructive jaundice ([30]Ballehaninna and Chamberlain, 2012). A
   large number of carbohydrate antigens, cytokeratin, glycoprotein, and
   Mucinic markers and hepatocarcinoma–intestine–pancreas protein, and
   PC-associated protein markers have been discovered as a putative
   biomarkers for management of PC ([31]Ballehaninna and Chamberlain,
   2013). However, none of these have demonstrated superiority to CA19-9
   in the validation cohorts. Previously, our group discovered a novel
   five-genes-based tissue biomarker for the diagnosis of PDAC using
   innovative meta-analysis approach on multiple transcriptome studies.
   This biomarker panel could distinguish PDAC from healthy controls with
   94% sensitivity and 89% specificity and was also able to distinguish
   PDAC from CP, other cancers, and non-tumor from PDAC precursors at
   tissue level ([32]Bhasin et al., 2016). The relevance of tissue-based
   diagnostic markers remains unclear owing to the limitations of
   obtaining biopsy samples. Additionally, most current studies are based
   on small sample sizes with limited power to identify robust biomarkers.
   Provided the erratic nature of PC, the major unmet requirement is to
   have reliable blood-based biomarkers for early diagnosis of PDAC.

   The crucial requisite for better PDAC diagnosis has driven a large
   number of genome-level studies defining the molecular landscape of PDAC
   to identify early diagnosis biomarkers and potential therapeutic
   targets. Despite many genomics studies, we do not have a reliable
   biomarker that is able to surpass the sensitivity and specificity of
   CA19-9. The independent studies suffer from inherent statistical
   limitations where the datasets derived from different batches,
   techniques and platforms and analytic methods result in the lack of
   concordance ([33]Ramasamy et al., 2008). The published gene signatures
   of individual microarray studies are not concordant with comparative
   analysis and meta-analysis studies when standard approaches are used
   due to variability in analytical strategies ([34]Ramasamy et al.,
   2008).

   In our work, we have included all the available gene expression
   datasets for PDAC versus healthy subjects from GEO^[35]1 and
   ArrayExpress database^[36]2 measured via microarray or sequencing
   platforms. We have included the datasets derived from blood and tissue
   sources excluding cell lines in our analysis, which was not included
   previously. The cell lines were excluded for they do not depict normal
   cell morphology and do not maintain markers and functions seen in vivo.

   The approach of combining multiple studies has previously been stated
   to reveal biological insight by increasing the reproducibility and
   sensitivity which is generally not evident in the independent original
   datasets ([37]Wang et al., 2004). Using the uniform pre-processing,
   normalization and batch correction approaches in the meta-analysis can
   assist in eliminating false-positive results. Therefore, we used
   multiple datasets in combinations and further divided them in training,
   testing and validation sets to identify and validate the markers with
   secretory signal peptides. We hypothesize that proteins with secretory
   potential will be secreted out of the tissue into the blood and these
   markers can be used as prognostic markers in a non-invasive manner.
   There were no previous studies on identification of marker genes that
   could be used with least-invasive methods. Also, a set of multiple
   genes targeting different pathways and biological processes are more
   reliable and sensitive than single gene-based marker for complex
   diseases like cancer ([38]Ramasamy et al., 2008). We also corroborated
   the protein expression of our markers in proteomics datasets obtained
   from human protein atlas (HPA)^[39]3.

Materials and Methods

Dataset Identification

   The publicly available microarray repositories i.e., ArrayExpress (see
   text footnote 2) and GEO (see text footnote 1) were searched for gene
   expression studies of human pancreatic specimens. The selected datasets
   were divided into five training sets and fourteen independent
   validation sets for initial development and validation of biomarkers.
   To avoid the representation of the datasets only from tissues the few
   blood studies available were divided across all training and validation
   phase of this study.

   Each training dataset ([40]GSE18670, E-MEXP-950, [41]GSE32676,
   [42]GSE74629, and [43]GSE49641) included a minimum of four samples of
   normal pancreas and a minimum of four samples of PDAC. In training set
   we included minimum two datasets with source pancreatic tissue and
   peripheral blood. This was done to identify a predictor based on genes
   that are detectable in both pancreatic tissue and blood. Datasets
   [44]GSE18670 (Set1: 6 normal, 5 PDAC), [45]GSE32676 (Set6: 6 normal, 24
   PDAC) and E-MEXP-950 (Set3: 10 normal, 12 PDAC) was derived from
   pancreatic tissue, whereas [46]GSE74629 (Set4: 14 normal, 32 PDAC) and
   [47]GSE49641 (Set5: 18 normal, 18 PDAC) contain transcriptome profile
   of peripheral blood PDAC patients.

   Further, 14 validation sets were also divided into three groups, one
   “Test sets” ([48]Table 1A); second “Validation Sets” ([49]Table 1A) and
   third “Prospective Validation Sets” ([50]Table 1B). Five Tissue studies
   were included: one from microdissected tissue samples (Set6: 6 normal,
   6 PDAC) and four from whole tissues (Set7: 45 normal, 40 PDAC; Set8: 6
   normal, 6 PDAC; Set9: 8 normal and 12 PDAC and Set10: 15 normal, 33
   PDAC). One blood study from peripheral blood was also validated using
   the biomarker (E-Set11: 14 normal, 12 PDAC).

TABLE 1A.

   Datasets used for development and validation of secretory genes based
   PDAC classifier.
   Groups Dataset Normal Tumor Sample type Platform Accession
   Training Sets Set 1 6 5 Enriched U133 Plus 2.0 E-GEOD-18670
   Set 2 6 24 Whole Tissue U133 Plus 2.0 E-GEOD-32676
   Set 3 10 12 Microdissected U133A E-MEXP-950
   Set 4 14 32 Peripheral Blood HumanHT-12 V4.0 [51]GSE74629
   Set 5 18 18 Peripheral Blood Gene St 1.0 [52]GSE49641
   Test sets Set 6 6 6 Microdissected U133A E-MEXP-1121
   Set 7 45 40 Whole Tissue Gene St 1.0 [53]GSE28735
   Set 8 6 6 Whole Tissue Gene St 1.0 [54]GSE41368
   Set 9 8 12 Whole Tissue U133 Plus 2.0 E-GEOD-71989
   Set 10 15 33 Whole Tissue U133 Plus 2.0 E-GEOD-16515
   Set 11 14 12 Peripheral Blood U133 Plus 2.0 E-GEOD-15932
   Validation Sets V1 61 69 Whole Tissue Gene St 1.0 E-GEOD-62452
   V2 20 36 Whole Tissue U133 Plus 2.0 E-GEOD-15471
   V3 9 45 Whole Tissue Agilent-028004 [55]GSE60979
   V4 12 118 Whole Tissue U219 [56]GSE62165
   V5 50 33 Blood Platelet HiSeq-2500 [57]GSE68086
   [58]Open in a new tab

TABLE 1B.

   Datasets used for prospective validation of secretory genes based PDAC
   classifier.
   Group Dataset Group Pancreatic tumor Sample type Platform Accession
   Prospective Validation Sets PV1 4 Normal 150 PDAC Tissue RNA-Seq TCGA
   PV2 61 Normal 69 PDAC (Stage I and II) Whole Tissue Gene St 1.0
   E-GEOD-62452
   PV3 9 (Pancreatitis) 9 (PDAC) Whole Tissue U95Av2 E-EMBL-6
   PV4 7 (Normal) 15 (IPMA, IPMC, IPMN) Microdissected U133 Plus 2.0
   [59]GSE19650
   [60]Open in a new tab

   For Phase I Validation we selected five datasets from different
   platforms from whole tissues and blood platelets, including comparison
   of normal versus PDAC samples similar to training and test sets. Four
   whole tissue datasets (V1: 61 normal, 69 PDAC; V2: 20 normal, 36 PDAC;
   V3: 9 normal, 45 PDAC; and V4: 12 normal, 118 tumor) and one dataset
   from blood with samples from blood platelets (V5: 50 normal, 33 PDAC)
   were included.

   In Prospective Validation, the performance of the developed PDAC
   biomarker panel was tested on four additional independent datasets
   i.e.,: (i) PDAC versus normal (pancreatic) tissue from TCGA database
   (PV1: 4 normal, 150 PDAC), (ii) PDAC versus normal pancreatic tissues
   in early stages [PV2: 61 normal, 69 PDAC (Stage I and II)], (iii) PDAC
   versus CP (PV3: 9 pancreatitis, 9 PDAC), and (iv) PDAC precursor
   lesions (IPMA, IPMC, and IPMN) with associated invasive carcinoma [PV4:
   6 normal, 15 PDAC precursors (5 IPMA, 5 IPMC, 5 IPMN)] versus normal
   pancreas tissues ([61]Table 1B). Three datasets utilized
   oligonucleotide- based microarray platforms (two versions of Affymetrix
   GeneChips and Gene St 1.0 microarrays in one dataset) whereas the
   cancer genome atlas (TCGA) data is the sequencing data obtained using
   RNA-sequencing technology.

Quality Control and Outlier Analysis

   Stringent quality control and outlier analysis was performed on all
   datasets used for training and validation to remove low quality arrays
   from the analysis. The technical quality of arrays was determined on
   the basis of background values, percent present calls and scaling
   factors using various bioconductor packages ([62]Wilson and Miller,
   2005; [63]Kauffmann et al., 2009). The arrays with high quality were
   subjected to outlier analysis using array intensity distribution,
   principal component analysis, array-to-array correlation and
   unsupervised clustering. The samples that were identified to be of low
   quality or identified as outliers were eliminated from the analysis.

Mapping of Platform Specific Identifiers to Universal Identifier

   To facilitate the collation of the differentially expressed (DE) genes
   identified by analysis of individual datasets, the platform specific
   identifiers associated with each dataset were annotated to
   corresponding universal gene symbol identifiers. Gene symbols were used
   in subsequent analyses including comparative analysis of different
   datasets as well as predictor development. Briefly Affymetrix data was
   annotated using the custom CDF from brainarray^[64]4. Affymetrix probe
   set IDs that could not be mapped to an Entrez gene identifiers were
   removed from the gene lists. For Agilent- 028004, HumanHT-12 V4.0 and
   Gene St 1.0 studies the raw matrix was directly retrieved from the GEO
   interactive web tool, GEO2R^[65]5, which were further processed and
   normalized. The normalized and annotated genes for TCGA was obtained
   from Broad GDAC Firehose database^[66]6. We have removed 29 non-PDAC
   samples from TCGA during validation as our classifier was trained using
   PDAC samples ([67]Peran et al., 2018).

Pre-processing and Normalization of Microarray Datasets

   Potential bias introduced by the range of methodologies used in the
   original microarray studies, including various experimental platforms
   and analytic methods, was controlled by applying a uniform
   normalization, preprocessing and statistical analysis strategy to each
   dataset. Raw microarray dataset were normalized using vooma ([68]Law,
   2013) algorithm which estimates the mean-variance relationship and use
   the relationship to compute appropriate gene expression level weights.
   Similarly, RNA-sequencing datasets were normalized using voom algorithm
   ([69]Law et al., 2014). The normalized datasets were used for
   performing meta-analysis as well as predictor development.

Differential Gene Expression Analysis for Generating Meta-Signature

   To generate PDAC meta-signature, we performed differential expression
   analysis on individual datasets from training sets by comparing normal
   versus cancer samples. To identify DE genes, a linear model was
   implemented using the linear model microarray analysis software package
   (LIMMA) ([70]Ritchie et al., 2015). LIMMA estimates the differences
   between normal and cancer samples by fitting a linear model and using
   an empirical Bayes method to moderate standard errors of the estimated
   log-fold changes for expression values from each probe set. In LIMMA,
   all genes were ranked by t-statistics using a pooled variance, a
   technique particularly suited to small numbers of samples per
   phenotype. The DE probes were identified on the basis of absolute fold
   change and Benjamini and Hochberg corrected P-value ([71]Benjamini and
   Hochberg, 1995). The genes with multiple test corrected P-value < 0.05
   were considered as DE. Comparative analyses were performed to identify
   those genes that are significantly DE across multiple PDAC datasets.
   Genes that are concordantly over or under expressed in three PDAC
   datasets (two tissues and one blood study) were included in PDAC
   meta-signature.

Secretory Gene Set Identification

   To identify a non-invasive predictor based on genes with secretory
   potential, we selected genes that had signal peptide for secretory
   proteins with no transmembrane segments (noTM). The Biomart package in
   R ([72]Durinck et al., 2005) with quering the gene symbols to SignalP
   database facilitated the analysis. The Ensembl Biomart database enables
   users to retrieve a vast diversity of annotation data for specific
   organisms. After loading the library, one can connect to either public
   BioMart databases (Ensembl, COSMIC, Uniprot, HGNC, Gramene, Wormbase
   and dbSNP mapped to Ensembl) or local installations of these. One set
   of functions can be used to annotate identifiers such as Affymetrix,
   RefSeq and Entrez-Gene, with information such as gene symbol,
   chromosomal coordinates, OMIM and Gene Ontology or vice-versa.

Training and Independent Validation of PDAC Classifier Using Support Vector
Machine

   The upregulated secretory genes DE from PDAC meta-signature was used
   for training of PDAC classifier. Classifier was generated by
   implementing the random forest (RF) using caret{R} and support vector
   machines (SVM) approach using e1071{R}. Polynomial kernel was used to
   develop the classifier. RF and SVM was first tuned using 10-fold
   cross-validation at different costs and the best cost and gamma
   functions were later used to perform classification on testing and
   validation sets. Classifiers were trained using normalized,
   preprocessed gene expression values from each of the five training
   datasets independently. To independently validate our model in each
   dataset, performance of classifiers in the training sets was evaluated
   using internal LOOCV. We assessed the classifier of five to ten genes
   selected from the set of upregulated genes to identify the biomarker
   panel that works best in both tissue and blood-based studies. The
   complete sets of possible combinations of five to ten genes were drawn
   from the upregulated genes and the accuracy of each classifier was
   assessed. The performance of classifiers was measured using
   threshold-dependent (e.g., sensitivity, specificity, accuracy) and
   threshold-independent ROC analysis. In ROC analysis, the AUC provides a
   single measure of overall prediction accuracy. The biomarker panel with
   the highest performance in the training sets (both tissue and
   blood-based studies) was chosen for assessment of predictive power in
   six independent test datasets using threshold-dependent and
   -independent measures i.e., AUC. SVM outperformed the RF models in the
   training datasets.

Survival Analysis

   To determine the association of key genes with survival in PC, we
   performed survival analysis using the TCGA database^[73]7. The survival
   analysis was performed on PDAC mRNA of 150 patients [excluding samples
   related to normal tissues and non-PDAC tissues ([74]Peran et al.,
   2018)]. Survival analysis was performed on the basis of individual mRNA
   expression using the Kaplan-Meier (K-M) approach ([75]Kaplan and Meier,
   1958). The normalized expression data for each gene was divided into
   high and low median groups. The survival analysis was performed using
   K-M analysis from survival package in R. The results of the survival
   analysis were visualized using K-M survival curves with log rank
   testing. The results were considered significant if the P-values from
   the log rank test were below 0.05. The effects of mRNA on the event
   were calculated using univariate Cox proportional hazard model without
   any adjustments.

Pathways Analysis

   The biological pathways for the genes was performed using ToppFun
   software of ToppGene suite ([76]Chen et al., 2009). ToppGene is a
   one-stop portal for gene list enrichment analysis and candidate gene
   prioritization based on functional annotations and protein interactions
   network. ToppFun detects functional enrichment of the provided gene
   list based on transcriptome, proteome, regulome (TFBS and miRNA),
   ontologies (GO, Pathway), phenotype (human disease and mouse
   phenotype), pharmacome (Drug-Gene associations), literature
   co-citation, and other features. The biological pathways with FDR <
   0.05 were considered significantly affected.

Results

PDAC Differential Expression Analysis and Meta-Signature Development

   To develop a gene based minimally invasive biomarker for
   differentiating PDAC from normal/pancreatitis, we identified 19
   microarray and RNA sequencing studies containing PDAC and normal
   samples. These datasets based on their origin i.e., blood or tissue
   were divided into training sets, independent test sets, validation sets
   and prospective validation sets ([77]Figure 1; Overview of
   meta-analysis strategy). For classifier training, we performed
   meta-analysis on 3-tissue and 2-blood-based PDAC studies to identify
   meta-signature that are DE in blood and tissue during PC. To account
   for the differences in microarray/sequencing platform used in studies,
   we processed and normalized studies according to their platforms and
   selected the genes that are common across multiple studies. The number
   of DE secretory genes ranged from 480 to 810 genes, totaling 2,010
   significantly DE genes in the five training datasets. We identified 74
   genes (35 downregulated and 39 upregulated) with concordant
   directionality in at least two of the three tissue datasets and one of
   the two blood datasets ([78]Figure 2A, shown in red color and
   [79]Supplementary Table S1).

FIGURE 1.

   [80]FIGURE 1
   [81]Open in a new tab

   Overview of the meta-analysis approach for development and validation
   of PDAC biomarker panel. Predictor was developed using the data from
   Set1-Set5 (S1-S5 in Step 4) and was further tested on Set5-Set10 and
   validated on V1-V5 and PV1-PV4 datasets.

FIGURE 2.

   [82]FIGURE 2
   [83]Open in a new tab

   Meta-signature of genes that are consistently DE in multiple datasets
   and candidate PDAC diagnostic biomarker panel. (A) Venn diagram of the
   five training datasets for the DE genes. 74 genes (marked in red) with
   concordant directionality are common to at least 2 of the 3 tissue
   datasets (Set 1 to Set 3) and one of the 2 blood datasets (Set 4 and
   Set 5). (B) Heatmap of the 74 meta-signature genes DE in PDAC from five
   training datasets. Red = upregulated, Green = downregulated. (C)
   Heatmap of the 9-upregulated marker genes in training sets for PDAC
   biomarker panel. (D) Description of the genes from the 9-gene based
   PDAC biomarker panels. (E) AUC plot [CI: 95%] for 9-gene PDAC
   classifier across the five training sets using leave one out
   cross-validation (LOOCV). Set1 and Set 2 are matched normal samples
   i.e., obtained from same individual. Set 3 normal samples are not
   matched, Normal samples are obtained from the patients undergoing
   surgery with other pancreatic diseases. Set 4 and Set 5 are blood
   sourced studies therefore the normal subjects were matched for gender,
   age and habits.

   The 74 genes showed consistent expression across the PDAC samples in
   the selected five datasets (3 tissue and 2 blood datasets) as compared
   to the normal pancreas ([84]Figure 2B), with the extent of
   over-expression or under-expression denoted by red or green shading,
   respectively. Pathway analysis of these 74 common PDAC genes depicted
   significant enrichment (P-value < 0.05) in multiple extracellular
   matrix-associated pathways (e.g., Ensemble of genes encoding
   extracellular matrix and extracellular matrix-associated proteins,
   remodeling of the extracellular matrix, structural ECM glycoproteins,
   Cell adhesion molecules) ([85]Supplementary Figure S1). These pathways
   play important roles in the adhesion of cells that is a key process in
   progression of PDAC.

Variables Selection and Class Prediction Analysis in Training Sets

   The 39 upregulated genes from the 74 common genes were selected for
   predictor development. We have specifically targeted upregulated genes
   for their therapeutics and diagnostic applications. We plotted boxplots
   of these 39 genes across all the five training sets and removed the
   genes with opposite direction in any of these five sets. The 27
   concordantly upregulated genes ([86]Supplementary Table S2) were
   selected after the boxplot analysis. These combined gene set clearly
   discriminated between PDAC and normal pancreas samples in all the
   datasets of training set, as depicted in the heatmap for 27 genes
   ([87]Supplementary Figure S2A) and principal component analysis (PCA)
   plots ([88]Supplementary Figure S2B). The predictors based on 5 to 10
   genes were developed and assessed by LOOCV implementing with a
   polynomial kernel based SVM classifier. All the possible combination of
   five to ten genes were tested from 27 upregulated genes. The
   classifiers containing the selected 9 genes i.e., IFI27, ITGB5, CTSD,
   EFNA4, GGH, PLBD1, HTATIP2, IL1R2, and CTSA performed with highest
   accuracy. These 9 genes were upregulated in PDAC as compared to the
   normal pancreas in all the five training sets ([89]Figures 2C,D).

   We performed LOOCV cross-validation analysis of the 9-gene PDAC
   classifier across the five training datasets to determine its
   predictive performance. For each of the five training datasets
   individually, sensitivity ranged from 0.83 to 1.0 and specificity 0.71
   to 1.00 for the predictor ([90]Supplementary Figure S3A, [91]Table 2).
   Comparison of the 9-gene PDAC classifier performance in tissues
   (Set1-Set3) and blood datasets (Set 4 and Set 5) showed approximately
   0.94 sensitivity and 0.97 specificity for the tissue datasets, in
   contrast to 0.88 sensitivity and 0.80 specificity for the blood
   datasets ([92]Supplementary Figure S3B, [93]Table 2). AUC for the three
   tissue datasets ranged from 0.89 to 1.00 with median = 0.96
   ([94]Supplementary Figure S3B) and for two blood datasets from 0.92 to
   0.96 with median = 0.94 ([95]Table 2, [96]Supplementary Figure S3C and
   [97]Figure 2E), demonstrated threshold independent performance). The
   average gene expression plots with all the samples combined from the
   five training sets ([98]Supplementary Figure S4A) and the PCA plots of
   training sets ([99]Supplementary Figure S4B) from 9 genes supported the
   discriminatory power of the marker combinations in identification of
   PDAC subjects from normal.

TABLE 2.

   The performance matrix of the 9-gene PDAC classifier on the training,
   testing, validation and prospective validation sets.
   Groups                      Datasets Accuracy Sensitivity Specificity AUC
   Training Sets                Set 1     1.00      1.00        1.00     1.00
                                Set 2     1.00      1.00        1.00     1.00
                                Set 3     0.87      0.83        0.90     0.89
                                Set 4     0.82      0.93        0.71     0.93
                                Set 5     0.86      0.83        0.89     0.97
   Test Sets                    Set 6     1.00      1.00        1.00     1.00
                                Set 7     0.92      0.90        0.93     0.94
                                Set 8     1.00      1.00        1.00     1.00
                                Set 9     0.95      0.91        1.00     1.00
                                Set 10    0.96      0.93        1.00     0.94
                                Set 11    0.73      0.75        0.71     0.80
   Validation Sets                V1      0.79      0.76        0.83     0.83
                                  V2      0.98      0.97        1.00     1.00
                                  V3      0.94      1.00        0.89     0.98
                                  V4      0.95      1.00        0.91     0.99
                                  V5      0.83      0.84        0.82     0.89
   Prospective Validation Sets   PV1      0.82      0.94        0.72     0.93
                                 PV2      0.74      0.74        0.75     0.82
                                 PV3      0.83      0.78        0.89     0.95
                               PV4-IPMA   1.00      1.00        1.00     1.00
                               PV4-IPMC   0.84      0.83        0.86     0.81
                               PV4-IPMN   1.00      1.00        1.00     1.00
   [100]Open in a new tab

Biological Significance of Selected Genes

   CTSA and CTSD are involved in extracellular matrix associated proteins;
   IFI27 and IL1R2 in cytokine signaling in immune system; ITGB5 and
   HTATIP2 in apoptotic pathway and EFNA4, GGH and PLBD1 are involved in
   Ephrin signaling, fluoropyrimidine activity and glycerophospholipid
   biosynthesis, respectively. The genes selected based on the presence of
   signal peptide for secretion are supposed to be secretory; however, the
   signal peptide is also present in several membrane proteins ([101]Uhlen
   et al., 2015). In the selected classifier genes, CTSD, EFNA4 and IL1R2
   are predicted to be secretory proteins whereas CTSA, GGH, PLBD1, IFI27,
   ITGB5 and HTATIP2 are predicted to be intracellular or membrane bound
   proteins in HPA. Furthermore, CTSA and PLBD1 are also localized in
   Lysosomes and GGH is secretory protein as per UniProtKB^[102]8
   predictions. Since our 9 gene markers could be detected with a
   detectable expression in both tissues and blood samples from PDAC
   patients, we further validated the performance of these genes for PDAC
   Diagnosis.

Independent Performance of Classifier in Differentiating PDAC From Normal

   The biomarker set designed above was further tested in six independent
   sets with five tissue and one blood based PDAC studies. The classifier
   genes depicted an upregulation pattern in most of independent
   validation sets [103]Supplementary Figure S5. The boxplot revealed
   higher expression of all the 9 genes, averaged over test sets, in the
   tumor samples as compared to the healthy ([104]Figure 3A). For each of
   the six datasets individually, sensitivity ranged from 0.75 to 1.00 and
   specificity from 0.71 to 1.00 for the predictor ([105]Figure 3B,
   [106]Table 2). Comparison of the 9-gene PDAC classifier performance in
   tissue and blood showed an average 0.94 sensitivity and 0.97
   specificity for the tissue datasets, in contrast to 0.75 sensitivity
   and 0.71 specificity for the blood dataset. AUC for the five tissue
   datasets ranged from 0.94 to 1.00 and for one blood datasets AUC was
   0.80 ([107]Figure 3C, [108]Table 2).

FIGURE 3.

   [109]FIGURE 3
   [110]Open in a new tab

   Performance of 9-gene PDAC Classifier on test sets using leave one out
   cross-validation (LOOCV). (A) The boxplot of the averaged expression of
   the genes across all the six test datasets. The P-values as calculated
   by t-test between the groups are on the individual genes. (B)
   Diagnostic performance of the 9-gene PDAC classifier on the six test
   sets of PDAC vs. normal pancreas. Sensitivity (Sens.) and specificity
   (Spec.) indicated besides each set. (C) AUC plot for 9-gene [CI:
   0.95–0.99] PDAC classifier across the six test datasets.

High Accuracy of Our 9-Gene PDAC Classifier in Predicting PDAC in 5
Independent Validation Sets

   In five validation sets, the 9-gene PDAC classifier accurately
   predicted the class of PDAC compared to normal with maximum AUC of 1.00
   in the independent validation tissue (V2) set that contained 20 normal
   and 36 PDAC samples. More than 0.95 AUC was observed in three
   independent validation tissue sets (V2, V3 and V4) that contained 36,
   45 and 118 PDAC and 20, 9 and 12 normal pancreas samples, respectively,
   ([111]Figure 4A and [112]Table 1B). The boxplot revealed higher
   expression of all the 9 genes, averaged over validation sets, in the
   tumor samples as compared to the healthy samples ([113]Figure 4B). In a
   tissue dataset (V1) containing 61 normal and 69 tumor samples a
   specificity of 0.83 and sensitivity of 0.76 was determined. In 50
   normal and 33 PDAC blood platelet sample (V5) 0.84 sensitivity, 0.82
   specificity and 0.88 AUC was achieved. The prediction of the PDAC class
   in comparison to normal was accurate with a sensitivity ranging
   0.76–1.00 and specificity ranging between 0.82 and 1.00 ([114]Figure 4C
   panel II, [115]Table 2). [116]Supplementary Figure S6 presents the
   heatmap of the nine genes in individual validation datasets and the PCA
   plots depicting the discrimination of PDAC from normal samples.

FIGURE 4.

   [117]FIGURE 4
   [118]Open in a new tab

   Performance of 9-gene PDAC Classifier on validation sets using leave
   one out cross-validation (LOOCV). (A) The boxplot of the averaged
   expression of the genes across all the five validation datasets. The
   P-values as calculated by t-test between the groups are mentioned on
   the individual genes. (B) Diagnostic performance of the 9-gene PDAC
   classifier on the five validation sets of PDAC vs. normal pancreas.
   Sensitivity (Sens.) and specificity (Spec.) indicated besides each set.
   (C) AUC plot [CI: 0.95–0.99] for 9-gene PDAC classifier across the five
   validation datasets.

Cross-Platform Performance of Classifier on TCGA Pancreatic Samples

   We further estimated the cross-platform performance of classifiers on
   the most widely used PC sample resource namely TCGA. TCGA dataset
   contains 150 PDAC samples and 4 normal samples and gene expression
   pattern analysis is not in consistence with other studies
   ([119]Supplementary Figure S7C). The cross-platform validation of
   classifier on TCGA data also achieved high sensitivity (0.94) and
   specificity (0.72) indicating the stability of the classifier in
   handling the cross-platform variation in absolute gene expression
   signal ([120]Figure 5 PV1). The classifier achieved an excellent AUC of
   0.93 ([121]Table 2). The lower specificity of TCGA datasets might be
   due to the limited number of normal samples in the dataset. Heatmap of
   the 9 genes and PCA plots depicts the discrimination of two classes
   with the nine genes in the TCGA samples ([122]Supplementary Figure S7
   PV1).

FIGURE 5.

   [123]FIGURE 5
   [124]Open in a new tab

   Performance of 9-gene PDAC Classifier on prospective validation sets
   using leave one out cross-validation (LOOCV). AUC plot [CI: 0.95–0.99]
   for 9-gene PDAC classifier and the diagnostic performance of (A) the
   classifier for PV1 dataset, (B) the classifier for PV2 dataset. (C) The
   classifier for IPMA, IPMC and IPMN subjects in PV4 dataset and (D) the
   classifier for PV3 dataset.

   The markers did not show concordance in the TCGA dataset; however, the
   significance of these genes in the survival analysis can be very well
   established using the TCGA database. The samples were partitioned at
   median for selected nine-genes and survival analysis was performed on
   two clusters ([125]Supplementary Figure S8). The results showed the
   combined survival of genes was able to clearly discriminate between
   better and poor survivors (P-value significance of 0.05 and hazard
   Ratio of 0.85), indicating their prognostic role in PDAC. High CTSD,
   EFNA4, HTATIP2, IFI27, ITGB5 and PLBD1 expression is associated with
   shortened survival time. Also, the survival analysis of these genes
   with a hazard ratio of >1 at significant P-value indicates their
   prognostic importance.

Performance of Classifier in Identifying Early Stage PDAC

   As it is well established in literature that lack of established
   strategies for early detection of PDAC result in poor prognosis and
   mortality, we therefore tested performance of our classifiers on stage
   I and II PDAC. The predictor could distinguish stage I and II PDACs
   from normals with 0.74 sensitivity and 0.75 specificity and an AUC 0.82
   ([126]Figure 5 PV2, [127]Table 2). Heatmap of the nine genes and PCA
   plots depicts the discrimination of two classes with the nine genes in
   early stages PDAC samples ([128]Supplementary Figure S7 PV2).

Performance of Classifier in Discriminating PDAC From Pancreatitis

   Identification of CP and discriminating it from PDAC is a key
   challenge. As our 9-gene PDAC classifier accurately established the
   differences between PDAC and CP, it is important to include further
   validation steps for the biomarker panel. The array U95Av2 have the
   recorded signal intensity values for all the genes except PLBD1, hence
   only 8 genes were tested as a classifier for the discrimination of CP
   from PDAC. We tested the biomarker on the PV3 dataset wherein there
   were nine samples each for CP and PDAC. The classifier genes on PV3
   dataset depicted significantly altered expression pattern between PDAC
   from CP ([129]Supplementary Figure S7 PV3). The classifier achieved a
   specificity of 0.89 and sensitivity of 0.78 with an overall accuracy of
   0.83 and an AUC of 0.95 in discriminating PDAC from CP ([130]Figure 5
   PV3, [131]Table 2).

Classifier Discriminated Pre-cancerous Lesions From Normal Pancreas With Good
Accuracy

   To estimate the ability of the biomarker panel in discriminating
   precancerous lesions from a normal pancreas, we tested its performance
   on independent dataset containing normal main pancreatic duct
   epithelial cells microdissected by lasers and neoplastic epithelial
   cells from potential PDAC precursor lesions i.e., IPMA, IPMC and IPMN
   [15]. Classifier genes were consistently overexpressed in the PDAC
   samples, GGH was under-expressed in IPMA samples whereas it was
   overexpressed across the other PDAC precursors, IPMC and IPMN
   ([132]Supplementary Figure S9). The 9-gene PDAC classifier separated
   all potential PDAC precursor (IPMA, IPMC, IPMN) samples from the normal
   pancreatic duct samples except for one normal sample and one IPMC
   sample ([133]Figure 5 PV4). The biomarker panel differed IPMA and IPMN
   from normal pancreas with 1.00 sensitivity and 1.00 specificity,
   achieving an AUC of 1.00 ([134]Figure 5 PV4). The predictor separated
   IPMC from healthy pancreas with 0.83 sensitivity and 0.86 specificity,
   achieving an AUC of 0.81 ([135]Table 2).

Classifier Performed Better Than Previously Known Markers

   To estimate the performance of our current marker as compared to the
   previously established markers we compared the performance of our
   marker with each study [[136]Bhasin et al. (2016), [137]Balasenthil et
   al. (2017), [138]Kisiel et al. (2015), and Immunovia ([139]Mellby et
   al., 2018)]. We used polynomial kernel for each set of markers and
   selected best model to record the performance on all the training, test
   and validation datasets ([140]Supplementary Figure S10 and
   [141]Supplementary Table S3). We found that all the methods performed
   well in tissue biopsies samples whereas when applied to the blood
   studies the performance of our marker set is the best ([142]Figure 6).
   Our set of markers has performed well in tissues as well as blood
   studies and will be an ideal minimally invasive biomarker for studying
   in future studies and clinical trials.

FIGURE 6.

   [143]FIGURE 6
   [144]Open in a new tab

   Comparative performance of 9-gene PDAC Classifier with different
   previously established biomarkers. AUC plot [CI: 0.95–0.99] for 9-gene
   PDAC classifier across the three tissue and three blood datasets. The
   boxes colored in mustard color have greater than 0.80 AUC.

Validation of the Markers in Single-Cell Transcriptomics Studies

   Furthermore, as the markers are derived from bulk sequencing protocols
   it is important to know if the markers discovery is not influenced by
   different cell-types in normal and cancerous pancreas. Therefore, we
   used single-cell RNA-sequencing data published by [145]Peng et al.
   (2019) suggesting heterogeneity in PDAC tumor to plot expression of our
   markers on different cell-types. Using standard Seurat single-cell
   analysis methodology ([146]Butler et al., 2018; [147]Stuart et al.,
   2019), we identified that our markers are not associated with any
   cell-types and are expressed across major cell types in pancreatic
   cancer ([148]Supplementary Figure S11). All our markers depicted
   upregulation in various tumor microenvironment cells including immune
   cells and endothelial cells.

Validation of Markers in Blood-Based Proteomics Study

   The nine-gene markers in the classifier were discovered and validated
   from the transcriptomics studies, hence the validation of their
   expression at the protein level is necessary. Therefore, we confirmed
   the expression of the nine genes at the protein level in publicly
   available proteomics studies and HPA. The immunolabeling of the
   proteins of the respective genes in HPA ([149]Supplementary Figure S12)
   suggest higher staining of the proteins in tumors as compared to the
   normal samples except IFI27 where the expression of the protein cannot
   be detected. To further validate the protein expression of our markers
   we searched for the corresponding proteins in multiple pancreatic
   cancer proteomics studies ([150]Chen et al., 2005;
   [151]Crnogorac-Jurcevic et al., 2005; [152]Cui et al., 2009;
   [153]McKinney et al., 2011; [154]Kosanam et al., 2013; [155]Wang et
   al., 2013; [156]Iuga et al., 2014). CTSD, a cathepsin family protein,
   and Ephrin and Interferon gamma family markers are found to be highly
   expressed in multiple proteomics studies ([157]Chen et al., 2005;
   [158]Cui et al., 2009; [159]McKinney et al., 2011).

Discussion

   We applied a data mining approach to a large number of publicly
   available transcriptome datasets derived from pancreatic cancer and
   healthy blood and tissues, followed by class prediction analysis using
   machine learning and validation of the classifier in the independent
   datasets to discover candidate PDAC biomarkers ([160]Harsha et al.,
   2009; [161]Ranganathan et al., 2009). We explored the genes with
   secretory peptide DE in the PDAC as compared to normal pancreas/blood,
   for the first time to investigate an accurate secretory/non-invasive
   biomarker panel for the PDAC diagnosis. We report here a 9-gene PDAC
   classifier that differentiates PDAC as well as the precursor lesions
   from the normal with high accuracy. This 9-gene PDAC classifier was
   validated independently in 12 different blood and tissue studies. The
   9-gene PDAC classifier encodes proteins with secretory potential in
   pancreas and few other tissues.

   Approximately 2500 candidate biomarkers have been associated with PDAC
   previously while some of these biomarkers are in various evaluation
   stages only CA19-9 is approved by FDA ([162]Koprowski et al., 1979,
   [163]1981; [164]Hyöty et al., 1992). However, accuracy of CA19-9 is not
   accurate enough for screening, especially for an early detection of
   PDAC. Presently, the extensive validation of diagnostic or predictive
   gene/protein expression biomarkers for accurate discrimination between
   healthy patients, benign, premalignant and malignant disease are still
   lacking. Therefore, we aimed to identify a biomarker panel with greater
   sensitivity and specificity and identified a 9-gene marker panel that
   performs with high accuracy in discriminating PDAC with normal pancreas
   across multiple platforms, using either whole/microdissected tissue or
   peripheral blood.

   To determine whether the genes in our classifier reflect key
   pathophysiological pathways associated with the development of PDAC, we
   reviewed available information for the role of these genes. Most of our
   9-gene classifier genes have been linked to tumorigenesis, indicating a
   causal role in the development and progression of PDAC. HTATIP2 is
   involved in apoptosis function in liver metastasis related genes
   ([165]Shi et al., 2009), gastric cancer ([166]Xu et al., 2010) and
   pancreatic cancer ([167]Ouyang et al., 2014). IFI27, functioning in
   immune system, has been suggested as a marker of epithelial
   proliferation and cancer ([168]Grutzmann et al., 2003; [169]López-Casas
   and López-Fernández, 2010). ITGB5 involved in integrin signaling have
   been found to be upregulated in several analysis studies ([170]Van den
   Broeck et al., 2012). The Integrin and ephrin pathways have been
   proposed to play a crucial role in pancreatic carcinogenesis and
   progression, including ITGB1, a paralog of ITGB5, and EPHA2 as most
   important regulators ([171]Van den Broeck et al., 2012). EPHA2 belongs
   to ephrin receptor subfamily and is involved in developmental events,
   especially in the nervous system and in erythropoiesis. To this family
   belongs one of our genes EFNA4 which activates another ephrin receptor
   EPHA5. IL1R2 was identified as possible candidate gene in PDAC that can
   lead to defects of the apoptosis pathway ([172]Rückert et al., 2010).
   Moreover, Il1, the ligand of IL1R2, is secreted by the pancreatic cells
   ([173]Arlt et al., 2002) and has an important function in inflammation
   and proliferation that can also trigger the apoptosis ([174]Dupraz et
   al., 2000; [175]Ruckdeschel et al., 2002; [176]Yoshida et al., 2004).
   CTSD have been shown to be upregulated in the PDAC cancer
   ([177]Iacobuzio-Donahue et al., 2003). AGR2, a surface antigen, has
   been shown to promote the progression of PDAC cells through regulation
   of Cathepsins B and D genes ([178]Dumartin et al., 2011). CTSA was
   identified as one of the 76 deregulated genes in a study aiming for the
   development of early diagnostic markers as well as potential novel
   therapeutic targets for both familial and sporadic PDAC
   ([179]Crnogorac-Jurcevic et al., 2013). PLBD1 has been found to be
   upregulated in various studies with five-fold increase in cell lines
   ([180]Makawita et al., 2011) and in study where the effect of
   pancreatic β-cells inducing immune-mediated diabetes was studied
   ([181]Salem et al., 2014). Metabolism-related GGH has been found to be
   relevant and upregulated in gallbladder carcinomas ([182]Washiro et
   al., 2008).

   Most of the genes in the 9-gene classifier (ITGB1, EPHA2, IL1R2) are
   involved in the migration, immune pathways, adhesion and metastasis of
   PDAC or other cancers, that are specifically associated with the
   developmental events and signaling in the progression of cancer. To
   corroborate the involvement of these genes in PDAC progression and
   early stages of PDAC development, we evaluated the expression levels of
   these genes in the early lesions of PDAC precursors i.e., LIGD-IPMN,
   HGD-IPMN and InvCa-IPMN ([183]Figure 5) [15]. Eight genes except GGH
   are upregulated in IPMA, IPMN, and IPMC as well as in PanINs, as
   compared to a normal pancreas, demonstrating their enhanced expression
   is linked with the progression of PDAC that occurs early during
   development of malignancy. The outcomes of our study clearly show that
   our 9-gene classifier reflect drivers of early defects during
   progression and development of PDAC. This argument is further
   strengthened by the survival analysis of the genes where five of the
   nine genes (CTSA, CTSD, EFNA4, IFI27 and IL1R2) are strongly related to
   discriminating better and poor survivors.

   Since individuals with CP are at increased risk of developing PDAC and
   pathological discrimination is challenging between CP and PDAC which
   makes it important for a classifier to discriminate between these two
   disease stages. While other studies have performed meta-analysis of
   transcriptome data for PDAC to identify the genes that are
   overexpressed in PDAC ([184]Iacobuzio-Donahue et al., 2003;
   [185]López-Casas and López-Fernández, 2010; [186]Munding et al., 2012),
   they are irrelevant in identifying the markers for prognosis of PDAC.
   Our 9-gene biomarker classifier accurately distinguished premalignant
   and malignant pancreatic lesions such as PanIN, IPMA, IPMN and IPMC
   from healthy pancreas. As all 9 genes of our classifier are upregulated
   in PanIN (as compare to normal pancreas) already, it indicates that
   these 9 genes are dysregulated in early lesions during the process of
   PDAC development and therefore could assist in an early detection of
   PDAC.

   Further, to analyze the potential of the 9-gene biomarker in accurate
   classification of PDAC subjects versus healthy subjects we compared our
   biomarker combination with previously known and established biomarkers.
   Our analysis also indicates that the 9-gene biomarker panel including
   multiple genes, rather than a single biomarker, is more powerful and
   had possibility to improve the specificity and selectivity for an
   accurate detection of PDAC. The idea behind generation of biomarker
   panel with the better identification in blood sample, in corroboration
   with the tissue studies, is fulfilled here. The previously established
   markers worked well in the tissue studies but could not show their
   similar potential in blood studies.

   Further, the protein expression of selected biomarker genes was also
   examined to determine their association with PDAC at protein levels.
   The analysis depicted that multiple gene product/proteins corresponding
   to biomarkers genes depicted higher expression in pancreatic cancer
   tissues. Interestingly some marker (e.g., EFNA4, GGH) also depicted
   over-expression in other cancers indicating their association with
   tumor development and progression related hallmark processes. In recent
   years, multiple proteomics studies were performed to understand the
   proteome landscape of the PDAC but still lack in generating
   comprehensive picture due to technological limitations. Most of the
   proteomics technique can measure the expression of 2,000-3,000 proteins
   that is far from generating the global overview of proteome. High
   expression of Cathepsin family proteins specifically CTSD is noted in
   several proteomics studies which was also the case for Ephrin and
   Interferon gamma family markers ([187]Chen et al., 2005; [188]Cui et
   al., 2009; [189]McKinney et al., 2011). Also, the expression of these
   genes is not found to be related to a particular cell-type in
   pancreatic cancer cell lineage. However, the fact that the overall
   study is based on bulk sequencing data cannot be overlooked and these
   cells may comprise of multiple cell-types which may or may not
   influence the overall methodology of marker selection. Overall, the
   protein-expression of the selected genes and their expression in
   multiple cell-types of pancreatic cancer is established. However, the
   aforementioned limitations have to be challenged before designing the
   diagnostic panel. The 9-gene markers identified here still needs
   validation in a bigger cohort for its potential in identifying
   accurately the early stages but this marker combination potentially has
   shown its discriminatory power across various blood and tissue datasets
   obtained from different sources and different platforms.

Data Availability Statement

   The datasets presented in this study can be found in online
   repositories. The names of the repository/repositories and accession
   number(s) can be found in the article/[190]Supplementary Material.

Author Contributions

   IK performed the bioinformatics analysis and wrote the manuscript. MB
   supervised the bioinformatics analysis and edited the manuscript. Both
   the authors read and approved the final manuscript.

Conflict of Interest

   BIDMC will be filling patent on behalf of MB and IK on the use of
   biomarker panel for early PDAC diagnosis. MB is an equity holder at
   BiomaRx and Canomiks.

Acknowledgments