Abstract Background Pancreatic carcinoma (PC) is one of the most aggressive cancers affecting human health. It is essential to identify candidate biomarkers for the diagnosis and prognosis of PC. The present study aimed to investigate the diagnosis and prognosis biomarkers of PC. Methods Differentially expressed genes (DEGs) were identified from the mRNA expression profiles of [37]GSE62452, [38]GSE28735 and [39]GSE16515. Functional analysis and the protein-protein interaction network analysis was performed to explore the biological function of the identified DEGs. Diagnosis markers for PC were identified using ROC curve analysis. Prognosis markers were identified via survival analysis of TCGA data. The protein expression pattern of the identified genes was verified in clinical tissue samples. A retrospective clinical study was performed to evaluate the correlation between the expression of candidate proteins and survival time of patients. Moreover, comprehensive analysis of the combination of multiple genes/proteins for the prognosis prediction of PC was performed using both TCGA data and clinical data. In vitro studies were undertaken to elaborate the potential roles of these biomarkers in clonability and invasion of PC cells. Findings In total, 389 DEGs were identified. These genes were mainly associated with pancreatic secretion, protein digestion and absorption, cytochrome P450 drug metabolism, and energy metabolism pathway. The top 10 genes were filtered out following Fisher's exact test. ROC curve analysis demonstrated that TMPRSS4, SERPINB5, SLC6A14, SCEL, and TNS4 could be used as biomarkers for the diagnosis of PC. Survival analysis of TCGA data and clinical data suggested that TMC7, TMPRSS4, SCEL, SLC2A1, CENPF, SERPINB5 and SLC6A14 can be potential biomarkers for the prognosis of PC. Comprehensive analysis show that a combination of identified genes/proteins can predict the prognosis of PC. Mechanistically, the identified genes attributes to clonability and invasiveness of PC cells. Interpretation We synthesized several sets of public data and preliminarily clarified pathways and functions of PC. Candidate molecular markers were identified for diagnosis and prognosis prediction of PC including a novel gene, TMC7. Moreover, we found that the combination of TMC7, TMPRSS4, SCEL, SLC2A1, CENPF, SERPINB5 and SLC6A14 can serve as a promising indicator of the prognosis of PC patients. The candidate proteins may attribute to clonability and invasiveness of PC cells. This research provides a novel insight into molecular mechanisms as well as diagnostic and prognostic markers of PC. Fund National Natural Science Foundation of China [No. 81602646 & 81802339], Natural Science Foundation of Guangdong Province [No. 2016A030310254] and China Postdoctoral Science Foundation [No. 2016M600648]. Keywords: Pancreatic carcinoma, Diagnosis, Prognosis, Biomarker, Function Highlights * • We synthesized public data and clinical data and identified key differentially expressed genes in pancreatic carcinoma. * • Candidate molecular markers were identified for diagnosis and prognosis prediction of pancreatic carcinoma. * • The combination of the markers can serve as a promising prognostic indicator in patients with pancreatic carcinoma. * • A novel gene, TMC7, that is overexpressed in pancreatic carcinoma was identified. The role of TMC7 was preliminary explored. __________________________________________________________________ Research in context. Evidence before this study Early detection of pancreatic carcinoma (PC) is essential in order to provide patients with an optimal therapeutic approach. Carbohydrate antigen 19–9 is the only diagnostic marker approved by the FDA but its diagnostic potential is limited due to its restricted sensitivity and specificity. Integrated analysis is also required for accurate prognostic biomarkers that help to guide patients' therapy. Currently a variety of potential biomarkers in blood and tumors have been reported. However, many studies focused only on disparate genes, which are not sufficient for the diagnosis and prognosis of PC. Besides, few of these biomarkers have been validated for clinical use. Thus, the combination of different markers as diagnostic or prognostic indices appears promising. Microarrays run on high-throughput platforms have emerged as a promising and efficient tool for screening differentially expressed genes (DEGs) in cancers and identifying promising biomarkers for the diagnosis and prognosis of cancers. Many gene expression profiling microarrays have been conducted to find various DEGs in various cancers. However, inconsistent results are commonly obtained due to either sample heterogeneity in independent studies or studies conducted using only a single cohort. Added value of this study In this study, we synthesized several sets of public data and clinical data. Candidate molecular markers were identified for diagnosis and prognosis prediction of PC. Moreover, we found that the combination of the markers can serve as a promising prognostic indicator in patients with PC. A novel gene, TMC7, that is overexpressed in PC was identified. The role of the candidate differentially expressed proteins in PC cells including TMC7 was preliminary explored. This research provides a novel insight into diagnostic and prognostic markers as well as molecular mechanisms of PC. Implications of all the available evidence Our study suggested the identified seven genes can be potential target genes of PC. Inhibition of these genes may provide a potential therapeutic target for treatment of PC. We plan to explore whether the expression pattern of the potential diagnostic markers exist in circulating tumor cells from PC patients. We hope that these markers can help to make an early diagnosis and also help to predict the prognosis which might provide essential information regarding personalized treatment decisions for individual patients. Alt-text: Unlabelled Box 1. Introduction Pancreatic carcinoma (PC) is a highly malignant tumor that accounts for 216,000 new cancer cases annually and causes >200,000 deaths a year worldwide [[40]1,[41]2]. The 5-year survival rate for patients with PC is <5% due to early metastasis to regional lymph nodes and hematogenous spread to distant organs. The only potentially curable treatment of PC is surgical resection. However, approximately 80% of tumors are unresectable at the time of diagnosis [[42]3,[43]4]. For patients with advanced stage of PC, chemotherapy is the treatment of choice although the regimens have extensive side-effects, making them unsuitable for patients with a low performance status. For this reason, early detection of PC is essential in order to provide patients with an optimal therapeutic approach. In addition, integrated analysis is required for accurate prognostic biomarkers that help to guide patients' therapy. Recent advances in genome-wide studies have revealed the diverse and complex genetic alterations of PC patients which may explain diverse disease behavior in a clinical setting. Currently, a variety of potential biomarkers in blood and tumors have been reported. However,many studies focused only on disparate genes, which are not sufficient for the diagnosis and prognosis of PC. Besides, few of these biomarkers have been validated for clinical use [[44]5,[45]6]. Thus, the combination of different markers as diagnostic or prognostic indices appears promising. Recently, microarrays run on high-throughput platforms have emerged as a promising and efficient tool for screening differentially expressed genes (DEGs) in cancers and identifying promising biomarkers for the diagnosis and prognosis of cancers [[46]7]. Many gene expression profiling microarrays have been conducted to find various DEGs in various cancers [[47][8], [48][9], [49][10]]. However, inconsistent results are commonly obtained due to either sample heterogeneity in independent studies or studies conducted using only a single cohort. The aim of the present study was to explore possible molecular mechanisms and potential diagnostic and prognostic biomarkers of PC. First, data from three Gene Expression Omnibus (GEO) database were combined and key genes involved in PC were picked out using bioinformatic methods. The diagnostic value of the genes was evaluated using ROC curve. And the prognostic value of the genes was evaluated using survival analysis of TCGA data. Following, the protein expression pattern of the candidate genes was detected in clinical tissue samples. We further performed a retrospective clinical study of the expression of candidate proteins and survival time of patients. Moreover, the combination of multiple proteins for the prognosis prediction of PC was evaluated. In vitro experiment was conducted to elaborate the potential roles of these biomarkers in clonability and invasion of PC cells. 2. Materials and methods 2.1. Differential expression analysis Datasets were downloaded from the GEO database ([50]http://www.ncbi.nlm.nih.gov/geo/). mRNA expression profiles of [51]GSE62452 [[52]11], [53]GSE28735 [[54]12] and [55]GSE16515 [[56]13] were downloaded from the GEO data repository. The dataset for [57]GSE62452 includes 69 PC samples and 61 normal pancreatic tissue samples. [58]GSE28735 incorporates 45 PC samples and 45 normal pancreatic tissue samples. [59]GSE16515 consists of 36 PC samples and 16 normal samples. A total of 150 PC and 122 nonmalignant pancreatic tissue samples were included. The original annotation files were downloaded, then quality control and normalization were performed using the robust multi-array average (RMA) method [[60]14]. A t-test followed by Benjamini & Hochberg (BH) adjustment were applied to identify DEGs between tumor and normal tissues. The genes that met the cutoff criteria of a fold change > 2 and an adjusted P-value < .05 were considered DEGs. Integrative analysis of the three set of DEGs was done using Robust Rank Aggreg [[61]15], and genes with a score of < 0.01 were selected. In total, 389 DEGs were selected. Clustering analyses were done to show expression patterns of the differentially expressed genes in tumor and normal tissues. 2.2. GO and pathway enrichment analyses Online biological tools were used to investigate the functions and pathways of the candidate DEGs. Gene Ontology (GO) enrichment analysis ([62]http://www.geneontology.org/) was used to explore the biological functions associated with the DEGs [[63]16]. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis ([64]http://www.genome.jp/kegg/pathway.html) was applied to further illuminate the pathways of the DEGs [[65]17]. The Database for Annotation, Visualization and Integrated Discovery (DAVID) ([66]https://david.abcc.Ncifcrf.gov/) can provide biological meaning for genes. GO and KEGG pathway enrichment analysis of identified DEGs was performed using DAVID [[67]18,[68]19]. P < 0.05 was set as the threshold. The false discovery rate (FDR) was controlled at the 0.01 threshold. 2.3. Protein-protein interaction (PPI) network construction A PPI network of identified DEGs in all three datasets was constructed using the STRING online database ([69]http://string-db.org) [[70]20]. Cytoscape software ([71]http://www.cytoscape.org/) was used to analyze the interaction relationships of the candidate DEGs encoding proteins in PC [[72]21]. Functional modules in interaction networks were conducted using the Markov clustering algorithm. The most stringent protein interaction screening criteria (confidence > 0.9) were set as the threshold. 2.4. Data standardization To identify strong candidate genes that can accurately recognize and represent changes in PC, standardized and comprehensive analysis of three sets of data from different platforms was performed. The YuGene transform was applied to eliminate the impact of the different platforms [[73]22]. The YuGene software is implemented as an R package available from CRAN. 2.5. ROC analysis Receiver operating characteristic (ROC) curve analysis was performed to evaluate the sensitivity and specificity of the DEGs for PC diagnosis using the R package. An area under the curve (AUC) value was calculated and used to designate the ROC effect. 2.6. Immunohistochemistry Tissue microarray chips containing 99 samples of pancreatic cancer and 71 samples of paired normal pancreatic tissue were purchased from Outdo Biotech (Shanghai, China). The characteristics of the patients are shown in Supplementary File S1. Immunohistochemistry (IHC) staining was performed as previously described [[74]23]. Microarray chips were stained with anti-CENPF (Abcam), anti-SCEL, anti- GLUT-1, anti-SLC6A14, anti-TMC7 (Thermofisher, Waltham, MA USA), anti-TMPRSS4 and anti- MASpin (Gene Tex, Irvine, CA, USA) antibodies. Control staining with only secondary antibodies was performed to ensure specificity. Rabbit IgG polyclonal-Isotype control (Abcam) was used as the negative control. The score for staining was independently assessed by two experienced pathologists based on the integrated staining intensity and the proportion of positive cells. The final score was determined by adding the staining intensity score and the average proportion of positive cells score; the final score ranged from 0 to 7. For the purpose of further analysis, the samples with a score of 0–3 were defined as low expression, while the samples with scores of 4–7 were defined as high expression. 2.7. The cox regression model A multivariate Cox regression model was constructed on TCGA dataset using the seven DEGs (CENPF, SCEL, SLC2A1, SLC6A14, TMC7, TMPRSS4 and SERPINB5). The stimated baseline risk was calculated as the following formula: h ^(t) = h ^[0] (t) exp(x[i]'β ^) Where x[i]' indicates gene and β ^ refers to the relative expression value of corresponding gene. 2.8. Survival analysis A Kaplan-Meier survival analysis was performed for patients with differential gene/protein expression or of high−/low-risk group. For survival analysis of a single gene/protein, patients were assigned to high/low gene expression groups according to the median of the expression level of mRNA/protein. For survival analysis of combined seven genes, patients from TCGA dataset were divided into high−/low-risk groups according to the Cox regression model. For survival analysis of combined seven proteins, clinical patients with high expression of more than four proteins were divided into the high-risk group and the remaining patients were divided into the low-risk group. Statistical significance was assessed using the log-rank test (P < 0.05). 2.9. Cell culture and transient transfection Human pancreatic cancer cell lines PANC-1 and BxPC3 were obtained from Shanghai Advanced Research Institute, Chinese Academy of Sciences. Cells were cultured in DMEM supplemented with 10% fetal bovine serum. For transient transfection, cells were seeded in 6-well plates and transfected with plasmid using Lipofectamine 2000 (Invitrogen, USA) according to the manufacturer's instructions. Target sequences for shRNAs are summarized in Supplementary Table S1. 2.10. Soft agar assay 1 × 10^4 pre-treated PANC-1 and BxPC3 cells were used for soft agar assay as previously described [[75]23]. Cells were cultured for 14 days. Viable cell colonies larger than 0.1 mm were counted using a dissecting microscope. 2.11. Transwell assay PANC-1 and BxPC3 cells were treated as described above and seeded onto Matrigel-coated Transwell chambers for 24 h. Transwell assay was performed according to the method previously described [[76]24]. Cells that had passed through the membrane were counted. 3. Results 3.1. Identification of differentially expressed genes in PC and data intergration Three datasets of PC were downloaded from the GEO database. Gene expression profiles from GSE 16515 identified 1118 differentially expressed genes with 851 genes upregulated and 267 genes downregulated in PC samples when compared with normal pancreatic tissues. Gene expression profiles from GSE 62452 identified 325 differentially expressed genes with 204 genes upregulated and 121 genes downregulated in PC. Gene expression profiles from GSE 28735 identified 495 differentially expressed genes with 253 genes upregulated and 206 genes downregulated in PC ([77]Fig. 1a). Heatmap analysis showed that these genes presented differential expression profiles between normal tissues and cancer tissues ([78]Fig. 1b). For integrated analyses of the three GEO datasets, Robust Rank Aggreg was applied, and genes with a score < 0.01 were selected. A total of 389 consistently expressed DEGs were identified from the three datasets, and the cluster heatmap is shown in [79]Fig. 1c. Fig. 1. [80]Fig. 1 [81]Open in a new tab Identification of differentially expressed genes in PC and data integration. (A) Volcano plot of genome-wide gene expression profiles in PC and adjacent normal tissues from [82]GSE16515, [83]GSE62452 and [84]GSE28735. Red plots represent aberrantly expressed mRNAs with P < 0.05 and absolute log[2]FC > 1. Black plots represent normally expressed mRNAs. Green plots represent aberrantly expressed mRNAs with P < 0.05 and log[2]FC < −1. The abscissa shows the value of fold change in gene expression between tumors and normal tissues. The ordinate means the −log10 of the adjusted P value for each gene. (B) Heatmap analysis of differential expression profiles between normal tissues and cancer tissues from the three GEO databases. DEGs were defined with p < 0.05 and |log[2]FC| > 1. (C) The cluster heatmap of 389 consistently expressed DEGs from integrated analyses of the three GEO datasets. The normalized expression values are represented in shades of red and green, indicating expression above and below the median expression value across all tissues, respectively. (For interpretation of the references to colour in this figure legend, the reader is