Abstract Abstract Background Preventive and therapeutic options for renal failure are still limited. Gene expression profile analysis is powerful in the identification of biological differences between end stage renal failure patients and healthy controls. Previous studies mainly used variance/regression analysis without considering various biological, environmental factors. The purpose of this study is to investigate the gene expression difference between end stage renal failure patients and healthy controls with partial least squares (PLS) based analysis. Methods With gene expression data from the Gene Expression Omnibus database, we performed PLS analysis to identify differentially expressed genes. Enrichment and network analyses were also carried out to capture the molecular signatures of renal failure. Results We acquired 573 differentially expressed genes. Pathway and Gene Ontology items enrichment analysis revealed over-representation of dysregulated genes in various biological processes. Network analysis identified seven hub genes with degrees higher than 10, including CAND1, CDK2, TP53, SMURF1, YWHAE, SRSF1, and RELA. Proteins encoded by CDK2, TP53, and RELA have been associated with the progression of renal failure in previous studies. Conclusions Our findings shed light on expression character of renal failure patients with the hope to offer potential targets for future therapeutic studies. Virtual Slides The virtual slide(s) for this article can be found here: [31]http://www.diagnosticpathology.diagnomx.eu/vs/1450799302127207 Keywords: Renal failure, Partial least squares, Gene expression, Network Background Renal failure refers to the medical condition that kidneys fail to adequately filter waste products from blood. It is usually not reversible and patients with end stage renal failure have to be treated with long term dialysis or organ transplant [[32]1,[33]2]. Preventive and therapeutic options for this disease are still limited [[34]3]. Capture the gene expression signature of end stage renal failure patients may enhance the development of novel therapeutic strategies. High throughput microarray analysis is powerful to characterize the underlying pathogenesis of various diseases. Several studies have investigated the gene expression difference between renal failure patients and controls using this strategy [[35]4-[36]6]. These studies generally carried out variance or regression analysis to detect dysregulated genes. This statistical procedure ignored unaccounted array specific factors, including various biological, environmental factors. Previous studies [[37]7,[38]8] have suggested that partial least squares (PLS) based expression profile analysis is efficient in dealing with large amount of genes and fairly small samples. Compared with variance and regression analysis, PLS based analysis is more sensitive while maintaining reasonable high specificity, small false discovery rate and false non-discovery rate. Previous study using PLS analysis on other complex disease such as breast cancer has proved its feasibility [[39]9]. Therefore, capturing the gene expression signature in renal failure patients by using PLS based analysis may provide new understanding of the pathogenesis and offer potential therapeutic targets. In the current study, to investigate the gene expression difference between end stage renal failure patients and healthy controls, we performed PLS-based analysis by using gene expression data from the gene expression omnibus (GEO) database. Pathways or Gene Ontology items significantly over-represented with dysregulated genes were also acquired by using enrichment analysis. In addition, we constructed a protein-protein interaction (PPI) network with the proteins encoded by dysregulated genes to identify hub genes that may be related with disease progression. Methods Microarray data The whole data set of gene expression profile [40]GSE37171 from the GEO database was downloaded. This series represents transcription profile of 63 end-stage renal failure patients and 20 healthy controls. All samples were taken from peripheral blood. The dataset was based on the [41]GPL570 platform ([HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array). This study is approved by the institutional review board of the affiliated hospital of Xuzhou medical college (NO. 131081). Identification of differentially expressed genes Normalization of raw intensity values was performed by using Robust Multi-array Analysis (RMA) [[42]10]. The resulting log2-transformed expression value of each probe was used in subsequent analysis. A multivariate linear model was used to describe the relationship between gene expression values and the disease status. For each sample, the model is expressed as: [MATH: y=i=1pαixi+b :MATH] (1) where y is the binary variable of disease status, 0 coded as control and 1 coded as renal failure; p is the total number of genes in the array. PLS analysis was then carried out to estimate the effects of each gene. The main purpose of PLS regression was to build orthogonal components (called ‘latent variables’ here). It is: [MATH: COVtk,ukmax :MATH] (2) [MATH: Subjecttotk=1anduk=1 :MATH] (3) where t [k ]is the kth latent variable decomposes from all individuals’ genes expression data X (the matrix of n × p, n refers to the number of individuals and p refers to the number of genes), u [k ]is the kth latent variable decomposes from the phenotype data Y (n × 1) [[43]11]. The non-linear iterative partial least squares (NIPALS) algorithm [[44]12] was used to calculate the PLS latent variables derived from the expression profile on the target trait, as follows: 1) Randomly initialize u [0] =  Y 2) w = X ^T u [0 ], w = w/||w|| 3) t = Xw 4) c = Y ^T t, c = c/||c|| 5) u = Yc 6) if u - u [0 ] < 10E-8, go to step 7), else u [0 ] =  u , repeat step 2)-5) 7) X  =  X - tt ^T X , Y  =  Y - tt ^T Y Then go back to 2) to calculate the next latent variable. To evaluate the importance of the expressed genes on disease, the statistics of variable importance on the projection (VIP) [[45]13] was calculated as: [MATH: VIPj=pk=1hCor2Y,tkwkj2k=1hCor2Y,tk :MATH] (4) where, the Cor operator is the Pearson correlation coefficient, and for each w [k ], it should be normalized by dividing ||w[ k ]||, and h is the number of latent variables used in the model. To avoid the model over fitting, the best number of latent variables (h above) was determined by the prediction accuracy based on three folds cross validation. The VIP for each gene was then calculated with the h latent variables to obtain genes associated with renal failure. In addition, the false discovered rate (FDR) procedures were used to control the expected proportion of incorrectly rejected null hypotheses. The permutation procedure (N = 10000 times) was used to obtain the empirical distribution of PLS-based VIP in each replicate. The FDR for each gene was then calculated as: [MATH: FDRi=j10000ipBoolVIPi,j>VIPi/10000p :MATH] (5) where Bool represents the logical value of expression: “True” codes as 1 and “False” codes as 0. Significant genes were selected with a threshold of FDR < 0.01. Enrichment analysis Annotation of all probes was carried out by using the simple omnibus format in text (SOFT) files. To capture biologically relevant character of differentially expressed genes, enrichment analysis was implemented. All genes were firstly mapped to the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways ([46]http://www.genome.jp/kegg/) [[47]14] and Gene Ontology database [[48]15]. Biological processes significantly overrepresented with differentially expressed genes were identified by using the hyper geometric distribution test. Network analysis PPI is important for all biological processes since most protein function through its interaction with other proteins [[49]16]. Among the proteins encoded by differentially expressed genes, those with more interactions with other proteins may play more important roles in the progression of renal failure. To visualize the interaction among these proteins and identify key molecules, a network was constructed by using the software Cytoscape (V 2.8.3, [50]http://www.cytoscape.org/)[[51]17]. The database ([52]http://ftp.ncbi.nlm.nih.gov/gene/GeneRIF/) of NCBI was used to get the interaction information of all proteins. For each protein, the number of links (interactions) was defined as its degree. Proteins with degrees over 10 were selected as hub molecules in this study. Results According to the prediction accuracy based on cross validation, six latent variables were used in the detection of differentially expressed genes (Figure [53]1). The results revealed that 573 genes were differentially expressed between end-stage renal failure patients and healthy controls, including 141 downregulated genes in patients and 432 upregulated ones. For all genes in the array, 6084 genes were mapped to various pathways, including 203 differentially expressed genes. The pathways enriched with differentially expressed genes are listed in Table [54]1. These pathways are involved in several systems, including nervous system, digestive system, and endocrine system. In addition, three cancer pathways, transcriptional misregulation in cancers (hsa05202), chronic myeloid leukemia (hsa05220) and small cell lung cancer (hsa05222) were also enriched with differentially expressed genes. A total of 16517 genes in the array were annotated based on the GO database, including 518 differentially expressed genes. Table [55]2 represents the five GO items enriched with selected genes. Protein binding (GO: 0005515) was the most significant GO item with over represented selected genes. In consistent with the pathway analysis, a transcription related GO item: transcription, DNA-dependent (GO: 0006351) was also identified to be overrepresented with dysregulated genes.Figure [56]2 illustrates the interaction network of proteins encoded by differentially expressed genes. Seven proteins, CAND1, CDK2, TP53, SMURF1, YWHAE, SRSF1, and RELA were identified to be hub molecules, with degrees of 31, 29, 22, 19, 15, 12, and 10 respectively. Figure 1. Figure 1 [57]Open in a new tab The distribution prediction accuracy as the number of latent variable number increases. The prediction accuracy achieves 100% when the latent variable number is six. Table 1. Pathways enriched with differentially expressed gene KEGG id Pathway description Pathway class P-value hsa04722 __________________________________________________________________ Neurotrophin signaling pathway __________________________________________________________________ Nervous system __________________________________________________________________ 2.09E-03 __________________________________________________________________ hsa05202 __________________________________________________________________ Transcriptional misregulation in cancers __________________________________________________________________ Cancers __________________________________________________________________ 5.20E-03 __________________________________________________________________ hsa04120 __________________________________________________________________ Ubiquitin mediated proteolysis __________________________________________________________________ Folding, sorting and degradation __________________________________________________________________ 5.22E-03 __________________________________________________________________ hsa05220 __________________________________________________________________ Chronic myeloid leukemia __________________________________________________________________ Cancers __________________________________________________________________ 1.05E-02 __________________________________________________________________ hsa05222 __________________________________________________________________ Small cell lung cancer __________________________________________________________________ Cancers __________________________________________________________________ 2.28E-02 __________________________________________________________________ hsa05010 __________________________________________________________________ Alzheimer's disease __________________________________________________________________ Neurodegenerative diseases __________________________________________________________________ 2.43E-02 __________________________________________________________________ hsa04970 __________________________________________________________________ Salivary secretion __________________________________________________________________ Digestive system __________________________________________________________________ 2.85E-02 __________________________________________________________________ hsa04130 __________________________________________________________________ SNARE interactions in vesicular transport __________________________________________________________________ Folding, sorting and degradation __________________________________________________________________ 3.08E-02 __________________________________________________________________ hsa04912 __________________________________________________________________ GnRH signaling pathway __________________________________________________________________ Endocrine system __________________________________________________________________ 3.17E-02 __________________________________________________________________ hsa04972 __________________________________________________________________ Pancreatic secretion __________________________________________________________________ Digestive system __________________________________________________________________ 3.87E-02 __________________________________________________________________ hsa05100 __________________________________________________________________ Bacterial invasion of epithelial cells __________________________________________________________________ Infectious diseases __________________________________________________________________ 4.05E-02 __________________________________________________________________ hsa04730 Long-term depression Nervous system 4.60E-02 [58]Open in a new tab Table 2. GO items enriched with differentially expressed gene #GO id GO description GO class FDR GO:0005515 __________________________________________________________________ protein binding __________________________________________________________________ Function __________________________________________________________________ 2.92E-06 __________________________________________________________________ GO:0005730 __________________________________________________________________ nucleolus __________________________________________________________________ Component __________________________________________________________________ 4.54E-05 __________________________________________________________________ GO:0005634 __________________________________________________________________ nucleus __________________________________________________________________ Component __________________________________________________________________ 6.16E-05 __________________________________________________________________ GO:0006351 __________________________________________________________________ transcription, DNA-dependent __________________________________________________________________ Process __________________________________________________________________ 4.19E-03 __________________________________________________________________ GO:0002326 B cell lineage commitment Process 3.72E-02 [59]Open in a new tab Figure 2. Figure 2 [60]Open in a new tab Interaction network constructed by proteins encoded by differentially expressed genes. Proteins with more interactions are shown in bigger size. Proteins in red are encoded by downregulated genes in patients while those in blue are encoded by upregulated genes in patients. Discussion Renal failure is a complex medical condition which may result from kidney injury or chronic diseases [[61]18,[62]19]. Microarray is a powerful technology for investigating the gene expression difference between end-stage renal failure patients and healthy controls. However, it is challenging to develop a suitable statistical model to deal with the small sample number and fairly large amount of genes. Previous studies on renal failure mainly used variance or regression analysis, without considering unaccounted array specific factors. Here we used PLS based analysis to identify dysregulated genes in end-stage renal failure patients. Pathway enrichment analysis revealed that overrepresentation of dysregulated genes in various systems. Dysfunction of various systems may be complications of renal failure since kidneys are essential in the maintenance of homeostatic status. In addition, we also detected cancer-related pathways and GO items to be enriched with differentially expressed genes. The correlation between renal failure and cancer related biological processes may due to the dysfunction of cell cycle and DNA repair process in patients. Previous studies have demonstrated the enhanced expression of DNA repair-related proteins and induced cell cycle arrest at G1/S and G2/M in renal failure rats [[63]20-[64]22]. Overrepresentation of dysregulated genes in the chronic myeloid leukemia (hsa05220) pathway revealed the similar gene expression of these two diseases which may explain the causative effect of lymphocytic leukemia on renal failure [[65]19]. These identified biological processes revealed the molecular signatures of renal failure. To detect hub molecules, we constructed a network with proteins encoded by identified differentially expressed genes (Figure [66]2). Several hub molecules have been identified to play important roles in the progression of renal failure before. Take RELA for example, protein encoded by this gene is NF-kappaB p65. In consistent with our results, detection of NF-kappaB p65 based on immunohistochemical staining and ELISA suggested that NF-kappaB p65 in rat glomeruli of multiple organ failure was significantly higher than that of control group [[67]23]. Attenuation of NF-kappaB p65 activation is effective in reducing endotoxic kidney injury [[68]24]. Inhibition of inflammation through NF-κB also reduced renal dysfunction caused by sepsis in mice [[69]25]. The involvement of NF-kappaB p65 in renal failure may be due to its interaction with inflammatory chemokines [[70]26], such as CXCL16, which was increased in active nephrotic syndrome patients and correlated with blood lipids, urine protein and inflammation responses [[71]27]. Genes involved in regulation of cell cycle, TP53 and CDK2, were also identified as hub genes. Their involvements in renal failure through regulation of G1 cell cycle arrest were reported before [[72]28]. Moreover, paricalcitol could prevent cisplatin-induced renal injury by suppressing the up regulation of TP53 and CDK2[[73]29]. Therefore, our study confirmed that these three genes may serve as potential targets for renal failure treatments. For the rest four hub genes, SRSF1, CAND1, SMURF1, and YWHAE, no previous report of their association with renal failure has been proposed before. Protein encoded by SRSF1 is a member of the arginine/serine-rich splicing factor protein family. Up regulation of SRSF1 could increases the cellular pool of active p53 [[74]30], suggesting the implication of SRSF1 in renal failure through its regulation of the p53. For SMURF1, protein encoded by this gene is an ubiquitin ligase that is specific for receptor-regulated SMAD proteins. It is reported that reduction of Smad7 due to the overexpression of Smurf1 in unilateral ureteral obstruction kidneys plays an important role in the progression of tubulointerstitial fibrosis [[75]31], which a harmful process leading inevitably to renal function deterioration. Consistently, our analysis detected the up regulation of SMURF1, suggesting it may contribute to the progression of renal failure through its ubiquitination of SMAD7. Protein encoded by YWHAE belongs to the 14-3-3 family of proteins which mediate signal transduction by binding to phosphoserine-containing proteins. Quantitative protein expression profiling revealed that overexpression of YWHAE prompt the proliferation of renal cancer cells [[76]32]. CAND1 may also promote the progression of renal cell carcinoma through its interaction with carbonic anhydrase IX [[77]33]. Whether the up regulation contributes to the pathogenesis of renal failure needs further investigation. Conclusions In summary, with gene expression profile downloaded from the GEO database, we carried out PLS based analysis to identify differentially expressed genes in end-stage renal failure patients and healthy controls. Pathway and GO enrichment analyses were also implemented to capture biological relevant characters. A network of proteins encoded by differentially expressed genes was constructed to identify key molecules. Our results facilitate the disclosure of the molecular mechanism underlying renal failure progression. Consent Written informed consent was obtained from the patients for the publication of this report and any accompanying images. Competing interest The authors declare that they have no competing interests. Authors’ contributions PM designed the research and revised the manuscript. SD drafted the manuscript. SD, YX and TH carried out data analysis. All authors read and approved the final manuscript. Contributor Information Shuang Ding, Email: shuangding98@163.com. Yinhai Xu, Email: yinhaixu14@163.com. Tingting Hao, Email: tingtinghao14@163.com. Ping Ma, Email: pingm62@aliyun.com. References