Abstract Background Ulcerative colitis is a type of inflammatory bowel disease posing a great threat to the public health worldwide. Previously, gene expression studies of mucosal colonic biopsies have provided some insight into the pathophysiological mechanisms in ulcerative colitis; however, the exact pathogenesis is unclear. The purpose of this study is to identify the most related genes and pathways of UC by bioinformatics, so as to reveal the core of the pathogenesis. Methods Genome-wide gene expression datasets involving ulcerative colitis patients were collected from gene expression omnibus database. To identify most close genes, an integrated analysis of gene expression signature was performed by employing robust rank aggregation method. We used weighted gene co-expression network analysis to explore the functional modules involved in ulcerative colitis pathogenesis. Besides, biological process and pathways analysis of co-expression modules were figured out by gene ontology enrichment analysis using Metascape. Results A total of 328 ulcerative colitis patients and 138 healthy controls were from 14 datasets. The 150 most significant differentially expressed genes are likely to include causative genes of disease, and further studies are needed to demonstrate this. Seven main functional modules were identified, which pathway enrichment analysis indicated were associated with many biological processes. Pathways such as ‘extracellular matrix, immune inflammatory response, cell cycle, material metabolism’ are consistent with the core mechanism of ulcerative colitis. However, ‘defense response to virus’ and ‘herpes simplex infection’ suggest that viral infection is one of the aetiological agents. Besides, ‘Signaling by Receptor Tyrosine Kinases’ and ‘pathway in cancer’ provide new clues for the study of the risk and process of ulcerative colitis cancerization. Keywords: Ulcerative colitis, Bioinformatics analysis, Genes and pathways Introduction Ulcerative colitis (UC) is a subtype of inflammatory bowel disease (IBD), which is a kind of idiopathic, chronic, recurrent, debilitating and nonspecific inflammatory condition, and its characteristic is the alternate periods of remission and active disease ([30]Planell et al., 2013; [31]Strober, Fuss & Mannon, 2007). Worldwide, UC is more common than Crohn’s disease (CD). Both diseases are more common in industrialized countries, particularly in North America and Western Europe, although their incidence is rising in Asia. The whole morbidity reported is between 1.2 and 20.3 cases per 100,000 persons per year, and the prevalence is between 7.6 and 245 cases per 100,000 persons per year ([32]Danese & Fiocchi, 2011; [33]Loftus, 2004). No sex preponderance exists in UC ([34]Bernstein et al., 2006). The peak age at onset of the disease was 30–40 years ([35]Cosnes et al., 2011). A total of 8–14% of patients have a family history of IBD and first-degree relatives to patients with UC have four times the chance of developing the disease ([36]Childers et al., 2014). Studies have confirmed that genes, environment, intestinal microorganisms and autoimmune factors are involved in the etiology of UC ([37]Chu et al., 2016; [38]Dignass et al., 2012). However, the exact pathogenesis of UC is not clear. With the progress of genome-wide research, more and more genes closely related to UC have been discovered. The research of DNA microarrays by [39]Lawrance, Fiocchi & Chakravarti (2001) discovered that the differentially expressed genes (DEGs) in UC inflammatory sites, in addition to the expected variety of cytokine, chemokine related genes, and inflammation-related HNL, NGAL, proliferation-related GRO, as well as the tumor-related DD96, DRAL, MXI1, and immune-related IGHG3, IGLL2, CD74. An RNA Microarray study of IBD, including six UC patients, found that genes related to functions of biosynthetic and metabolic processes, electrolyte transport, such as HNF4G, KLF5, AQP8, ATP2B1, and SLC16A, were significantly down-regulated in UC samples. Nevertheless, the over-expressed genes are mainly involved in such biological processes as Cell motility, Immune and inflammatory response, Antimicrobial response, Regulation of Cell growth and proliferation, and cytokine chemotaxis. For example, CORO1A, MMP12, TIMP1, PTGDS, CD79A, POU2AF1, TNFRSF7, IGFBP5, FSCN1, CCL11, etc ([40]Wu et al., 2007). More recently, a similar study involving 67 UC patients showed significantly up-regulation of genes including SAA1, DEFA5&6, MMP3&7, S100A8&9 ([41]Noble et al., 2008). A meta-analysis of 2,693 UC patients reported about 30 gene loci closely related to UC, including not only TNFSF15, NKX2-3, IL12B, MST1, IL18RAP, HLA, IBD5, RNF186/OTUD3/PLA2G2E, DLD/LAMB1, IL10, CARD9, IFNG/IL26, JAK2, IL23R, but also novel FCGR2A, 5p15, 2p16, CARD9 and ORMDL3 ([42]McGovern et al., 2010). However, genetics only explains 7.5% of the disease variation, with small predictive ability for phenotypes, and are currently limited in clinical practice ([43]UK IBD Genetics Consortium & The Wellcome Trust Case Control Consortium 2, 2009). The aim of this article is to further explore the interaction of genes related to the pathogenesis of UC and the interaction of the enriched signal pathways, elucidating underlying pathogenic events that may contribute to find new and valuable therapeutic targets of the disease. Gene Expression Omnibus is a public database, and dozens of gene expression datasets about UC patients are freely available, which provide very valuable information, and it could be reused to provide new insights into the molecular pathogenesis of UC. In addition, due to the small sample size in single dataset and discrepancies of the characteristics among multiple heterogeneous datasets, individual genome-wide gene expression datasets could have restricted capability in forecasting the functional gene networks. Thus, it is necessary to gather those datasets and synthetically integrate those massive data through systems biology tools, and finally receive the stable and credible results ([44]Marques et al., 2010; [45]Rung & Brazma, 2017; [46]Seifuddin et al., 2013). The robust rank aggregation (RRA) analysis is a strict tool of systems biology, which can be adopted to the comparison of multiple gene ranking lists obtained from experiments on different platforms greatly expanded the sample size, making the identification of genes related to diseases more reliable and valuable ([47]Kolde et al., 2012). The theory of RRA is that by looking at the location of genes respectively in each ranked list and comparing it with a randomly shuffled baseline list, each gene will be assigned a p-value, and the better the location in these ranked lists, the smaller the p-value will be. The final ranking of genes is based on the P value, and logarithmic fold changes (logFC) can be calculated as needed to determine the importance of genes together with the P-value. In the current, systematic review and comprehensive integration of genome-wide gene expression datasets in UC is still missing. Therefore, we performed the systematic review and comprehensively integrated those genome-wide gene expression datasets through RRA to identify the most probable causative genes of UC. We hope to mark out some deepening insights into UC pathogenesis and provide some molecular target for therapeutic. Moreover, we would use weighted gene co-expression network analysis(WGCNA) to categorise those important and aberrantly expressed genes into several biologically functional modules ([48]Langfelder & Horvath, 2008; [49]Prom-On et al., 2010), which could be biologically meaningful gene clusters and play important roles in UC pathogenesis. Materials & Methods Datasets search and eligibility criteria On the Gene Expression Omnibus (GEO) home page ([50]http://www.ncbi.nlm.nih.gov/geo/), “UC biopsy” was used as the search term, and the datasets in the search results were filtered according to the following criteria: (1) the gene expression profile measured by microarray chip technology; (2) the dataset was a comparison between active UC patients’ tissue and non-UC patients’ healthy tissue; (3) Sample size should be at least 5; (4) The database provided raw data or gene expression. Fragments Per Kilobase of transcript per Million fragments mapped (FPKM) matrix files for these datasets and can be used for reanalysis. The raw data is the direct information measured by instrument, in CEL format, which can be processed by R and converted into TXT format of gene expression FPKM matrix. The gene expression FPKM matrix files provided by the website should not have been normalized. Datasets that did not meet the above criteria are excluded. Robust Rank Aggregation (RRA) analysis The data set of a single platform is difficult to reach a large sample size, and the result is of low credibility. We used the RRA analysis method to comprehensively compare and analyze the results obtained from the genetic difference analysis of each platform, and selected the genes with strong consistency and difference, so as to make the final differentially expressed genes (DEGs) more convincing. Multiple packages of R software were applied for data processing and statistical analysis ([51]R Core Team, 2018; [52]Gentleman et al., 2004). Affy package for data preprocessing read.AnnotatedDataFrame(), read in the grouping information file for the samples(UC patients and controls); read.csv(), read in the annotations files of gene expression omnibus platform (GPL), including the conversion of probes to gene symbols; eset.rma <- justRMA(), datExpr=exprs (eset.rma), these two-step functions apply the RMA method to normalize original files, with the purpose of adjusting the overall characteristics of a single sample to make it more suitable for comparison. Surrogate variable analysis (SVA) package for batch effect removing Batch effect is caused by different samples under different conditions such as experiment time, experiment environment, instrument, etc., and merely data normalization cannot remove batch effect. SVA package were used to remove the batch effects from different samples of the same platform ([53]Chen et al., 2011; [54]Leek et al., 2012). This step is performed using Empirical Bayes method, whose core function is ComBat Finally, gene expression value matrix files with row name as gene symbol and column name as sample number were obtained for each platform for further analysis. Limma package for differential genes analysis The limma package is a comprehensive package with many options for loading data, data pre-processing (background correction, intra-group normalization and inter-group normalization), and differential genetic analysis. The function of empirical Bayes linear regression method for finding differential genes is very popular. At the same time, limma package is very scalable. Both one channel and tow channel data can be analyzed for differential genes, even including quantitative PCR and RNA-seq data types ([55]Ritchie et al., 2015). The gene expression matrix files obtained in the last step were used for differential gene analysis between UC and Control groups by Limma package respectively, so as to acquire the DEGs of each platform ([56]Wettenhall & Smyth, 2004). MakeContrasts() as the key function and gene rank lists of different platforms were generated. In the process, the False Discovery Rate (FDR) is calculated by benjamini-hochberg correction method, which means a adjusted P-Value, but the P-Value is still used as the basis for the significance judgment of the result. RobustRankAggreg package for RRA analysis The RobustRankAggreg package was used to implement RRA analysis for Gene rank lists of different platforms to generate the most valuable DEGs ([57]Kolde et al., 2012). Core functions: list(), rankMatrix(), aggregateRanks Genes with P value < 0.05 and —logFC—>1 were selected, and the smaller the P value, the higher the ranking, often small P value of the gene corresponds to a large —logFC—. The final result was visualized by pheatmap package. WGCNA In order to clarify the main role of DEGs in the pathogenesis of UC, this method is used to cluster genes with close relationship in the same module. The weighted gene co-expression network was constructed by the WGCNA package of R. First, an appropriate gene expression FPKM matrix file is required. A number of genes and suitable samples were extracted from the raw data, and the matrix file is the FPKM of these genes for each sample. The DEGs generated in the RRA analysis were only the most important genes, and could not present the overall picture of the co-expression network. In order to cover most valuable difference genes, we adjusted the cut off value to p < 0.05 and —logFC—>0.14. In other studies, —logFC— values are often different in order to select sufficient and relatively high value genes for WGCNA. For example, Yan et al. selected —logFC—>0.26 ([58]Yan et al., 2018), while [59]Lu et al. (2014) set —logFC—>0.585 in order to get more differentiated genes . Besides, only samples from the same platform can be combined for WGCNA. To make the results more convincing, we selected [60]GPL570 with the largest sample size, including 143 UC patients and 79 controls from eight datasets. Then, hclust() was used to hierarchical clustering of samples by average method and results in the initial sampletree. The following we defined sample clustering height = 80 to remove the isolated samples from the group, so as to obtain a more hierarchical sampletree for further analysis. The core process of WGCNA is to build a scale-free distributed topological network, making the functional modules developed more cohesive ([61]Langfelder & Horvath, 2008). In the view of many relevant references prove that when the scale-free fit index is greater than