Abstract

Background

   Ulcerative colitis is a type of inflammatory bowel disease posing a
   great threat to the public health worldwide. Previously, gene
   expression studies of mucosal colonic biopsies have provided some
   insight into the pathophysiological mechanisms in ulcerative colitis;
   however, the exact pathogenesis is unclear. The purpose of this study
   is to identify the most related genes and pathways of UC by
   bioinformatics, so as to reveal the core of the pathogenesis.

Methods

   Genome-wide gene expression datasets involving ulcerative colitis
   patients were collected from gene expression omnibus database. To
   identify most close genes, an integrated analysis of gene expression
   signature was performed by employing robust rank aggregation method. We
   used weighted gene co-expression network analysis to explore the
   functional modules involved in ulcerative colitis pathogenesis.
   Besides, biological process and pathways analysis of co-expression
   modules were figured out by gene ontology enrichment analysis using
   Metascape.

Results

   A total of 328 ulcerative colitis patients and 138 healthy controls
   were from 14 datasets. The 150 most significant differentially
   expressed genes are likely to include causative genes of disease, and
   further studies are needed to demonstrate this. Seven main functional
   modules were identified, which pathway enrichment analysis indicated
   were associated with many biological processes. Pathways such as
   ‘extracellular matrix, immune inflammatory response, cell cycle,
   material metabolism’ are consistent with the core mechanism of
   ulcerative colitis. However, ‘defense response to virus’ and ‘herpes
   simplex infection’ suggest that viral infection is one of the
   aetiological agents. Besides, ‘Signaling by Receptor Tyrosine Kinases’
   and ‘pathway in cancer’ provide new clues for the study of the risk and
   process of ulcerative colitis cancerization.

   Keywords: Ulcerative colitis, Bioinformatics analysis, Genes and
   pathways

Introduction

   Ulcerative colitis (UC) is a subtype of inflammatory bowel disease
   (IBD), which is a kind of idiopathic, chronic, recurrent, debilitating
   and nonspecific inflammatory condition, and its characteristic is the
   alternate periods of remission and active disease ([30]Planell et al.,
   2013; [31]Strober, Fuss & Mannon, 2007). Worldwide, UC is more common
   than Crohn’s disease (CD). Both diseases are more common in
   industrialized countries, particularly in North America and Western
   Europe, although their incidence is rising in Asia. The
   whole morbidity reported is between 1.2 and 20.3 cases per 100,000
   persons per year, and the prevalence is between 7.6 and 245 cases per
   100,000 persons per year ([32]Danese & Fiocchi, 2011; [33]Loftus,
   2004). No sex preponderance exists in UC ([34]Bernstein et al., 2006).
   The peak age at onset of the disease was 30–40 years ([35]Cosnes et
   al., 2011). A total of 8–14% of patients have a family history of IBD
   and first-degree relatives to patients with UC have four times the
   chance of developing the disease ([36]Childers et al., 2014). Studies
   have confirmed that genes, environment, intestinal microorganisms and
   autoimmune factors are involved in the etiology of UC ([37]Chu et al.,
   2016; [38]Dignass et al., 2012). However, the exact pathogenesis of UC
   is not clear.

   With the progress of genome-wide research, more and more genes closely
   related to UC have been discovered. The research of DNA microarrays by
   [39]Lawrance, Fiocchi & Chakravarti (2001) discovered that the
   differentially expressed genes (DEGs) in UC inflammatory sites, in
   addition to the expected variety of cytokine, chemokine related genes,
   and inflammation-related HNL, NGAL, proliferation-related GRO, as well
   as the tumor-related DD96, DRAL, MXI1, and immune-related IGHG3, IGLL2,
   CD74. An RNA Microarray study of IBD, including six UC patients, found
   that genes related to functions of biosynthetic and metabolic
   processes, electrolyte transport, such as HNF4G, KLF5, AQP8, ATP2B1,
   and SLC16A, were significantly down-regulated in UC samples.
   Nevertheless, the over-expressed genes are mainly involved in such
   biological processes as Cell motility, Immune and inflammatory
   response, Antimicrobial response, Regulation of Cell growth and
   proliferation, and cytokine chemotaxis. For example, CORO1A, MMP12,
   TIMP1, PTGDS, CD79A, POU2AF1, TNFRSF7, IGFBP5, FSCN1, CCL11, etc
   ([40]Wu et al., 2007). More recently, a similar study involving 67 UC
   patients showed significantly up-regulation of genes including SAA1,
   DEFA5&6, MMP3&7, S100A8&9 ([41]Noble et al., 2008). A meta-analysis of
   2,693 UC patients reported about 30 gene loci closely related to UC,
   including not only TNFSF15, NKX2-3, IL12B, MST1, IL18RAP, HLA, IBD5,
   RNF186/OTUD3/PLA2G2E, DLD/LAMB1, IL10, CARD9, IFNG/IL26, JAK2, IL23R,
   but also novel FCGR2A, 5p15, 2p16, CARD9 and ORMDL3 ([42]McGovern et
   al., 2010). However, genetics only explains 7.5% of the disease
   variation, with small predictive ability for phenotypes, and are
   currently limited in clinical practice ([43]UK IBD Genetics Consortium
   & The Wellcome Trust Case Control Consortium 2, 2009).

   The aim of this article is to further explore the interaction of genes
   related to the pathogenesis of UC and the interaction of the enriched
   signal pathways, elucidating underlying pathogenic events that may
   contribute to find new and valuable therapeutic targets of the disease.

   Gene Expression Omnibus is a public database, and dozens of gene
   expression datasets about UC patients are freely available, which
   provide very valuable information, and it could be reused to provide
   new insights into the molecular pathogenesis of UC. In addition, due to
   the small sample size in single dataset and discrepancies of the
   characteristics among multiple heterogeneous datasets, individual
   genome-wide gene expression datasets could have restricted capability
   in forecasting the functional gene networks. Thus, it is necessary to
   gather those datasets and synthetically integrate those massive data
   through systems biology tools, and finally receive the stable and
   credible results ([44]Marques et al., 2010; [45]Rung & Brazma, 2017;
   [46]Seifuddin et al., 2013).

   The robust rank aggregation (RRA) analysis is a strict tool of systems
   biology, which can be adopted to the comparison of multiple gene
   ranking lists obtained from experiments on different platforms greatly
   expanded the sample size, making the identification of genes related to
   diseases more reliable and valuable ([47]Kolde et al., 2012). The
   theory of RRA is that by looking at the location of genes respectively
   in each ranked list and comparing it with a randomly shuffled baseline
   list, each gene will be assigned a p-value, and the better the location
   in these ranked lists, the smaller the p-value will be. The final
   ranking of genes is based on the P value, and logarithmic fold changes
   (logFC) can be calculated as needed to determine the importance of
   genes together with the P-value.

   In the current, systematic review and comprehensive integration of
   genome-wide gene expression datasets in UC is still missing. Therefore,
   we performed the systematic review and comprehensively integrated those
   genome-wide gene expression datasets through RRA to identify the
   most probable causative genes of UC. We hope to mark out some deepening
   insights into UC pathogenesis and provide some molecular target for
   therapeutic.

   Moreover, we would use weighted gene co-expression network
   analysis(WGCNA) to categorise those important and aberrantly expressed
   genes into several biologically functional modules ([48]Langfelder &
   Horvath, 2008; [49]Prom-On et al., 2010), which could be biologically
   meaningful gene clusters and play important roles in UC pathogenesis.

Materials & Methods

Datasets search and eligibility criteria

   On the Gene Expression Omnibus (GEO) home page
   ([50]http://www.ncbi.nlm.nih.gov/geo/), “UC biopsy” was used as the
   search term, and the datasets in the search results were filtered
   according to the following criteria: (1) the gene expression profile
   measured by microarray chip technology; (2) the dataset was a
   comparison between active UC patients’ tissue and non-UC patients’
   healthy tissue; (3) Sample size should be at least 5; (4) The database
   provided raw data or gene expression. Fragments Per Kilobase of
   transcript per Million fragments mapped (FPKM) matrix files for these
   datasets and can be used for reanalysis. The raw data is the direct
   information measured by instrument, in CEL format, which can be
   processed by R and converted into TXT format of gene expression FPKM
   matrix. The gene expression FPKM matrix files provided by the website
   should not have been normalized. Datasets that did not meet the above
   criteria are excluded.

Robust Rank Aggregation (RRA) analysis

   The data set of a single platform is difficult to reach a large sample
   size, and the result is of low credibility. We used the RRA analysis
   method to comprehensively compare and analyze the results obtained from
   the genetic difference analysis of each platform, and selected the
   genes with strong consistency and difference, so as to make the final
   differentially expressed genes (DEGs) more convincing. Multiple
   packages of R software were applied for data processing and statistical
   analysis ([51]R Core Team, 2018; [52]Gentleman et al., 2004).

Affy package for data preprocessing

   read.AnnotatedDataFrame(), read in the grouping information file for
   the samples(UC patients and controls); read.csv(), read in the
   annotations files of gene expression omnibus platform (GPL), including
   the conversion of probes to gene symbols; eset.rma <- justRMA(),
   datExpr=exprs (eset.rma), these two-step functions apply the RMA method
   to normalize original files, with the purpose of adjusting the overall
   characteristics of a single sample to make it more suitable for
   comparison.

Surrogate variable analysis (SVA) package for batch effect removing

   Batch effect is caused by different samples under different conditions
   such as experiment time, experiment environment, instrument, etc., and
   merely data normalization cannot remove batch effect. SVA package were
   used to remove the batch effects from different samples of the same
   platform ([53]Chen et al., 2011; [54]Leek et al., 2012). This step is
   performed using Empirical Bayes method, whose core function is ComBat
   Finally, gene expression value matrix files with row name as gene
   symbol and column name as sample number were obtained for each platform
   for further analysis.

Limma package for differential genes analysis

   The limma package is a comprehensive package with many options for
   loading data, data pre-processing (background correction, intra-group
   normalization and inter-group normalization), and differential genetic
   analysis. The function of empirical Bayes linear regression method for
   finding differential genes is very popular. At the same time, limma
   package is very scalable. Both one channel and tow channel data can be
   analyzed for differential genes, even including quantitative PCR and
   RNA-seq data types ([55]Ritchie et al., 2015).

   The gene expression matrix files obtained in the last step were used
   for differential gene analysis between UC and Control groups by Limma
   package respectively, so as to acquire the DEGs of each platform
   ([56]Wettenhall & Smyth, 2004). MakeContrasts() as the key function and
   gene rank lists of different platforms were generated. In the process,
   the False Discovery Rate (FDR) is calculated by benjamini-hochberg
   correction method, which means a adjusted P-Value, but the P-Value is
   still used as the basis for the significance judgment of the result.

RobustRankAggreg package for RRA analysis

   The RobustRankAggreg package was used to implement RRA analysis for
   Gene rank lists of different platforms to generate the most valuable
   DEGs ([57]Kolde et al., 2012). Core functions: list(), rankMatrix(),
   aggregateRanks Genes with P value < 0.05 and —logFC—>1 were selected,
   and the smaller the P value, the higher the ranking, often small P
   value of the gene corresponds to a large —logFC—. The final result was
   visualized by pheatmap package.

WGCNA

   In order to clarify the main role of DEGs in the pathogenesis of UC,
   this method is used to cluster genes with close relationship in the
   same module. The weighted gene co-expression network was constructed by
   the WGCNA package of R.

   First, an appropriate gene expression FPKM matrix file is required. A
   number of genes and suitable samples were extracted from the raw data,
   and the matrix file is the FPKM of these genes for each sample. The
   DEGs generated in the RRA analysis were only the most important genes,
   and could not present the overall picture of the co-expression network.
   In order to cover most valuable difference genes, we adjusted the cut
   off value to p < 0.05 and —logFC—>0.14. In other studies, —logFC—
   values are often different in order to select sufficient and relatively
   high value genes for WGCNA. For example, Yan et al. selected
   —logFC—>0.26 ([58]Yan et al., 2018), while [59]Lu et al. (2014) set
   —logFC—>0.585 in order to get more differentiated genes . Besides, only
   samples from the same platform can be combined for WGCNA. To make the
   results more convincing, we selected [60]GPL570 with the largest sample
   size, including 143 UC patients and 79 controls from eight datasets.

   Then, hclust() was used to hierarchical clustering of samples by
   average method and results in the initial sampletree. The following we
   defined sample clustering height = 80 to remove the isolated samples
   from the group, so as to obtain a more hierarchical sampletree for
   further analysis.

   The core process of WGCNA is to build a scale-free distributed
   topological network, making the functional modules developed more
   cohesive ([61]Langfelder & Horvath, 2008). In the view of many relevant
   references prove that when the scale-free fit index is greater than