Abstract

   Chronic Obstructive Pulmonary Disease (COPD) and Idiopathic Pulmonary
   Fibrosis (IPF) have contrasting clinical and pathological
   characteristics and interesting whole-genome transcriptomic profiles.
   However, data from public repositories are difficult to reprocess and
   reanalyze. Here, we present PulmonDB, a web-based database
   ([50]http://pulmondb.liigh.unam.mx/) and R library that facilitates
   exploration of gene expression profiles for these diseases by
   integrating transcriptomic data and curated annotation from different
   sources. We demonstrated the value of this resource by presenting the
   expression of already well-known genes of COPD and IPF across multiple
   experiments and the results of two differential expression analyses in
   which we successfully identified differences and similarities. With
   this first version of PulmonDB, we create a new hypothesis and compare
   the two diseases from a transcriptomics perspective.

   Subject terms: Genetic databases, Chronic obstructive pulmonary
   disease, Cystic fibrosis

Introduction

   A common way to study diseases is by using transcriptomic analysis,
   which can reveal components of the genome that are active and help us
   understand which biological processes are affected^[51]1. Over the
   years, transcriptomic profiles have been compiled and published in
   public repositories such as Gene Expression Omnibus (GEO)^[52]2,[53]3
   and ArrayExpress^[54]4. Having a way to compare transcriptomic data
   from Chronic Obstructive Pulmonary Disease (COPD) and Idiopathic
   Pulmonary Fibrosis (IPF) will help to identify common and distinct
   molecular mechanisms for these two diseases. However, an overwhelming
   task is to integrate high-throughput data from public repositories,
   because of platform differences (resulting in batch effects),
   heterogeneous experimental conditions, and the lack of uniformity on
   experimental annotations. Wang et al. reviewed different approaches in
   which they discussed tools such as GEO2R^[55]5, ScanGEO^[56]6,
   ImaGEO^[57]7, BioJupies^[58]8. These tools reuse public data, reanalyze
   it consistently, and integrate additional data. Even with these
   available tools, performing meta-analyses is still challenging^[59]9.
   In particular, for COPD and IPF, because the information from only a
   few experiments is available in these resources, such an analysis
   requires manual annotation by the user or inclusion of only curated GEO
   Datasets (also referred as GDS), and only none of them integrates
   microarray and RNA-Seq data, to our knowledge.

   Therefore, we created a curated gene expression lung disease database,
   PulmonDB, to organize the currently large amount of expression data for
   both COPD and IPF. To accomplish this task, we used COMMAND > _, a web
   application previously used to create two successful transcriptomic
   compendia: one for bacterial genomes, COLOMBOS^[60]10,[61]11, and the
   second for grapevine VESPUCCI^[62]12. While there are other chronic
   respiratory diseases, such as asthma, cystic fibrosis, and pulmonary
   hypertension association, among others, given the biological
   similarities between COPD and IPF, we decided to focus the first
   version of PulmonDB on these two diseases. We integrated transcriptomic
   experiments from different sources and their curated annotations, and
   built an online web resource to facilitate the exploration of gene
   expression profiles for COPD and IPF creating new hypotheses, and to
   allow for the identification of co-expression patterns in further
   analyses.

Results

   PulmonDB is a relational database implemented in MySQL with lung
   disease transcriptome measurements, re-annotated platform probes, and
   manually curated data with a controlled vocabulary designed for lung
   diseases (Fig. [63]1). Tables were created to describe each feature and
   to connect the information across experiments, samples, measurements,
   platforms, genes, and annotated information. The full database scheme
   is provided in Supplementary Fig. [64]1.

Figure 1.

   [65]Figure 1
   [66]Open in a new tab

   Flow chart of PulmonDB. PulmonDB was created using COMMAND by
   downloading, parsing and storing COPD and IPF public transcriptomic
   data into a MySQL database. Then, we remapped microarray probes to
   establish a uniform gene annotation, and we also created a controlled
   vocabulary for clinical and biological annotations for each sample. We
   created contrasts based on the original hypothesis, selecting a sample
   as the reference. Finally, the data were homogenized and subjected to a
   quality check.

PulmonDB a curated gene expression lung disease database

   PulmonDB is a curated gene expression database of human lung diseases,
   with RNA-seq and microarray data from different platforms that have
   been uniformly preprocessed and manually curated to add sample and
   experiment information. In addition, we developed a website to access
   and visualize homogenized data ([67]http://pulmondb.liigh.unam.mx/),
   and we also developed an R package
   ([68]https://github.com/AnaBVA/pulmondb) to download curated annotation
   and preprocessed data that can be used for further analysis in the R
   environment.

   Our database has a total of 76 GSEs, corresponding to 4481 unique
   preprocessed GSM contrasts that used 26 different platforms or GPLs
   (platform ID from GEO) (Fig. [69]2C). PulmonDB contains different
   sample types, we searched for human gene expression experiments related
   to COPD and IPF without any restriction. Lung biopsies account for
   37.8% of samples, and 33.2% are blood samples. However, different cell
   types can be found in PulmonDB: some of them are primary cells (e.i.
   alveolar macrophages, fibroblasts, alveolar epithelial cells, etc.),
   and others are cell lines (e.i. A549) (Fig. [70]2A). Of the samples,
   34.9% correspond to COPD, 40.5% to control samples (30.9% healthy plus
   9.6% match tissue), 17.2% to IPF, and 1.5% to other diseases
   (Fig. [71]2B and Supplementary Table [72]2). We separated control
   tissues into two groups, “healthy” individuals, as far as the authors
   are aware and “match_tissue_controls” which refers to tissue samples
   from a phenotypically healthy region of a patient who had a tumor
   removed (e.i. non-tumor tissue from a cancer patient).

Figure 2.

   [73]Figure 2
   [74]Open in a new tab

   Summary of PulmonDB. (A) The number of contrast samples in PulmonDB per
   biological sample type. (B) The number of sample states found in
   PulmonDB. The color key below the bar chart shows the sectors for COPD
   patients, healthy/controls, IPF patients, match_tissue_controls
   (non-cancerous sample from a cancer patient), and other diseases (such
   as asthma). (C) The number of contrast samples measured using each
   platform (clustered by using Affymetrix, Agilent, Illumina, and other
   platforms with fewer samples).

   Although other resources reuse and reanalyze GEO data using web
   interfaces^[75]9, those tools are not specialized for lung diseases.
   Their limitations include the need for previous manual curation in each
   analysis, and they consider a small number of COPD and IPF experiments
   due to the fact that only curated GEO data are used. We designed a web
   interface that enables data exploration and visualization to facilitate
   lung disease analysis. This interface uses Clustergrammer^[76]13 to
   visualize gene expression values and the creation of interactive
   heatmaps that allow data exploration. A valuable feature
   of Clustergrammer is to be connected to EnrichR^[77]14, which provides
   pathway enrichment analysis. All these features together should help to
   generate new hypotheses about the pathologies of lung diseases to
   perform exploratory analyses, to visualize specific gene expression
   across public experiments for comparing results, and to generate new
   insights based on different data sets.

PulmonDB can recapitulate gene expression patterns expected in COPD and IPF

   To show that PulmonDB can be used to recapitulate previously reported
   knowledge regarding COPD and IPF biology, we performed a literature
   search and manually selected relevant genes for each disease. We
   selected 19 genes related to IPF (not necessarily associated with gene
   expression in lung tissues) to visualize their gene expression:
   CCL18^[78]15, CXCL12^[79]16, CXCL13^[80]17, collagens (COL1A1, COL1A2,
   COL3A1, COL5A2, COL14A1)^[81]18, DSP^[82]19, FAS^[83]20, IL-8^[84]21,
   MMP1^[85]22, MMP2^[86]23, MMP7^[87]22, MUC5B^[88]19, SPP1^[89]24,
   PTGS2^[90]25, TGFB1^[91]26 and THY1^[92]27. Then, we selected eight IPF
   experiments performed with lung tissue biopsy samples ([93]GSE32537,
   [94]GSE21369, [95]GSE24206, [96]GSE94060, [97]GSE72073, [98]GSE35145,
   [99]GSE31934), and using the PulmonDB website, we created a heatmap
   with the gene expression patterns and observed that the hierarchical
   clustering of these data separates IPF and control data sets
   (Fig. [100]3A, green and gray clusters at the bottom). For COPD, we
   curated 16 genes from the literature that were deemed relevant to this
   disease: HHIP^[101]28,[102]29, CFTR^[103]30,[104]31, PPARG^[105]32,
   SERPINA1^[106]33,[107]34, JUN^[108]35, FAM13A^[109]36, MYH10^[110]35,
   CHRNA5^[111]37, JUND^[112]35, JUNB^[113]35, TNF^[114]34, MMP9^[115]34,
   MMP12^[116]34, CHRNA3^[117]37, TGFBR3^[118]32, and GATA2^[119]32. We
   selected five experiments ([120]GSE27597, [121]GSE37768, [122]GSE57148,
   [123]GSE8581, [124]GSE1122) performed on lung tissue biopsy samples
   from COPD patients and controls. Our hierarchical clustering analysis
   of the expression profiles using the PulmonDB interface allowed us to
   cluster patients and controls into two different groups (Fig. [125]3B),
   similar to the case of IPF. In conclusion, PulmonDB not only helps to
   recapitulate previously published work (Supplementary Fig. [126]3) but
   also helps to verify gene expression stability across experiments. This
   may help to analyze concordance in different experiments, contrast
   study results, show implications of using different control groups,
   etc. We believe this resource can be used to drive, make decisions, and
   support new hypotheses in experimental laboratories for studying
   molecular or cellular disease mechanisms.

Figure 3.

   [127]Figure 3
   [128]Open in a new tab

   IPF and COPD well-known disease-associated genes. In both heatmaps,
   rows are genes, and columns are sample contrasts. Both were
   hierarchically clustered. The first annotation row represents their GSE
   IDs. The second annotation row is the sample type, LUNG_BIOPSY samples,
   in light brown. The third and the fourth annotation rows are sample
   states, the third annotation row represents the test state, and the
   fourth annotation row is the reference state. (A) IPF genes reported
   being relevant in the literature (CCL18^[129]15, CXCL12^[130]16,
   CXCL13^[131]17, COL1A1, COL1A2, COL3A1, COL5A2, COL14A1^[132]18,
   DSP^[133]19, FAS^[134]20, IL-8^[135]21, MMP1^[136]22, MMP2^[137]23,
   MMP7^[138]22, MUC5B^[139]19, SPP1^[140]24, PTGS2^[141]25, TGFB1^[142]26
   and THY1^[143]27). The IPF experiments selected were [144]GSE32537
   (pink), [145]GSE21369 (purple), [146]GSE24206 (blue), [147]GSE94060
   (grass-green), [148]GSE72073 (lemon yellow), [149]GSE35145 (green), and
   [150]GSE31934 (yellow). The third and the fourth annotation rows are
   sample states: light blue, MATCH_TISSUE_CONTROL; dark blue,
   HEALTHY/CONTROL; turquoise, IPF samples; and grey, NON_IPF_ILD. (B)
   COPD genes reported being relevant in the literature
   (HHIP^[151]28,[152]29, CFTR^[153]30,[154]31, PPARG^[155]32,
   SERPINA1^[156]33,[157]34, JUN^[158]35, FAM13A^[159]36, MYH10^35,
   CHRNA5^[160]37, JUND^[161]35, JUNB^[162]35, TNF^[163]34, MMP9^[164]34,
   MMP12^[165]34, CHRNA3^[166]37, TGFBR3^[167]32, and GATA2^[168]32). The
   COPD experiments selected were [169]GSE27597, [170]GSE37768,
   [171]GSE57148, [172]GSE8581, and [173]GSE1122. The third and the fourth
   annotation rows are sample states: light blue, MATCH_TISSUE_CONTROL;
   dark blue, HEALTHY/CONTROL; red, COPD samples.

Differences and similarities in COPD and IPF

   PulmonDB can be used not only to replicate previous knowledge but also
   to provide a framework to test new hypotheses. In this context, we set
   out to investigate the differences and similarities between COPD and
   IPF in lung tissue when compared to samples from healthy individuals
   (Fig. [174]4A). Using PulmonDB in the R environment, we selected
   contrasts where the sample was annotated as lung biopsy and the
   reference status as HEALTHY/CONTROLs ([175]GSE52463, [176]GSE63073,
   [177]GSE1122, [178]GSE72073, [179]GSE24206, [180]GSE27597,
   [181]GSE29133, [182]GSE31934, [183]GSE37768) (Fig. [184]4B), and then
   using limma^[185]38 we assessed differential gene expression between
   COPD and IPF. We identified 1781 differentially expressed genes
   (Supplementary Fig. [186]4). To have a visual representation of the
   differences between COPD and IPF, we selected the top 20 differentially
   expressed genes and visualized their expression using the PulmonDB
   website tool (Fig. [187]4C). We observed that data sets tend to cluster
   by test status; Fig. [188]4C shows IPF contrasts on the left
   (turquoise), control contrasts in the middle (blue), and COPD contrasts
   on the right (red). Genes are clustered in two groups (left panel,
   y-axis); the first gene group (I) is overexpressed in IPF while it is
   barely expressed or underexpressed in COPD contrasts. By comparison,
   the second gene cluster (group II) is overexpressed in COPD contrasts
   and underexpressed in IPF. To correlate similarities among samples, the
   20 top differentially expressed genes were used (Fig. [189]4C, right
   panel); samples from the same disease group showed higher correlations
   and tended to have a null or negative correlation with the
   HEALTHY/CONTROL and the opposite disease (Fig. [190]4C). For example,
   FOSB and CXCL2 have opposite behaviors, as both genes are overexpressed
   in COPD and underexpressed in IPF. FOSB is part of the family of Fos
   genes that can dimerize with JUN family proteins to form the
   transcription factor complex AP-1, which is related to COPD^[191]39.
   CXCL2 is a chemokine secreted in inflammation that induces chemotaxis
   in neutrophils^[192]40,[193]41; these cells are predominant in COPD,
   and they are key mediators in tissue damage^[194]42. While neutrophils
   are also important in IPF, we observed their underexpression in this
   disease.

Figure 4.

   [195]Figure 4
   [196]Open in a new tab

   IPF and COPD differentially expressed and similarly expressed genes.
   (A) Flow chart of steps used for COPD and IPF differential expression
   analysis to evaluate transcriptomic differences and similarities.
   (B) Experiments selected for the analysis, following the criteria of
   being lung biopsy samples and contrasted with HEALTHY/CONTROL
   references. The colors represent the sample state: COPD, red;