Abstract

Background

   The increased multi-omics information on carefully phenotyped patients
   in studies of complex diseases requires novel methods for data
   integration. Unlike continuous intensity measurements from most omics
   data sets, phenome data contain clinical variables that are binary,
   ordinal and categorical.

Results

   In this paper we introduce an integrative phenotyping framework (iPF)
   for disease subtype discovery. A feature topology plot was developed
   for effective dimension reduction and visualization of multi-omics
   data. The approach is free of model assumption and robust to data
   noises or missingness. We developed a workflow to integrate homogeneous
   patient clustering from different omics data in an agglomerative manner
   and then visualized heterogeneous clustering of pairwise omics sources.
   We applied the framework to two batches of lung samples obtained from
   patients diagnosed with chronic obstructive lung disease (COPD) or
   interstitial lung disease (ILD) with well-characterized clinical
   (phenomic) data, mRNA and microRNA expression profiles. Application of
   iPF to the first training batch identified clusters of patients
   consisting of homogenous disease phenotypes as well as clusters with
   intermediate disease characteristics. Analysis of the second batch
   revealed a similar data structure, confirming the presence of
   intermediate clusters. Genes in the intermediate clusters were enriched
   with inflammatory and immune functional annotations, suggesting that
   they represent mechanistically distinct disease subphenotypes that may
   response to immunomodulatory therapies. The iPF software package and
   all source codes are publicly available.

Conclusions

   Identification of subclusters with distinct clinical and biomolecular
   characteristics suggests that integration of phenomic and other omics
   information could lead to identification of novel mechanism-based
   disease sub-phenotypes.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12864-015-2170-4)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Cluster analysis, Genomics, Chronic lung disease, Integrative
   clustering

Background

   Disease phenotyping refers to a procedure that specifies disease
   definition or diagnosis in terms of observable abnormal phenotypic
   characteristics that occur due to the interaction between genotypes and
   environmental effects. Traditionally, the two largest pulmonary disease
   phenotypes—obstructive and restrictive lung diseases—have been
   determined using physiological, radiological, or histopathological
   features. Our study focuses on the most common diseases representing
   these aforementioned phenotypes—COPD and ILD. COPD is a lung disease
   caused by the repeated exposure to a noxious agent resulting in
   irreversible airflow limitation. COPD is classified by the Global
   Initiative for Chronic Obstructive Lung Disease criteria in four major
   categories based on symptoms, airflow obstruction, and exacerbation
   history [[41]1]. Similarly, the term Interstitial Lung Disease
   designates a loosely defined group of patients characterized by changes
   in the interstitium of the lung, causing pulmonary restriction and
   impaired gas exchange. This group includes: Idiopathic Pulmonary
   Fibrosis (IPF), Non Specific Interstitial Pneumonia (NSIP),
   Hypersensitivity Pneumonitis (HP), Cryptogenic Organizing Pneumonia
   (COP), Respiratory Bronchiolotis-associated Interstitial Lung Disease
   (RB-ILD), Collagen Vascular Disease—associated Interstitial Lung
   Disease (CVD-ILD), Desquamative Interstitial Pneumonia (DIP) and Acute
   Interstitial Pneumonia (AIP), among others.

   Despite the advancement in phenotyping these two broad lung disease
   categories based on traditional methods, current clinical definitions
   and classifications of COPD or ILD often fail to accommodate the large
   number of patients with atypical features who typically fall into
   undefined categories [[42]2]. Moreover, existing classifications do not
   reflect advances in high-throughput mRNA and miRNA expression
   techniques that may improve our understanding of the complexity of a
   given individual’s phenotype. In this paper, we refer to “phenome” as
   the collection of traditional disease phenotypes described above, which
   is in contrast to measurements from rapidly developing high-throughput
   omics techniques. The purpose of this paper is to provide a
   generalizable phenotyping procedure by combining phenome and other
   omics data (mRNA and miRNA expression in our example) for novel disease
   subtype discovery.

   As the array and massively parallel sequencing costs keep dropping,
   omics data generation has increased at an unprecedented rate.
   Meaningful integration and presentation of the abundant information has
   led to new computational and statistical challenges. According to Tseng
   et al. (2012) [[43]3], omics data integration contains two major
   categories: horizontal meta-analysis and vertical integrative analysis.
   In the former type of data integration, the same type of omics data
   sets (e.g. gene expression, GWAS or eQTL) are collected from different
   labs and aligned horizontally with gene features matched on the rows.
   The major purpose of the analysis is similar to traditional
   meta-analysis, repeated over all features on the genome for candidate
   biomarker or pathway detection. In the latter analysis, multi-layers of
   omics data (such as genotyping, gene expression, miRNA expression,
   methylation and mutations) are measured in a given patient cohort and
   integrative analyses are performed to understand the inter-omics
   disease mechanism and relationship. As an example, the Cancer Genome
   Atlas (TCGA) [[44]4] contains multi-omics data for more than 10,000
   patients, and spans more than 20 cancers. Vertical information
   integration of multi-layer omics data has gained increasing attention
   in the past few years in biomedical research [[45]5–[46]8]. Depending
   on biological purposes, many tools have been developed. A large
   collection of existing integrative applications utilize relatively
   naïve summary/comparative scores (e.g. correlations and signal-to-noise
   ratios) and visualization tools (e.g. heatmaps, scatter plots, volcano
   plots, box plots and survival curves) with minimal statistical
   information integration [[47]9, [48]10] and several convenient packages
   or pipelines along this approach are available [[49]11, [50]12]. Other
   advanced statistical and computational methods have been rapidly
   developed. For example, methods have been developed to integrate copy
   number variation or methylation with gene expression profiles [[51]13,
   [52]14]. Dimension reduction methods including principal component
   analysis [[53]15], partial least squares [[54]16], and nonnegative
   matrix factorization [[55]17] have been applied to identify homogeneous
   and heterogeneous patterns across multi-omics data. To identify novel
   disease subtypes, the Bayesian consensus clustering [[56]18] and
   iCluster [[57]19] are two recent powerful methods to combine
   multi-omics data. They are, however, limited by a few drawbacks that
   motivated the development of our new integrative phenotyping framework
   (iPF). First, both Bayesian consensus clustering and iCluster assume
   numerical (continuous) measurements in the multi-omics data but the
   complexity of phenome data that contains binary, ordinal and
   multi-class categorical data types is not addressed. Second, both
   methods are model-based and poor performance is expected if the
   distribution assumptions are violated. Finally, both methods lack
   visualization tools for further exploratory analysis. The proposed iPF
   described in this paper aims to fill these gaps. iPF incorporates
   automatic feature selection, dimension reduction, data smoothing and
   pattern visualization in the feature topology plot. Single-omic cluster
   analysis is applied to each omics data set and a workflow to encompass
   homogeneous and heterogeneous clustering information across omics data
   is performed. In our application to the COPD and ILD data, novel
   disease subphenotypes were characterized and validated by training and
   testing batches. Post-hoc functional analysis revealed important
   biological processes related to the disease subtypes that might lead to
   novel diagnosis or treatment strategy. All data sets and source code
   used in this paper are publicly available as an iPF R package
   ([58]http://tsenglab.biostat.pitt.edu/software.htm).

Results

Overview of Integrative phenotyping framework (iPF)

   The integrative phenotyping framework (iPF) consists of the following
   four steps in the flowchart (Fig. [59]1): (1) Data pre-processing: Each
   omics data set is adequately pre-processed and normalized. Redundant
   (e.g. non-expressed and/or non-informative) features are separately
   eliminated in each omics data set; (2) Feature concatenation: Omics
   data sets are vertically combined as in Fig. [60]2a. A distance
   (dissimilarity) matrix between any two features within and across omics
   data sets is defined (Fig. [61]2b); (3) Dimension reduction:
   Multidimensional scaling (MDS) is applied to map all features to a
   two-dimensional Euclidean space for dimension reduction (Fig. [62]2c);
   (4) Feature smoothing: Feature intensities are smoothed in the reduced
   2D space for each patient (Fig. [63]2d); (5) Clustering for subtype
   discovery and visualization: Unsupervised clustering analysis is
   performed to identify potential disease subtypes, and feature
   intensities within each cluster are averaged to generate representative
   plots for each cluster (Fig. [64]2e). The resulting contour plots are
   referred to as “feature topology plots (FTP)”, hereafter. Details of
   the iPF framework are presented in the method section and supporting
   information.

Fig. 1.

   Fig. 1
   [65]Open in a new tab

   Flow chart of Integrative Phenotyping Framework (iPF): Integrative
   phenotyping framework (iPF) includes the following steps (1) Data
   preprocessing (2) Feature concatenation (3) Dimension reduction (4)
   Feature smoothing (5) Clustering for subtype discovery and
   visualization

Fig. 2.

   Fig. 2
   [66]Open in a new tab

   Overview of integrative clustering in integrative phenotyping framework
   (iPF): (a) Vertically combined multiple omics data sets (b) A distance
   matrix between any two features within and across omic data sets (c)
   Multidimensional scaling (MDS) mapping to a two-dimensional Euclidean
   space (d) Smoothed feature intensities in the reduced 2D space for each
   patient (e) Unsupervised clustering to identify potential disease
   subtypes and averaged feature intensities for representative plots of
   each cluster

   Different omics data may contribute to similar (homogeneous) or
   distinct (heterogeneous) disease subtype definition. For example,
   iCluster vertically aligns multi-omics data and performs latent
   variable decomposition. It implicitly assumes that different omics data
   contribute to one final and common disease subtype definition. The
   Bayesian consensus clustering method alternatively models common and
   distinct patient subtypes from different omics data. In our framework,
   we perform pairwise agglomerative merging strategy when a pair of omics
   data sets present “homogeneous” clustering results. For example, when
   three omics data sets (mRNA, microRNA (miRNA) and clinical) are
   available in Fig. [67]3, all pairs of omics data sets are compared. If
   mRNA and miRNA generate similar clustering results while the other two
   pairs do not, we merge mRNA and miRNA. Finally, we compare the
   clustering results of mRNA + miRNA versus clinical. This strategy can
   be generalized to combining three or more omics data sets and
   systematically but dynamically investigate clusters identified in
   different omics data. To determine homogeneity or heterogeneity of
   clustering results, Fig. [68]4 shows the feature topology plots for
   clusters from the first and the second omics data sources on the left
   and on the top (three clusters in each omics data source). The number
   of overlapped patients in the 3 × 3 table provides evidence of
   homogeneity (majority of patients are on the diagonal; for example,
   mRNA vs miRNA in Additional file [69]1: Figure S8(a)) or heterogeneity
   (existence of clusters off-diagonal as shown in Additional file [70]1:
   Figure S8(a), (b) and Fig. [71]4).

Fig. 3.

   Fig. 3
   [72]Open in a new tab

   Graphical illustration of heterogeneous data sets comparison scheme in
   iPF. Step 1-2: Compare and combine all possible pairs of omics data
   sets until we produce homogenous clustering results. Step 3: Pairwise
   comparison until identifying any heterogeneous data sets

Fig. 4.

   Fig. 4
   [73]Open in a new tab

   iPF clustering results for multiple omics data sets using all samples
   (Clinical, mRNA + miRNA). This figure shows three clusters are
   generated in each omics data source, and the feature topology plots for
   clusters from the first and the second omics data sources on the left
   and on the top, respectively

Discovery of disease subtypes from multi-omics data sets

   In this study, we combine three omics data sets (669 clinical
   variables, 4258 mRNA expressions and 438 miRNA expressions) from a
   cohort of 319 lung disease patients. To estimate the number of
   clusters, Additional file [74]1: Figure S9A-D presents the Gap
   statistics and the incremental differences in each omics data set for
   different comparisons. Almost all of the results clearly indicate 3
   clear clusters in each omics data sources. Additional file [75]1:
   Figure S8 shows feature topology plots of the pairwise comparison of
   clustering results from each omics data set. The pie charts of the
   clusters describe the composition of existing diagnoses from
   clinicians. The three clusters identified from mRNA data and miRNA data
   are highly consistent with only 38 off-diagonal samples
   (38/319 = 12 %), while the comparison of mRNA vs clinical and miRNA vs
   clinical show heterogeneous clustering with 121 (38 %) and 150 (47 %)
   off-diagonal samples (Additional file [76]1: Figure S8). As a result,
   we merge mRNA and miRNA data for combined cluster analysis and compare
   with the clustering result from the clinical data in Fig. [77]4. The
   result shows three consensus clusters (cluster A, 76 samples mostly
   COPD; cluster I, 80 samples mostly ILD; cluster E, 43 samples of
   intermediate subtype) and six off-diagonal differentially defined
   clusters from the two omics data sources. Noticeably, 18 samples
   (cluster G) are determined ILD-like from clinical data but are viewed
   as COPD-like from mRNA + miRNA clustering. Similarly, 11 samples
   (cluster C) are viewed as COPD-like in clinical clustering but are
   ILD-like in mRNA + miRNA clustering. COPD and ILD are considered to
   have distinct disease mechanisms because of their extremely divergent
   phenotypic patterns despite similar risk factors as well as the
   presence of a combined emphysema and pulmonary fibrosis overlapping
   syndrome [[78]20]. This makes it possible that omics measurements might
   help improve disease phenotyping. For example, the diagnoses of the
   first cluster (n = 121) by clinical data (sum of cluster A, B, and C)
   are mostly associated with COPD. With additional information from mRNA
   and miRNA expression, these 121 samples are further divided into
   clusters A, B and C. Overall, the result shows the limitation of
   current disease diagnosis that can be improved by large-scale clinical
   and transcriptomic measurements. The identified cluster E (and
   potentially also clusters B, D, F and H) may present a novel
   intermediate disease subtype with a disease mechanism different from
   existing COPD and ILD definitions.

   A common issue in using omics data sets for disease subtype discovery
   is the reproducibility and potential presence of batch effects. To
   validate the finding, our analysis initially started with a first
   training cohort of 91 patients and was then validated in a second
   testing cohort of 228 patients. The proposed comprehensive validation
   scheme is comprised of the three phases of discovery, prediction and
   validation. A first training cohort produces clustering and MDS
   coordinates results, and thereby the MDS coordinates are directly
   applied to the testing cohort for clustering as the prediction phase.
   In the validation phase, clustering results are produced within the
   testing cohort. In our analysis, clustering results in the prediction
   and the validation phase show high reproducibility with a consistently
   identified intermediate disease subtype (see Additional file [79]1:
   Figure S6). Additional file [80]1: Figure S6(a) demonstrates a workflow
   of three phases of discovery, prediction and validation. Additional
   file [81]1: Figure S6(b) demonstrates the discovery phase of clustering
   results from the training cohort. The model is then applied to the
   testing cohort for clustering as the prediction phase in Additional
   file [82]1: Figure S6(c). Additional file [83]1: Figure S6(d) shows a
   validation phase of clustering result within the testing cohort. By
   comparing Additional file [84]1: Figure S6(c) and (d), the result shows
   high reproducibility with a consistently identified intermediate
   disease subtype. To analytically measure a level of concordance between
   clusters in Fig. [85]2c and d, we employ adjusted rand index (ARI)
   [[86]21]. The estimated ARIs are 0.764 for the clinical data set and
   0.43 for the three clusters in the transcriptomic data set
   (mRNA + miRNA). Taken together, we conclude that the discovered
   clusters from the training and testing cohort preserve common
   characters, which enable the combination of all the samples (n = 319)
   from the three batches in the pooled analysis, as shown in Fig. [87]4.

Discovery of discriminant phenome features

   We focus our analysis on the three consistent clusters, hereafter
   referred to as Cluster A, E and I. Table [88]1 shows the group means of
   12 selected demographic and clinical variables for the three clusters
   and their pairwise as well as overall ANOVA p-values. Patients in
   cluster E were likely to be younger (average age 55 years) when
   compared to clusters A and I (65.7 and 66.1 years respectively) and
   there were more females (65.1 %) compared to A and I (39.5 % and 30 %
   respectively). In contrast, patients in cluster A had a more
   obstructive pattern in their pulmonary function test (average FEV1/FVC
   ratio 0.653) compared to those in clusters E and I (0.93 and 1.12
   respectively). Similarly, quantitative CT analysis revealed that
   patients in cluster A were more likely to have emphysema (CT% emphysema
   14.4 %) while those in cluster I were more likely to have high lung
   reticular volumes (662 mL). In this case, patients in cluster E seemed
   to express an intermediate phenotype although the gender and age
   distributions do no suggest cluster E as a pure intermediate phenotype
   between cluster A and I.

Table 1.

   Summary of significant features grouped in each cluster (cluster A, E,
   and I)^a
   Total Cluster A Cluster E Cluster I P-value Cluster Cluster Cluster
   (n = 199) (n = 76) (n = 43) (n = 80) ANOVA A & E A & I E & I
   Age, yrs 63.5 65.7 55 66.1 2.91E-07 7.68E–02 1.24E–07 5.05E–02
   Gender, % female 41.2 39.5 65.1 30 7.51E–04 2.44E–02 7.23E–01 7.42E–04
   Body Mass Index, BMI 28.6 28 27.5 29.8 3.29E–02 1.00E + 00 7.24E–02
   1.07E–01
   FEV1 % predicted 61.7 48 64.3 73.4 1.12E–13 4.37E–05 1.53E–13 3.61E–02
   FVC % predicted 69.2 72.4 69.7 65.8 1.85E–02 1.00E + 00 1.79E–02
   2.93E–01
   FEV1/FVC ratio 0.9 0.653 0.93 1.12 3.62E–29 1.59E–09 1.49E–25 2.75E–09
   DLCO 53.9 59.2 57.3 47 2.55E–03 1.00E + 00 8.14E–03 1.46E–02
   Total lung capacity, mean 5.28 6.55 4.87 4.19 4.69E–19 1.16E–07
   2.32E–18 4.98E–02
   CT % emphysema 7.19 14.4 1.88 1.01 4.46E–13 1.73E–07 3.77E–11 3.70E–01
   Lung reticular volume, ml 309 63.8 198 662 5.86E–17 2.22E–02 3.03E–16
   2.32E–07
   Diagnosis, % IPF 38.7 1.32 9.3 90 4.01E–33 1.16E–01 6.00E–28 6.75E–18
   Diagnosis, % Emphysema 19.6 43.4 14 0 4.29E–11 3.20E–03 1.12E–10
   1.99E–03
   [89]Open in a new tab

   ^aThe average values of 12 selected demographic and clinical variables
   in each sub–cluster groups, p–values from Kruskal–Wallis ANOVA (all
   three groups) and p-values from Kruskal-Wallis rank sum test (paired
   wise groups)

Gene co-expression modules and pathway analysis demonstrates enrichment for
inflammatory and immune related annotations in Cluster E

   We performed one-way ANOVA analysis for each mRNA and miRNA feature
   based on cluster A (COPD), E (Intermediate) and I (ILD) labels, and
   identified 1684 statistically significant (p-value adjusted by
   Bonferroni  < 1e-10) features in the gene expression (n = 1596) and
   miRNA (n = 88). We further performed gene co-expression cluster
   analysis using partition around medoids (PAM) to identify four gene and
   miRNA modules in Fig. [90]5 (gap statistics analysis in Additional file
   [91]1: Figure S10 clustered into four gene modules). Note that we
   intentionally turned the 88 miRNAs to opposite direction (by
   multiplying the expression intensities by -1) in the cluster analysis
   to account for the fact that most miRNAs have inhibitory effects on
   mRNA expression. As a result, expressions of miRNAs in a module [[92]5]
   have a clear negative correlation with the remaining mRNAs (blue) in
   Fig. [93]5.

Fig. 5.

   Fig. 5
   [94]Open in a new tab

   Heatmap for the four modules of gene expression and miRNA features
   which significantly differentiate three clusters (COPD, Intermediate,
   and ILD). We performed gene co-expression cluster analysis using
   partition around modoids (PAM) to identify four gene and miRNA modules.
   When perform clustering, the 88 miRNAs intensities are turned to the
   opposite direction (by multiplying the expression intensities by -1) to
   show that most miRNAs have inhibitory effects on mRNA expression

   In order to look more deeply into the biological pathways
   differentiating the microarray samples in cluster E from those in
   cluster I (Fig. [95]5) we focused on the genes in module two, since
   this was the most distinctive of all studied modules. By using
   Ingenuity Pathway Analysis (IPA, see [96]Methods for more detail), we
   identified enrichment for a large number of immune related annotations
   in module two genes when comparing samples in cluster E relative to
   those in cluster I. These annotations were largely related to immune
   cell trafficking, predominantly leukocyte activation, migration,
   movement and chemotaxis as well as cell movement of phagocytes,
   neutrophils and myeloid cells (Additional file [97]1: Table S18). The
   large majority of these annotations showed an increase in their
   activation score (z-score), predicting the increase activation of these
   functions in cluster E subjects.

   Given the enrichment for immune related annotations, we looked for
   potential drugs targeting overexpressed genes in module two and
   identified drugs with immunomodulatory and immunosuppressive effects
   (Additional file [98]1: Figure S12), medications that are currently not
   recommended for the treatment of IPF patients, the predominant group of
   patients in cluster E. Some of these drugs included hydroxichloroquine,
   a drug that has been largely used to treat rheumatoid arthritis and
   systemic lupus erythematous [[99]22, [100]23]. Rituximab, a humanized
   monoclonal antibody which targets B cell and is commonly used to treat
   autoimmune disorders and Efalizumab, a formerly available
   immunosuppressant that was used to treat autoimmune disorders by
   inhibiting lymphocyte activation and cell migration. Taken together,
   these results demonstrate how iPF allowed us to identify a subset of
   patients with a predominant immune related phenotype that could
   potentially respond to immunomodulatory therapy.

Simulation study to show advantages of iPF

   To demonstrate that iPF is robust and accurate in disease subtype
   discovery using omics data, we simulated data sets with various degrees
   of variances and proportion of noise features, using the
   “clusterGeneration” package. Additional file [101]1: Figure S11 shows
   that feature fusion methods used in iPF (FF, FFspK and FFmClust)
   outperform traditional clustering methods. More details are shown in
   Additional file [102]1: Text S3.

Discussion

   In this paper, we have proposed a framework to integrate multi-omics
   data sources for disease subtype identification. Compared to existing
   methods (such as iCluster and Bayesian consensus clustering), iPF
   allows different variable types (binary, continuous, ordinal and
   multi-class), which are common in clinical variables. The method is
   also model-free, manages missing values without imputation techniques
   (i.e., pair-wise feature correlations can be estimated by excluding
   only samples with missing values), robust to data noises and provides
   effective visualization tools. Compared to the latent variable
   estimation in iCluster and posterior distribution simulation in
   Bayesian consensus clustering, the computational load of iPF is also
   much more affordable. Our hierarchical integrative strategy
   (Fig. [103]3) further allows analysis of homogeneous and heterogeneous
   clustering structure across multi-omics data.

   We tested the iPF framework using two large multi-omic data sets
   obtained from lung tissues from well characterized patients with
   chronic lung diseases. One of the most relevant findings by iPF is the
   replication of the two most distinct disease subphenotypes in pulmonary
   medicine: COPD and ILD, diseases that represent the opposite spectrum
   of lung physiologic patterns, airflow obstruction in COPD and lung
   restriction in ILD. Currently, the diagnosis of obstructive and
   restrictive lung diseases relies mostly on clinical findings, pulmonary
   function test, radiologic studies and lung biopsies in selected cases
   [[104]24]. Results from genomic analyses of lung tissues have not been
   included in diagnostic algorithms in chronic lung diseases mostly
   because of the added cost of these technologies, the difficulties to
   access lung tissue samples and the concept that an accurate diagnosis
   can be achieved by using the current diagnostic armamentarium in
   clinical practice and most commonly without the need to obtain a
   biopsy. Our findings demonstrate that iPF is a valuable tool not only
   because it was able to distinguish the known disease categories, but
   also because it allowed us to identify a previously unknown disease
   subphenotype of patients that clustered in between patients with COPD
   and ILD (see Additional file [105]1: Figure S19), characterized by the
   overexpression of genes predominantly associated with immune cell
   activation and trafficking. Interestingly, a large number of patients
   in this cluster had the clinical diagnosis of Idiopathic Pulmonary
   Fibrosis (IPF).

   The identification of genes associated with immune cell activation in a
   subset of patients with IPF is relevant for various reasons. IPF is a
   fibrotic lung disease of unknown etiology, associated with high
   mortality rates [[106]25]. IPF is thought to be caused by repeated
   cycles of alveolar epithelial cell injury followed by fibroblast
   recruitment, proliferation and extracellular matrix deposition
   [[107]26]. In the past, IPF was considered to be a chronic inflammatory
   process, in part, due to the partial response observed with the use of
   immunosuppressive therapy in a small subset of patients, evidence that
   was limited to case reports [[108]27]. Subsequently, a large
   randomized, controlled clinical trial demonstrated that
   immunosuppressive drugs actually increased the mortality and number of
   hospitalizations of IPF patients [[109]28] and the use of these drugs
   was abandoned from clinical practice. Our pathway analysis revealed
   enrichment for immune related annotations when comparing samples
   between intermediate (cluster E) and ILD (cluster I) clusters
   (Fig. [110]4). These annotations were largely associated with immune
   cell trafficking, leukocyte activation, migration, movement and
   chemotaxis (For details, see Additional file [111]1: Table S18). The
   annotations mostly present high activation scores suggesting increased
   activation of these pathways in cluster E patients. The fact that some
   of the overexpressed genes in the intermediate cluster of ILD patients
   are targeted by immonompodulatory drugs such as hydroxichloroquine,
   Rituximab, and Efalizumab, suggest that a small subset of ILD patients,
   even among those with idiopathic pulmonary fibrosis (IPF), may actually
   benefit from immunomodulation. In this regard, iPF facilitates
   sub-categorizing patients on the basis of multi-level data sources
   beyond the conventional diagnostic tools, and identifying biologically
   associated functional annotations and new drugs effective to the subset
   of patients, namely, implementing personalized diagnosis and treatment
   of chronic lung diseases.

   Integrative Phenotyping Framework has a few potential limitations.
   Firstly, the method applies dimension reduction and smooth techniques.
   The performance is expected to deteriorate if the omics throughput is
   too low (e.g. data from small number of clinical variables or a small
   assay). Secondly, several procedures in iPF are not fully automated and
   need expert decision. For example, the decision of the number of
   clusters is not always easy and is an intrinsic methodological
   difficulty for almost all clustering tools in real practice. In our
   lung disease application, clusterings from mRNA and miRNA were found to
   be almost identical and it was an easy decision to merge the two omics
   data sets in the hierarchical integration. In other applications, we
   expect that deciding on the homogeneous or heterogeneous clustering
   results may not always be clear-cut. Thirdly, the hierarchical
   integration is performed pairwisely, merging and comparing two at a
   time. The number of clusters may increase exponentially if many omics
   data sets are combined and all omics data generate heterogeneous
   clustering. However, such complexity may be an intrinsic biological
   fact and cannot easily be accommodated by statistical models. Finally,
   our findings in the first multi-omics data set were only validated
   “in-silico” in the second multi-omics data set. While it would have
   been ideal to add a second validation method using an alternative gene
   expression platform (i.e. reverse transcriptase, polymerase chain
   reaction – qRT-PCR), we felt that the strength of our results made
   unnecessary the addition of an alternative validation method.

   The integration of multi-omics data sources using the iPF framework has
   immense potential to advance the field of personalized medicine by
   confirming clinical diagnoses, aid in the identification of new disease
   subphentoypes, provide biological insights, and new targets for drug
   therapy. However, despite the impressive reproducibility of our
   findings across two large data sets, additional studies focusing on
   validating our results in other large data sets will be required before
   the iPF framework can be applied to daily clinical practice.

Methods

Integrative phenotyping framework (iPF)

Data pre-processing and feature integration

   The overall workflow and diagram are demonstrated in Figs. [112]1 and
   [113]2. For pre-processing, each omics data set was normalized
   separately. Non-expressed (low mean intensities) and non-informative
   (low standard deviations) features were filtered. We consider the
   integration of M different types of high-throughput omics data sets,
   denoting these by
   [MATH: <msup><mi>X</mi><mfenced close=")"
   open="("><mi>m</mi></mfenced></msup><mspace
   width="0.5em"></mspace><mo>=</mo><mspace
   width="0.5em"></mspace><msub><mfenced close="}"
   open="{"><msubsup><mi>x</mi><mrow><mi>i</mi><mi>j</mi></mrow><mfenced
   close=")" open="("><mi>m</mi></mfenced></msubsup></mfenced><mfenced
   close="|" open="|"><mrow><msub><mi>I</mi><mi>m</mi></msub><mfenced
   close="|" open="|"><mrow><mspace
   width="0.5em"></mspace><mo>×</mo><mspace
   width="0.5em"></mspace></mrow></mfenced><mi>J</mi></mrow></mfenced></ms
   ub> :MATH]
   data set of the m^th omics, where x^(m)[ij] is the intensity of feature
   i and sample j for i ∈ I[m], j ∈ J, and m = 1, 2, … , M. Denote by
   [MATH: <msubsup><mover accent="true"><mi>X</mi><mo
   stretchy="true">⇀</mo></mover><mi>i</mi><mfenced close=")"
   open="("><mi>m</mi></mfenced></msubsup><mspace
   width="0.5em"></mspace><mo>=</mo><mspace
   width="0.5em"></mspace><mfenced close="}"
   open="{"><mrow><msubsup><mi>x</mi><mrow><mi>i</mi><mi>j</mi></mrow><mfe
   nced close=")" open="("><mi>m</mi></mfenced></msubsup><mo>,</mo><mspace
   width="0.5em"></mspace><mo>…</mo><mspace
   width="0.5em"></mspace><mo>,</mo><msubsup><mi>x</mi><mrow><mi>i</mi><mf
   enced close="|" open="|"><mi>J</mi></mfenced></mrow><mfenced close=")"
   open="("><mi>m</mi></mfenced></msubsup></mrow></mfenced> :MATH]
   the feature vector of the i^th feature in the m^th omics data set. The
   omics data are vertically concatenated such that X = {x[ij]}[|I| × |J|]
   is the combined data set of all M data sets, where I = ∪ ^M[m = 1]I[m].
   (Fig. [114]2a). All features are standardized to have zero mean and
   unit variance to avoid scaling issues or dominance of certain omics
   types.

MDS dimension reduction and feature smoothing

   Based upon the combined data set X, we calculate the feature
   dissimilarity (distance) matrix by
   [MATH: <mi>D</mi><mspace width="0.5em"></mspace><mo>=</mo><mspace
   width="0.5em"></mspace><mfrac><mfenced close=")"
   open="("><mrow><msub><mfenced close="}"
   open="{"><mn>1</mn></mfenced><mfenced close="|"
   open="|"><mrow><mi>I</mi><mfenced close="|"
   open="|"><mo>×</mo></mfenced><mi>I</mi></mrow></mfenced></msub><mo>−</m
   o><mi>R</mi></mrow></mfenced><mn>2</mn></mfrac><mspace
   width="0.5em"></mspace><mo>=</mo><mspace
   width="0.5em"></mspace><msub><mfenced close="}"
   open="{"><msub><mi>d</mi><mrow><mi>i</mi><msup><mi>i</mi><mo>′</mo></ms
   up></mrow></msub></mfenced><mfenced close="|"
   open="|"><mrow><mi>I</mi><mfenced close="|"
   open="|"><mo>×</mo></mfenced><mi>I</mi></mrow></mfenced></msub><mtext>,
   </mtext> :MATH]

   where R is the correlation matrix (defined in Additional file [115]1:
   Table S14) between different variable types such as continuous,
   ordinal, binary and multi-class categorical (Fig. [116]2b).
   Multi-dimensional scaling (non-metric MDS) is then applied to project
   all concatenated features onto a two-dimensional space:

   [MATH: <msub><mrow><mi>G</mi></mrow><mi>X</mi></msub><mspace
   width="0.5em"></mspace><mrow><mo>(</mo><msubsup><mrow><mover><mi>X</mi>
   <mo stretchy="false">⇀</mo></mover></mrow><mi>i</mi><mrow><mo
   stretchy="false">(</mo><mi>m</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>)</mo></mrow><mspace
   width="0.5em"></mspace><mo>=</mo><mspace
   width="0.5em"></mspace><mrow><mo>(</mo><msubsup><mrow><mi>u</mi></mrow>
   <mrow><mi>i</mi><mn>1</mn></mrow><mrow><mo
   stretchy="false">(</mo><mi>m</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>,</mo><mspace
   width="0.5em"></mspace><msubsup><mrow><mi>u</mi></mrow><mrow><mi>i</mi>
   <mn>2</mn></mrow><mrow><mo stretchy="false">(</mo><mi>m</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>)</mo></mrow><mo>,</mo><msp
   ace
   width="0.5em"></mspace><mi>w</mi><mi>h</mi><mi>e</mi><mi>r</mi><mi>e</m
   i><mspace
   width="0.5em"></mspace><msub><mrow><mi>G</mi></mrow><mi>X</mi></msub><m
   space width="0.5em"></mspace><mo>:</mo><mspace
   width="0.5em"></mspace><mspace
   width="0.5em"></mspace><msup><mrow><mi>R</mi></mrow><mi>J</mi></msup><m
   space width="0.5em"></mspace><mo stretchy="false">→</mo><mspace
   width="0.5em"></mspace><msup><mrow><mi>R</mi></mrow><mn>2</mn></msup><m
   space width="0.5em"></mspace><mspace
   width="0.5em"></mspace><mo>,</mo><mspace width="0.5em"></mspace><mspace
   width="0.5em"></mspace><mi>i</mi><mspace
   width="0.5em"></mspace><mo>∈</mo><mspace
   width="0.5em"></mspace><msub><mrow><mi>I</mi></mrow><mi>m</mi></msub><m
   space width="0.5em"></mspace><mrow><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">n</mi><mi
   mathvariant="normal">d</mi></mrow><mspace
   width="0.5em"></mspace><mi>m</mi><mspace
   width="0.5em"></mspace><mo>=</mo><mspace
   width="0.5em"></mspace><mn>1</mn><mo>,</mo><mspace
   width="0.5em"></mspace><mo>…</mo><mspace
   width="0.5em"></mspace><mo>,</mo><mi>M</mi><mo>.</mo> :MATH]
   Note that the MDS mapping is performed based on the combined data set
   X. MDS coordinates after feature projects are comparable across
   different omics data sets (Fig. [117]2c), an important property to
   allow clustering comparison across omics data sets in a later step.
   Feature intensities are represented by gradient color: red, yellow and
   blue represent high, intermediate and low intensities respectively.
   After dimension reduction, we apply a nonparametric smoothing technique
   to intensities at MDS coordinates using generalized additive model
   (GAM) [[118]29] and create a Feature Topology Plot (FTP) using the
   smoothed intensities in 2D space (See Additional file [119]1: Figure
   S5(b)). The smoothing method applies a thin plate spline penalty as the
   basis (TPRS) [[120]30]. It averages response values in a neighborhood
   and plays an essential role for iPF to be robust against randomly noisy
   features.

   This workflow can be seen as converting integrated feature vectors into
   a smoothed 2D image in the MDS space, and includes multiple benefits.
   First, highly correlated features are encouraged to concentrate in a
   small region in the 2D image. So when we measure the distance
   (correlation) between two subjects based on 2D space, these highly
   correlated feature will contribute less than they do at the original
   feature space. It is challenging to characterize high-dimensional
   covariance among thousands or tens of thousands features. Instead we
   project feature vectors into the 2D space (MDS), which help to
   attenuate effects of highly correlated features. Second, the smoothing
   technique reduces noise effects and thus enhances the efficiency of
   clustering analysis. Additional file [121]1: Text S3 presents various
   numerical simulations for sensitivity and robustness analyses of the
   proposed Feature Fusion technique (= Feature concatenation + Dimension
   reduction + Feature smoothing). Details of smoothing and generation of
   Feature Topology Plots are available in Additional file [122]1: Text
   S2.

Cluster analysis for subtype discovery and visualization

   In Fig. [123]2 (d), we can generate the smoothed intensities of each
   patient j in the m^th omics data set. Suppose we scale the MDS
   coordinates to the unit square (x, y coordinates between 0 and 1). We
   represent the smoothed intensity vector (of length (n + 1)^2) of
   patient j in the m^th omics data set as
   [MATH: <msubsup><mi>c</mi><mi>j</mi><mfenced close=")"
   open="("><mi>m</mi></mfenced></msubsup><mspace
   width="0.5em"></mspace><mo>=</mo><mspace
   width="0.5em"></mspace><mfenced close="}"
   open="{"><mrow><msubsup><mover accent="true"><mi>f</mi><mo
   stretchy="true">^</mo></mover><mi>j</mi><mfenced close=")"
   open="("><mi>m</mi></mfenced></msubsup><mspace
   width="0.5em"></mspace><mfenced close=")" open="("><mrow><mstyle
   scriptlevel="+1"><mfrac><mi>s</mi><mi>n</mi></mfrac></mstyle><mo>,</mo>
   <mspace width="0.5em"></mspace><mstyle
   scriptlevel="+1"><mfrac><mi>t</mi><mi>n</mi></mfrac></mstyle></mrow></m
   fenced></mrow></mfenced><mtext>,</mtext> :MATH]
   where
   [MATH: <msubsup><mover accent="true"><mi>f</mi><mo
   stretchy="true">^</mo></mover><mi>j</mi><mfenced close=")"
   open="("><mi>m</mi></mfenced></msubsup><mspace
   width="0.5em"></mspace><mfenced close=")" open="("><mrow><mstyle
   scriptlevel="+1"><mfrac><mi>s</mi><mi>n</mi></mfrac></mstyle><mo>,</mo>
   <mspace width="0.5em"></mspace><mstyle
   scriptlevel="+1"><mfrac><mi>t</mi><mi>n</mi></mfrac></mstyle></mrow></m
   fenced> :MATH]
   is the smoothed intensity at the
   [MATH: <mfenced close=")" open="("><mrow><mstyle
   scriptlevel="+1"><mfrac><mi>s</mi><mi>n</mi></mfrac></mstyle><mo>,</mo>
   <mspace width="0.5em"></mspace><mstyle
   scriptlevel="+1"><mfrac><mi>t</mi><mi>n</mi></mfrac></mstyle></mrow></m
   fenced> :MATH]
   MDS coordinate (s = 0,1,…, n and t = 0,1,…,n). To perform cluster
   analysis to assign patients into clusters based on the m^th omics data
   set, we adopt the dissimilarity measure between any two patients i and
   j as D^(m)[j,j '] = 1 − cor (c^(m)[j], c^(m)[j ']), where cor(⋅,⋅) is
   the Pearson correlation of two vectors. Partition around medoids (PAM)
   [[124]31] is then applied to cluster patients into potential disease
   subtypes. The number of clusters is determined by Gap statistics
   [[125]32] for each omics data set. Figure [126]2e (Step vii to viii)
   shows a schematic demonstration that each of the two omics and clinical
   data sets generates three clusters of patients. One can further average
   the Feature Topology Plots in each cluster to visually show the cluster
   patterns (Step ix in Fig. [127]2e). For example, Additional file
   [128]1: Figure S8(a) shows almost identical cluster pattern
   visualization from mRNA and miRNA expression data sets, which are
   further validated by the confusion matrix with few off-diagonal
   patients. This justifies combining mRNA and miRNA for clustering and
   comparison with clustering from the clinical data in Fig. [129]4.

Integrative strategy for multi-omics clustering

   One major difficulty of disease subtype discovery using multi-omics
   data is the possible heterogeneity of clustering results from different
   omics data. The above iPF framework and visualization can handle
   integration of M = 2 omics data sets. When M ≥ 3, we propose to first
   compare all possible pairs of omics data sets (Fig. [130]3 Step 1). In
   our motivating example, three pairwise comparisons are demonstrated in
   Additional file [131]1: Figure S8. It is clear that mRNA and miRNA
   gives almost the same clustering results while clustering from clinical
   data is very different from mRNA and miRNA. As a result, we combine
   mRNA and miRNA (Fig. [132]3 Step 2) and compare with clinical data
   (Fig. [133]3 Step 3) to generate the final result shown in Fig. [134]4.

Biomarker detection and functional annotation

   After patient clusters are determined from mRNA + miRNA and clinical
   data clustering in Fig. [135]4, we select the unambiguous clusters
   (cluster A: COPD; cluster I: ILD; cluster E: novel intermediate
   subtype) for further biomarker detection and functional annotation.
   One-way analysis of variance [[136]33] is applied to detect significant
   biomarkers associated with the three clusters (p-value adjusted by
   Bonferroni < 1e-10). The detected biomarkers are clustered using
   partition around medoids (PAM) to identify co-expressed gene modules
   and the number of gene modules is determined by Gap statistics
   (Additional file [137]1: Figure S10). Functional annotation is
   performed using Ingenuity Pathway Analysis (IPA) software.

Methods for pathway enrichment analysis in cluster E

   In order to identify underlying biological pathways distinguishing the
   microarray samples from patients in cluster E relative to cluster I, we
   performed IPA and the Ingenuity Downstream Effects Analysis. The
   downstream effects analysis is based on prior knowledge of expected
   causal effects between genes and biological functions stored in the
   Ingenuity® Knowledge Base. This analysis examined genes in the selected
   modules that are known to affect biological functions and compared
   their direction of change (i.e. expression in cluster E samples
   relative to cluster I) to that expected from the literature. In this
   analysis, if the observed direction of change is mostly consistent with
   a particular activation state of a biological function (“increased” or
   “decreased”), then a prediction is made about that activation state.
   For each biological function, an activation z-score is computed. The
   activation z-score is used to infer likely activation states of
   biological functions based on comparison with a model that assigns
   random regulation directions.

Conlusion

   We present an integrative analysis tool to inter-connect disease
   subphenotypes and visualize feature intensity patterns. Using a large
   dataset of lung samples with parallel genomic and phenomic data, we
   show this integrative phenotyping framework (iPF) can lead to
   successful feature discovery and integrative clustering in high
   dimensional space. Applying iPF to large data sets, we identify a
   subphenotype of patients with Chronic Obstructive Pulmonary Disease and
   Interstitial Lung Disease, characterized by overexpression of genes
   associated with inflammatory and immune responses.

Acknowledgements