Abstract

   Idiopathic pulmonary fibrosis is a progressive and debilitating lung
   disease with large unmet medical need and few treatment options. We
   describe an analysis connecting single cell gene expression with bulk
   gene expression-based subsetting of patient cohorts to identify IPF
   patient subsets with different underlying pathogenesis and cellular
   changes. We reproduced earlier findings indicating the existence of two
   major subsets in IPF and showed that these subsets display different
   alterations in cellular composition of the lung. We developed
   classifiers based on the cellular changes in disease to distinguish
   subsets. Specifically, we showed that one subset of IPF patients had
   significant increases in gene signature scores for myeloid cells versus
   a second subset that had significantly increased gene signature scores
   for ciliated epithelial cells, suggesting a differential pathogenesis
   among IPF subsets. Ligand-receptor analyses suggested there was a
   monocyte-macrophage chemoattractant axis (including potentially
   CCL2-CCR2 and CCL17-CCR4) among the myeloid-enriched IPF subset and a
   ciliated epithelium-derived chemokine axis (e.g. CCL15) among the
   ciliated epithelium-enriched IPF subset. We also found that these IPF
   subsets had differential expression of pirfenidone-responsive genes
   suggesting that our findings may provide an approach to identify
   patients with differential responses to pirfenidone and other drugs. We
   believe this work is an important step towards targeted therapies and
   biomarkers of response.

Introduction

   Idiopathic pulmonary fibrosis (IPF) is a chronic and progressive
   fibrosing disease of the lung with a median survival time of <5 years
   after diagnosis [[32]1, [33]2]. IPF is characterized histologically by
   a pattern of usual interstitial pneumonia and the appearance of
   honeycombing cysts and fibroblastic foci [[34]1, [35]2]. Although the
   disease is associated with infiltration and accumulation of
   inflammatory cells, IPF patients typically do not improve with
   anti-inflammatory therapy and the only approved IPF therapies,
   nintedanib and pirfenidone, are anti-fibrotic and not curative [[36]3,
   [37]4]. Despite recent advances in genome-wide association studies
   (GWAS) [[38]5–[39]7], the mechanisms connecting genetic susceptibility,
   environmental factors and molecular and pathological changes in IPF are
   incompletely understood.

   IPF is a heterogeneous disease with differences in clinical outcome and
   rates of disease worsening, suggesting that there are subsets of IPF
   patients with different molecular mechanisms of pathogenesis [[40]8,
   [41]9]. As such, a better understanding of IPF pathogenesis and subset
   heterogeneity is essential to advance new therapies for this
   devastating disease. A previous attempt by Yang et al. [[42]10] to
   understand the molecular basis of IPF heterogeneity identified two
   subsets of IPF patients that were primarily differentiated on the basis
   of high and low expression of genes from ciliated epithelium; the
   former was associated with greater pulmonary honeycombing. However,
   this finding has not been replicated in another study and the
   pathophysiologic correlates of both subsets have not been explored,
   including the interactions of ciliated epithelium with other cell
   types. Cell phenotype-based studies of IPF patient blood and lung
   samples have shown that increases in plasma cells and mast cells and
   decreases in T cells were respectively associated with mild versus
   severe disease [[43]11–[44]13]. Importantly, the overlap between
   subsets identified using different data sources such as gene
   expression, and cell phenotypes have not been investigated. The
   development of molecular classifiers to reliably detect and separate
   subsets of IPF patients using machine learning has not been attempted.

   In the current study, we found that the subsets described by Yang et
   al. ([45]GSE32537, referred to henceforth as ‘Schwartz-Univ of Colorado
   bulk expression cohort’) [[46]10] were replicated in our analysis of a
   new overlapping cohort of IPF patients from a study by Kaminski and
   colleagues ([47]GSE47460, referred to henceforth as ‘Kaminski-LGRC bulk
   expression cohort’) [[48]14–[49]17] and in our analysis of a
   non-overlapping independent cohort of IPF patients ([50]GSE134692 (BMS
   bulk RNA-seq cohort) [[51]18]). We characterized the cellular changes
   associated with each subset of IPF patients using cell type signatures
   derived from recently published single cell RNA sequencing (scRNAseq)
   data obtained from IPF patients and healthy lungs including
   [52]GSE132771 (i.e. ‘Sheppard-UCSF single cell cohort’), [53]GSE135893
   (‘Kropski-Vanderbilt Univ single cell cohort’) and [54]GSE136831
   (‘Kaminski-Yale Univ single cell cohort’) [[55]19, [56]20, [57]21].
   Importantly, we identified coordinated changes in genes associated with
   different cell types in each subset of IPF patients that have important
   implications for the molecular mechanisms driving disease. Finally, we
   developed molecular classifiers using machine learning approaches to
   reliably distinguish subsets of patients.

Methods

Processing of [58]GSE32537 (Schwartz-Univ of Colorado bulk expression cohort)
and [59]GSE47460 (Kaminski-LGRC bulk expression cohort) IPF gene expression
dataset

   We downloaded and reprocessed microarray data from Schwartz and
   colleagues ([60]GSE32537, Schwartz-Univ of Colorado bulk expression
   cohort [[61]10] and Kaminski and colleagues ([62]GSE47460,
   Kaminski-LGRC bulk expression cohort) [[63]14–[64]17, [65]22] using
   ArrayStudio (Qiagen). We applied quantile normalization to the raw data
   and applied the ‘Remove batch effects’ function in ArrayStudio
   (Qiagen). We posted normalized gene expression matrices along with the
   code used to process the data on GitHub
   ([66]https://github.com/JKarmanAbbVie/IPFproject2020).

Processing of additional public bulk IPF gene expression studies used in this
study

   Author-supplied normalized matrix and design files for [67]GSE134692
   (BMS bulk RNA-seq cohort) [[68]18] were downloaded from Gene Expression
   Omnibus. For [69]GSE134692 (BMS bulk RNA-seq cohort) [[70]18], we only
   used samples from ‘Batch 1’ to avoid batch effects [[71]18].
   [72]GSE124685 (‘Kaminski-Yale Univ bulk progression RNA cohort’)
   [[73]12] RNA sequencing dataset was re-processed from SRA files posted
   in Gene Expression Omnibus using ArrayStudio. Count data was normalized
   using the ‘edgeR’ R package [[74]23] (‘TMM’ method) implemented in
   ArrayStudio. Datasets used in this study are summarized in [75]Table 1.

Table 1. Data sets used in this manuscript.

   Accession number Name for dataset used in manuscript Platform IPF
   patients (n) Healthy controls (n) References