Abstract Idiopathic pulmonary fibrosis is a progressive and debilitating lung disease with large unmet medical need and few treatment options. We describe an analysis connecting single cell gene expression with bulk gene expression-based subsetting of patient cohorts to identify IPF patient subsets with different underlying pathogenesis and cellular changes. We reproduced earlier findings indicating the existence of two major subsets in IPF and showed that these subsets display different alterations in cellular composition of the lung. We developed classifiers based on the cellular changes in disease to distinguish subsets. Specifically, we showed that one subset of IPF patients had significant increases in gene signature scores for myeloid cells versus a second subset that had significantly increased gene signature scores for ciliated epithelial cells, suggesting a differential pathogenesis among IPF subsets. Ligand-receptor analyses suggested there was a monocyte-macrophage chemoattractant axis (including potentially CCL2-CCR2 and CCL17-CCR4) among the myeloid-enriched IPF subset and a ciliated epithelium-derived chemokine axis (e.g. CCL15) among the ciliated epithelium-enriched IPF subset. We also found that these IPF subsets had differential expression of pirfenidone-responsive genes suggesting that our findings may provide an approach to identify patients with differential responses to pirfenidone and other drugs. We believe this work is an important step towards targeted therapies and biomarkers of response. Introduction Idiopathic pulmonary fibrosis (IPF) is a chronic and progressive fibrosing disease of the lung with a median survival time of <5 years after diagnosis [[32]1, [33]2]. IPF is characterized histologically by a pattern of usual interstitial pneumonia and the appearance of honeycombing cysts and fibroblastic foci [[34]1, [35]2]. Although the disease is associated with infiltration and accumulation of inflammatory cells, IPF patients typically do not improve with anti-inflammatory therapy and the only approved IPF therapies, nintedanib and pirfenidone, are anti-fibrotic and not curative [[36]3, [37]4]. Despite recent advances in genome-wide association studies (GWAS) [[38]5–[39]7], the mechanisms connecting genetic susceptibility, environmental factors and molecular and pathological changes in IPF are incompletely understood. IPF is a heterogeneous disease with differences in clinical outcome and rates of disease worsening, suggesting that there are subsets of IPF patients with different molecular mechanisms of pathogenesis [[40]8, [41]9]. As such, a better understanding of IPF pathogenesis and subset heterogeneity is essential to advance new therapies for this devastating disease. A previous attempt by Yang et al. [[42]10] to understand the molecular basis of IPF heterogeneity identified two subsets of IPF patients that were primarily differentiated on the basis of high and low expression of genes from ciliated epithelium; the former was associated with greater pulmonary honeycombing. However, this finding has not been replicated in another study and the pathophysiologic correlates of both subsets have not been explored, including the interactions of ciliated epithelium with other cell types. Cell phenotype-based studies of IPF patient blood and lung samples have shown that increases in plasma cells and mast cells and decreases in T cells were respectively associated with mild versus severe disease [[43]11–[44]13]. Importantly, the overlap between subsets identified using different data sources such as gene expression, and cell phenotypes have not been investigated. The development of molecular classifiers to reliably detect and separate subsets of IPF patients using machine learning has not been attempted. In the current study, we found that the subsets described by Yang et al. ([45]GSE32537, referred to henceforth as ‘Schwartz-Univ of Colorado bulk expression cohort’) [[46]10] were replicated in our analysis of a new overlapping cohort of IPF patients from a study by Kaminski and colleagues ([47]GSE47460, referred to henceforth as ‘Kaminski-LGRC bulk expression cohort’) [[48]14–[49]17] and in our analysis of a non-overlapping independent cohort of IPF patients ([50]GSE134692 (BMS bulk RNA-seq cohort) [[51]18]). We characterized the cellular changes associated with each subset of IPF patients using cell type signatures derived from recently published single cell RNA sequencing (scRNAseq) data obtained from IPF patients and healthy lungs including [52]GSE132771 (i.e. ‘Sheppard-UCSF single cell cohort’), [53]GSE135893 (‘Kropski-Vanderbilt Univ single cell cohort’) and [54]GSE136831 (‘Kaminski-Yale Univ single cell cohort’) [[55]19, [56]20, [57]21]. Importantly, we identified coordinated changes in genes associated with different cell types in each subset of IPF patients that have important implications for the molecular mechanisms driving disease. Finally, we developed molecular classifiers using machine learning approaches to reliably distinguish subsets of patients. Methods Processing of [58]GSE32537 (Schwartz-Univ of Colorado bulk expression cohort) and [59]GSE47460 (Kaminski-LGRC bulk expression cohort) IPF gene expression dataset We downloaded and reprocessed microarray data from Schwartz and colleagues ([60]GSE32537, Schwartz-Univ of Colorado bulk expression cohort [[61]10] and Kaminski and colleagues ([62]GSE47460, Kaminski-LGRC bulk expression cohort) [[63]14–[64]17, [65]22] using ArrayStudio (Qiagen). We applied quantile normalization to the raw data and applied the ‘Remove batch effects’ function in ArrayStudio (Qiagen). We posted normalized gene expression matrices along with the code used to process the data on GitHub ([66]https://github.com/JKarmanAbbVie/IPFproject2020). Processing of additional public bulk IPF gene expression studies used in this study Author-supplied normalized matrix and design files for [67]GSE134692 (BMS bulk RNA-seq cohort) [[68]18] were downloaded from Gene Expression Omnibus. For [69]GSE134692 (BMS bulk RNA-seq cohort) [[70]18], we only used samples from ‘Batch 1’ to avoid batch effects [[71]18]. [72]GSE124685 (‘Kaminski-Yale Univ bulk progression RNA cohort’) [[73]12] RNA sequencing dataset was re-processed from SRA files posted in Gene Expression Omnibus using ArrayStudio. Count data was normalized using the ‘edgeR’ R package [[74]23] (‘TMM’ method) implemented in ArrayStudio. Datasets used in this study are summarized in [75]Table 1. Table 1. Data sets used in this manuscript. Accession number Name for dataset used in manuscript Platform IPF patients (n) Healthy controls (n) References