Abstract

   Asthma is a heterogeneous, complex syndrome, and identifying asthma
   endotypes has been challenging. We hypothesize that distinct endotypes
   of asthma arise in disparate genetic variation and life-time
   environmental exposure backgrounds, and that disease comorbidity
   patterns serve as a surrogate for such genetic and exposure variations.
   Here, we computationally discover 22 distinct comorbid disease patterns
   among individuals with asthma (asthma comorbidity subgroups) using
   diagnosis records for >151 M US residents, and re-identify 11 of the 22
   subgroups in the much smaller UK Biobank. GWASs to discern asthma risk
   loci for individuals within each subgroup and in all subgroups combined
   reveal 109 independent risk loci, of which 52 are replicated in
   multi-ancestry meta-analysis across different ethnicity subsamples in
   UK Biobank, US BioVU, and BioBank Japan. Fourteen loci confer asthma
   risk in multiple subgroups and in all subgroups combined. Importantly,
   another six loci confer asthma risk in only one subgroup. The strength
   of association between asthma and each of 44 health-related phenotypes
   also varies dramatically across subgroups. This work reveals
   subpopulations of asthma patients distinguished by comorbidity
   patterns, asthma risk loci, gene expression, and health-related
   phenotypes, and so reveals different asthma endotypes.

   Subject terms: Classification and taxonomy, Asthma
     __________________________________________________________________

   Asthma is a heterogeneous, complex syndrome that arises in individuals
   with various genetic and exposure variations. Here, the authors show
   that disease comorbidity patterns can serve as a surrogate for these
   variations, and identify asthma endotypes distinguished by comorbidity
   patterns, asthma risk loci, gene expression, and health-related
   phenotypes.

Introduction

   Asthma is a prevalent, debilitating, and expensive condition that
   affects about 30 million Americans and about 300 million people
   worldwide^[60]1. It is a heterogeneous complex syndrome that
   undoubtedly represents an amalgam of multiple distinct “diseases,” each
   stemming from a different constellation of genetic variations,
   environmental exposure histories, and molecular mechanisms that results
   in a generally similar clinical diathesis. The heterogeneous nature of
   asthma is evidenced in its varying clinical presentations, spectrum of
   airway inflammation, and differences in individual responses to asthma
   treatments^[61]2–[62]14. Moreover, the risk loci discovered by
   genome-wide association studies (GWASs) in very large samples of
   individuals with “asthma” do not account for all of the genetic risks
   for asthma, indicating that genetic variants in additional loci are yet
   to be discovered. These missing loci likely include those that
   contribute to specific subtypes of asthma – but acquiring sufficiently
   large numbers of individuals with detailed phenotypic and genetic data
   to study the genetics of asthma subgroups has been challenging.

   We and others have performed studies of genetic variation, gene
   expression, and DNA methylation in an attempt to identify patient
   subpopulations based on pathogenetic mechanism
   (“endotypes”)^[63]15–[64]21, but such studies require direct patient
   contact and invasive procedures to obtain airway cells, thereby
   limiting the number of participants.

   The extreme heterogeneity of asthma makes it paradigmatic of many
   complex common diseases. Consequently, designing an approach to
   distinguish asthma patient subgroups within which individuals share
   common pathogenetic mechanisms could provide a beacon for parallel
   approaches in other complex common diseases of the lung (e.g., COPD,
   interstitial lung disease) or of other organ systems (e.g.,
   hypertension, congestive heart failure, type 2 diabetes).

   In this work, we describe a novel approach based on the hypothesis that
   individuals with different asthma endotypes might be separable based on
   the other accompanying (non-asthma) diseases they have. Our reasoning
   is as follows: Each comorbid disease category (e.g., cardiovascular
   disease, gastrointestinal disease, or breast cancer) is characterized
   by sets of variations across many genes and sets of exposures (e.g.,
   neighborhood environment, infections, toxins, in utero, experiential),
   behaviors, and traumas that together predispose to diseases in the
   category^[65]22–[66]27. Thus, comorbid diseases altogether can be
   considered a “surrogate” for a corresponding broad genetic and exposure
   landscape. It seemed likely to us that the asthma diathesis that
   develops in individuals with one of these broad genetic/exposure
   landscapes may well have a different pathophysiological basis compared
   to other asthmatic individuals, whose asthma arises in a very different
   genetic/exposure landscape. The endotypes of asthmatic individuals from
   such different landscapes may manifest in unique sets of asthma risk
   loci and distinct phenotypic characteristics. In this study, we tested
   this hypothesis.

Results

Developing a workflow for asthma subgroup identification

   To identify asthma subgroups with distinct comorbidity patterns from a
   collection of diagnosis records, we applied a “topic modeling”
   approach^[67]28–[68]34, inspired by natural-language processing (NLP).
   In essence, identifying asthma subgroups can be considered as the same
   task as extracting “topics” (such as “US politics” or “biotechnology
   news”) from a collection of newspapers, if the following analogies are
   made: (i) A disease code is a “word;” (ii) A patient’s diagnosis record
   that contains disease codes (each with its respective abundance) is a
   “sentence” that consists of words (with words possibly repeated); (iii)
   A large collection of patient-specific diagnosis histories is a
   “collection of sentences”; and (iv) An asthma subgroup as defined by a
   specific distribution of co-occurring diseases (i.e., a comorbidity
   pattern) is a “topic” (i.e., a probability distribution over words).
   Specifically, we implemented a Hierarchical Dirichlet Process (HDP)
   model^[69]35,[70]36, originally proposed for unsupervised clustering of
   large collections of texts, such as news articles. In our version of
   implementation, we treat chronologically ordered clinical histories of
   individual patients as sentences. In this representation,
   natural-language words map to disease diagnostic codes (a “text”), and
   a large collection of patient histories maps to “text corpus.” The
   underlying generative probabilistic model of data is built on formalism
   of a stochastic Dirichlet process. In this formalism, each disease
   subtype is generated by a unique Dirichlet process, and Dirichlet
   processes for individual disease subtypes share a base distribution
   which itself is drawn from a Dirichlet process. The HDP modeling
   automatically determined the optimal number of subgroups through a
   nonparametric Bayesian model selection approach (see Methods).

   The MarketScan database of diagnosis contains records for over 151
   million US residents^[71]37, covering 567 major groups of diseases
   suggested by ICD code taxonomy^[72]38,[73]39. We selected asthma
   patients aged 15–70 who also had comorbid diseases to construct the
   “collection of sentences” for modeling. The resulting population was
   around six million, of which we used records from one million randomly
   selected individuals each time as input to the HDP modeling, repeating
   the modeling process for 100 times (see a flowchart in Fig. [74]1a). A
   large ensemble of clusters was thus generated, and a cluster therein
   was essentially a specific frequency distribution of comorbid diseases.
   Some resulting clusters were similar, while others were not, partially
   due to the stochastic nature of HDP modeling. The inter-cluster
   dissimilarity, i.e., dissimilarity between frequency distributions, can
   be measured by Jensen-Shannon divergence, and we then applied
   Hierarchical Density-Based Spatial Clustering of Applications with
   Noise (HDBSCAN)^[75]40–[76]42 to discern the stable subgroups of
   recurring clusters from non-recurring ones (outliers). We considered a
   subgroup to be stable and designated it as an “asthma subgroup,” only
   if it enclosed more than 50 cluster points (see Methods for parameter
   selection results). By applying this subgroup-discovery workflow, we
   identified 22 asthma comorbidity subgroups, each with a unique
   distribution of 567 disease frequencies. The specific frequency
   distribution of 567 disease groups defined the “comorbidity pattern” in
   an asthma subgroup and was quantified collectively by its enclosed
   clusters. The median values, as well as minima, the first quartiles,
   the third quartiles, and maxima of the occurring frequencies of
   diseases in the clusters are shown in Supplementary Data [77]1 for each
   subgroup.

Fig. 1. Identification of asthma subgroups through topic modeling.

   [78]Fig. 1
   [79]Open in a new tab

   a Flowchart of asthma subgroup identification. The MarketScan data
   includes around six million asthma patients who have at least one
   comorbid disease (CD). To enable the estimation of sample statistics,
   we randomly selected one million patients and applied topic modeling to
   obtain comorbidity clusters (one cluster is projected as one point in
   the t-SNE plot). This procedure was repeated 100 times, generating a
   large collection of clusters shown as thousands of scattered points in
   the t-SNE projection. We used this t-SNE low-dimensional projection of
   topics only for visualization purpose, rather than for cluster
   discovery. With inter-cluster dissimilarity measured by Jensen-Shannon
   divergence, we applied HDBSCAN to identify stable subgroups of clusters
   as well as their hierarchies. A potential subgroup was deemed to be a
   stable “asthma subgroup”, only if it harbored more than 50 cluster
   points. We also conducted a sensitivity analysis on our identification
   approach in four additional cohorts, and subsequentially show the
   eleven subgroups that were commonly found in all the different cohorts
   above. Then, given the distribution of diagnosis counts shown in an
   individual’s record, we can express it as a linear combination of the
   distributions of diagnosis counts as defined in the asthma subgroups,
   and suggest that the subgroup with the largest assigned coefficient
   could represent the individual’s record best, therefore “assigning” the
   individual to this subgroup (
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   ,
   [MATH:
   <msub><mrow><mi>Φ</mi></mrow><mrow><mi>k</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   , and
   [MATH:
   <msub><mrow><mi>Θ</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>k</mi></mrow
   ></msub> :MATH]
   contain the information about record-diagnosis co-occurrences, subgroup
   profiles, and assignment coefficients, respectively; see Methods for
   more details). b The top ten frequently occurring diseases in the
   identified eleven asthma subgroups. A complete and precise definition
   of an asthma subgroup requires one to specify the frequency
   distribution of 567 disease groups. For each subgroup, we use a bar
   plot to show its top ten frequently occurring diseases, and color-code
   the bars as well as the annotations by the broader categories that the
   diseases belong to. The y axis denotes the normalized occurring
   frequency of a given disease, and we can see that a subgroup is named
   after the broader category to which several most frequently occurring
   diseases belong (see Supplementary Data [80]1 for the subgroup profiles
   in detail).

   Next, we conducted sensitivity analyses on our identification approach
   using other four different cohorts, including (i) individuals in the
   MarketScan data who were aged between 15 and 70, but carried at least
   two asthma codes, (ii) individuals in the MarketScan data who carried
   at least one asthma code, but were aged between 40 and 70, (iii)
   individuals in the MarketScan data who not only were aged between 15
   and 70 and carried at least one asthma code, but also had at least one
   type of asthma drug prescriptions, and (iv) individuals enrolled in UK
   Biobank (UKB). By repeating the exact same procedure as described
   above, we could re-discover 21, 20, 22, and eleven subgroups out of the
   original 22, respectively (see Supplementary Data [81]2–[82]5 and
   Supplementary Table [83]2 for the subgroup profiles). For visualization
   purpose only, the asthma subgroups were projected into a
   two-dimensional space using the t-SNE algorithm^[84]43, and we show the
   eleven subgroups that were found in all the different cohorts above in
   Supplementary Fig. [85]1b. Supplementary Fig. [86]1c shows the
   hierarchical clustering of these subgroups, and for each subgroup, a
   word cloud summarizes comorbid diseases therein contained and their
   occurring frequencies (proportional to the font sizes). For easier
   reference, we labeled each asthma comorbidity subgroup with a serial
   number and the broader category to which several most frequently
   occurring diseases belonged (see Fig. [87]1b for the relative
   frequencies of top 10 comorbid conditions in each asthma subgroup).

   To our knowledge, this is the first such analysis of asthma comorbidity
   patterns over the entire disease spectrum. Some comorbid conditions
   identified in the 2007 American National Asthma Education and
   Prevention Program (NAEPP) guidelines^[88]44 appear prominent in
   certain subgroups, such as gastrointestinal disease in subgroup 3 and
   depression in subgroup 11. More interestingly, some comorbidity
   associations are novel, such as lymphoma in subgroup 4 and joint
   disorder in subgroup 5.

Identifying genetic associations specific to asthma subgroups

   Our underlying premise is that each individual’s comorbid diseases
   arose in a gene-environment background that predisposed to their
   occurrences. Therefore, comorbidities can serve as surrogates for the
   various overall gene-environment settings in which different asthma
   endotypes can arise. By comparing patients with and without asthma who
   all share the same comorbidity pattern (as defined in an asthma
   comorbidity subgroup), we studied asthma risk genes in a
   subgroup-specific manner. For this purpose, we selected unrelated
   individuals of white British background and with high-quality
   genotyping from the UKB^[89]45 as a discovery cohort, including 44,383
   asthma cases and 260,715 non-asthma controls (see Table [90]1 and
   Supplementary Table [91]4). With the profiles of subgroups
   comprehensively defined, we could assign any individual to the most
   appropriate asthma subgroup that best matched an individual’s complete
   collection of disease diagnoses and respective occurring frequencies
   (see Methods).

Table 1.

   Descriptions of used databases
   Database Ethnicity Total sample size (asthma case count) Male
   percentage Median age^g Usage
   MarketScan (select age ≥15)^a White (78.3%), Black (14.5%)^f 84,315,387
   (6,048,247) 44.8% 41 (29–53) Asthma subgroup identification
   UK Biobank^b British white 305,098 (44,383) 45.7% 59 (51–64) GWAS
   discovery, and phenotype association analysis
   Irish white 22,600 (3,186) 41.9% 57 (49–63) Replication of GWAS
   findings via meta-analysis
   African, Caribbean 6,833 (998) 40.5% 51 (46–58)
   BioVU^c White 16,060 (1,668) 50.3% 61 (51–71)
   BioBank Japan^d East Asian 194,413 (3,368) 54.1% 65 (55–73)
   UChicago RNAseq^e White (37.1%), Black (58.6%) 70 (42) 32.9% 38 (27–50)
   Differential gene expression validation
   [92]Open in a new tab

   ^aThe MarketScan insurance claims database in the US, including
   diagnosis records.

   ^bNational health database in the UK, including diagnosis records and
   genotype data.

   ^cPatient-based registry of Vanderbilt University Medical Center,
   including diagnosis records and genotype data.

   ^dPatient-based registry in Japan, including diagnosis records and
   genotype data.

   ^eRNAseq transcriptome profiles of bronchial epithelial cells of
   patients enrolled in the University of Chicago.

   ^fImputed percentage based on county-level distributions of race.

   ^gValues in parentheses are interquartile ranges given in years.

   First, we performed a larger GWAS of asthma by comparing asthma cases
   and non-asthma controls among all individuals with any comorbid
   diseases (“any-CDs group”)^[93]46. We observed 103 independent loci of
   genome-wide significance (p < 5 × 10^−8), 13 of which were not
   previously reported in the NHGRI-EBI GWAS catalog database^[94]47.

   Second, we assigned asthma cases and non-asthma controls to their
   comorbidity subgroups, forming case and control subgroup pairs that
   shared the same comorbidity patterns (see Fig. [95]2a). Within each of
   the eleven subgroups that were re-discovered in UKB, we carried out a
   GWAS of asthma, identifying 14 loci that were also found in the initial
   larger GWAS analysis, plus six additional loci that conferred asthma
   risk in one subgroup, but not in the other subgroups or in the initial
   asthma GWAS. We show Manhattan plots of these results in Fig. [96]2b
   and annotate significant loci with their nearest genes (see
   Supplementary Table [97]1 for the complete loci information).

Fig. 2. Genome-wide significant associations with asthma.

   [98]Fig. 2
   [99]Open in a new tab

   a Study design for association analyses. Starting with the general
   population who may have any comorbid diseases (the any-CDs group) in UK
   Biobank, we were able to assign an individual with 1 of the 11 asthma
   subgroups that were found in UK Biobank. Then, we performed GWASs to
   identify asthma risk loci for the any-CDs group and for each subgroup
   individually (by comparing asthma cases against non-asthma controls
   within each subgroup). b GWAS Manhattan plots. This figure overlays
   GWAS results from the any-CDs group (in black) and from five selected
   subgroups (in multi-colors) that contained genome-wide significant
   asthma risk loci, including subgroups 3 “GI,” 4 “Lymphoma,” 5
   “Musculoskeletal,” 6 “Lung,” and 8 “Cardiovascular.” All the
   association p values are shown on a –log[10] scale on the y axis, and
   genomic locations are shown on the x-axis. The threshold of genome-wide
   significance (5 × 10^−8) is indicated as a horizontal dashed line in
   red. Triangles at top indicate SNPs that have a higher –log[10](p
   value) than shown. In addition, we annotate genome-wide significant
   loci with the names of their nearest genes, and in the case where a
   gene is commonly found in multiple subgroups and in the any-CDs group,
   the subgroup serial numbers and letter “G” are written, respectively,
   in parentheses under the gene name. In particular, we highlight the
   genes nearest to the six subgroup-specific loci by rotating their names
   with an angle of 45 degrees. More details can be found in Supplementary
   Table [100]1.

   For example, in addition to being significantly associated with asthma
   in the initial larger GWAS, variants near IL1RL1, KIAA2026, EMSY, and
   GSDMB were also associated with asthma in subgroup 3 “GI;” variants
   near TSLP, RANBP6, and SLC7A10 in subgroup 8 “Cardiovascular;” and
   variants near D2HGDH, HLA-DQA1, IL33, and SMAD3 in subgroups 3 and 8.
   The lead SNPs at the six subgroup-specific loci include rs11144271
   (near OSTF1, p = 2.50 × 10^−8) and rs113757163 (near COX10,
   p = 1.58 × 10^−9) in subgroup 5 “Musculoskeletal,” rs2249851 (in
   FAM129B, p = 3.30 × 10^−9) in subgroup 3 “GI,” rs76225731 (in SNHG14,
   p = 3.66 × 10^−8) in subgroup 6 “Lung,” rs117262476 (in PCNT,
   p = 1.46 × 10^−8) in subgroup 4 “Lymphoma,” and rs2765400 (near
   KRT8P37, p = 2.56 × 10^−8) in subgroup 8 “Cardiovascular.” Five of the
   six subgroup-specific loci, except for the last one (rs2765400), were
   novel, meaning, never reported in any asthma GWASs before. If a
   Bonferroni correction is further applied to adjust the twelve GWASs in
   total (eleven subgroups and a general asthma population), and the
   adjusted genome-wide significance threshold becomes 4.17 × 10^−9 (i.e.,
   5 × 10^−8/12), then there are two associations that remain significant:
   rs113757163 near COX10 and rs2249851 in FAM129B.

   In summary, we identified a total of 109 independent loci, representing
   the union of all genome-wide significant asthma risk loci found in any
   of the GWASs in our study (Fig. [101]3a). We investigated the
   heterogeneity in the effect sizes of the lead SNPs at these 109 loci
   across the eleven subgroups, using a Cochran’s Q test^[102]48. This
   revealed significant heterogeneity at nine loci (marked with red #
   symbols in Fig. [103]3b), which included all the six subgroup-specific
   loci (Supplementary Data [104]6). To validate these discoveries, we
   conducted a multi-ancestry meta-analysis^[105]49–[106]53 of four
   additional cohorts, including two subsets from UKB that were not
   included in the initial GWAS (a cohort of white Irish and any other
   white background, and a cohort of African, Caribbean and any other
   backgrounds associated with recent African descent, respectively), a
   European ancestry subset of BioVU from the Vanderbilt University
   Medical Center^[107]54,[108]55, and an East Asian ethnic group from
   BioBank Japan (BBJ)^[109]56–[110]58. After multiple testing correction,
   there remained 61 associations (involving 52 loci) successfully
   replicated, consisting of 49 (involving 49 loci) from the any-CDs group
   and twelve (involving ten loci) from subgroups. The latter, in
   particular, included three subgroup-specific loci: rs11144271 (near
   OSTF1), and rs113757163 (near COX10), both in subgroup 5
   “Musculoskeletal,” and rs2765400 (near KRT8P37) in subgroup 8
   “Cardiovascular” (see Supplementary Data [111]7 for summary
   statistics).

Fig. 3. Summary of genome-wide significant loci and differential gene
expression.

   [112]Fig. 3
   [113]Open in a new tab

   a A summary of the significant loci in a Venn diagram. The association
   analysis by comparing asthma cases and non-asthma controls in the
   any-CDs group identified 103 independent loci at genome-wide
   significance level. Similar association analyses within each of the
   eleven asthma subgroups discovered 20 significant loci, of which 14
   were also seen in the any-CDs group, and, interestingly, six more loci
   were specific to one subgroup only. Altogether there were 109
   independent loci identified. b Association results for significant
   loci. The forest plot on the left side summarizes the association
   results seen in the any-CDs group for the 109 loci, at which the lead
   SNPs are listed in the first column. Squares denote the effect sizes,
   i.e., natural logarithm of odds ratios or ln(OR), and horizontal lines
   are the 95% confidence intervals. From top to bottom, the effect sizes
   are in ascending order, from negative (in blue) to positive values (in
   red). The wave-like plot on the right side displays a series of effect
   sizes seen in the eleven subgroups that can be found in UK Biobank for
   each of the 109 SNPs. The subgroup names are labeled along the
   horizontal axis, while for each of the 109 SNPs that are displayed
   along the vertical axis, its effect size is represented as a peak in
   the red shade if it is positive, and as a trough in blue shade if
   negative. The absolute value of the effect size is proportional to the
   height (or depth) of the peak (or trough), and is also color-coded. All
   the genome-wide significant associations between SNPs and subgroups are
   marked with green asterisks, and in particular, the six SNPs that are
   specific to one subgroup only are highlighted in green in the first
   name column. In addition, the heterogeneity of per-locus effect sizes
   across the eleven subgroups was assessed through a Cochran’s Q test,
   finding nine loci with evidence of significant heterogeneity in effect
   sizes (indicated with # symbols in red after the respective SNP names
   in the first column). See Supplementary Data [114]6 for the association
   results in detail and Supplementary Fig. [115]5 for the numbers of
   allocated cases and controls in each subgroup. c Differential gene
   expression. For three of the subgroup-specific SNPs, we confirmed the
   differential expression of their nearby genes (i.e., OSTF1, COX10, and
   FAM129B), using an independent dataset of bronchial epithelial
   transcriptome profiles. The gene OSTF1, for example, has significantly
   lower expression among asthma cases in subgroup 5 “Musculoskeletal”,
   compared to non-asthma controls and asthma cases in other subgroups
   (see the x axis labels and respective sample sizes in parentheses). The
   y-axis shows the normalized transcript count on a log[2] scale, i.e.,
   log[2][(transcript count+0.5)/size factor]; the minimum, the first
   quartile, the median, the third quartile, and the maximum of OSTF1 for
   non-asthma controls are 9.17, 9.42, 9.54, 9.67, and 9.93, for asthma
   cases in subgroup 5 are 8.91, 9.24, 9.31, 9.34, and 9.43, and for
   asthma cases in other subgroups are 9.20, 9.34, 9.45, 9.59, and 9.97;
   these values of COX10 for non-asthma controls are 7.80, 7.97, 8.03,
   8.12, and 8.30, for asthma cases in subgroup 5 are 8.14, 8.26, 8.26,
   8.33, and 8.35, and for asthma cases in other subgroups are 7.15, 7.86,
   8.00, 8.09, and 8.33; these values of FAM129B for non-asthma controls
   are 12.41, 12.71, 12.85, 13.01, and 13.44, for asthma cases in subgroup
   3 are 12.91, 13.06, 13.15, 13.23, and 13.35, and for asthma cases in
   other subgroups are 12.28, 12.62, 12.73, 13.04, and 13.64). The mean
   log[2] fold changes (L2FC) of OSTF1 in subgroup 5 of asthma cases were
   −0.30 (two-sided Wald statistic p value = 0.0019) and −0.25 (p
   value = 0.011), when compared to non-asthma controls and asthma cases
   in other subgroups, respectively. The other comparisons show that
   bronchial epithelial cell expression of COX10 in subgroup 5
   “Musculoskeletal” and FAM129B in subgroup 3 “GI” are significantly
   higher, compared to non-asthma controls and their respective asthma
   cases in other subgroups.

   Third, using transcriptome data from bronchial epithelial cells (BECs)
   obtained by bronchoscopy from a small number of patients (42 asthma
   cases and 28 non-asthma controls) at the University of
   Chicago^[116]18,[117]59, we checked for possible differential
   expression of the genes nearest to the six subgroup-specific loci.
   Based on the available diagnosis information, we assigned the 42 asthma
   cases into comorbidity subgroups; only subgroups 5 and 3 involving
   three genes (OSTF1, COX10, and FAM129B) contained five or more
   individuals, and were included in these analyses. We formularized gene
   transcript counts using a generalized linear model of the negative
   binomial family^[118]60 with age, sex, and ethnicity included as
   covariates. We compared asthma cases within each group to two reference
   (control) groups: 28 non-asthmatic individuals, and the asthma cases
   that fell into those subgroups other than the one being tested (see
   Methods). As shown in Fig. [119]3c, OSTF1 expression was significantly
   reduced while COX10 was overexpressed in asthmatics in subgroup 5
   “Musculoskeletal”, compared to the expression levels in the non-asthma
   controls or in the asthma cases not in subgroup 5. The expression of
   FAM129B was significantly higher among the cases in subgroup 3 “GI”
   compared to either reference group. In addition, we used both the
   HaploReg v4.1^[120]61 and the Genotype-Tissue Expression project
   (GTEx)^[121]62 databases to determine whether the associated SNPs were
   also expression quantitative trait loci (eQTLs). We found that
   rs11144271 is an eQTL for OSTF1 in whole blood (p = 2.5 × 10^−29), and
   rs2249851 is an eQTL for FAM129B in cultured fibroblasts
   (p = 3.1 × 10^−22), in whole blood (p = 1.3 × 10^−6), in pituitary
   (p = 3.6 × 10^−5), and in tibial artery (p = 6.1 × 10^−5). Admittedly,
   differential expression analysis and functional validation of
   additional genes will be needed to infer causal associations between
   the genes and subgroup-specific asthma risk.

   Next, we performed pathway enrichment analyses based on the full
   subgroup association results. Asthma subgroups indeed show distinct
   sets of enriched biological pathways/processes, for example,
   keratinocyte differentiation (p = 7.52 × 10^−16) and the regulation of
   leukocyte proliferation (p = 5.27 × 10^−7) in subgroup 3 “GI”, and
   keratinocyte differentiation (p = 4.40 × 10^−19) and epidermal cell
   differentiation (p = 2.24 × 10^−14) in subgroup 8 “Cardiovascular”.
   These enriched biological pathways or processes could potentially
   inform subgroup-specific asthma pathogeneses. Complete listings for all
   the eleven asthma subgroups can be found in Supplementary Table [122]3.

Asthma associations with health-related phenotypes differ across subgroups

   If the identified subgroups reflect true endotypes, then there should
   be health-related phenotypes (e.g., measurable clinical differences)
   that differentially associate with asthma among comorbidity subgroups
   and possibly suggest distinct pathogenetic mechanisms. To test this, we
   leveraged the phenotypic data in the UKB resource^[123]45, and focused
   on a total of 140 different phenotypes that measured ten health-related
   categories, including spirometry, blood count, blood biochemistry,
   urine biochemistry, early life factors, anthropometry, addictions,
   diet, physical activity, and local environment. We focused these
   studies on the same cohort as we used for the GWAS discovery: unrelated
   individuals of white British ethnicity with available diagnosis
   records. We implemented a multivariate adaptive shrinkage (mash)
   method^[124]63 to assess the heterogeneity of the associations across
   subgroups by benchmarking against the larger group with any
   comorbidities (as benchmarks; see Methods).

   The first step was to examine asthma associations for each phenotype in
   the ten categories, in each subgroup as well as in the larger group. We
   used the slope estimate of an association, i.e., increased likelihood
   of asthma with respect to increasing or decreasing value of the
   phenotypic feature, to denote the association’s direction (by the sign
   of the slope) and strength (by the absolute value of the slope; see
   Supplementary Data [125]11). The analysis revealed 44 phenotypes
   associated with asthma differentially across subgroups (see
   Supplementary Data [126]12 for the estimates of the slope differences
   after benchmarking against any-CDs group). These subgroup-specific
   differential associations are highlighted in color (blue signifies less
   positive than the benchmark, red signifies more positive than the
   benchmark) in the meta-plots in Fig. [127]4 and Supplementary
   Fig. [128]7, which show the posterior means and variances of the
   association slopes. This analysis demonstrated that clinically relevant
   phenotypes indeed varied across subgroups, with some suggesting
   potential subgroup-specific endotypic mechanisms (see Discussion).

Fig. 4. Differential asthma associations with health-related phenotypes
across subgroups.

   [129]Fig. 4
   [130]Open in a new tab

   A total of 10 different categories of health-related phenotypes (140
   different measurements in total) were subjected to phenotype
   association analysis (see Methods for technical details and
   Supplementary Data [131]11 for the numbers of allocated cases and
   controls in each subgroup). We first computed phenotypes’ slope
   estimates of asthma associations within each subgroup and in the
   any-CDs group. The direction and strength of the association are
   characterized by the sign and absolute value of the slope,
   respectively. a Heterogeneous slope estimates related to blood count.
   We assessed the heterogeneity in these slope estimates across subgroups
   for each phenotype, and benchmarked against the slope value for that
   phenotype in the any-CDs group. Each phenotype is presented as a
   meta-plot, which shows the posterior means (as squares) and 95 percent
   confidence intervals (as error bars) of the slopes from subgroups 1 to
   11 that were also discovered in UK Biobank (displayed from top to
   bottom). Slope estimates that are significantly less positive than the
   any-CDs group benchmark (marked by a vertical dashed line) are shown in
   blue, while those that are significantly more positive are shown in
   red; the respective subgroup numbers are also shown for significantly
   different subgroups. For example, subgroup 6 “Lung” exhibits many
   red-blood-cell-related phenotypes that are in significantly stronger
   associations with asthma likelihood than appear for the general
   population in the any-CDs group. b Heterogeneous slope estimates
   related to the local environment, diet, and physical activity. In the
   same fashion as shown in a, we display the meta-plots of the phenotypes
   in the categories of the local environment, diet, and physical
   activity. A distinct pattern of these phenotypes distinguishes subgroup
   1 “Diabetes,” in which stronger associations of greenspace, air
   quality, salt intake, and exercise are evident.

   Finally, we have collated summary statistics of relevant health-related
   phenotypes available in MarketScan and UKB data (including white blood
   cell counts, spirometry measurements, body mass index, smoking status,
   age of asthma onset, and asthma medications), and compare them in a
   subgroup-specific manner in Supplementary Tables [132]5–[133]9. As
   shown in Supplementary Table [134]6, the abnormalities of spirometry
   measures (including reductions of predicted forced vital capacity
   (FVC), forced expiratory volume in one second (FEV[1]), peak expiratory
   flow (PEF), and the ratio of FEV[1] to FVC (FEV[1]/FVC) are greatest in
   subgroup 6 “Lung” and are modest in subgroup 5 “Musculoskeletal;”
   Supplementary Table [135]9 shows that inhaled steroid combinations with
   long-acting beta agonists or antibody inhibitors, both of which
   medication categories are usually prescribed for more severe asthma,
   have the largest fraction of users in the “Lung” subgroup and less than
   half that fraction in the “Musculoskeletal” subgroup. Collectively,
   these suggest that the “Lung” subgroup may comprise individuals with
   more severe asthma than that experienced by individuals in the
   “Musculoskeletal” subgroup.

Discussion

   Currently, the most widely adopted method of asthma classification is
   based on severity, defined by the level of symptoms, lung function, and
   rescue bronchodilator use^[136]44. Asthma has also been classified by
   onset age: early- and late-onset^[137]64; by the presence or absence of
   allergic sensitization: atopic and nonatopic^[138]65,[139]66; by the
   level of symptom control: controlled, partly controlled, and
   uncontrolled^[140]67; or, more recently by the co-occurrence of other
   medical conditions like obesity^[141]68,[142]69,
   rhinosinusitis^[143]70, and depression^[144]71–[145]73, which are
   thought to exacerbate symptoms or even directly contribute to asthma
   pathogenesis. One problem with the current classifications lies in poor
   coherence and subjectivity; studies have shown that poor agreement can
   exist across classification systems^[146]74, official guidelines, and
   physician assessment^[147]75. Additionally, there is increasing
   evidence that the current classifications can sometimes be too broad to
   adequately reflect the highly heterogeneous characteristics observed in
   asthma populations^[148]4,[149]64,[150]76. In this study, we sought to
   discover asthma subtypes in a data-driven, probabilistic modeling-based
   unsupervised way: (i) We gathered large-scale, multi-dimensional
   datasets, including very large diagnosis records and genotype data
   originating from multiple countries (US, UK, and Japan), RNA-sequencing
   profiles (laboratory measurement), and a suite of health-related
   phenotypic measures (see Table [151]1 and Supplementary Fig. [152]5 for
   a brief summary of all the used datasets); (ii) The workflow and
   methodologies proposed in this study are a showcase for the benefits
   from the integration of these multi-dimensional information, and can
   work as machinery that has general applicability towards the
   investigation of other complex diseases.

   The ever-increasing availability of large-scale administrative medical
   records has allowed us to find emerging comorbid conditions among
   asthma patients^[153]77–[154]85 and should allow the investigation of
   their adverse effects, including asthma exacerbation^[155]86, lower
   quality of life^[156]87,[157]88, and increased risk of morbidity and
   mortality^[158]89. Here, we refer to a comorbidity pattern as a
   specific distribution of diseases that co-occur with asthma, and
   hypothesize that such comorbidity patterns, if analyzed systematically
   from country-scale diagnosis records, can be very informative in
   dissecting hidden heterogeneity of asthma and guiding asthma
   endotyping. The rationale for this approach is rooted in the
   hypothetical deep connection between comorbidity patterns and asthma
   endotypes. First, genetic factors can predispose an individual to
   different asthma endotypes as well as to the manifestation of many
   other co-occurring diseases, in other words, genetic origins are
   shared. Studies have shown that trait-associated SNPs discovered by
   previous GWASs are largely pleiotropic, and tend to influence general
   biological functions contributing to numerous traits^[159]90. Second,
   different asthma endotypes and comorbid conditions can also share
   environmental exposures or even possibly cause one another, promoting
   the convergence of certain comorbidities. For these reasons, we suggest
   that comorbidities are effectively working surrogates for
   gene-environment landscapes that lead to different asthma endotypes,
   and that different comorbidity subgroups may harbor unique asthma risk
   loci. In other words, it seemed likely that the additional risk factors
   for asthma to arise in one gene-environment landscape (as prevails in
   one comorbidity subgroup) are different from the additional risk
   factors that make asthma more likely to arise in a different
   gene-environment landscape (e.g., as prevails in a second comorbidity
   subgroup). In this study, we tested and confirmed this possibility. Our
   approach of using comorbid patterns to derive homogeneous endotypes
   resonates with the previous studies that identified novel disease
   subtypes and genetic loci through non-random ascertainment of
   covariates informed by multiple traits and genetics^[160]91,[161]92.
   However, this ascertainment could conceivably induce unintended, biased
   associations^[162]93–[163]95. In an effort to restrain them, we
   replicated the genetic risk loci in multiple ethnic cohorts and
   aggregated the genetic, gene expression, and phenotypic associations
   that collectively may suggest the heterogeneity existing in asthma.

   The subgroup-specific variants that were found significant in GWASs
   here may point to different pathogenetic mechanisms in asthma
   endotypes. For example, we identified an association specific to asthma
   subgroup 5 “Musculoskeletal.” The lead variant was near
   osteoclast-stimulating factor 1 (OSTF1), a gene that interacts with
   fatty acid binding protein 4 (FABP4)^[164]96, which in turn regulates
   airway inflammation in experimental asthma^[165]97,[166]98. OSTF1 also
   regulates cell motility^[167]99, which could be important in bronchial
   epithelial repair and inflammatory cell trafficking. Another nearby
   gene specific to subgroup 5, cytochrome C oxidase assembly homolog 10
   (COX10), regulates T-cell activation and
   differentiation^[168]100–[169]102, and so could regulate asthmatic
   airway inflammation in some way particularly important for this
   subgroup. Family with Sequence Similarity 129 Member B (FAM129B),
   selective for subgroup 3 “GI,” regulates glycolysis, Ras activation,
   oxidative stress, apoptosis^[170]103–[171]106, and more generalized
   cell processes whose contributions to asthma pathogenesis could take
   multiple forms. Experimental studies will be required to identify the
   exact mechanism(s) by which these genes contribute to asthma in a
   subgroup-specific fashion.

   Similarly, unique phenotypic associations also characterize some asthma
   subgroups. For example, we were struck by the strong positive slope
   relationships among multiple measures of red blood cell (RBC)
   production and accumulation in subgroup 6 “Lung,” including
   reticulocytes, erythrocytes, hematocrit, and hemoglobin (Fig. [172]4a).
   Increased RBC production could reflect higher erythropoietin
   elaboration or sensitivity. Erythropoietin is known to reduce airway
   remodeling in experimental murine allergic asthma^[173]107, perhaps
   inducing the activation of regulatory T cells^[174]108 through
   stimulation with TGF-β released from M2 macrophages. However, TGF-β is
   well known to promote airway smooth muscle differentiation and
   accumulation^[175]109–[176]112, and erythropoietin-induced TGF-β
   secretion could conceivably represent the key pathogenetic contributor
   that promotes the emergence of asthma in patients with the comorbidity
   background of subgroup 6, in which COPD is the most frequent comorbid
   disease. Consistent with this notion, the association between asthma
   and blood eosinophil count or percentage is significantly weaker in
   subgroup 6 than in the larger group with any comorbidities, suggesting
   that Th2-type inflammation may be relatively less important for the
   development of asthma in this subgroup.

   As another example (Fig. [177]4b), greater likelihood of asthma in
   subgroup 1 “Diabetes” is related to less greenspace, higher air
   pollution, higher salt intake, and lower physical activity. Indeed,
   both greenspace^[178]113 and air pollution^[179]114 have been
   previously linked to asthma prevalence or severity, and these effects
   are mirrored in the observed slopes for the whole UKB samples analyzed
   here as well. Greenspace reduces the incidence of elevated
   interleukin-8 (IL8) in serum^[180]115, while both NO[2]^[181]116 and
   particulate matter air pollution^[182]117,[183]118 induce IL8
   expression in human airway epithelium. High-intensity interval exercise
   reduces circulating IL8 in both lean and overweight-obese
   individuals^[184]119, and while eating higher salt diets, individuals
   with exercise-induced asthma experienced worsened post-exercise airflow
   obstruction and had greater induced sputum IL8 concentrations than when
   eating a low-salt diet^[185]120. Importantly, IL8 is particularly
   elevated in the lung secretions of severe asthmatics^[186]121. In all,
   the known role of IL8 in asthma and the phenotypic peculiarities of
   subgroup 1 asthmatics suggest that their asthma may be especially
   driven by IL8 secretion. Each of these potential subgroup-specific
   endotypic mechanisms should be explored experimentally. In total, out
   of the tested 140 health-related phenotypes, there are 44 showing
   significant heterogeneity across our subgroups of asthma (see
   Supplementary Fig. [187]7 for the other significant phenotypes); these
   might also contain clues about endotypic mechanisms.

   Additionally, 182 asthma-associated loci (at the suggestive threshold,
   p < 10^−5) had significantly larger effect sizes in specific subgroups
   than in the initial larger GWAS, although these associations did not
   reach genome-wide significance. Another 73 independent genome regions
   had similar effect sizes in one or more subgroups as well as the larger
   group with any comorbidities (see Supplementary Note [188]2 for
   details). Understanding these genetic specificities and commonalities,
   which collectively mapped the genomic landscapes of asthma subgroups,
   can be critical in discovering new asthma endotypes and in elucidating
   their distinct or shared molecular etiologies.

   Admittedly, disentangling genetic and environmental heterogeneity of
   asthma is difficult because (i) sample size diminishes quickly in the
   process of subdividing asthma cases into subgroups; and (ii)
   asthma-associated polymorphisms tend to have small effect
   sizes^[189]122–[190]125. Although a subsampling method (see
   Supplementary Note [191]1 and Methods) alleviates these problems to
   some extent, the detection of genome-wide significant signals was still
   restricted to several relatively large subgroups. Extending our current
   work in the future, it may be possible to represent asthma groupings by
   multi-dimensional, quantitative risk scores: genotypic, phenotypic, or
   both. Advantages are two-fold: (i) Continuous risk scores would be
   assigned to asthma cases instead of binary classifications, allowing
   the samples to be used more effectively and thus providing gains in
   statistical power, while the central challenge in this regard is how to
   best incorporate into these analyses the collection of SNPs and genes,
   and; (ii) Such scores could predict one’s asthma subgroup before the
   actual onset of the score-predicted comorbidities, and so could lead to
   a better understanding of their endotype at an earlier age. Another
   possible extension of our current approach is to allow the intake of
   dynamic data about disease trajectories or progressions. This extension
   will likely be valuable, considering that previous studies have shown
   that the exact timing of specific environmental exposures during
   critical developmental windows could influence risk trajectories that
   ultimately trigger asthma^[192]126, and only the exposures occurring in
   early life may leave observable signatures^[193]127. To this end,
   longitudinal data with a reasonably long period of coverage will be
   required.

Methods

   All relevant ethical regulations have been followed. This study was
   approved by the University of Chicago Institutional Review Board, and
   informed consent was obtained from all research subjects to the work
   involving transcriptome data of BECs. The study design and conduct
   complied with all relevant regulations regarding the use of human study
   participants and was conducted in accordance with the criteria set by
   the Declaration of Helsinki.

The US MarketScan Commercial database and topic modeling for asthma subgroup
identification

   The US MarketScan databases, owned by IBM Watson Health, are a suite of
   administrative claims-based databases that include inpatient and
   outpatient claims, medical procedure claims, prescription claims,
   clinical utilization records, and healthcare expenditures. These data
   were collected from employers, managed care organizations, health plan
   providers, and state Medicaid agencies. The covered patient population
   is mainly composed of relatively more affluent, privately-insured
   segments of US society^[194]37,[195]128. Distinct strengths that lie in
   the MarketScan databases include: (i) comprehensive and high-quality
   coding of diagnoses, procedures, and drug prescriptions, (ii) large
   collection of samples that cover over half of the US population, (iii)
   longitudinal tracking at the individual level, and (iv) full
   integration of inpatient and outpatient care events, emergency care
   services and outpatient pharmaceutical data. More than 900
   peer-reviewed research articles have been published since the launch of
   these databases in 1995, and the number of related publications has
   increased even more rapidly in recent years^[196]129,[197]130.

   In order to identify asthma subgroups in this study, we used one of the
   US MarketScan databases—the US MarketScan Commercial Claims and
   Encounters database (US MarketScan data). The US MarketScan data
   contain the US country-scale collection of diagnosis records for over
   151 million unique individuals who were enrolled in the database during
   the years between 2003 and 2013. We selected those individuals who were
   aged between 15 and 70, and carried an asthma code with at least one
   comorbid disease (in addition to asthma). Here, we used 493.00–493.99
   (for ICD-9-CM) and J45.0–J45.998 (for ICD-10-CM) as asthma codes. The
   resulting population was 6,048,247, and we used their diagnosis records
   to identify comorbidity-based asthma subgroups. Asthma classification
   based on diagnosis records was pursued using a topic modeling approach,
   by analogy with Word documents.

   In topic modeling, a document can be viewed as a mixture of topics,
   where a topic is defined as a distribution over a fixed vocabulary,
   then a topic model describes a probabilistic generative process for the
   document in two stages: first, to specify the topic proportions, and
   second, for the generation of each word in the document, to assign a
   topic according to its specified proportion and draw a word from the
   corresponding distribution^[198]28–[199]33.

   On the basis of our diagnosis records consisting of International
   Classification of Diseases versions 9 and 10 (ICD-9 and ICD-10) codes,
   we only took into account unique ICD codes per day (only keeping unique
   ICD codes on each day) and then grouped these ICD codes into 567 major
   groups of disease diagnoses on the basis of their clinical
   manifestations^[200]38,[201]39. These 567 disease groups constituted
   the basic “vocabulary”, which all the records were built on. An asthma
   subgroup can be analogously defined as a distribution of diseases
   (other than asthma) that reflects an existing common comorbidity
   pattern among asthma patients.

   After terminology conversion from “document–topic–word” to “diagnosis
   record–asthma subgroup–diagnosis”, the probabilistic generative process
   for a diagnosis record (equivalent to a word document in document
   modeling) also involves two stages: first, to assign subgroup
   proportions, and second, for the generation of each diagnosis in the
   record, to choose a subgroup (equivalent to a topic) and to draw a
   diagnosis (equivalent to a word) within accordingly. In reality, we are
   dealing with a statistical inference problem: only diagnosis records
   can be observed, and the goal is to extract the underlying subgroups
   that are most likely to have generated these data. For this purpose, a
   Hierarchical Dirichlet Process (HDP) model^[202]35 was applied, and its
   C++ implementation is publicly available at the Github repository at
   [203]https://github.com/blei-lab/hdp^[204]36. We set the hyperparameter
   “max_iter” (maximal number of iterations) to be 500, which is large
   enough for the modeling process to converge (based on our initial test
   runs). Supplementary Fig. [205]4a shows its basic design: Shaded and
   unshaded nodes indicate observed and latent variables, respectively;
   Arrows denote conditional dependencies between variables, and plate
   notations are used to illustrate repeated sampling steps. For example,
   the inner plate over
   [MATH:
   <msub><mrow><mi>Z</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   denotes the repeated sampling of asthma subgroup assignments and
   diagnoses until
   [MATH: <msub><mrow><mi>N</mi></mrow><mrow><mi>d</mi></mrow></msub>
   :MATH]
   diagnoses are generated for diagnosis record d. The plate over
   [MATH:
   <msub><mrow><mi>Θ</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>k</mi></mrow
   ></msub> :MATH]
   demonstrates the repeated sampling of a distribution over subgroups for
   each diagnosis record d for a collection of D records, and the plate
   surrounding
   [MATH:
   <msub><mrow><mi>Φ</mi></mrow><mrow><mi>k</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   illustrates the sampling of diagnosis distributions for each subgroup k
   until the total number K is reached. Hyperparameters
   [MATH: <mi>α</mi> :MATH]
   and
   [MATH: <mi>β</mi> :MATH]
   define the HDPs which are the distributions over a set of random
   probability measures over
   [MATH:
   <msub><mrow><mi>Θ</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>k</mi></mrow
   ></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>Φ</mi></mrow><mrow><mi>k</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   , respectively. Therefore, given the observed
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   , statistical inference aims to estimate
   [MATH:
   <msub><mrow><mi>Θ</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>k</mi></mrow
   ></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>Φ</mi></mrow><mrow><mi>k</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   ^[206]34. A nonparametric Bayesian approach was implemented to infer
   these parameters, and the optimal number of subgroups can also be
   learnt in the process instead of being fixed a priori.

   In our implementation of HDP modeling (see the flowchart in
   Fig. [207]1a), we randomly selected one million out of the six million
   records of asthma patients as input each time, and repeated the HDP
   modeling process 100 times, gathering a large collection of clusters.
   Some clusters had similar profiles, while others did not (partially due
   to the stochastic nature of HDP modeling). We measured the
   inter-cluster dissimilarity by Jensen-Shannon divergence and
   considering all the 567 disease dimensions, and applied HDBSCAN
   (Hierarchical Density-Based Spatial Clustering of Applications with
   Noise)^[208]40–[209]42, discovering 22 stable subgroups of recurring
   clusters as well as their hierarchies. A subgroup was deemed to be
   stable if it harbored more than 50 cluster points. The number of
   cluster points enclosed in these 22 subgroups are 103, 103, 112, 93,
   93, 125, 70, 87, 65, 104, 126, 67, 78, 79, 98, 107, 56, 74, 142, 68,
   182, and 100, respectively. In particular, if we only look at the
   eleven subgroups that can be replicated in other cohort settings, their
   number of belonging cluster points are 103, 93, 125, 70, 126, 78, 79,
   98, 56, 74, and 100, respectively. We understood that the threshold
   number of cluster points for claiming a stable subgroup was an
   important hyperparameter. Therefore, at the very beginning, we tested
   different numbers, for example, 25, 50, and 100, yielding 29, 22, and
   two subgroup partitions, respectively. We found that 50 was the optimal
   threshold number, leading to the 22 subgroups that suggested a
   reasonable nosology, as judged by physicians in our team. While the
   comorbidity patterns seen in the 29 subgroups (if the threshold number
   of cluster points for claiming a stable subgroup is set to be 25)
   appeared to be scattered and trivial, and the comorbidity patterns seen
   in the two subgroups (if the threshold number of cluster points for
   claiming a stable subgroup is set to be 100) would be too coarse. We
   additionally justified the hyperparameter selection of 50 cluster
   points using the elbow method. In detail, we tried different threshold
   numbers of cluster points for claiming a stable subgroup, and compute
   their mean stability scores^[210]41 of all the resulting subgroups
   after specifying a threshold number. We then plot these mean stability
   scores against the threshold numbers of cluster points (see
   Supplementary Fig. [211]1a). The location in the plot at which the
   increase of the mean stability scores switches from fast to slow (the
   elbow location) is regarded as the indicator of the optimal threshold
   number. In this work, the optimal number is 50 (indicated by a dashed
   line in the plot).

   The occurring frequency of a given disease in the subgroup can be
   precisely quantified by the median value as well as the minimum, the
   first quartile, the third quartile, and maximum of the frequency values
   of the disease in the enclosed clusters collectively (see Supplementary
   Data [212]1 for the subgroup profiles in detail). Just for
   visualization purposes, we show the t-SNE two-dimensional projection of
   the identified asthma subgroups in Supplementary Fig. [213]1b.

   Furthermore, we examined the sensitivity of modeling results towards
   four different cohort settings, including (i) the 3,152,519 individuals
   in the US MarketScan data who were aged between 15 and 70, but carried
   at least two asthma codes (as opposed to one asthma code used in the
   original configuration), (ii) the 3,401,250 individuals in the US
   MarketScan data who carried at least one asthma code, but were aged
   between 40 (as opposed to 15 used in the original configuration) and
   70, (iii) the 3,687,965 individuals in the US MarketScan data who not
   only were aged between 15 and 70 and carried at least one asthma code,
   but also had at least one type of asthma drug prescriptions (the asthma
   drug prescriptions that are documented in the database include antibody
   inhibitor, inhaled corticosteroids, inhaled steroid combinations with
   long-acting beta agonists, leukotriene modifiers, mast cell
   stabilizers, methylxanthines, short-acting inhaled beta-2 agonists, and
   systemic corticosteroids), and (iv) the 66,448 individuals enrolled in
   UK Biobank who carried at least one asthma code and were aged between
   39 and 72. Note that UK Biobank, different from MarketScan’s
   administrative claims-based database, is a national health registry
   dataset and more skewed towards an older and white-ancestry population
   (see Table [214]1 and Supplementary Table [215]4 for comparison
   details). By repeating the exact same procedure as described above (see
   the flowchart in Fig. [216]1a), we successfully replicated 21, 20, 22,
   and eleven subgroups, respectively, out of the original 22 (see
   Supplementary Data [217]2–[218]5 for the subgroup profiles). In order
   to assess whether any of the subgroups generated based on the cohorts
   for sensitivity analyses can be claimed as successful replications of
   the subgroups discovered based on the discovery cohort, we computed
   their Pearson’s correlations based on the median frequency profiles of
   comorbid diseases in the respective subgroups. We only claim a
   successful replication if the respective correlation is determined to
   be significant. The common set of the successful replications of the
   discovered subgroups using all four different cohort settings comprised
   eleven subgroups (see Supplementary Table [219]2), and we specifically
   termed them “asthma subgroups”. For easier reference, each asthma
   subgroup is named after the broader category to which several most
   frequently-occurring diseases belonged, although it is the distribution
   of 567 disease groups that completely define the subgroup (see
   Supplementary Fig. [220]1c). As a summary, we analyzed independent
   large asthma cohorts and found that the identified asthma subgroups
   were largely consistent. Using the two largest datasets, US Marketscan
   and UK Biobank, we identified eleven stable topics/subgroups. Note that
   we aimed at arriving a not necessarily exhaustive but necessarily
   stable set of topics/subgroups across at least two datasets.

   In next, considering that databases, such as US MarketScan used here,
   contain diagnosis information about individuals in different abundance
   and for different durations, we wanted to examine the extent to which
   the discovered subgroups were proxies of diagnosis code counts or
   observation times, or in other words, to find out whether the eleven
   subgroups end up with similar diagnosis counts and observation times.
   Therefore, we reported the summary statistics of individuals’ diagnosis
   code counts in each of the eleven subgroups and all of them combined
   based on US MarketScan data (see Supplementary Table [221]11 for their
   minimum, the first quartile, the median, the mean, the third quartile,
   and maximum values). Given the individuals’ diagnosis code counts of
   any two subgroups or all the eleven subgroups combined, we can assess
   their distribution similarity by estimating the overlapping area of
   their kernel density estimations^[222]131. In total, we examined 66
   comparisons of subgroup pairs by exhausting all the possible pair
   combinations of the eleven asthma subgroups and all of them combined,
   i.e.,
   [MATH: <mfenced close=")" open="("><mrow><mfrac
   linethickness="0"><mrow><mn>12</mn></mrow><mrow><mn>2</mn></mrow></mfra
   c></mrow></mfenced><mo>=</mo><mn>66</mn> :MATH]
   (see Supplementary Table [223]12). The distribution similarity metric
   is equal to 1 for two identical distributions and 0 for two completely
   dissimilar ones. We found that the median and mean similarity values
   were 0.731 and 0.714, respectively. In addition, we compared enrollment
   patterns (visibility of patients in claims) of patients in all putative
   asthma subgroups and them combined. We computed (a) the total
   enrollment time (the duration when an individual stays enrolled) and
   (b) total diagnosis recording time (the duration from the time of the
   individual’s first diagnosis record to the time of the last diagnosis
   record). Supplementary Table [224]13 summarizes the summary statistics
   of these duration values in a subgroup-specific manner, and
   Supplementary Table [225]14 reports their distribution similarity
   values for any two subgroup (or all the eleven subgroups combined)
   pairs out of 66 possible comparisons. The results show that
   distributions of observation times across subgroups (or combined
   subgroups) are very similar: (a) for the enrollment durations, the
   median and mean similarity values are 0.820 and 0.814, respectively;
   (b) for the recording durations, the median and mean similarity values
   are even higher, 0.835 and 0.840, respectively. Altogether, these large
   similarity values suggest that there exists no systemic difference
   between subgroups and between single subgroups and combined subgroups
   in terms of diagnosis code counts or observation times. In rare cases,
   “GI” and “Lymphoma” subgroups have relatively low similarity value
   (0.4119) in the distribution comparison of their code counts, but still
   the similarity between the distributions of their enrollment or
   recording durations is high (~0.7).

   Lastly, in order to check whether subgroup assignment to individuals
   solely depended on the single, most frequently occurring disease or
   not, we computed two types of assignment fraction values based on the
   diagnosis records of asthma patients. Taking the “Psychiatric” subgroup
   (the most frequently occurring disease is “Depression”) as an example,
   we computed (i) the fraction of patients who are in the “Psychiatric”
   subgroup indeed carry the “Depression” code, and (ii) the fraction of
   patients who carry the “Depression” code are eventually assigned to the
   “Psychiatric” subgroup. As a result, for the “Psychiatric” subgroup,
   the fraction i is 0.902, indicating a large majority of patients in the
   subgroup do carry the top code (interestingly, the remaining 10% of
   patients do not have to carry the top code in order to be assigned to
   the subgroup). The fraction ii is as low as 0.208, suggesting that
   having the top code alone is far from guaranteeing one to be assigned
   to the respective subgroup and other codes as well as their occurring
   frequencies play a role in such subgroup assignment process. Similar
   phenomena can also be observed in the other ten subgroups
   (Supplementary Table [226]15).

Asthma subgroup assignment

   After stable asthma subgroups are identified, the next task is to find
   an appropriate subgroup label for each individual that can best
   describe her/his comorbidity pattern, and to do this assignment for
   both asthmatic and non-asthmatic individuals. In fact, we purposely
   intended to use the subgroups discovered in asthma patients to classify
   non-asthma patients as well, so that we could compare asthma and
   non-asthma individuals who fell into the same subgroup (or in other
   words, shared the same comorbidity pattern), for example, in
   genome-wide association analysis.

   From the perspective of matrix factorization, the statistical inference
   process described in Methods above can be expressed as finding a
   low-dimensional representation for the record-diagnosis (document-word)
   co-occurrence matrix of
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   by decomposing it into the matrix of subgroup (topic) proportions
   [MATH:
   <msub><mrow><mi>Θ</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>k</mi></mrow
   ></msub> :MATH]
   and the matrix of subgroups (topics)
   [MATH:
   <msub><mrow><mi>Φ</mi></mrow><mrow><mi>k</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   (see Supplementary Fig. [227]4b, and its notations are the same as
   those used in Supplementary Fig. [228]4a). Given
   [MATH:
   <msub><mrow><mi>W</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   (observed) and
   [MATH:
   <msub><mrow><mi>Φ</mi></mrow><mrow><mi>k</mi><mo>,</mo><mi>n</mi></mrow
   ></msub> :MATH]
   (identified by HDP modeling), we can estimate
   [MATH:
   <msub><mrow><mi>Θ</mi></mrow><mrow><mi>d</mi><mo>,</mo><mi>k</mi></mrow
   ></msub> :MATH]
   by minimizing the least-square errors between the left- and right-hand
   sides of the equation. Finally, we labeled the individual
   [MATH: <mi>d</mi> :MATH]
   with the subgroup of which the respective proportion value was the
   highest among
   [MATH: <mfenced close="}"
   open="{"><mrow><msub><mrow><mi>Θ</mi></mrow><mrow><mi>d</mi><mo>,</mo><
   mn>1</mn></mrow></msub><mo>,</mo><mo>⋯</mo><mspace
   width="0.25em"></mspace><mo>,</mo><msub><mrow><mi>Θ</mi></mrow><mrow><m
   i>d</mi><mo>,</mo><mi>k</mi></mrow></msub></mrow></mfenced> :MATH]
   . In other words, given the distribution of diagnosis counts shown in
   an individual’s record, we tried to express it as a linear combination
   of the distributions of diagnosis counts as defined in the asthma
   subgroups, and then suggested that the subgroup with the largest
   assigned coefficient could represent the individual’s record best. It
   is worth emphasizing that the subgroup assignment accounts for (i) not
   a few dominant diseases in one’s diagnosis record but the complete
   collection of diseases therein, and (ii) not just the diseases’
   presence but their frequencies of appearance in records.

   This subgroup assignment process was applied to all the participating
   cohorts prior to the analyses of genome-wide associations,
   replications, differential gene expression, and phenotypic associations
   (see Supplementary Fig. [229]5 for the allocations of these cohorts to
   the asthma subgroups).

UK Biobank (UKB) database and GWAS

   The UKB database is a National Health Service registry database in the
   United Kingdom, including around 500,000 participants who were aged
   40–69 years and recruited between 2006 and 2010^[230]45. This database
   was mainly used to find genotypes and phenotypes that appear to be
   significantly different between asthma cases and non-asthma controls in
   each of the eleven asthma subgroups that have been identified using the
   US MarketScan data. We selected the individuals who had diagnosis
   records plus genotype and/or phenotype data available. Diagnosis
   records were retrieved from both self-reports and medical assessments
   during regular visits, and this information was used in assigning
   participants to the identified asthma subgroups.

   First of all, we checked whether there was some skew towards certain
   ancestry admixture for the eleven different asthma subgroups by
   examining the first (PC1) and the second (PC2) genetic principal
   components. We report the summary of PC1 and PC2 in the asthma case and
   non-asthma control pair in each of the eleven subgroups. Supplementary
   Table [231]10 summarizes the minimum, the first quartile, the median
   value, the mean value, the third quartile, and the maximum of PC1 and
   of PC2. Given the PC1 or PC2 values of two subgroups (either case or
   control), we can assess their distribution similarity by estimating the
   overlapping area of their kernel density estimations^[232]131. In
   total, we examine 231 comparisons of subgroup pairs by exhausting all
   the possible pair combinations of the 22 subgroups that include the
   eleven case subgroups and the eleven respective control subgroups,
   i.e.,
   [MATH: <mfenced close=")" open="("><mrow><mfrac
   linethickness="0.0pt"><mrow><mn>22</mn></mrow><mrow><mn>2</mn></mrow></
   mfrac></mrow></mfenced><mo>=</mo><mn>231</mn> :MATH]
   . The distribution similarity metric is equal to 1 for two identical
   distributions and 0 for two completely dissimilar ones. For PC1, the
   minimum, the first quartile, the median value, the mean value, the
   third quartile, and the maximum similarity values are as high as 0.874,
   0.918, 0.947, 0.940, 0.962, and 0.980, respectively. For PC2, the
   minimum, the first quartile, the median value, the mean value, the
   third quartile, and the maximum similarity values are also very high,
   0.848, 0.925, 0.943, 0.940, 0.958, and 0.985, respectively. These
   results suggest that none of the eleven asthma subgroups are enriched
   due to a particular ancestry admixture.

   Within each subgroup, association analyses were performed to discover
   asthma-associated genetic variants and various phenotypes (see Methods
   “Associating with health-related phenotypes based on UKB phenotypic
   data”). In UKB, a total of around 96 million genetic variants,
   including genotyped and imputed variants, were eligible for genome-wide
   association analysis^[233]45. We chose the unrelated participants
   within the white British ancestry subset who were paired with
   high-quality genotype data and diagnosis records for the analysis, and
   the sample size was 305,098 (including 44,383 asthma cases who also had
   at least one comorbid disease). Furthermore, we imposed the following
   quality control thresholds: SNP call rate >0.95, minor allele frequency
   >0.01 and Hardy–Weinberg equilibrium p > 10^−6.

   We used a logistic-regression model to test statistical associations
   between additive SNP effects (i.e., 0, 1, 2 allele dosage coding) and
   asthma^[234]46, within the group of individuals with any comorbid
   diseases (the any-CDs group) or within each of the identified
   subgroups. It is worth noting that the asthma cases were always
   compared against the corresponding non-asthma controls that shared the
   same comorbidity pattern as defined in the respective subgroup. The
   covariates include sex, age of enrollment, and the first ten genetic
   principal components.

   We considered an association to be suggestive and worthy of further
   investigation if its p < 10^−5, and to be genome-wide significant if
   its p < 5 × 10^−8^[235]132. The lead SNPs that met the suggestive
   threshold were subject to further statistical test on whether their
   effects were indeed significantly stronger than those found in the
   any-CDs group (see Methods “Stronger risk loci identification using a
   subsampling method”). Importantly, we identified 103 genome-wide
   significant loci in the any-CDs group and 20 in asthma subgroups (14
   loci overlapped or 109 loci in union). To control the false discovery
   rate (FDR), we subjected all the GWAS results out of the twelve GWASs
   (in eleven subgroups and in a general asthma population) to multiple
   testing corrections using the Benjamini–Hochberg procedure. All the
   genome-wide significant loci we reported in Supplementary Data [236]7
   were still significant after multiple testing corrections, with all FDR
   values <0.001. Out of these loci identified in any-CDs group and in
   asthma subgroups, 49 and 10 loci, respectively, were reproducible in a
   follow-up multi-ancestry meta-analysis across two different ethnicity
   subsets of UK Biobank, BioVU, and BBJ. In particular, there were six
   loci that conferred asthma risk to one asthma subgroup only but not to
   others (see Methods for technical details, Supplementary Table [237]1
   for summary statistics, and Supplementary Fig. [238]2 for selected GWAS
   plots). We also checked whether our identified risk loci were in
   linkage disequilibrium (LD) with any previously reported loci in the
   NHGRI-EBI GWAS catalog database^[239]47, and only claimed a novel
   finding if the LD measured by r^2 was smaller than 0.05 (based on 1000
   Genomes reference panel that is specific to British in England and
   Scotland). As a result, 18 out of the 109 identified loci were novel,
   including five subgroup-specific ones (see Supplementary Data [240]7).

   In addition, we assessed the heterogeneity of per-locus effect sizes,
   i.e., ln(OR) estimates, across all subgroups by applying Cochran’s Q
   test^[241]48. As a result, nine out of the 109 identified loci showed
   evidence of significant heterogeneity in effect sizes across asthma
   subgroups (see Supplementary Data [242]6).

Replicating genome-wide significant associations in multi-ancestry
meta-analysis

   To replicate the genome-wide significant associations discovered using
   the white British subset in UKB, we leveraged another four independent
   cohorts. Two were taken from other ethnic subsets in UKB, and
   specifically, we selected the unrelated individuals with high-quality
   genotyping: (i) 22,600 individuals of white Irish and any other white
   background (including 3186 asthma cases who also had at least one
   comorbid disease), and (ii) 6833 individuals of African, Caribbean and
   any other black background (including 998 asthma cases who also had at
   least one comorbid disease).

   As for the third cohort, we introduced another database—BioVU, a
   de-identified DNA databank from the Vanderbilt University Medical
   Center^[243]54. DNA samples were collected from routine clinical
   testing that would otherwise be discarded, and were linked to
   phenotypic data derived from electronic medical records (EMR) system.
   The clinical information in EMRs is updated every 1–3 months. The DNA
   samples underwent genome-wide genotyping with arrays including the
   Multi-Ethnic Global array, and then genotypes were imputed according to
   the HRC reference panel^[244]133 using the Michigan imputation
   server^[245]134. For replication analysis, we selected 16,060
   individuals of European descent (determined by principal component
   analysis), which included 1,668 asthma cases with at least one comorbid
   disease.

   The fourth cohort was the East Asian ethnic group from BBJ project,
   which was launched in 2003 to implement personalized medicine and is
   being conducted in three 5-year periods. The BBJ is a patient-based
   registry of around 200,000 participants who are of East Asian descent
   and diagnosed with any of 47 target common diseases. These target
   diseases, covering 15 broad categories, were selected owing to their
   clinical importance related to morbidity or mortality in Japan. Through
   the cooperation of 12 medical institutes, consisting of 66 hospitals,
   clinical information was collected and DNA samples were sequenced for
   genomic analyses^[246]58. Details about genotyping and imputation can
   be found in reference^[247]56. Previous analyses and comparisons
   against other Japanese databases using BBJ revealed largely consistent
   trends in common clinical variables, indicating that BBJ can represent
   the general patient population in Japan^[248]57. For the replication
   analysis, we selected a total of 194,413 individuals who had both
   diagnostic records and high-quality genotyping data, in which there
   were 3,368 asthma patients with at least one comorbid disease.

   Based on these four independent cohorts, we performed a multi-ancestry
   meta-analysis in the following three steps. First, as described in
   Methods “Asthma subgroup assignment”, we assigned asthma cases and
   non-asthma controls to the identified asthma comorbidity subgroups (see
   Supplementary Fig. [249]5 for the numbers of allocated cases and
   controls). Second, focusing one cohort at a time, we conducted a
   multivariate logistic-regression analysis using sex, age, and the first
   ten genetic principal components as covariates, except for BioVU data,
   in which the covariates included sex, age, the first three genetic
   principal components of ancestry, and genotyping array type/batch. In
   the case of BBJ, several target SNPs were neither genotyped nor
   imputed, we used the SNPs in the highest LD with respect to the target
   SNPs if available (LD measured by r^2, according to 1000 Genomes East
   Asian reference panel, March 2012 release; see Supplementary
   Fig. [250]3 for details). The final step was to merge these individual
   summary statistics, and we performed a meta-analysis by assuming a
   fixed effects model with inverse variance weighting^[251]49–[252]51.
   The merged effect size can be calculated as the weighted average of all
   individual effect sizes:
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>F</mi></mrow></msub><mo>=</mo><mfrac><mrow><msub><mrow><m
   i>w</mi></mrow><mrow><mn>1</mn></mrow></msub><msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>1</mn></mrow></msub><mo>+</mo><msub><mrow><mi>w</mi></mro
   w><mrow><mn>2</mn></mrow></msub><msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>2</mn></mrow></msub><mo>+</mo><msub><mrow><mi>w</mi></mro
   w><mrow><mn>3</mn></mrow></msub><msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>3</mn></mrow></msub><mo>+</mo><msub><mrow><mi>w</mi></mro
   w><mrow><mn>4</mn></mrow></msub><msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>4</mn></mrow></msub></mrow><mrow><msub><mrow><mi>w</mi></
   mrow><mrow><mn>1</mn></mrow></msub><mo>+</mo><msub><mrow><mi>w</mi></mr
   ow><mrow><mn>2</mn></mrow></msub><mo>+</mo><msub><mrow><mi>w</mi></mrow
   ><mrow><mn>3</mn></mrow></msub><mo>+</mo><msub><mrow><mi>w</mi></mrow><
   mrow><mn>4</mn></mrow></msub></mrow></mfrac> :MATH]
   1

   and the merged variance is
   [MATH: <mi>v</mi><mi>a</mi><mi>r</mi><mfenced close=")"
   open="("><mrow><msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mi>F</mi></mrow></msub></mrow></mfenced><mo>=</mo><mfrac><mr
   ow><mn>1</mn></mrow><mrow><msub><mrow><mi>w</mi></mrow><mrow><mn>1</mn>
   </mrow></msub><mo>+</mo><msub><mrow><mi>w</mi></mrow><mrow><mn>2</mn></
   mrow></msub><mo>+</mo><msub><mrow><mi>w</mi></mrow><mrow><mn>3</mn></mr
   ow></msub><mo>+</mo><msub><mrow><mi>w</mi></mrow><mrow><mn>4</mn></mrow
   ></msub></mrow></mfrac> :MATH]
   2

   where
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>1</mn></mrow></msub> :MATH]
   ,
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>2</mn></mrow></msub> :MATH]
   ,
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>3</mn></mrow></msub> :MATH]
   , and
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mn>4</mn></mrow></msub> :MATH]
   are effect sizes (i.e., logarithm of odds ratios) using the white Irish
   and black subsets of UKB, the European-descent subset of BioVU, and the
   East Asian group of BBJ, respectively;
   [MATH: <msub><mrow><mi>w</mi></mrow><mrow><mn>1</mn></mrow></msub>
   :MATH]
   ,
   [MATH: <msub><mrow><mi>w</mi></mrow><mrow><mn>2</mn></mrow></msub>
   :MATH]
   ,
   [MATH: <msub><mrow><mi>w</mi></mrow><mrow><mn>3</mn></mrow></msub>
   :MATH]
   , and
   [MATH: <msub><mrow><mi>w</mi></mrow><mrow><mn>4</mn></mrow></msub>
   :MATH]
   are their associated weights (i.e., the reciprocal of the respective
   squared standard errors)^[253]52,[254]53. Since an association
   replicates only if the sign of effect sizes matches between the
   discovery and replication analyses, we used a one-sided p value to test
   replication, with an expected association direction based on the
   discovery analysis^[255]135,[256]136. Out of the 128 discovered
   associations (involving 109 independent loci), 127 associations
   (involving 108 loci) were eligible for replication, and the only one
   exception was due to the small sample size (i.e., none of the four
   cohorts had more than 100 asthma cases allocated to the subgroup).
   After controlling the FDR using Benjamini–Hochberg
   procedure^[257]137,[258]138, we successfully replicated 61 associations
   (involving 52 loci, FDR < 0.10). The detailed results are summarized in
   Supplementary Data [259]7. Among the 61 associations that were
   successfully replicated at an overall meta-analysis FDR of 0.1, there
   are ten associations that have FDR values right around 0.05 (from 0.05
   to 0.06) and another ten associations that have FDR values greater than
   0.06. By carefully examining these 20 replication results for which FDR
   values fall between 0.05 and 0.1, we find different degrees of
   inconsistency in the direction of SNP effects found in the four
   replication cohorts: Compared to the effect direction found in the
   discovery cohort (UKB British white group), there are one, three, six,
   and one replication showing effects of opposite directions in UKB Irish
   and other white groups; UKB African, Caribbean, and other black group;
   BioVU European-descent group; and BBJ (only nine out of 20 associations
   have enough samples for replication attempts in the first place),
   respectively. Such inconsistency in effect directions would be greater
   for the other 66 associations that were not replicated (FDR > 0.1),
   particularly in the UKB black and BioVU groups which show 34 and 26
   cases with inconsistent directions, respectively.

Differential gene expression analysis

   We wanted to test for differential expression of the genes to which the
   six subgroup-specific SNPs were mapped. Thus, we introduced an
   independent dataset, containing transcriptome profiles of bronchial
   epithelial cells (BECs) in 42 asthma cases and 28 non-asthma controls
   enrolled in the University of Chicago hospitals^[260]19. The involved
   cDNA libraries were constructed using the TruSeq RNA Sample Preparation
   v2 Guide (Illumina) and run on the Illumina HiSEquation 2000 platform.
   Reads were mapped to the transcriptome using BWA (Burrows-Wheeler
   Aligner)^[261]139. BEDTools was used to determine the sequences that
   would overlap with protein-coding regions^[262]140. The mapped reads
   per individual ranged from 10,100,000 to 51,150,000, with median value
   to be 19,210,000. The reads were adjusted for gene length and variation
   in sample read depth, and then normalized using upper quartile
   normalization.

   Using diagnosis history information, we first assigned the 42 asthma
   patients to the five subgroups that the six SNPs (meeting genome-wide
   significance threshold) related to. Only two subgroups involving three
   SNPs had five or more individuals: subgroup 5 “Musculoskeletal” had
   five cases and subgroup 3 “GI” had seven cases. The three SNPs,
   including rs11144271, rs113757163, and rs2249851, closest to genes
   OSTF1, COX10, and FAM129B, respectively, which were subject to
   differential gene expression analysis. Two types of control groups were
   compared against: (i) the 28 non-asthmatic individuals, and (ii) the
   remaining asthma cases that were assigned to the subgroups other than
   the one to be tested.

   In this analysis, we first normalized the raw gene transcript counts by
   size factors to account for sequencing depth differences, estimated
   gene-wise dispersions, and then modeled the counts using a generalized
   linear model of the negative binomial family^[263]60. The confounding
   factors considered in the model included age, sex, and ethnicity. The
   significance of the test associations between gene counts and asthma
   subgroups were determined using the two-sided Wald test. In subgroup 11
   (joint disorder), OSTF1 was significantly lower expressed, while COX10
   was higher expressed, if compared with the respective expression levels
   in controls (i) and (ii). In subgroup 3 “GI,” the expression of FAM129B
   was significantly higher than those in the controls (Fig. [264]3c).

Associating with health-related phenotypes based on UKB phenotypic data

   To examine heterogeneity in phenotypic associations across the asthma
   subgroups, we made use of the phenotypic data in the UKB
   resource^[265]45 by focusing on a collection of 140 phenotypes that
   measured ten general categories related to health, including
   spirometry, blood count, blood biochemistry, urine biochemistry, early
   life factors, anthropometry, addictions, diet, physical activity, and
   local environment. Spirometry, in particular, includes pulmonary
   function measures on FVC, FEV[1], FEV[1]/FVC, and PEF. After computing
   their respective predicted values using the prediction equations for
   Caucasian male and female adults developed from the third US National
   Health and Nutrition Examination Survey^[266]141, we further derived
   their percentage predicted values by normalizing the measured against
   the predicted values. Finally, min-max normalization was applied to all
   the phenotypic measures, so that their values all varied from 0 to 1
   and the slope estimates of their associations could be compared to each
   other.

   This analysis was based on the same samples as used in GWAS discovery,
   i.e., the unrelated individuals who had diagnosis records available and
   were in the white British ethnic group of UKB, including about 44,383
   asthma cases and 260,715 non-asthma controls. The analysis consists of
   four steps:
    1. Find appropriate subgroup assignment for all the samples, with or
       without asthma.
    2. In a given subgroup
       [MATH: <mi>i</mi> :MATH]
       , pick a phenotypic measure and associate it with asthma diagnosis
       (yes or no) in a multivariate logistic-regression analysis using
       sex, age of enrollment, and the first ten genetic principal
       components as covariates (height is also included, if the
       phenotypic measure relates to spirometry). The resulting slope
       estimate of the phenotype (
       [MATH: <msub><mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></msub>
       :MATH]
       ) characterizes how asthma likelihood associates with the
       phenotype: a positive (or negative) value indicates a positive (or
       negative) association; greater the absolute value is, stronger the
       association is.
    3. Repeat step 2 for all the 140 phenotypic measures and for all the
       eleven asthma subgroups as well as the any-CDs group. The false
       discovery rate was controlled via Benjamini–Hochberg
       procedure^[267]137,[268]138. Particularly, the slope (
       [MATH: <msub><mrow><mi>β</mi></mrow><mrow><mn>0</mn></mrow></msub>
       :MATH]
       ) from the any-CDs group would serve as a benchmark value to be
       used in the next step. The detailed results generated in this step
       can be found in Supplementary Data [269]11 (the raw slope estimates
       before benchmarking against any-CDs group).
    4. Estimate the deviation of
       [MATH: <msub><mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></msub>
       :MATH]
       from
       [MATH: <msub><mrow><mi>β</mi></mrow><mrow><mn>0</mn></mrow></msub>
       :MATH]
       and test its statistical significance, allowing for a quantitative
       assessment of heterogeneity in
       [MATH: <msub><mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></msub>
       :MATH]
       across different subgroups by comparing to the common benchmark
       [MATH: <msub><mrow><mi>β</mi></mrow><mrow><mn>0</mn></mrow></msub>
       :MATH]
       . To this end, we implemented a multivariate adaptive shrinkage
       (mash) method, which took the
       [MATH: <msub><mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></msub>
       :MATH]
       estimates as well as their standard errors as inputs and adopted an
       empirical Bayes procedure^[270]63. Out of the 140 phenotypes, 44
       showed significant heterogeneity in
       [MATH: <msub><mrow><mi>β</mi></mrow><mrow><mi>i</mi></mrow></msub>
       :MATH]
       across asthma subgroups. The final results are summarized in
       Fig. [271]4, Supplementary Fig. [272]7, and Supplementary
       Data [273]12 (the estimates of the slope differences after
       benchmarking against any-CDs group).

Stronger risk loci identification using a subsampling method

   Here, we asked among the asthma associations found in the subgroups
   that had passed the suggestive threshold (p < 10^−5), how many of them
   were indeed significantly stronger than those found in the any-CDs
   group. To make a fair comparison of GWAS statistics, however, we needed
   to equate their statistical detection powers first.

   As statistical power is largely influenced by sample size, detecting an
   association within an asthma subgroup, which is a subset of the
   undivided general population, is relatively less powered. This can be
   demonstrated using the mathematical formula for
   [MATH: <mi>Z</mi> :MATH]
   score, written as below:
   [MATH: <mi>Z</mi><mo>=</mo><mfrac><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><msub><mrow><mover
   accent="true"><mrow><mi>σ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></msub></mrow></mfrac><mo>=</mo><mfrac><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mover accent="true"><mrow><mspace
   width="0.25em"></mspace><mi>σ</mi></mrow><mrow><mo>^</mo></mrow></mover
   ><mo>/</mo><msqrt><mrow><mi>n</mi></mrow></msqrt></mrow></mfrac> :MATH]
   3

   where
   [MATH: <mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   is a SNP effect size (i.e., the logarithm of the odds ratio),
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>σ</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row><mrow><mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover></m
   row></msub> :MATH]
   is the standard error, and
   [MATH: <mover
   accent="true"><mrow><mi>σ</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   is the sample standard deviation.
   [MATH: <mi>n</mi> :MATH]
   denotes (effective) sample size and can be approximated via
   [MATH: <mn>1</mn><mo>/</mo><mfenced close=")"
   open="("><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><msub><mrow><mi>n</m
   i></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi>e</mi></mrow></msub></
   mrow></mfrac><mo>+</mo><mfrac><mrow><mn>1</mn></mrow><mrow><msub><mrow>
   <mi>n</mi></mrow><mrow><mi>c</mi><mi>t</mi><mi>r</mi><mi>l</mi></mrow><
   /msub></mrow></mfrac></mrow></mfenced> :MATH]
   , where
   [MATH:
   <msub><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi>e<
   /mi></mrow></msub> :MATH]
   and
   [MATH:
   <msub><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>t</mi><mi>r</mi><mi>l<
   /mi></mrow></msub> :MATH]
   are the numbers of asthma cases and non-asthma controls, respectively.
   Here,
   [MATH: <mi>Z</mi> :MATH]
   score is preferred over p value, in order to encode not only the
   significance level (reflected by the magnitude of
   [MATH: <mi>Z</mi> :MATH]
   score) but also the direction of SNP effect (reflected by the sign).
   Detecting a SNP-asthma association, although its actual
   [MATH: <mover
   accent="true"><mrow><mi>β</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   and
   [MATH: <mover
   accent="true"><mrow><mi>σ</mi></mrow><mrow><mo>^</mo></mrow></mover>
   :MATH]
   remain unchanged, would yield different
   [MATH: <mi>Z</mi> :MATH]
   scores if cohorts with different sample sizes
   [MATH: <mi>n</mi> :MATH]
   were used. Therefore, for an association in the any-CDs group (based on
   the general population who may have any comorbid diseases,
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi
   >e</mi></mrow><mrow><mi>g</mi></mrow></msubsup> :MATH]
   cases and
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>t</mi><mi>r</mi><mi
   >l</mi></mrow><mrow><mi>g</mi></mrow></msubsup> :MATH]
   controls), we should re-estimate what its
   [MATH: <msup><mrow><mi>Z</mi></mrow><mrow><mi>g</mi></mrow></msup>
   :MATH]
   score would have been if it had been based on the cases and controls of
   the same sizes (
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi
   >e</mi></mrow><mrow><mi>s</mi></mrow></msubsup> :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>t</mi><mi>r</mi><mi
   >l</mi></mrow><mrow><mi>s</mi></mrow></msubsup> :MATH]
   , respectively) as a subgroup had (we called it the projected
   [MATH: <msup><mrow><mi>Z</mi></mrow><mrow><mi>g</mi></mrow></msup>
   :MATH]
   here), in order to make a fair comparison against the subgroup-based
   [MATH: <msup><mrow><mi>Z</mi></mrow><mrow><mi>s</mi></mrow></msup>
   :MATH]
   score.

   In this analysis, we inferred the projected
   [MATH: <msup><mrow><mi>Z</mi></mrow><mrow><mi>g</mi></mrow></msup>
   :MATH]
   empirically using a stratified subsampling algorithm. From each
   subgroup of cases (or controls), we randomly drew a number of samples,
   and this number was proportional to the original size of the cases’ (or
   controls’) subgroup; the total number of cases (or controls) we drew
   from all the subgroups should equal to
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi
   >e</mi></mrow><mrow><mi>s</mi></mrow></msubsup> :MATH]
   (or
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>t</mi><mi>r</mi><mi
   >l</mi></mrow><mrow><mi>s</mi></mrow></msubsup> :MATH]
   ). In other words, the original
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi
   >e</mi></mrow><mrow><mi>g</mi></mrow></msubsup> :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>t</mi><mi>r</mi><mi
   >l</mi></mrow><mrow><mi>g</mi></mrow></msubsup> :MATH]
   seen in the any-CDs group were shrunk to
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>a</mi><mi>s</mi><mi
   >e</mi></mrow><mrow><mi>s</mi></mrow></msubsup> :MATH]
   and
   [MATH:
   <msubsup><mrow><mi>n</mi></mrow><mrow><mi>c</mi><mi>t</mi><mi>r</mi><mi
   >l</mi></mrow><mrow><mi>s</mi></mrow></msubsup> :MATH]
   , respectively, with their respective compositions of subgroups
   proportionally unchanged. Then, based on the newly generated
   subsamples, we performed the logistic-regression analysis as described
   in Methods “UK Biobank (UKB) database and GWAS” to compute the
   empirical estimates of the projected
   [MATH: <msup><mrow><mi>Z</mi></mrow><mrow><mi>g</mi></mrow></msup>
   :MATH]
   . But this was just one empirical estimate based on one possible set of
   subsamples. In practice, we repeated this subsampling process followed
   by the regression analysis for 20,000 times, thus generating a
   collection of 20,000 projected
   [MATH: <msup><mrow><mi>Z</mi></mrow><mrow><mi>g</mi></mrow></msup>
   :MATH]
   scores.

   Finally, we can test the null hypothesis: the subgroup-based
   [MATH: <msup><mrow><mi>Z</mi></mrow><mrow><mi>s</mi></mrow></msup>
   :MATH]
   score followed the same distribution as defined by the projected scores
   collected above. Assuming this hypothesis was true, we computed an
   empirical two-tailed p value, which suggested the probability of
   getting the test statistic at least as extreme as
   [MATH: <msup><mrow><mi>Z</mi></mrow><mrow><mi>s</mi></mrow></msup>
   :MATH]
   . In this manner, we computed p values for all the possible
   associations between the lead SNPs of interest and asthma subgroups.
   Then we controlled the FDR and adjusted the p values using
   Benjamini–Hochberg procedure^[274]137,[275]138. If an FDR was <0.05,
   then we would reject the null hypothesis about the respective
   association, declaring that in fact the association had an
   extremer-than-expected
   [MATH: <mi>Z</mi> :MATH]
   score, and was significantly stronger in the subgroup and in the
   any-CDs group. Altogether, there were 182 associations of this kind
   (involving 182 loci) identified (see Supplementary Data [276]8 for a
   detailed summary).

Identifying genomic regions that share influences on asthma

   First, the 22 autosomes were divided into 1703, approximately
   independent regions based on patterns of LD that were derived from the
   European population in 1000 Genomes reference panel^[277]142, and on
   average each region contained 3054 SNPs. We wanted to know whether
   there existed genomic regions that shared asthma-associated influences
   (i) between asthma subgroups and the any-CDs group, and (ii) between
   the subgroups. For this purpose, by comparing GWAS summary statistics,
   we implemented an established hierarchical Bayesian model to estimate
   the probability that a genomic region contained at least one variant
   that influenced asthma susceptibility in (i) or (ii)^[278]143. More
   specifically, we performed a scan for genomic regions, computed a
   regional Bayes factor that measured the support for an association in a
   given genomic region, and inferred the posterior probability by
   maximizing a log-likelihood function. At a threshold of the posterior
   probability greater than 0.9 (i.e., at an FDR of 0.10), 73 unique
   genomic regions were identified for the pairs in (i) (Supplementary
   Data [279]9), and 21 unique genomic regions for the pairs in (ii)
   (Supplementary Data [280]10 and see Supplementary Fig. [281]6 for most
   conserved genomic regions that were shared by the any-CDs group and at
   least four subgroups).

Pathway enrichment analysis based on GWAS summary statistics

   Here, we aimed to find out unique biological pathways that were
   enriched in an asthma subgroup-specific manner. In each subgroup, we
   selected the lead SNPs that surpassed the suggestive threshold
   (p < 10^−5), and mapped these SNPs to genes using positional, eQTL, and
   chromatin interaction information. In order to find possible
   overrepresentation of biological pathways and agreement with GWAS
   catalog, these mapped genes were tested against “background” gene sets
   obtained from MSigDB (i.e., hallmark gene sets, positional gene sets,
   curated gene sets, motif gene sets, computational gene sets, GO gene
   sets, oncogenic signatures, and immunologic signatures), WikiPathways
   (19,283 protein-coding genes), and GWAS catalog genes. Hypergeometric
   test was used and the resulting p values per category (i.e., canonical
   pathways, GO biological processes, and GWAS catalog, separately) were
   further adjusted via Benjamini–Hochberg correction^[282]144. Finally,
   we reported significant findings (Benjamini–Hochberg adjusted p value
   <0.05) in Supplementary Table [283]3.

Reporting summary

   Further information on research design is available in the [284]Nature
   Research Reporting Summary linked to this article.

Supplementary information

   [285]Supplementary Information^ (7.4MB, pdf)
   [286]41467_2022_33628_MOESM2_ESM.pdf^ (188KB, pdf)

   Description of Additional Supplementary Files
   [287]Reporting Summary^ (342.7KB, pdf)
   [288]Supplementary Data 1^ (678.9KB, xlsx)
   [289]Supplementary Data 2^ (768KB, xlsx)
   [290]Supplementary Data 3^ (716.2KB, xlsx)
   [291]Supplementary Data 4^ (671.3KB, xlsx)
   [292]Supplementary Data 5^ (425.9KB, xlsx)
   [293]Supplementary Data 6^ (38.2KB, xlsx)
   [294]Supplementary Data 7^ (96.3KB, xlsx)
   [295]Supplementary Data 8^ (48.7KB, xlsx)
   [296]Supplementary Data 9^ (26.3KB, xlsx)
   [297]Supplementary Data 10^ (20KB, xlsx)
   [298]Supplementary Data 11^ (223.7KB, xlsx)
   [299]Supplementary Data 12^ (87.9KB, xlsx)

Acknowledgements