Abstract Asthma is a heterogeneous respiratory disease characterized by airway inflammation and obstruction. Despite recent advances, the genetic regulation of asthma pathogenesis is still largely unknown. Gene expression profiling techniques are well suited to study complex diseases including asthma. In this study, differentially expressed genes (DEGs) followed by weighted gene co-expression network analysis (WGCNA) and machine learning techniques using dataset generated from airway epithelial cells (AECs) and nasal epithelial cells (NECs) were used to identify candidate genes and pathways and to develop asthma classification and predictive models. The models were validated using bronchial epithelial cells (BECs), airway smooth muscle (ASM) and whole blood (WB) datasets. DEG and WGCNA followed by least absolute shrinkage and selection operator (LASSO) method identified 30 and 34 gene signatures and these gene signatures with support vector machine (SVM) discriminated asthmatic subjects from controls in AECs (Area under the curve: AUC = 1) and NECs (AUC = 1), respectively. We further validated AECs derived gene-signature in BECs (AUC = 0.72), ASM (AUC = 0.74) and WB (AUC = 0.66). Similarly, NECs derived gene-signature were validated in BECs (AUC = 0.75), ASM (AUC = 0.82) and WB (AUC = 0.69). Both AECs and NECs based gene-signatures showed a strong diagnostic performance with high sensitivity and specificity. Functional annotation of gene-signatures from AECs and NECs were enriched in pathways associated with IL-13, PI3K/AKT and apoptosis signaling. Several asthma related genes were prioritized including SERPINB2 and CTSC genes, which showed functional relevance in multiple tissue/cell types and related to asthma pathogenesis. Taken together, epithelium gene signature-based model could serve as robust surrogate model for hard-to-get tissues including BECs to improve the molecular etiology of asthma. Subject terms: Computational models, Gene expression Introduction Asthma is a complex heterogeneous disease characterized by recurring symptoms of reversible airflow obstruction, bronchial hyperresponsiveness, and airway inflammation. Genetic, environmental and other social determinants risk factors play key roles in asthma etiology^[34]1. A family history of asthma is an associated with an increase in asthma risk in the offspring, demonstrating a strong genetic component with estimated heritability as high as 80%^[35]2,[36]3. Clinical outcome such as lung function, Immunoglobulin and skin prick test have been suggested as clinical biomarker. However, clinical biomarkers are not specific to capture the allergic inflammation signal presented in asthmatic patients. There is a need for an effective molecular biomarker that are closely linked to disease mechanisms. Gene expression profiling techniques that simultaneously analyze a large quantity of transcripts are well suited to identify novel genes and pathways involved in asthma pathogenesis^[37]4,[38]5. Transcriptomic signatures that differentiate asthmatic and healthy state based on genome-wide gene expression profile in different human tissue/cell types including nasal airway epithelium cells (NECs), airway epithelium cells (AECs), bronchial airway epithelium cells (BECs), and whole blood (WB) cells were reported previously^[39]6. Furthermore, gene signatures from airway smooth muscle (ASM) cells data that discriminate asthmatic and healthy subjects were reported^[40]7. Ideally, gene expression based diagnostic model should be constructed in the tissue/cell types relevant to the disease development^[41]8. For example, identifying biomarkers and constructing diagnostic genetic models using bronchial airway epithelium as target tissue have great potential for elucidating pathophysiological changes in bronchial airways of asthma^[42]8. However, one of the major challenges in asthma research is obtaining sufficient bronchial epithelial samples to construct diagnostic and prediction models. This is not realistic specifically in children because obtaining these samples requires performance of invasive bronchoscopies. Previous studies demonstrated that nasal upper bronchial airway epithelium tissue/cell types shared the same airway biology with lower respiratory tracts^[43]8–[44]10. Thavagnanam et al. suggested to use easily accessible tissue (e.g., NECs) as a surrogate for less accessible tissue (e.g., bronchial epithelium) in asthma studies^[45]6. The WB cells is another surrogate sample and used in many asthma studies. In addition, samples obtained from easily accessible surrogate tissue can help to get large sample size required for developing diagnostic model with sufficient statistical power. Identifying gene signatures obtained from easily accessible tissue sampled during asthma attacks can aid to elucidate the pathogenesis and changes in asthmatic easy-to-access bronchial airways^[46]11,[47]12. However, previous studies have typically considered a single tissue or a small number of tissues and there are limited studies that comprehensively evaluated whether diagnostic models derived from surrogate tissue samples can provide comparable diagnostic performance with less accessible target tissues (i.e., cells from lung tissue). In this study, we systematically determine which of the genes have tissue-specific effects or broadly shared among tissue types. In addition, most of the previous studies in asthma mainly focused on identifying DEGs. The analysis and interpretation of DEGs is important for defining key genes which may be driving changes in asthma. However, individual genes may fail to fully capture the molecular pathways as genes do not function in isolation. Most importantly, each gene has limited contribution to complex diseases including asthma^[48]13. Co-expression analysis considers all genes together and constructs networks among genes to form co-expressed modules and these modules are potentially used to infer regulatory association between target genes and transcription factors^[49]14. Furthermore, co-expression module genes can be used to discover interaction network and hidden biological models relevant to disease pathogenesis^[50]15,[51]16. In recent years, machine learning approaches such as LASSO logistic regression, SVM and random forest (RF) were used to unravel new biological insights from the genomic data^[52]17,[53]18. Although these approaches identified potential gene signatures in asthma, their approach only considered each gene individually. However, the individual genes may not always decipher meaningful biological functions. To address these limitations, we analyzed gene expression data from 257 asthmatic and 136 control subjects in NECs and 62 asthmatic and 43 control subjects in AECs and developed co-expression network graph combined with machine learning methods to prioritize and select potential genes related to asthma risk. We further validated the performance of the risk models in an independent and different tissue types including BECs, ASM and WB datasets. Materials and methods Data sets and filtering criteria Our overall workflow strategy is shown Fig. [54]1. Initially, gene expression profiles and the corresponding clinical information were downloaded from publicly available NCBI Gene Expression Omnibus (GEO) database. Eligible gene expression asthma datasets were chosen using the following inclusion criteria. (1) Homo sapiens, (2) sample size ≥ 50, (3) consists of gene expression profiles of asthmatic and control subjects and published within seven years. Three asthma RNA-seq datasets (accession: [55]GSE152004, [56]GSE201955 and [57]GSE58434) and two asthma microarray datasets (accession: [58]GSE67472 and [59]GSE69683) satisfied the inclusion criteria were used for subsequent analysis. The raw count dataset, [60]GSE152004 derived from nasal epithelium cells (NECs) contains 257 asthmatic and 136 control subjects. The normalized microarray dataset, [61]GSE67472 derived from airway epithelium cells (AECs) contains 62 asthmatic and 43 control subjects. The normalized RNA-seq data in [62]GSE201955 derived from bronchial epithelial cells (BECs) contains 79 asthmatic and 39 control subjects. The RNA-seq data in [63]GSE58434 derived from airway smooth muscle (ASM) cells contains 17 asthmatic and 36 control subjects. The normalized microarray data in [64]GSE69683 derived from whole blood (WB) cells contains 324 asthmatic and 87 control subjects. The summary of the datasets used in this study are shown Table [65]1. Figure 1. [66]Figure 1 [67]Open in a new tab The overall workflow of this study. Initially, RNA-sequencing or microarray based gene expression data were collected, preprocessed, normalized, and analyzed for differential expression analysis and weighted correlation network analysis (WGCNA) to generate DEGs and modules associated with asthma status. To identify differentially co-expressed genes (DCEGs), an intersecting analysis between DEGs (adjusted p-value < 0.05) and genes within modules significantly correlated with asthma status (P-value < 0.05) was performed. Candidate DCEGs were further analyzed by four machine learning algorithms to identify gene-signatures and constructed different asthma classification models and prediction, which were then validated in independent datasets. CV-coefficient of variation. Table 1. Asthma gene expression datasets used in current study. Tissue class GEO ID Sample size (cases /controls) Gender (% female) Age (years), mean ± SD Tissue type Platform Surrogate [68]GSE152004 257/136 51 14.35 [MATH: ± :MATH] 3.19 NECs Illumina HiSeq 2000 Primary/target tissue [69]GSE67472 62/43 51 35.34 [MATH: ± :MATH] 10.83 AECs Affymetrix Human Genome U133 Plus 2.0 Array Surrogate [70]GSE69683 324/87 55 NA WB Affymetrix HT HG-U133 + PM Array Primary [71]GSE201955 79/39 71 38.54 [MATH: ± :MATH] 12.07 BECs Illumina HiSeq 2500 and 400 Primary [72]GSE58434 17/36 NA NA ASM Illumina HiSeq 2000 [73]Open in a new tab AECs Airway epithelial cells, ASM Airway smooth muscle, BECs Bronchial epithelial cells, NECs Nasal epithelial cells, WB Whole blood. Data processing and selection of DEGs For raw count RNA-seq expression matrix in the NECs ([74]GSE152004) dataset, DESeq2^[75]19 package was used to pre-process, filter out genes showing less than 10 reads based on the sum of rows and normalize the background. Batch effect, treatment effect and/or unrelated variables in the datasets derived from ASMs and BECs were eliminated using surrogate variable analysis (SVA) package^[76]20. After each dataset was preprocessed and normalized separately, the normalized gene expression datasets derived from AECs and NECs tissue types were used for model development and the normalized datasets derived from BECs, ASM and WB tissue/cell type samples were used for model validation. Differentially expressed genes (DEGs) in asthmatic subjects compared with controls were identified using limma^[77]21 package and a significance threshold adjusted P-value < 0.05 based on Benjamin-Hochberg procedure was used to identify DEGs in AECs and NECs datasets. We used the ggplot2 package to generate a volcano plot and show both the adjusted P-value and fold change. The statistical software R was used to conduct all statistical analyses. Gene co-expression network analysis Initially, genes were filtered by coefficient of variation (CV) to avoid non-varying or low-expressed genes in both AECs and NECs datasets, and genes with CV > 4% (5833 and 7496 hypervariable genes in AECs and NECs, respectively) were utilized to construct a gene coexpression network using WGCNA R package^[78]22. To characterize the correlation structure of these hypervariable genes, gene similarity matrix was constructed using pairwise correlation [MATH: Sij :MATH] = [MATH: cor(xi,xj :MATH] ), where [MATH: xi :MATH] and [MATH: xj :MATH] represent the ith row and the jth row of gene expression data matrix X, respectively. The similarity matrix was transformed into an adjacency matrix, represented by [MATH: Aij=cor(xi,xj)β :MATH] , where the suitable soft-thresholding candidate power β that ranges from 1 to 20 and the appropriate power were determined based on index value in the dataset ( usually greater than 0.85) using the pick Soft Threshold function^[79]22. Second, the adjacency gene network was transformed into a topological overlap matrix (TOM), and corresponding dissimilarity (1-TOM) matrix was computed. Finally, average linkage hierarchical clustering with Dynamic Tree Cut was used to identify modules, and minimum number of genes in each module was set to be 50^[80]23. Selection of asthma correlated modules and differentially co-expressed genes (DCEGs) To select asthma correlated modules, module eigengenes (ME), which is the principal component of each gene module and could be considered as a representative of all genes in a given module was computed. The ME values were correlated with asthmatic and control subjects using Pearson’s correlation and the modules significantly correlated with asthmatic subjects were selected (|r| [MATH: :MATH] 0.2 and P-value < 0.05). The genes within the modules that had significant association with asthmatic subjects and controls in AECs and NECs dataset were selected and named as module genes (co-expressed genes). Then, an overlapping analysis was conducted between co-expressed genes and DEGs to screen differentially co-expressed genes (DCEGs) for further analysis. Gene prioritization using four machine learning algorithms For identification of prioritized gene-signatures associated with asthma, we used gene expression data from AECs (n = 105) and NECs (n = 393) and selected the respective DCEGs as input features in four different supervised ML algorithms: RF, recursive feature elimination (RFE), LASSO, and Boruta^[81]24–[82]27. RF is a supervised ML algorithm, which creates decision trees on randomly selected data samples, obtains prediction from individual tree and choose best solution by means of majority voting. RF also uses mean decrease accuracy for ranking individual gene-importance^[83]24. RFE is an effective gene selection algorithm that fits a diagnostic model recursively and removes weakest gene features per iteration until a specific optimal number of gene features is selected, while attempts to eliminate collinearity among gene features in the model^[84]25. The genes are ranked by gene importance of the model^[85]25. Logistic regression with LASSO penalty is gene-selection method, which uses regularization parameter to shrink insignificant regression coefficients to zero and this method will automatically select those genes that are useful, discarding redundant or non-informative genes in asthma prediction^[86]26. The Boruta algorithm uses a random replicate of the original data to create shuffled copies of all features which are called shadow features. Then, the algorithm performs a classification matrix using all features to compute the most important features. The shadows’ feature importance is used as a reference for evaluating the scores obtained by the actual features^[87]27. The potential features yielded the ‘‘confirmed’’ status in Boruta iterations and achieved higher importance than the best shadow was selected. Boruta algorithm is an extended version of RF and widely used for selecting gene-signature associated with response variable^[88]28,[89]29. Despite each method has its own strength, there are limited studies that examine which method perform better in risk prediction including asthma; particularly when there is high correlation among gene features. The four methods were used to screen potential gene-features and their asthma classification performance were compared. Construction of asthma classification models and validation To compare which method outperforms in classifying asthmatic from control subjects based on the same number of gene-signatures obtained from the four methods (LASSO, RF, RFE, and Boruta) in multiple tissue/cell types, two broadly used classifiers RF and SVM were selected. RF algorithm was used for both feature selection/prioritization and classification^[90]24,[91]25. We also used SVM algorithm as classifier to evaluate classification performance of the identified gene-signatures based on different models in multiple tissue/cell types^[92]30,[93]31. The SVM and RF algorithms were used to predict asthma in the discovery sets: NECs and AECs tissue/cell types and validation sets: BECs, ASM and WB tissue/cell types and finally the best risk prediction method for different tissue/cell type datasets in discriminating asthmatic from control subjects was selected. Evaluating classification performance The model diagnostic performance of different feature selection and classification methods were evaluated based on different performance metrics including AUC, Matthew's correlation coefficient (MCC), and F1-score (F-measure). Multiple model prediction performance metrics were used to avoid overoptimistic results when the number of asthmatic and control subjects are unbalanced, as previous studies suggested^[94]32,[95]33. The AUC of ROC curve is the approximation of the area under precision-recall curve, whereas F1-score and MCC are defined as follow. [MATH: F1=TPTP+ 12(FP+FN) :MATH] [MATH: MCC=TP×TN-FP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN) :MATH] where, TP, TN, FP and FN represent the number of correctly predicted asthma class, the number of correctly predicted control class, the number of incorrectly predicted asthma class and the number of incorrectly predicted control class, respectively. F1 equal to 1 shows perfect model classification performance and 0 implies the model is imperfect. MCC values range from –1 to 1, the model classification performance is perfect at 1 and completely incorrect classification at -1. Functional annotation and enrichment analysis To identify the biological function underlying differentially co-expressed genes in each significant asthma associated module, we performed pathway enrichment analysis by Ingenuity Pathway Analysis (IPA) software ([96]http://www.ingenuity.com/products/ipa). The IPA method evaluates proportional representation of module genes from a defined set in a canonical pathway in all set of known genes. Canonical pathways of the input module genes were evaluated to identify significantly enriched pathways adjusting for multiple testing. The p-value is calculated based on a right-tailed Fisher Exact test. For canonical pathway analysis, a ‑log (P‑value) > 2 was taken as threshold to define significant canonical pathways^[97]34. Results Identification of differentially expressed genes (DEGs) in asthma The genome wide DEG analysis results for asthma in the AECs and NECs were visualized via volcano plot (Fig. [98]2a,b). The results showed that a total of 3564 genes from AECs in Fig. [99]2a and 8669 genes from NECs data in Fig. [100]2b were differentially expressed with adjusted p-value < 0.05, and these DEGs were retained for subsequent analysis. Figure 2. [101]Figure 2 [102]Open in a new tab Identification of asthma related genes for the AECs and the NECs. (a, b) Volcano plot showing 3564 DEGs for the AECs and 8669 DEGs for NECs respectively. DEGs- differentially expressed genes (adjusted p-value < 0.05). (c) The correlation between 13 modules and asthma status in the AECs data. The modules associated with asthma include purple module, magenta module, pink module, greenyellow module, yellow module, green module and brown module in the AECs dataset. (d) The correlation between 10 modules and asthma status in the NECs data. The modules associated with asthma include blue module, brown module, pink module and black module in the NECs dataset. We kept genes within the selected modules for each dataset for subsequent analysis. Identification of asthma associated key modules and co-expressed genes To characterize the correlation structure of 5833 and 7496 hypervariable genes and further examine their gene-regulatory networks in AECs and NECs, respectively, we conducted WGCNA analysis using hierarchical agglomerative clustering with average linkage. For AECs dataset, the suitable soft threshold power (β) = 8 (scale-free [MATH: R2 :MATH] = 0.9) was used as the correlation coefficient threshold to ensure relatively balanced mean connectivity and scale free network (Fig. [103]S1a). WCGNA revealed a total of 13 modules in the AECs dataset (Figs. [104]S2a and [105]2c). For NECs dataset, power (β) = 5 (scale-free R^2 = 0.86) was used as the correlation coefficient threshold to ensure balanced connectivity and scale free network (Fig. [106]S1b) and the result of WGCNA analysis showed a total of 10 modules in the NECs dataset (Figs. [107]S2b and [108]2d). To identify module-trait association, the estimated eigengenes values were correlated with the clinical traits of asthmatic and control subjects in the AECs and NECs datasets as indicated in the heatmap (Fig. [109]2c,d; |r| [MATH: :MATH] 0.2 and P-value < 0.001). Seven modules (purple, pink, greenyellow, brown, magenta, yellow, and green) in AECs and four modules (blue, brown, pink and black) in NECs were identified as significantly correlated with asthmatic subjects. The purple, pink and green modules were positively correlated with asthmatic subjects, while the brown, yellow, greenyellow and magenta were negatively correlated with asthmatic subjects in AECs dataset. The blue module was positively correlated with asthmatic subjects, while the brown, pink and black modules were negatively correlated with asthmatic subjects in NECs dataset. A total of 2495 co-expressed genes were found in seven significant modules including purple module (170 genes), magenta module (244 genes), pink module (283 genes), greenyellow module (86 genes), yellow module (467 genes), green module (455 genes) and brown module (790 genes) in the AECs dataset (Table [110]S1), while a total of 2634 co-expressed genes were found in four significant modules including blue module (1225), brown module (961), pink module (211) and black module (237 genes) in the NECs dataset (Table [111]S1). Selection of DCEGs in asthma correlated modules Next, overlapping analysis between 3564 DEGs and 2495 co-expressed genes in six asthma correlated modules derived from AECs dataset resulted a total of 854 DCEGs (Table [112]S1). Similarly, overlapping analysis between 8669 DEGs and 2634 co-expressed genes in four asthma correlated modules derived from NEC data resulted a total of 725 DCEGs (Table [113]S1). These identified DCEGs in both AECs and NECs dataset were used for functional enrichment and asthma diagnostic gene-signature based model development. Functional analysis of the DCEGs in asthma correlated modules To obtain further insights into the biological function of the DCEGs in significant asthma associated modules derived from AECs and NECs datasets, the biological function enrichment analyses were performed using IPA software and the results are shown Fig. [114]3a,b and Tables [115]S2, [116]S3. The functional enrichment analysis of 132 unique DCEGs in the purple module derived from AECs dataset enriched in key biological functions such as IL-13 Signaling, role of IL-17A in arthritis, glutamate removal from folates, histamine biosynthesis (Fig. [117]3a). The enrichment analysis of 163 correlated genes in the pink module involved in several biological functions, for example mitochondrial dysfunction, PI3K/AKT Signaling, and others (Fig. [118]3a). Other functional enrichment of correlated genes in asthma correlated modules (greenyellow and brown modules) derived from AECs dataset are shown in Fig. [119]3a and Table [120]S2. Meanwhile, pathway analysis of DCEGs in the two most asthma correlated modules (blue and brown modules) derived from NECs dataset showed enrichment in several biological functions. The most enriched pathways of 1225 correlated genes for blue module included integrin signaling, CAMP-mediated signaling, protein coupled receptor signaling, S100 family signaling, IL-13 Signaling (Fig. [121]3b and Table [122]S2). The 961 correlated genes in brown module involved in pathogen induced cytokine storm signaling, Th1 and Th2 Activation, crosstalk between dendritic cells and natural killer cells (Fig. [123]3b and Table [124]S3). The functional enrichment overlapping analysis of correlated genes associated with asthma relevant modules of purple and pink modules derived from AECs and blue and brown modules derived from NECs were enriched in biological functions including IL-13 Signaling and PI3K/AKT signaling and apoptosis signaling (Fig. [125]3C and Tables [126]S2, [127]S3). Figure 3. [128]Figure 3 [129]Open in a new tab The significant canonical pathways of DCEGs associated with (a) purple module, pink module, greenyellow module, and brown module derived from AECs dataset (b) blue module and brown module derived from NECs dataset and (c) common canonical pathways of correlated genes derived from AECs and NECs datasets. Selection of potential genes associated with asthma Based on the diagnostic gene selection methods discussed in material and method section, four ML-methods were applied to further select and prioritize asthma associated gene-signature in AECs dataset (n = 105) from the total of 854 DCEGs. Logistic regression with LASSO penalty and five-fold cross-validation was implemented to identify optimal [MATH: λ :MATH] value = 0.02 which is derived from minimum binomial deviance, which was related to 30 DCEGs in predicting asthma in AECs dataset (Fig. [130]S3a,b). Three other ML-algorithms including RF, Boruta and RFE were also used to prioritize and select top 30 DCEGs based on the relative importance of each DCEG in asthma prediction (Fig. [131]S3c–e). Constructing gene expression-based asthma classifier models To compare the power of discrimination between asthmatic and control subjects, we examined 30-gene signature identified by distinct ML feature selection algorithms: LASSO, RF, RFE and Boruta. The diagnostic performance of selected genes by four methods are shown in Fig. [132]4a–d in AECs dataset. The diagnostic ability of LASSO using 30-gene signature and AECs dataset showed AUC = 0.99 and AUC = 1 based on RF and SVM classifiers, respectively (Fig. [133]4a,b). LASSO-based genes performed better compared with other methods in discriminating asthmatic from control subjects in AECs dataset. Moreover, the AUC precision-recall curve (AUC-PR) was used as additional measure of model to control potential misleading of AUC curve. Importantly, AUC-PR measure of LASSO selected genes showed superior ability in classifying asthmatic from control subjects (Fig. [134]4c,d). Furthermore, the sensitivity, specificity, MCC and F-score values of LASSO gene selection method revealed better performance in classifying asthmatic from control subjects in AECs dataset (Fig. [135]S3f and Table [136]S4). Figure 4. [137]Figure 4 [138]Open in a new tab Model comparison of different gene selection methods (a, b) AUC values of different gene feature ranking methods-based RF and SVM classifiers in AECs dataset, respectively (c, d) AUC precision-recall (AUC-PR) curve values of different gene feature ranking methods based RF and SVM classifiers in AECs dataset, respectively. The methods to construct diagnostic model, evaluate and validate for asthma prediction in NECs dataset, are analogues to the diagnostic model in AECs dataset. LASSO method with tenfold cross-validation identified 34 DCEGs in predicting asthma in NECs dataset (Fig. [139]S4a,b) with optimal λ value = 0.004. Moreover, three methods (RF, Boruta and RFE) were used to prioritize and select top 34 potential gene signatures based on the relative importance of each DCEG in asthma prediction. The corresponding results are shown in Fig. [140]S4c–e. Next, the diagnostic performance of four methods were examined using RF and SVM classifiers and their corresponding results are indicated in Fig. [141]5a–d. Notably, the diagnostic performance of LASSO identified 34-gene signature based on RF (AUC = 1) and SVM (AUC = 1) classifiers showed higher diagnostic performance in classifying asthmatic subjects from controls in NECs dataset (Fig. [142]5a,b). In addition, the AUC-PR values indicated that LASSO method with RF (AUR-PR = 0.92 and SVM (AUC-PR = 0.99) classifiers revealed in superior ability for classifying asthmatic subjects from control compared with other methods (Fig. [143]5c,d). Furthermore, the specificity, MCC and F-score values of LASSO with SVM classifier showed that the LASSO method had better classifying ability compared with other methods in NECs dataset (Fig. [144]S4d and Table [145]S5). Figure 5. [146]Figure 5 [147]Open in a new tab Model comparison of different gene selection methods (a, b) AUC values of different gene feature ranking methods-based RF and SVM classifiers in NECs dataset, respectively (c, d) AUC precision-recall (AUC-PR) curve values of different gene feature ranking methods based RF and SVM classifiers in NECs dataset, respectively. Evaluation of the diagnostic models using independent data To evaluate and compare whether the 30 and 34 gene-signatures derived from AECs and NECs datasets perform well in distinguishing asthmatic subjects from controls, various tissue/cell types of datasets including BECs, ASM and WB tissue/cell types were used as model validation datasets. Initially, the differential co-expressed AEC-and NCE-derived gene signatures between asthmatic subjects and controls were compared in validation datasets. Notably, the identified gene signatures-derived from AECs and NECs were found to be expressed in all three validation datasets (Fig. [148]6a,b), where nine genes including CPA3, SERPINB2, CHCHD5, EMC6, RPUSD3, POSTN, SEC14L1 and UPK1B derived from AECs dataset were persistently upregulated in asthmatics subjects compared with controls. Out of 34 gene-signatures derived from NECs, two genes (CTSC and UPK1B) were persistently upregulated while one gene TMEM8B were persistently downregulated in asthmatic subjects compare with controls in all validation datasets. Other gene-signatures showed tissue specific differential expression. Figure 6. [149]Figure 6 [150]Open in a new tab The heatmap showing the multiple tissue/cell type datasets logFC distribution of the gene signatures derived from AECs and NECs datasets. (a) 30-gene signature in various tissue types including NECs: nasal epithelial cells, AECs: airway epithelial cells, ASM: airway smooth Muscle and BECs: bronchial epithelial cells and WB: whole blood cells. (b) 34-gene signature in NECs, AECs, ASMs, BECs, and WB cells. Log2 fold-change is the log2-ratio of (expression in asthmatic subjects/expression in control subjects).Upregulation and downregulation in asthmatic compared with control subjects are reflected by log2 FC > 0 and < 0, respectively. FC = fold-change. The heat map of multiple tissue/cell type datasets of log FC values of gene signatures were generated using the pheatmap Version: 1.0.12 package (([151]https://cran.r-project.org/web/packages/pheatmap/index.html) in R. After evaluating the differential expression of 30 and 34 gene-signatures, we examined their diagnostic performance in various cell/tissue types. The diagnostic performance of 30-gene signature-based RF and SVM classifier algorithms are shown in Fig. [152]7. Using SVM classifier with 30-gene-signture-derived from AECs data, the AUC values achieved were 0.72, 0.97, 0.74 and 0.66 in BECs, NECs, ASM and WB, respectively (Fig. [153]7a–d). Using RF classifier, AEC-derived gene-signature, the AUC values for the four validation datasets were 0.76, 0.97, 0.82 and 0.65, respectively (Fig. [154]7a–d). For RF classifier, the AUC-PR values in the in BECs, NECs, ASM and WB were equals 0.57, 0.94, 0.87 and 0.31, respectively (Fig. [155]S5a-d). Moreover, model performance measures including sensitivity, specificity, MCC and F1-score of the 30-gene signature-based model derived from AEC data are shown in Fig. [156]S5e and Table [157]S6. Using SVM classifier in BECs, NECs, ASM and WB datasets, 30 gene-signature based diagnostic model derived from AECs exhibited a performance with MCC of 0.44, 0.79, 0.44 and 0.24, respectively (Table [158]S6 and Fig. [159]S5e). Furthermore, 30-gene signature using SVM classifier in BECs, NECs, ASMs and WB datasets exhibited a performance with F1-score of 0.64, 0.86, 0.85 and 0.41, respectively (Table [160]S6). The results showed that AECs-derived diagnostic model had better classification performance in the BECs, NECs and ASM data sets. Relatively, diagnostic model showed lower classification ability when the model was tested on WB dataset. Figure 7. [161]Figure 7 [162]Open in a new tab Validation of the 30-gene-signature based diagnostic model derived from AEC data. The classification performance is presented in terms of AUC values based RF and SVM methods in discriminating asthmatic subjects from control for (a) BECs, (b) NECs, (c) ASM, and (d) WB datasets. Similarly, the diagnostic performance of model derived from NECs data showed that the AUC value of SVM classifiers were 0.75, 0.89, 0.82 and 0.69 in BECs, AECs, ASM and WB, respectively (Fig. [163]8a–d). The diagnostic performance of model derived from NECs data showed that the AUC values based on RF classifier equals to 0.77, 0.91, 0.87 and 0.66 in BECs, AECs, ASM and WB, respectively (Fig. [164]8a–d). The AUC-PR values of SVM classifier in the in BECs, NECs, ASM and WB attained 0.57, 0.83, 0.88 and 0.32 to respectively (Fig. [165]S6a–d). Moreover, the sensitivity, specificity, MCC and F1-score of the 34-gene signature-based model derived from NECs data are shown in Fig. [166]S6e and Table [167]S6. As indicated in Fig. [168]S6e and Table [169]S6, the 34-gene signature using SVM classifier was tested in BECs, AECs, ASM and WB validation sets and the model performance of MCC value for each dataset was equal to 0.44, 0.65, 0.63, and 0.26, respectively. The diagnostic model showed a performance with F1-score value of 0.65, 0.80, 0.86 and 0.44 in the BECs, AECs, ASM and WB validation sets, respectively. The diagnostic model derived from NEC dataset also indicated that model perform well in the BECs, AECs and ASM compared with WB validation set. Figure 8. [170]Figure 8 [171]Open in a new tab Validation of the 34-gene-signature based diagnostic model derived from NEC dataset. The classification performance is presented in terms of AUC values based RF and SVM methods in discriminating asthmatic subjects from control for (a) BECs, (b) NECs, (c) ASM and (d) WB datasets. Discussion In our study, we developed diagnostic models based on asthma associated gene signatures obtained from a total of 105 AECs and 393 NECs subjects and validated in various tissue/cell types. We performed an integrated analysis of differential gene expression analysis,WGCNA and machine learning to identify potential gene signatures that discriminate asthmatic subjects from controls. Frist, we identified 854 and 725 asthma associated DCEGs in AECs and NECs datasets, respectively based integrated analysis of DEGs and WGCNA methods. Then, four machine learning algorithms including LASSO, RF, RFE and Boruta methods were used to select potential asthma associated DCEGs and their discriminating power and model performance measures were evaluated in both AECs and NECs datasets. The results showed that LASSO method identified 30 and 34 gene-signatures and showed better asthma prediction performance in AECs and NECs datasets, respectively. The validation datasets in independent multiple tissue/cells suggested that gene-signatures-derived from nasal/upper airways epithelium gene signature-based model could distinguish asthmatic subjects from controls in multiple tissue/cell types including BECs, ASMs and WB cells. The results suggested that the identified gene-signatures may be serve as promising a minimally invasive biomarker for asthma diagnosis. Despite it is ideal to develop gene-signature based model-derived to obtain samples from target tissues in diseases development (e.g. from lung tissue), it is not feasible and difficult specifically when a large sample size is needed for developing diagnostic tools with robust statistical power. Similar to previous studies, our asthma diagnostic classifiers were developed based on surrogate cell/tissue types and target cell/tissue^[172]9,[173]10,[174]35. An experimental study suggested to use nasal epithelial cells as surrogate for bronchial epithelium cells for asthma^[175]10. Despite several previous studies developed classification models to predict asthma, most of the studies focused on gene expression data from single tissue^[176]11,[177]36. Previous study compared different tissue types including AECs, NECs and peripheral blood mononuclear cells to predict asthma using DNA methylation data and showed that asthma diagnostic model derived from AECs and NECs tissue/cell types resulted better asthma prediction performance compared with peripheral blood mononuclear cells^[178]37. To characterize and understand DCEGs and their functional enrichment, WGCNA analysis was used in tissues that are quite related, which increases the expectation that a gene network signature in a tissue like NECs will replicate in AECs. Four modules (purple, pink, greenyellow and brown) in AECs and three modules (blue, brown and pink) in NECs were identified as significantly correlated with asthmatic subjects. The purple and pink modules were positively correlated with asthmatic subjects in AECs. The blue and brown were positively and negatively correlated with asthmatic subjects, respectively in NECs. We showed that DCEGs within asthma associated modules in AECs and NECs datasets were correlated with the expression of different genes that revealed distinct biological signaling and harbored gene-network signature associated with asthmatic subjects. Asthma associated pink and purple modules in AECs and blue and brown modules in NECs, and associated pathways showed an overlapped asthma related pathways including IL-13 Signaling and PI3K/AKT Signaling and apoptosis signaling. Notably, AEC and NEC-derived correlated gene signatures including CCL26, CLCA1 and POSTN were involved in IL-13 signaling, where IL-13 signaling of the airway epithelium is associated pathophysiology of asthma and airway inflammation^[179]38. The purple asthma associated module derived from AECs were enriched in mitochondrial dysfunction and PI3K/AKT signaling. The DCEGs uniquely correlated with asthma associated brown module derived from NECs was enriched in pathogen induced cytokine storm signaling, Th1 and Th2 Activation, crosstalk between dendritic cells and natural killer cells that suggests potential mechanism among these enriched pathways. Overall, our findings showed that NECs and AECs derived DCEGs enriched in asthma related pathways that may drive asthma pathology. Next, considering DCEGs derived from AECs and NECs datasets as candidate features, we used four ML methods to select potential gene-signatures that were important for subsequent validation. ML methods were used to develop asthma diagnostic model in predicting asthmatic subjects from controls^[180]17. However, relatively limited studies were focused integrating analysis of DEGs, WGCNA and ML methods in asthma prediction. In this study, different models comparison showed that DEGs, WGCNA followed by LASSO method identified 30 and 34 potential gene signatures, respectively. In AECs and NECs datasets with higher performance in discriminating asthmatic subjects from controls. Several previous transcriptomics studies used DEGs and ML approach to construct diagnostic and or prognostic models^[181]39,[182]40. We compared our approach with standard combined DEG + ML approach and the result showed similar performance (Supplementary Fig. [183]S7 and Supplementary Table [184]S7). However, integrated analysis of DEG, WGCNA and ML approach is essential approach to select key co-expressed genes for the exploration of biological function, pathway, etc. and also to alleviate multiple testing problem by reducing feature size and hence minimize computational cost compared with combined analysis of DEG and ML approach^[185]41. Hence, we implemented DEG + WGCNA + LASSO model to select candidate genes for downstream analyses and validation. LASSO identified potential DCEGs includes CPA3, SERPINB2, CHCHD5, EMC6, RPUSD3, POSTN, SEC14L1 and UPK1B in AECs dataset and these DCEGs were also persistently upregulated in multiple tissue/cell type datasets from asthmatic subjects. The previous study reported that elevated expression of CPA3 gene was observed in asthmatic subjects compared to controls and CPA3 gene correlated with sputum mast cells, asthma and rhinitis^[186]42. Recent study reported that the expression level of SERPINB2 gene was increased in airway epithelial cells of asthmatic and in atopic asthmatic subjects compared controls^[187]43. The LASSO identified five potential DCEGs in NECs dataset includes SIX2, CDH26, NEBL, CTSC and SLC2A9A. Several DCEGs identified in this study demonstrated biological function relevant to asthma. For example, a previous study showed that abnormality CDH26 gene are characterized by IL-13 stimulation of the airway epithelium and T2 inflammation of the airway epithelium in asthma development^[188]44. Yang et al. (2017) reported that CTSC gene was elevated in asthmatic subjects, which was also associated with methylation marks of subjects with asthmatic and allergy^[189]45. It has been reported that CTSC gene is maturated by a multistep proteolytic process and is secreted by activated cells during inflammatory lung diseases^[190]46. Our study also confirmed that CTSC gene was not only upregulated and co-expressed with other potential asthma related genes in nasal epithelium of asthmatic subjects but also persistently upregulated in multiple tissue/cell types of asthmatic subjects, which reflects that upregulation of CTSC gene in multiple tissue/cell may have functional association with the development and progression asthma disease. To the best of our knowledge, our study is one of the first to develop asthma diagnostic models using differential expression analysis and co-expression network combined with machine learning based on microarray and RNAseq datasets of AECs and NECs tissue/cell sample types. Prioritizing and identifying potential gene signatures to construct asthma diagnostic model from easily accessible tissue/cell types are vital to elucidate pathological process of asthma at molecular level, and to extend adequate evidence for the development of therapeutic target. The main contribution of the current study is to identify potential gene signatures and to compare diagnostic performance of different machine learning methods in classifying asthmatic from control subjects based on AECs and NECs tissue/cell datasets and validate the diagnostic models, which are stable and show robust performance in classifying asthmatic from control subjects. Our method prioritized and identified potential asthma associated DCEGs, suggests several of which are implicated in asthma pathology. More recently, machine learning and statistical methods haven been commonly used in RNA-seq and microarray data analysis of biomedical studies^[191]47,[192]48. However, the analysis of high-dimensional genomic data has a number of challenges including model overfitting and multicollinearity problems (e.g., existence of DCEGs in modules). To address such problems, appropriate statistical machine learning methods are required. Here, to identify the appropriate gene selection method in distinguishing asthmatic subjects from controls, we evaluated different gene selection methods based on the results of DEGs and WGCNA in the derivation datasets and independent validation datasets. From classification performance, LASSO algorithm was identified as robust method to select potential gene signature to improve the diagnostic performance. Notably, all methods showed better diagnostic performance in the derivation sets. However, the robustness of the model should be validated in external validation datasets. In our study, we developed gene signatures based diagnostic models using NECs and AECs datasets, and validated to examine whether they can perform well in external datasets with different tissue/cell types including BECs, ASM cells and WB datasets. Moreover, we examined whether diagnostic models that derived from easily accessible cell/ tissues (NECs and AECs) can also serve as robust surrogate model for target cell/tissue (e.g., ASM cells, BECs) and easily accessible cells (e.g. WB cells) regardless of sequencing technology (microarray or RNA-seq). Notably, gene signature based diagnostic model derived from microarray gene expression AECs dataset and gene signature based diagnostic model derived from RNA-seq NECs dataset were validated and the analysis indicated that both diagnostic models showed a better performance in the BECs and ASM dataset compared with WB dataset. The reason could be gene expression derived from WB tissue may not specific to asthma conditions. Whereas validation of diagnostic model based on gene expression comes from the target tissue sources-BECs and ASM tissue/cell types showed better performance, where these target tissue/cell types have well known role in asthma exacerbations and airways remodeling^[193]7,[194]49. Overall, the results showed that diagnostic models derived from NECs and AECs datasets can serve as surrogate source of biological samples for hard-to-get tissues including BECs dataset. Most models perform better prediction in training dataset but predict poorly in external validation dataset^[195]50, may be due to overfitting problem. The best model should have high AUC, F1-score and MCC values^[196]32. Our gene-signature based diagnostic models derived from AECs and NECs data showed higher accuracy and stable performance in external different tissue/cell type datasets. The multiple tissue/cell validation datasets circumvent overoptimistic results and assure general reproducibility. Despite our developed diagnostic models showed promising performance in predicting asthma, the current study has still some limitations. Since this study focused on computational analysis based on retrospective samples, future validation of the identified signatures should be performed with functional experiments. The sample size in some public dataset is small, which may hide potential correlations between gene expression signatures and outcome variable. Future study should consider increasing sample size and other feature selection strategies to improve diagnostic prediction performance of asthma and other airway diseases. In conclusion, we identified small number of differentially co-expressed gene signatures and established diagnostic models based on an integrated analysis of bioinformatics and machine learning methods to predict asthma diagnosis using airway epithelium gene expression data. Based on multiple-diagnostic performance criteria, we found that comparable diagnostic performance between AECs and NECs, which highlight the importance of gene-signature –based diagnostic models derived from AECs and NECs data as suitable surrogate model in predicting asthma diagnosis. More importantly, our diagnostic models are promising tool to improve decision making, which may provide potential gene signatures for diagnosis of asthma and other airway diseases. Supplementary Information [197]Supplementary Figures.^ (1.3MB, docx) [198]Supplementary Tables.^ (48.3KB, docx) Acknowledgements