Abstract Background Allergic rhinitis (AR), as a prevailing disease worldwide, is a common chronic inflammatory disease of the upper airway; however, its prognosis did not meet expectations of the general public. Regarding the importance of treatment for AR, biomarkers with high sensitivity and precision are in urgent need. Methods To detect the potential mechanisms associated with the AR pathology, the differentially expressed gene data from the clinical dataset [32]GSE19187 and [33]GSE46171 were used for weighted gene co-expression network analysis. Additionally, 5 machine-learning algorithms were integrated to screen for novel diagnostic biomarkers for AR. Results This analysis identified 133 genes enriched in the black and purple module pathways from weighted gene co-expression network analysis (WGCNA). The 4 key diagnostic genes—CST1, CST2, SERPINB4, and TOX—were identified through machine learning algorithms and validated for their diagnostic potential with high area under the curve (AUC) values in external datasets. Immune infiltration analysis indicated a higher proportion of resting mast cells in AR samples compared to controls, with CST1, CST2, and SERPINB4 showing high expression in these cells. Real-time PCR validation confirmed that CST1, CST2, and SERPINB4 were upregulated in AR samples while TOX was downregulated. Conclusion Our study identifies critical molecular markers and immune cell alterations associated with AR, providing insights into its pathogenesis and potential diagnostic biomarkers for clinical application. Future research should focus on validating these findings in larger cohorts to facilitate the development of targeted therapies for AR. Keywords: Allergic rhinitis, Machine learning algorithms, Biomarkers, Immune infiltration Introduction Allergic rhinitis (AR) is a common chronic inflammatory disorder that affects millions of individuals worldwide and has become a significant global health concern.[34]^1 It is primarily driven by atopic sensitization, an immunoglobulin E (IgE)-mediated immune response to environmental allergens, and manifests with symptoms such as nasal itching, sneezing, rhinorrhea, and nasal obstruction across all age groups.[35]^2 Although considerable progress has been made in understanding the clinical features of AR, the underlying molecular mechanisms that contribute to nasal mucosal inflammation and remodeling remain poorly understood.[36]^3 A deeper exploration of these mechanisms may provide new insights into the development of more effective therapeutic strategies for AR. Previous studies have identified several genes—such as TLR4, CD14, STAT1, CST1, and CLC—as being implicated in the susceptibility to AR and as potential biomarkers for disease development.[37]^4^,[38]^5 However, these findings alone do not fully explain the high prevalence and complex pathogenesis of AR, highlighting the need for further investigation. Recent studies suggest that the nasal epithelium plays a crucial role in the pathogenesis of AR, acting as the first line of defense against inhaled allergens and orchestrating downstream immune responses.[39]^6^,[40]^7 To explore the molecular underpinnings of AR, transcriptomic approaches such as microarray analysis have proven valuable for profiling gene expression patterns. While RNA sequencing is a newer and more expensive technology, microarrays remain widely used due to their cost-effectiveness, reproducibility, and abundance of publicly available datasets.[41]^8 Using microarray data from the Gene Expression Omnibus (GEO), researchers can systematically identify differentially expressed genes (DEGs) between allergic patients and healthy controls, revealing key inflammatory pathways like JAK-STAT, NF-κB, and toll-like receptor signaling.[42]9, [43]10, [44]11, [45]12, [46]13 In addition to epithelial cells, immune cells including eosinophils, mast cells, and dendritic cells contribute to AR pathogenesis. For example, epithelial-derived cytokines such as IL-25 and IL-33 promote Th2 immune responses, while mast cells play a central role in acute allergic reactions by releasing inflammatory mediators upon antigen re-exposure.[47]^14 Leveraging advances in bioinformatics and machine learning, this study aimed to identify novel gene targets in AR by analyzing microarray datasets. We performed differential expression analysis on dataset [48]GSE19187, followed by weighted gene co-expression network analysis (WGCNA) to detect modules associated with AR. Cross-referencing WGCNA results with gene ontology (GO) and KEGG pathway enrichment highlighted 81 crosstalk genes strongly related to AR. Subsequently, 5 machine learning algorithms—LASSO, SVM-RFE, random forest, XGBoost, and Boruta—were applied to prioritize hub genes. Four intersecting genes—CST1, CST2, SERPINB4, and TOX—were identified, and their expression was validated using dataset [49]GSE46171 and human NP69 cells stimulated with house dust mite extract. Our findings suggest these genes play key roles in AR pathogenesis and warrant further validation in larger cohorts. Materials and methods Microarray data Allergic rhinitis (AR)-related expression datasets ([50]GSE19187[51]^15 and [52]GSE46171[53]^16) were obtained through the GEO database ([54]http://www.ncbi.nlm.nih.gov/geo/). Besides, the [55]GSE19187 dataset, involving collected nasal epithelium samples from 11 control subjects and 14 subjects with rhinitis, was used to recognize DEGs and the mechanisms of asthma and AR.[56]^17 This study explored the detailed mechanism of rhinitis utilizing nasal epithelia expression profiles. Nasal epithelial samples from [57]GSE46171 were collected, including 11 samples of adults with allergic rhinitis and 17 samples from individuals without underlying respiratory diseases.[58]^18 The workflow of this study is shown in ([59]Fig. 1). Fig. 1. [60]Fig. 1 [61]Open in a new tab Workflow diagram of data collection and analysis. Differential expression analysis Raw microarray data were downloaded from the GEO database. Quality control was performed by visualizing boxplots, density plots, and principal component analysis (PCA) to detect potential outliers and batch effects. The data were then background-corrected and normalized using the Robust Multi-array Average (RMA) method implemented in the R/Bioconductor software package limma.[62]^19 Low-expression probes were filtered out to reduce noise and improve analysis accuracy. Differential expression analysis between AR and control samples was carried out using limma, with initial criteria of |logFC| > 0.5 and P-value <0.05. Statistical significance was further assessed using moderated t-tests, and multiple testing correction was applied via the Benjamini-Hochberg false discovery rate (FDR) method. Genes with an adjusted p-value (FDR) less than 0.05 were considered significantly differentially expressed. The results were visualized through heatmaps and volcano plots to highlight the significantly upregulated and downregulated genes. Weighted gene co-expression network analysis (WGCNA) WGCNA[63]^20 was performed to identify gene modules correlated with AR. First, to reduce noise and focus on biologically relevant genes, we selected the top 50% of genes ranked by variance across samples. This filtering step is commonly used to improve network detection sensitivity. Second, quality control was conducted by removing genes and samples with excessive missing values or low expression levels to ensure data reliability. Third, the soft-thresholding power (β) was chosen based on the criterion that the scale-free topology fit index reached 0.85 for the first time, ensuring that the constructed network approximates a scale-free topology. This thresholding was used to calculate the adjacency matrix representing gene co-expression similarity. Fourth, genes were clustered into modules using average linkage hierarchical clustering based on topological overlap, and each module was assigned a unique color. Finally, gene dendrograms and module-trait relationships were visualized to interpret the results. GO and KEGG pathway enrichment analysis GO and KEGG pathway enrichment analyses were performed using the Xiantao bioinformatics platform ([64]https://www.xiantaozi.com/products/apply). This platform provides an integrated, user-friendly interface for comprehensive functional enrichment analysis, enabling efficient identification of significantly enriched biological processes and pathways. Although the website is primarily in Chinese, it offers robust analytical tools widely used in related studies, facilitating streamlined data analysis.[65]21, [66]22, [67]23 Machine learning Five machine learning algorithms—LASSO, SVM-RFE, Boruta, XGBoost, and random forest—were applied to identify key genes associated with AR diagnosis. LASSO regression was implemented using the glmnet R package.[68]^24 The random forest model was developed using the randomForest R package,[69]^25 commonly used for classification and feature selection. For SVM-RFE, Boruta, and XGBoost, the respective R packages e1071,[70]^26 Boruta,[71]^27 and xgboost[72]^28 were utilized. The intersecting genes identified by all 5 algorithms were selected as candidate biomarkers. Immune infiltration analysis The CIBERSORT algorithm[73]^29 was used to estimate the relative proportions of 22 immune cell subtypes in the nasal epithelial brushing samples from [74]GSE19187, Immune cell abundance was visualized using heatmaps generated by the “corrplot” package.[75]^30 Recipient operating characteristics curve (ROC) and area under the curve (AUC) To evaluate the diagnostic performance of the identified feature genes for AR, ROC curve analysis was performed, and AUC values were calculated using the “pROC” R package. The analysis was conducted for both the training dataset ([76]GSE19187) and the external validation dataset ([77]GSE46171) to assess the sensitivity and specificity of each gene in distinguishing AR patients from healthy controls. Real-time quantitative PCR for validation NP69 cells, an immortalized human nasopharyngeal epithelial cell line sourced from Vigen Biotechnology (Zhenjiang, China), were utilized to validate gene expression changes in vitro.[78]^31 The cells were treated with 5 μg/mL House Dust Mite (HDM, Dermatophagoides pteronyssinus) extract (Cat. No. XPB91D3A2.5, Greer Laboratories, USA) for 24 h, a condition optimized through preliminary experiments evaluating epithelial cytokine responses. Total RNA was isolated and purified using the FastPure Cell/Tissue Total RNA Isolation Kit V2 (Vazyme, RC112-01), and reverse transcribed into cDNA using the HiScript II One Step RT-PCR Kit (Vazyme, P611-01). Real-time quantitative PCR (RT-qPCR) was performed on an Applied Biosystems QuantStudio 5 system. Relative gene expression levels were calculated using the 2ˆ-ΔΔCt method. Statistical analyses and P-values were obtained using Prism GraphPad (version 8.0). The primer sequences were as follows: CST1: Forward 5′-TGGCCCAGTATCTGAGTACCCT-3′, Reverse 5′-CCGCAGCGGACGT CTGTAGT-3′; CST2: Forward 5′-GGGCCGAACCATATGTACCA-3′, Reverse 5′-GCAGTTCTG GCTGTTCATGGA-3′; SERPINB4: Forward 5′-CCACTGATGCATA TGAGCTGAAG-3′, Reverse 5′-CTGGTAAAATTTCTTGATGGCATCT-3′; TOX: Forward 5′-AAGGGCCAAAATCCAAACG-3′, Reverse 5′-CCCACATTG AAGCCA CAATTTTA-3’. Statistical analysis All statistical analyses were performed using R software (version 4.2.1) and the Xiantao platform. Normality of continuous variables was assessed by the Shapiro–Wilk test. For comparisons between 2 groups, independent Student's t-tests were applied to data meeting normality assumptions; for comparisons among multiple groups, one-way ANOVA was used. When data did not meet normality criteria, appropriate non-parametric tests were applied. Multiple comparisons were corrected using the Benjamini–Hochberg false discovery rate method where applicable. Statistical significance was defined as P-value <0.05. Results Identification of DEGs We identified a total of 133 DEGs between AR patients and healthy controls from the [79]GSE19187 dataset using the limma package with multiple testing correction. Applying a p value of <0.05, 87 genes were significantly upregulated and 46 genes were significantly downregulated in AR patients compared to controls. The expression patterns of these DEGs are shown in the heatmap and volcano plot ([80]Fig. 2A and B). Fig. 2. [81]Fig. 2 [82]Open in a new tab The heat map and volcano map of the DEG identified in the AR dataset [83]GSE19187. (A) Each row represents DEG, and each column represents each sample of healthy or AR cases. Red and blue represent upregulated and downregulated DEGs, respectively. (B) The red and blue circles refer to DEGs with upregulated and downregulated gene expression, respectively. AR, Allergic rhinitis; DEGs, differentially expressed genes. WGCNA analysis and hub module selection The flow diagram of our sample clustering dendrogram from WGCNA is presented in ([84]SFig. 1A). To analyze the gene expression patterns of multiple samples, we clustered 20159 genes and analyzed the association between modules and specific traits or phenotypes from [85]GSE19187. As illustrated in ([86]SFig. 1B and C), the soft power threshold we set to 11 based on scale-free Rˆ2 = 0.9. Totally 13 modules were identified, which were the strongly correlated modules based on a 0.25 clustering height limit ([87]Fig. 3A). From the selected modules, the black (R = 0.47, P = 0.02) and purple (R = 0.56, P = 0.004) modules showed moderate but statistically significant correlations with AR ([88]Fig. 3B and C). These 2 modules had the highest module-trait correlation coefficients in our analysis. A total of 1158 candidate genes from the black and purple modules were included in subsequent analyses ([89]Table S1). Fig. 3. [90]Fig. 3 [91]Open in a new tab The number of modules and genes from WGCNA. (A) Cluster tree diagram of genes in [92]GSE19187, where each branch represents a gene and each color represents a co-expressed module. (B) The correlation of co-expressed module genes obtained by WGCNA and the numbers in parentheses represent the P-value of genes in each module. WGCNA, Weighted gene co-expression network analysis. C, Control; P, Patients. (C) Scatter plot of gene significance (GS) between weight and module member relationship (MM) in black and purple modules. In this module, there is a highly significant correlation between GS and MM. KEGG and GO enrichment analyses of overlapping DEGs To analyze the biological features of the key gene sets, we identified 81 overlapping genes by intersecting the DEGs obtained from the limma analysis with the hub genes derived from the black and purple modules identified through WGCNA ([93]Fig. 4A). Additionally, these genes were classified into amoebiasis, ECM (Extracellular matrix)-receptor interaction, hematopoietic cell lineage, estrogen signaling pathway, linoleic acid metabolism, mucin-type o-glycan biosynthesis, ferroptosis, malaria, and arachidonic acid metabolism in the KEGG and GO enrichment analysis ([94]Fig. 4B and C), and enriched in cellular component terms, including the apical part of the cell, apical plasma membrane, basal part of the cell, basolateral plasma membrane, basal plasma membrane, platelet alpha granule, and cornified envelope ([95]Fig. 4C). Regarding the molecular function terms, genes were mainly enriched in endopeptidase inhibitor activity, endopeptidase regulator activity, peptidase inhibitor activity, enzyme inhibitor activity, peptidase regulator activity, serine-type endopeptidase inhibitor activity, and cysteine-type endopeptidase inhibitor activity ([96]Fig. 4C). The pathway analysis results were involved in AR development. The number of genes associated with ECM-receptor interaction occurred in the KEGG pathway analysis, suggested a higher correlation with AR. Fig. 4. [97]Fig. 4 [98]Open in a new tab Enrichment analysis of gene intersections in AR. (A) The Venn plot shows that 81 genes were identified at the intersection of DEGs using black and purple module genes from WGCNA. (B) KEGG pathway analysis of gene crossover. Different colors represent various important pathways and related enriched genes. (C) GO analysis of gene crossover, including biological processes, cellular components, and molecular functions. The y-axis represents different GO terms, the x-axis represents the proportion of genes enriched in relative GO terms, the size of the circle represents the number of genes, and the color represents the P-value. AR, Allergic rhinitis. KEGG, Kyoto Encyclopedia of Genes and Genomes. GO, Gene ontology; WGCNA, Weighted gene co-expression network analysis; DEGs, Differentially Expressed Genes. Four feature genes identified by integrating 5 machine-learning algorithms To improve the robustness of feature gene selection, we applied 5 machine learning methods—LASSO, SVM-RFE, Random Forest (RF), XGBoost, and Boruta—to the 81 candidate genes. LASSO identified 8 genes, SVM-RFE selected 46 genes, RF identified 19 genes with importance >0.2, XGBoost yielded 8 genes, and Boruta selected 18 genes ([99]Fig. 5A–E, [100]Table S2–S4). To enhance specificity and reduce potential overfitting, we focused on the intersection of genes consistently identified across all 5 methods, resulting in 4 common genes: CST1, CST2, SERPINB4, and TOX ([101]Fig. 6A). These genes were subsequently validated in an independent dataset ([102]GSE18574), where they maintained consistent expression patterns and diagnostic potential ([103]Fig. 6B and C), supporting the robustness and reproducibility of this integrative approach. Fig. 5. [104]Fig. 5 [105]Open in a new tab Five machine learning methods for identifying the best central genes. (A) Screening of biomarkers in the LASSO model. Coefficient distribution map, where 1 line represents genes and the vertical axis represents coefficients. The solid line in the left figure represents partial likelihood deviation, and the number of genes corresponding to the lowest point of cure (n = 81) is most suitable for LASSO. The vertical axis of the right panel represents the binary deviation, which can be understood as the size of the model error. The left row displays the smallest error, while the right row displays fewer features. (B) The SVM-RFE algorithm is used to further select the best candidate genes, which is based on the highest accuracy and lowest error obtained from the curve. The x-axis displays the number of feature selections, while the y-axis displays the prediction accuracy. (C) The random forest algorithm displays the errors in AR; Sort the control group and genes based on importance scores. Use random forest to calculate the relative importance of overlapping candidate genes (importance>0.2). The importance index is plotted on the x-axis, while genetic variables are plotted on the y-axis. (D) The characteristic importance of factors predicting mortality in AR patients in XGBoost. (E) Boruta selected 84 feature genes with importance ranking. The green bar chart is considered meaningful. Fig. 6. [106]Fig. 6 [107]Open in a new tab The susceptibility genes of AR were identified. (A) The flower chart shows the 4 best feature genes shared by LASSO: random forest, SVM-RFE algorithm, XGBoost, and Boruta algorithm. (B) The expression levels of CST1, CST2, SERPINB4, and TOX in [108]GSE19187. (C) The expression levels of CST1, CST2, SERPINB4, and TOX in [109]GSE46171. (D) The ROC curves of 4 genes serve as a gene set for AR patients in [110]GSE19187. (E) The ROC curves of 4 genes in [111]GSE46171. C, Control; P, Patients; AR, Allergic rhinitis; DEGs, Differentially expressed genes; ROC, Receiver operating characteristics. ∗ P < 0.05, ∗ ∗ P < 0.01 and ∗∗∗ P < 0.0001. Verification of the expression and diagnostic significance of characteristic genes The expression levels of the 4 feature genes were checked in 14 AR samples and 11 healthy samples in [112]GSE19187. While CST1, CST2, and SERPINB4 were upregulated in AR, TOX was decreased in this disease. Moreover, the same changes of 4 genes were verified in 11 patients with AR and 17 healthy individuals in [113]GSE46171 ([114]Fig. 6A–C; P < 0.05). Furthermore, ROC curve analysis was conducted to quantitatively assess the diagnostic and predictive value of the characteristic genes ([115]Fig. 6D and E). The AUC value of CST1 was 0.981, CST2 was 0.942, SERPINB4 was 0.792, and TOX was 0.955 in [116]GSE19187 ([117]Fig. 6D). Simultaneously, the external validation dataset [118]CSE46171 presented high AUC values: CST1 (AUC:0.738), CST2 (AUC:0.754), SERPINB4 (AUC:0.706), and TOX (AUC:0.738), demonstrating that these hub feature genes predicted progression and exhibited a convincing diagnostic value for AR ([119]Fig. 6E). Immune cell infiltration of characteristic genes Immune infiltration analysis was conducted using the CIBERSORT algorithm to investigate the role of four genes in immune regulation. The graphs for differential immune infiltration are revealed in ([120]Fig. 7A and B). Compared with control samples, the proportion of resting mast cells was upregulated in AR samples, while the others demonstrated no significant difference. Additionally, the above 4 genes showed greater differences in immune infiltration analysis. While CST1, CST2, and SERPINB4 showed higher expression in resting mast cells, TOX exhibited no significant difference in this analysis. Overall, 4 genes screened from immune infiltration may be potential regulators for AR treatment. Fig. 7. [121]Fig. 7 [122]Open in a new tab Analysis of immune cell infiltration between control and AR. (A) The proportion of 22 types of immune cells in different samples can be seen from the box diagram. (B) Comparing the P-values of 19 immune cells between the AR group and the control group through a correlation matrix diagram. ∗P < 0.05, ∗ ∗ P < 0.01 and ∗∗∗ P < 0.001. The vertical axis displays subtypes of immune cells. AR, Allergic rhinitis. QPCR for mRNA levels of predicted hub genes To verify differential expressions of CST1, CST2, SERPINB4, and TOX, the human nasopharyngeal epithelial cell line (NP69) was used to assess mRNA expression levels of these genes via Real-time quantitative PCR. The expression trends of CST1, CST2, and SERPINB4 followed public databases, and the expression of TOX was lower in the AR group ([123]Fig. 8A–D). Consequently, the changes in the 4 gene expressions did not differ in the common stable cell line, suggesting that the 4 genes were of high clinical diagnostic value. Fig. 8. [124]Fig. 8 [125]Open in a new tab Validation of diagnostic gene expression by RT-qPCR. (A–D) The expressions of (A) CST1. (B) CST2. (C) SERPINB4. (D) TOX. between 5 μg/mL HDM-treatment NP69 cell line and Control subjects. ∗P < 0.05 and ∗∗P < 0.01. Discussion AR is a prevalent chronic inflammatory condition of the nasal mucosa, triggered by IgE-mediated reactions to allergens such as pollen, dust mites, and animal dander.[126]^32 This condition significantly impairs the quality of life, leading to symptoms such as nasal congestion, rhinorrhea, sneezing, and itching, which can further result in sleep disturbances, fatigue, and decreased cognitive function.[127]^33 The global burden of AR is substantial, affecting approximately 10–30% of the population, with increasing prevalence in urban areas.[128]^34 Despite its high prevalence and impact, the underlying molecular mechanisms of AR remain incompletely understood, necessitating further research to develop more effective diagnostic and therapeutic strategies. In this study, we focused on the nasal epithelial transcriptome to elucidate the detailed mechanisms of AR, leveraging advanced bioinformatics and machine learning techniques. The nasal epithelium is the primary site of allergen exposure and immune response initiation, making it a critical area for understanding AR pathophysiology.[129]^35 By analyzing gene expression profiles from publicly available datasets ([130]GSE19187 and [131]GSE46171), we identified DEGs and constructed gene co-expression networks to uncover key molecular pathways involved in AR. Our integrative approach, combining differential expression analysis, WGCNA, and multiple machine learning algorithms, has the potential to identify novel biomarkers and therapeutic targets for AR. The identification of key diagnostic genes and the development of predictive models could significantly enhance the accuracy of AR diagnosis and pave the way for personalized treatment strategies.[132]^36 Using 5 machine learning analyses and CIBERSORT to infer specific immune cell infiltration, we found that the expression status of 3 genes (CST1, CST2, and SERPINB4) was highly correlated with resting mast cell infiltration, except for TOX, which is associated with naive B cells. As members of the type 2 cystatin (CST) superfamily, CST1(Cystatin SN) and CST2(Cystatin SA) have been reported to be capable of inhibiting the proteolytic activities of cysteine proteases and is implicated in progression of several human cancers.[133]^37^,[134]^38 CST1 has been implicated in various inflammatory and immune responses. Recent studies have shown that CST1 is significantly upregulated in patients with AR, suggesting its role in the pathogenesis of this condition.[135]^39 The elevated expression of CST1 in AR patients may contribute to the modulation of protease activity, thereby influencing the inflammatory milieu in the nasal epithelium. NP-69 cell line is a classic nasal epithelial cell used to study nasopharyngeal related cancers. Our study corroborates these findings, demonstrating a marked increase in CST1 expression in AR samples compared to controls. This upregulation was further validated through qPCR in NP69 cells treated with HDM extract, reinforcing the potential of CST1 as a diagnostic biomarker for AR. Therefore, CST1 not only serves as a critical marker for AR diagnosis but also provides insights into the molecular mechanisms underlying the disease. Similar to CST1, CST2 has been found to be upregulated in allergic conditions, including AR.[136]^40 Our analysis identified CST2 as one of the keys differentially expressed genes in AR, with significant upregulation observed in both the primary dataset and external validation dataset. The qPCR results in NP69 cells further confirmed this trend, highlighting CST2's potential as a diagnostic marker. The role of CST2 in modulating protease activity and its impact on immune cell function, particularly in the context of mast cell activation, aligns with the observed immune cell infiltration patterns in AR.[137]^40 These findings suggest that CST2 may play a crucial role in the pathophysiology of AR, making it a valuable target for further research and potential therapeutic intervention. It is well known that serine protease inhibitors could be one of the treatments for inflammation. SERPINB4, also known as squamous cell carcinoma antigen 2, is a member of the serpin family of protease inhibitors.[138]^41 This gene has been associated with various inflammatory and immune responses, particularly in epithelial tissues.[139]^42 Our study identified SERPINB4 as one of the key genes upregulated in AR patients. The increased expression of SERPINB4 in AR samples, validated through qPCR in NP69 cells, suggests its involvement in the inflammatory process's characteristic of AR. The role of SERPINB4 in inhibiting serine proteases may contribute to the regulation of protease activity and the maintenance of epithelial integrity in the nasal mucosa.[140]^42 Additionally, the association of SERPINB4 with immune cell infiltration, particularly mast cells, underscores its significance in the immune response observed in AR. These findings highlight the potential of SERPINB4 as a diagnostic biomarker and a therapeutic target for AR, providing new avenues for understanding the molecular mechanisms underlying the disease. TOX, or thymocyte selection-associated high mobility group box protein, is a transcription factor involved in the regulation of T-cell development and function, and is associated with thymocyte selection, is expressed differently in all stages of T lymphocyte development.[141]^43 TOX is a nuclear deoxyribonucleic acid-binding factor[142]^44 that induces a set of chemokine genes in immune cells and modulates inflammation through the production of interleukin-10.[143]^45 However, TOX has not been previously associated with AR, suggesting it may be a valuable potential research target. Unlike the other identified genes, TOX was found to be downregulated in AR patients. This downregulation was consistent across both the primary and validation datasets, as well as in qPCR validation using NP69 cells. The decreased expression of TOX in AR suggests a potential role in modulating immune responses, particularly in the context of T-cell function and differentiation. These findings indicate that TOX may play a protective role in the immune response associated with AR, and its downregulation could be a contributing factor to the disease's pathogenesis. Further research is needed to elucidate the precise mechanisms by which TOX influences AR and to explore its potential as a therapeutic target. It is interesting that 3 of the 4 genes are associated with protease. There was other evidence that ADAM10 and ADAM17 influenced allergic rhinitis,[144]^46 which are also transmembrane metalloproteases. It is well known that serine protease inhibitors were selected as treatments for inflammation, because proteases’ dysregulation may induce severe consequences, like inflammation of neurology, skin, lung and arthritis.[145]^47 It could be predicted that the inhibition of proteases may be a promising target of AR. The enrichment analysis results from our study provide significant insights into the biological processes and pathways involved in allergic rhinitis (AR). KEGG pathway analysis revealed that the differentially expressed genes (DEGs) are involved in pathways such as amoebiasis, ECM-receptor interaction, hematopoietic cell lineage, and estrogen signaling pathway. These pathways are crucial for understanding the underlying mechanisms of AR. For instance, the ECM-receptor interaction pathway is essential for cell adhesion and communication, which are critical in inflammatory responses and tissue remodeling observed in AR.[146]^48 The hematopoietic cell lineage pathway is involved in the development and differentiation of immune cells, which are pivotal in the immune response associated with AR.[147]^49 Additionally, the estrogen signaling pathway has been implicated in modulating immune responses and inflammation, suggesting a potential link between hormonal regulation and AR pathogenesis.[148]^50 GO enrichment analysis further highlighted the involvement of DEGs in biological processes such as the negative regulation of proteolysis and regulation of endopeptidase activity. These processes are important for maintaining protease-antiprotease balance, which is often disrupted in inflammatory conditions like AR.[149]^51 The cellular component analysis indicated significant enrichment in apical and basal plasma membranes, suggesting that these genes may play roles in maintaining epithelial barrier integrity and function, which is often compromised in AR.[150]^52 In conclusion, the enriched pathways and biological processes identified in this study provide a comprehensive understanding of the molecular mechanisms underlying AR. Future studies should focus on further elucidating the roles of these pathways and genes in AR, and collecting peripheral blood samples from patients to validate the diagnostic value of these 4 genes in large samples, in order to enhance our understanding and management of this condition. The immune infiltration analysis conducted using the CIBERSORT algorithm revealed significant differences in immune cell subtype proportions between healthy and AR groups, highlighting the role of immune cells in the pathogenesis of AR. Specifically, the analysis identified an increased proportion of resting mast cells in AR samples compared to controls. Mast cells are known to play a crucial role in allergic reactions by releasing histamine and other mediators that contribute to inflammation and symptom manifestation in AR patients.[151]^53 The elevated expression of CST1, CST2, and SERPINB4 in resting mast cells suggests that these genes may be involved in the regulation of mast cell activity and, consequently, in the allergic response observed in AR. Certainly, our study has several limitations, such as using only ROC curves to predict the value of susceptible genes, small sample sizes, lack of validation cohorts, and absence of AR severity data, which may affect our conclusions. Conclusion In summary, this study successfully identified differentially expressed genes associated with allergic rhinitis and visualized these findings through heatmaps and volcano plots. WGCNA analysis revealed gene modules closely related to AR, and GO and KEGG enrichment analyses highlighted key biological processes and pathways involved in AR mechanisms. Five machine learning algorithms identified 4 key diagnostic genes, and a predictive model was developed. The CIBERSORT algorithm demonstrated significant differences in immune cell subtype proportions between healthy and AR groups. The ROC curve and AUC values indicated high accuracy of the classification model. Finally, real-time quantitative PCR validated the differential expression of key genes in the NP69 cell line. These findings provide valuable insights into the molecular mechanisms of AR and offer potential biomarkers for diagnosis and therapeutic targets. Future studies should aim to address the current limitations by incorporating larger sample sizes, clinical validation, and wet lab experiments to further substantiate these findings. Abbreviations AR, allergic rhinitis; CD109, C3 And PZP-Like Alpha-2-Macroglobulin Domain-Containing Protein 7; CD14, lipopolysaccharide receptor; CDH26, cadherin-Like Protein 26; CLC, charcot-Leyden Crystal Protein; CST1, cystain-SA-I; CST2, cystatin 2; CYP2E1, cytochrome P450 Family 2 Subfamily E Member 1’GCNT3, glucosaminyl (N-Acetyl) Transferase 3; GEO, gene Expression Omnibus; HGD, homogentisate 1,2-Dioxygenase; KDELR3, KDEL Endoplasmic Reticulum Protein Retention Receptor 3; LMRGs, lipid metabolism-related genes; PCSK6, proprotein Convertase Subtilisin/Kexin Type 6; SERPINB4, serpin Family B Member 4; SLC16A9, solute Carrier Family 16 Member 9; SLC39A8, solute Carrier Family 39 Member 8; SLC7A1, solute Carrier Family 7 Member 1;SPDEF, SAM Pointed Domain Containing ETS Transcription Factor; STAT1, signal transducer and activator of transcription 1; TLR4 , toll-like receptor 4; TOX , thymocyte Selection Associated High Mobility Group Box; TRPM6, transient Receptor Potential Cation Channel Subfamily Member 6; UPK1B, uroplakin 1B; WGCNA, weighted gene co-expression network analysis Availability of data and materials The datasets analyzed during the current study are available in the Gene Expression Omnibus (GEO) website, numbers were [152]GSE19187 and [153]GSE46171. Author contributions HL and YC designed the study. QC, JF and YL analyzed the data and wrote the primary manuscript. QC and JF prepared the figures and tables from the literature. All authors reviewed and approved the final manuscript. Ethics approval and consent to participate Not applicable. Authors' consent for publication We guarantee our manuscript is original, has not been published before, is not currently being considered for publication elsewhere, and has not been posted to a preprint server. The publication is approved by all authors and that, if accepted, it will not be published elsewhere in the same form, in English or in any other language, without the written consent of the copyright-holder. Funding This study was supported by the Taihu Lake talent plan (Top-Level, no. 2020THRC-GD-7), the 333 project of Jiangsu Province in 2022 (ZUZHIBU 202221001). Declaration of competing interest The authors declare that they have no competing interests. Acknowledgment