Abstract

Background

   Allergic rhinitis (AR), as a prevailing disease worldwide, is a common
   chronic inflammatory disease of the upper airway; however, its
   prognosis did not meet expectations of the general public. Regarding
   the importance of treatment for AR, biomarkers with high sensitivity
   and precision are in urgent need.

Methods

   To detect the potential mechanisms associated with the AR pathology,
   the differentially expressed gene data from the clinical dataset
   [32]GSE19187 and [33]GSE46171 were used for weighted gene co-expression
   network analysis. Additionally, 5 machine-learning algorithms were
   integrated to screen for novel diagnostic biomarkers for AR.

Results

   This analysis identified 133 genes enriched in the black and purple
   module pathways from weighted gene co-expression network analysis
   (WGCNA). The 4 key diagnostic genes—CST1, CST2, SERPINB4, and TOX—were
   identified through machine learning algorithms and validated for their
   diagnostic potential with high area under the curve (AUC) values in
   external datasets. Immune infiltration analysis indicated a higher
   proportion of resting mast cells in AR samples compared to controls,
   with CST1, CST2, and SERPINB4 showing high expression in these cells.
   Real-time PCR validation confirmed that CST1, CST2, and SERPINB4 were
   upregulated in AR samples while TOX was downregulated.

Conclusion

   Our study identifies critical molecular markers and immune cell
   alterations associated with AR, providing insights into its
   pathogenesis and potential diagnostic biomarkers for clinical
   application. Future research should focus on validating these findings
   in larger cohorts to facilitate the development of targeted therapies
   for AR.

   Keywords: Allergic rhinitis, Machine learning algorithms, Biomarkers,
   Immune infiltration

Introduction

   Allergic rhinitis (AR) is a common chronic inflammatory disorder that
   affects millions of individuals worldwide and has become a significant
   global health concern.[34]^1 It is primarily driven by atopic
   sensitization, an immunoglobulin E (IgE)-mediated immune response to
   environmental allergens, and manifests with symptoms such as nasal
   itching, sneezing, rhinorrhea, and nasal obstruction across all age
   groups.[35]^2 Although considerable progress has been made in
   understanding the clinical features of AR, the underlying molecular
   mechanisms that contribute to nasal mucosal inflammation and remodeling
   remain poorly understood.[36]^3 A deeper exploration of these
   mechanisms may provide new insights into the development of more
   effective therapeutic strategies for AR. Previous studies have
   identified several genes—such as TLR4, CD14, STAT1, CST1, and CLC—as
   being implicated in the susceptibility to AR and as potential
   biomarkers for disease development.[37]^4^,[38]^5 However, these
   findings alone do not fully explain the high prevalence and complex
   pathogenesis of AR, highlighting the need for further investigation.

   Recent studies suggest that the nasal epithelium plays a crucial role
   in the pathogenesis of AR, acting as the first line of defense against
   inhaled allergens and orchestrating downstream immune
   responses.[39]^6^,[40]^7 To explore the molecular underpinnings of AR,
   transcriptomic approaches such as microarray analysis have proven
   valuable for profiling gene expression patterns. While RNA sequencing
   is a newer and more expensive technology, microarrays remain widely
   used due to their cost-effectiveness, reproducibility, and abundance of
   publicly available datasets.[41]^8 Using microarray data from the Gene
   Expression Omnibus (GEO), researchers can systematically identify
   differentially expressed genes (DEGs) between allergic patients and
   healthy controls, revealing key inflammatory pathways like JAK-STAT,
   NF-κB, and toll-like receptor signaling.[42]9, [43]10, [44]11, [45]12,
   [46]13

   In addition to epithelial cells, immune cells including eosinophils,
   mast cells, and dendritic cells contribute to AR pathogenesis. For
   example, epithelial-derived cytokines such as IL-25 and IL-33 promote
   Th2 immune responses, while mast cells play a central role in acute
   allergic reactions by releasing inflammatory mediators upon antigen
   re-exposure.[47]^14

   Leveraging advances in bioinformatics and machine learning, this study
   aimed to identify novel gene targets in AR by analyzing microarray
   datasets. We performed differential expression analysis on dataset
   [48]GSE19187, followed by weighted gene co-expression network analysis
   (WGCNA) to detect modules associated with AR. Cross-referencing WGCNA
   results with gene ontology (GO) and KEGG pathway enrichment highlighted
   81 crosstalk genes strongly related to AR. Subsequently, 5 machine
   learning algorithms—LASSO, SVM-RFE, random forest, XGBoost, and
   Boruta—were applied to prioritize hub genes. Four intersecting
   genes—CST1, CST2, SERPINB4, and TOX—were identified, and their
   expression was validated using dataset [49]GSE46171 and human NP69
   cells stimulated with house dust mite extract. Our findings suggest
   these genes play key roles in AR pathogenesis and warrant further
   validation in larger cohorts.

Materials and methods

Microarray data

   Allergic rhinitis (AR)-related expression datasets ([50]GSE19187[51]^15
   and [52]GSE46171[53]^16) were obtained through the GEO database
   ([54]http://www.ncbi.nlm.nih.gov/geo/). Besides, the [55]GSE19187
   dataset, involving collected nasal epithelium samples from 11 control
   subjects and 14 subjects with rhinitis, was used to recognize DEGs and
   the mechanisms of asthma and AR.[56]^17 This study explored the
   detailed mechanism of rhinitis utilizing nasal epithelia expression
   profiles. Nasal epithelial samples from [57]GSE46171 were collected,
   including 11 samples of adults with allergic rhinitis and 17 samples
   from individuals without underlying respiratory diseases.[58]^18 The
   workflow of this study is shown in ([59]Fig. 1).

Fig. 1.

   [60]Fig. 1
   [61]Open in a new tab

   Workflow diagram of data collection and analysis.

Differential expression analysis

   Raw microarray data were downloaded from the GEO database. Quality
   control was performed by visualizing boxplots, density plots, and
   principal component analysis (PCA) to detect potential outliers and
   batch effects. The data were then background-corrected and normalized
   using the Robust Multi-array Average (RMA) method implemented in the
   R/Bioconductor software package limma.[62]^19 Low-expression probes
   were filtered out to reduce noise and improve analysis accuracy.
   Differential expression analysis between AR and control samples was
   carried out using limma, with initial criteria of |logFC| > 0.5 and
   P-value <0.05. Statistical significance was further assessed using
   moderated t-tests, and multiple testing correction was applied via the
   Benjamini-Hochberg false discovery rate (FDR) method. Genes with an
   adjusted p-value (FDR) less than 0.05 were considered significantly
   differentially expressed. The results were visualized through heatmaps
   and volcano plots to highlight the significantly upregulated and
   downregulated genes.

Weighted gene co-expression network analysis (WGCNA)

   WGCNA[63]^20 was performed to identify gene modules correlated with AR.
   First, to reduce noise and focus on biologically relevant genes, we
   selected the top 50% of genes ranked by variance across samples. This
   filtering step is commonly used to improve network detection
   sensitivity. Second, quality control was conducted by removing genes
   and samples with excessive missing values or low expression levels to
   ensure data reliability. Third, the soft-thresholding power (β) was
   chosen based on the criterion that the scale-free topology fit index
   reached 0.85 for the first time, ensuring that the constructed network
   approximates a scale-free topology. This thresholding was used to
   calculate the adjacency matrix representing gene co-expression
   similarity. Fourth, genes were clustered into modules using average
   linkage hierarchical clustering based on topological overlap, and each
   module was assigned a unique color. Finally, gene dendrograms and
   module-trait relationships were visualized to interpret the results.

GO and KEGG pathway enrichment analysis

   GO and KEGG pathway enrichment analyses were performed using the
   Xiantao bioinformatics platform
   ([64]https://www.xiantaozi.com/products/apply). This platform provides
   an integrated, user-friendly interface for comprehensive functional
   enrichment analysis, enabling efficient identification of significantly
   enriched biological processes and pathways. Although the website is
   primarily in Chinese, it offers robust analytical tools widely used in
   related studies, facilitating streamlined data analysis.[65]21, [66]22,
   [67]23

Machine learning

   Five machine learning algorithms—LASSO, SVM-RFE, Boruta, XGBoost, and
   random forest—were applied to identify key genes associated with AR
   diagnosis. LASSO regression was implemented using the glmnet R
   package.[68]^24 The random forest model was developed using the
   randomForest R package,[69]^25 commonly used for classification and
   feature selection. For SVM-RFE, Boruta, and XGBoost, the respective R
   packages e1071,[70]^26 Boruta,[71]^27 and xgboost[72]^28 were utilized.
   The intersecting genes identified by all 5 algorithms were selected as
   candidate biomarkers.

Immune infiltration analysis

   The CIBERSORT algorithm[73]^29 was used to estimate the relative
   proportions of 22 immune cell subtypes in the nasal epithelial brushing
   samples from [74]GSE19187, Immune cell abundance was visualized using
   heatmaps generated by the “corrplot” package.[75]^30

Recipient operating characteristics curve (ROC) and area under the curve
(AUC)

   To evaluate the diagnostic performance of the identified feature genes
   for AR, ROC curve analysis was performed, and AUC values were
   calculated using the “pROC” R package. The analysis was conducted for
   both the training dataset ([76]GSE19187) and the external validation
   dataset ([77]GSE46171) to assess the sensitivity and specificity of
   each gene in distinguishing AR patients from healthy controls.

Real-time quantitative PCR for validation

   NP69 cells, an immortalized human nasopharyngeal epithelial cell line
   sourced from Vigen Biotechnology (Zhenjiang, China), were utilized to
   validate gene expression changes in vitro.[78]^31 The cells were
   treated with 5 μg/mL House Dust Mite (HDM, Dermatophagoides
   pteronyssinus) extract (Cat. No. XPB91D3A2.5, Greer Laboratories, USA)
   for 24 h, a condition optimized through preliminary experiments
   evaluating epithelial cytokine responses. Total RNA was isolated and
   purified using the FastPure Cell/Tissue Total RNA Isolation Kit V2
   (Vazyme, RC112-01), and reverse transcribed into cDNA using the
   HiScript II One Step RT-PCR Kit (Vazyme, P611-01). Real-time
   quantitative PCR (RT-qPCR) was performed on an Applied Biosystems
   QuantStudio 5 system. Relative gene expression levels were calculated
   using the 2ˆ-ΔΔCt method. Statistical analyses and P-values were
   obtained using Prism GraphPad (version 8.0). The primer sequences were
   as follows: CST1: Forward 5′-TGGCCCAGTATCTGAGTACCCT-3′, Reverse
   5′-CCGCAGCGGACGT CTGTAGT-3′; CST2: Forward 5′-GGGCCGAACCATATGTACCA-3′,
   Reverse 5′-GCAGTTCTG GCTGTTCATGGA-3′; SERPINB4: Forward
   5′-CCACTGATGCATA TGAGCTGAAG-3′, Reverse
   5′-CTGGTAAAATTTCTTGATGGCATCT-3′; TOX: Forward
   5′-AAGGGCCAAAATCCAAACG-3′, Reverse 5′-CCCACATTG AAGCCA CAATTTTA-3’.

Statistical analysis

   All statistical analyses were performed using R software (version
   4.2.1) and the Xiantao platform. Normality of continuous variables was
   assessed by the Shapiro–Wilk test. For comparisons between 2 groups,
   independent Student's t-tests were applied to data meeting normality
   assumptions; for comparisons among multiple groups, one-way ANOVA was
   used. When data did not meet normality criteria, appropriate
   non-parametric tests were applied. Multiple comparisons were corrected
   using the Benjamini–Hochberg false discovery rate method where
   applicable. Statistical significance was defined as P-value <0.05.

Results

Identification of DEGs

   We identified a total of 133 DEGs between AR patients and healthy
   controls from the [79]GSE19187 dataset using the limma package with
   multiple testing correction. Applying a p value of <0.05, 87 genes were
   significantly upregulated and 46 genes were significantly downregulated
   in AR patients compared to controls. The expression patterns of these
   DEGs are shown in the heatmap and volcano plot ([80]Fig. 2A and B).

Fig. 2.

   [81]Fig. 2
   [82]Open in a new tab

   The heat map and volcano map of the DEG identified in the AR dataset
   [83]GSE19187. (A) Each row represents DEG, and each column represents
   each sample of healthy or AR cases. Red and blue represent upregulated
   and downregulated DEGs, respectively. (B) The red and blue circles
   refer to DEGs with upregulated and downregulated gene expression,
   respectively. AR, Allergic rhinitis; DEGs, differentially expressed
   genes.

WGCNA analysis and hub module selection

   The flow diagram of our sample clustering dendrogram from WGCNA is
   presented in ([84]SFig. 1A). To analyze the gene expression patterns of
   multiple samples, we clustered 20159 genes and analyzed the association
   between modules and specific traits or phenotypes from [85]GSE19187. As
   illustrated in ([86]SFig. 1B and C), the soft power threshold we set to
   11 based on scale-free Rˆ2 = 0.9. Totally 13 modules were identified,
   which were the strongly correlated modules based on a 0.25 clustering
   height limit ([87]Fig. 3A). From the selected modules, the black (R =
   0.47, P = 0.02) and purple (R = 0.56, P = 0.004) modules showed
   moderate but statistically significant correlations with AR ([88]Fig.
   3B and C). These 2 modules had the highest module-trait correlation
   coefficients in our analysis. A total of 1158 candidate genes from the
   black and purple modules were included in subsequent analyses
   ([89]Table S1).

Fig. 3.

   [90]Fig. 3
   [91]Open in a new tab

   The number of modules and genes from WGCNA. (A) Cluster tree diagram of
   genes in [92]GSE19187, where each branch represents a gene and each
   color represents a co-expressed module. (B) The correlation of
   co-expressed module genes obtained by WGCNA and the numbers in
   parentheses represent the P-value of genes in each module. WGCNA,
   Weighted gene co-expression network analysis. C, Control; P, Patients.
   (C) Scatter plot of gene significance (GS) between weight and module
   member relationship (MM) in black and purple modules. In this module,
   there is a highly significant correlation between GS and MM.

KEGG and GO enrichment analyses of overlapping DEGs

   To analyze the biological features of the key gene sets, we identified
   81 overlapping genes by intersecting the DEGs obtained from the limma
   analysis with the hub genes derived from the black and purple modules
   identified through WGCNA ([93]Fig. 4A). Additionally, these genes were
   classified into amoebiasis, ECM (Extracellular matrix)-receptor
   interaction, hematopoietic cell lineage, estrogen signaling pathway,
   linoleic acid metabolism, mucin-type o-glycan biosynthesis,
   ferroptosis, malaria, and arachidonic acid metabolism in the KEGG and
   GO enrichment analysis ([94]Fig. 4B and C), and enriched in cellular
   component terms, including the apical part of the cell, apical plasma
   membrane, basal part of the cell, basolateral plasma membrane, basal
   plasma membrane, platelet alpha granule, and cornified envelope
   ([95]Fig. 4C). Regarding the molecular function terms, genes were
   mainly enriched in endopeptidase inhibitor activity, endopeptidase
   regulator activity, peptidase inhibitor activity, enzyme inhibitor
   activity, peptidase regulator activity, serine-type endopeptidase
   inhibitor activity, and cysteine-type endopeptidase inhibitor activity
   ([96]Fig. 4C). The pathway analysis results were involved in AR
   development. The number of genes associated with ECM-receptor
   interaction occurred in the KEGG pathway analysis, suggested a higher
   correlation with AR.

Fig. 4.

   [97]Fig. 4
   [98]Open in a new tab

   Enrichment analysis of gene intersections in AR. (A) The Venn plot
   shows that 81 genes were identified at the intersection of DEGs using
   black and purple module genes from WGCNA. (B) KEGG pathway analysis of
   gene crossover. Different colors represent various important pathways
   and related enriched genes. (C) GO analysis of gene crossover,
   including biological processes, cellular components, and molecular
   functions. The y-axis represents different GO terms, the x-axis
   represents the proportion of genes enriched in relative GO terms, the
   size of the circle represents the number of genes, and the color
   represents the P-value. AR, Allergic rhinitis. KEGG, Kyoto Encyclopedia
   of Genes and Genomes. GO, Gene ontology; WGCNA, Weighted gene
   co-expression network analysis; DEGs, Differentially Expressed Genes.

Four feature genes identified by integrating 5 machine-learning algorithms

   To improve the robustness of feature gene selection, we applied 5
   machine learning methods—LASSO, SVM-RFE, Random Forest (RF), XGBoost,
   and Boruta—to the 81 candidate genes. LASSO identified 8 genes, SVM-RFE
   selected 46 genes, RF identified 19 genes with importance >0.2, XGBoost
   yielded 8 genes, and Boruta selected 18 genes ([99]Fig. 5A–E,
   [100]Table S2–S4). To enhance specificity and reduce potential
   overfitting, we focused on the intersection of genes consistently
   identified across all 5 methods, resulting in 4 common genes: CST1,
   CST2, SERPINB4, and TOX ([101]Fig. 6A). These genes were subsequently
   validated in an independent dataset ([102]GSE18574), where they
   maintained consistent expression patterns and diagnostic potential
   ([103]Fig. 6B and C), supporting the robustness and reproducibility of
   this integrative approach.

Fig. 5.

   [104]Fig. 5
   [105]Open in a new tab

   Five machine learning methods for identifying the best central genes.
   (A) Screening of biomarkers in the LASSO model. Coefficient
   distribution map, where 1 line represents genes and the vertical axis
   represents coefficients. The solid line in the left figure represents
   partial likelihood deviation, and the number of genes corresponding to
   the lowest point of cure (n = 81) is most suitable for LASSO. The
   vertical axis of the right panel represents the binary deviation, which
   can be understood as the size of the model error. The left row displays
   the smallest error, while the right row displays fewer features. (B)
   The SVM-RFE algorithm is used to further select the best candidate
   genes, which is based on the highest accuracy and lowest error obtained
   from the curve. The x-axis displays the number of feature selections,
   while the y-axis displays the prediction accuracy. (C) The random
   forest algorithm displays the errors in AR; Sort the control group and
   genes based on importance scores. Use random forest to calculate the
   relative importance of overlapping candidate genes (importance>0.2).
   The importance index is plotted on the x-axis, while genetic variables
   are plotted on the y-axis. (D) The characteristic importance of factors
   predicting mortality in AR patients in XGBoost. (E) Boruta selected 84
   feature genes with importance ranking. The green bar chart is
   considered meaningful.

Fig. 6.

   [106]Fig. 6
   [107]Open in a new tab

   The susceptibility genes of AR were identified. (A) The flower chart
   shows the 4 best feature genes shared by LASSO: random forest, SVM-RFE
   algorithm, XGBoost, and Boruta algorithm. (B) The expression levels of
   CST1, CST2, SERPINB4, and TOX in [108]GSE19187. (C) The expression
   levels of CST1, CST2, SERPINB4, and TOX in [109]GSE46171. (D) The ROC
   curves of 4 genes serve as a gene set for AR patients in [110]GSE19187.
   (E) The ROC curves of 4 genes in [111]GSE46171. C, Control; P,
   Patients; AR, Allergic rhinitis; DEGs, Differentially expressed genes;
   ROC, Receiver operating characteristics. ∗ P < 0.05, ∗ ∗ P < 0.01 and
   ∗∗∗ P < 0.0001.

Verification of the expression and diagnostic significance of characteristic
genes

   The expression levels of the 4 feature genes were checked in 14 AR
   samples and 11 healthy samples in [112]GSE19187. While CST1, CST2, and
   SERPINB4 were upregulated in AR, TOX was decreased in this disease.
   Moreover, the same changes of 4 genes were verified in 11 patients with
   AR and 17 healthy individuals in [113]GSE46171 ([114]Fig. 6A–C; P <
   0.05). Furthermore, ROC curve analysis was conducted to quantitatively
   assess the diagnostic and predictive value of the characteristic genes
   ([115]Fig. 6D and E). The AUC value of CST1 was 0.981, CST2 was 0.942,
   SERPINB4 was 0.792, and TOX was 0.955 in [116]GSE19187 ([117]Fig. 6D).
   Simultaneously, the external validation dataset [118]CSE46171 presented
   high AUC values: CST1 (AUC:0.738), CST2 (AUC:0.754), SERPINB4
   (AUC:0.706), and TOX (AUC:0.738), demonstrating that these hub feature
   genes predicted progression and exhibited a convincing diagnostic value
   for AR ([119]Fig. 6E).

Immune cell infiltration of characteristic genes

   Immune infiltration analysis was conducted using the CIBERSORT
   algorithm to investigate the role of four genes in immune regulation.
   The graphs for differential immune infiltration are revealed in
   ([120]Fig. 7A and B). Compared with control samples, the proportion of
   resting mast cells was upregulated in AR samples, while the others
   demonstrated no significant difference. Additionally, the above 4 genes
   showed greater differences in immune infiltration analysis. While CST1,
   CST2, and SERPINB4 showed higher expression in resting mast cells, TOX
   exhibited no significant difference in this analysis. Overall, 4 genes
   screened from immune infiltration may be potential regulators for AR
   treatment.

Fig. 7.

   [121]Fig. 7
   [122]Open in a new tab

   Analysis of immune cell infiltration between control and AR. (A) The
   proportion of 22 types of immune cells in different samples can be seen
   from the box diagram. (B) Comparing the P-values of 19 immune cells
   between the AR group and the control group through a correlation matrix
   diagram. ∗P < 0.05, ∗ ∗ P < 0.01 and ∗∗∗ P < 0.001. The vertical axis
   displays subtypes of immune cells. AR, Allergic rhinitis.

QPCR for mRNA levels of predicted hub genes

   To verify differential expressions of CST1, CST2, SERPINB4, and TOX,
   the human nasopharyngeal epithelial cell line (NP69) was used to assess
   mRNA expression levels of these genes via Real-time quantitative PCR.
   The expression trends of CST1, CST2, and SERPINB4 followed public
   databases, and the expression of TOX was lower in the AR group
   ([123]Fig. 8A–D). Consequently, the changes in the 4 gene expressions
   did not differ in the common stable cell line, suggesting that the 4
   genes were of high clinical diagnostic value.

Fig. 8.

   [124]Fig. 8
   [125]Open in a new tab

   Validation of diagnostic gene expression by RT-qPCR. (A–D) The
   expressions of (A) CST1. (B) CST2. (C) SERPINB4. (D) TOX. between 5
   μg/mL HDM-treatment NP69 cell line and Control subjects. ∗P < 0.05 and
   ∗∗P < 0.01.

Discussion

   AR is a prevalent chronic inflammatory condition of the nasal mucosa,
   triggered by IgE-mediated reactions to allergens such as pollen, dust
   mites, and animal dander.[126]^32 This condition significantly impairs
   the quality of life, leading to symptoms such as nasal congestion,
   rhinorrhea, sneezing, and itching, which can further result in sleep
   disturbances, fatigue, and decreased cognitive function.[127]^33 The
   global burden of AR is substantial, affecting approximately 10–30% of
   the population, with increasing prevalence in urban areas.[128]^34
   Despite its high prevalence and impact, the underlying molecular
   mechanisms of AR remain incompletely understood, necessitating further
   research to develop more effective diagnostic and therapeutic
   strategies.

   In this study, we focused on the nasal epithelial transcriptome to
   elucidate the detailed mechanisms of AR, leveraging advanced
   bioinformatics and machine learning techniques. The nasal epithelium is
   the primary site of allergen exposure and immune response initiation,
   making it a critical area for understanding AR pathophysiology.[129]^35
   By analyzing gene expression profiles from publicly available datasets
   ([130]GSE19187 and [131]GSE46171), we identified DEGs and constructed
   gene co-expression networks to uncover key molecular pathways involved
   in AR. Our integrative approach, combining differential expression
   analysis, WGCNA, and multiple machine learning algorithms, has the
   potential to identify novel biomarkers and therapeutic targets for AR.
   The identification of key diagnostic genes and the development of
   predictive models could significantly enhance the accuracy of AR
   diagnosis and pave the way for personalized treatment
   strategies.[132]^36 Using 5 machine learning analyses and CIBERSORT to
   infer specific immune cell infiltration, we found that the expression
   status of 3 genes (CST1, CST2, and SERPINB4) was highly correlated with
   resting mast cell infiltration, except for TOX, which is associated
   with naive B cells.

   As members of the type 2 cystatin (CST) superfamily, CST1(Cystatin SN)
   and CST2(Cystatin SA) have been reported to be capable of inhibiting
   the proteolytic activities of cysteine proteases and is implicated in
   progression of several human cancers.[133]^37^,[134]^38

   CST1 has been implicated in various inflammatory and immune responses.
   Recent studies have shown that CST1 is significantly upregulated in
   patients with AR, suggesting its role in the pathogenesis of this
   condition.[135]^39 The elevated expression of CST1 in AR patients may
   contribute to the modulation of protease activity, thereby influencing
   the inflammatory milieu in the nasal epithelium. NP-69 cell line is a
   classic nasal epithelial cell used to study nasopharyngeal related
   cancers. Our study corroborates these findings, demonstrating a marked
   increase in CST1 expression in AR samples compared to controls. This
   upregulation was further validated through qPCR in NP69 cells treated
   with HDM extract, reinforcing the potential of CST1 as a diagnostic
   biomarker for AR. Therefore, CST1 not only serves as a critical marker
   for AR diagnosis but also provides insights into the molecular
   mechanisms underlying the disease.

   Similar to CST1, CST2 has been found to be upregulated in allergic
   conditions, including AR.[136]^40 Our analysis identified CST2 as one
   of the keys differentially expressed genes in AR, with significant
   upregulation observed in both the primary dataset and external
   validation dataset. The qPCR results in NP69 cells further confirmed
   this trend, highlighting CST2's potential as a diagnostic marker. The
   role of CST2 in modulating protease activity and its impact on immune
   cell function, particularly in the context of mast cell activation,
   aligns with the observed immune cell infiltration patterns in
   AR.[137]^40 These findings suggest that CST2 may play a crucial role in
   the pathophysiology of AR, making it a valuable target for further
   research and potential therapeutic intervention. It is well known that
   serine protease inhibitors could be one of the treatments for
   inflammation.

   SERPINB4, also known as squamous cell carcinoma antigen 2, is a member
   of the serpin family of protease inhibitors.[138]^41 This gene has been
   associated with various inflammatory and immune responses, particularly
   in epithelial tissues.[139]^42 Our study identified SERPINB4 as one of
   the key genes upregulated in AR patients. The increased expression of
   SERPINB4 in AR samples, validated through qPCR in NP69 cells, suggests
   its involvement in the inflammatory process's characteristic of AR. The
   role of SERPINB4 in inhibiting serine proteases may contribute to the
   regulation of protease activity and the maintenance of epithelial
   integrity in the nasal mucosa.[140]^42 Additionally, the association of
   SERPINB4 with immune cell infiltration, particularly mast cells,
   underscores its significance in the immune response observed in AR.
   These findings highlight the potential of SERPINB4 as a diagnostic
   biomarker and a therapeutic target for AR, providing new avenues for
   understanding the molecular mechanisms underlying the disease.

   TOX, or thymocyte selection-associated high mobility group box protein,
   is a transcription factor involved in the regulation of T-cell
   development and function, and is associated with thymocyte selection,
   is expressed differently in all stages of T lymphocyte
   development.[141]^43 TOX is a nuclear deoxyribonucleic acid-binding
   factor[142]^44 that induces a set of chemokine genes in immune cells
   and modulates inflammation through the production of
   interleukin-10.[143]^45 However, TOX has not been previously associated
   with AR, suggesting it may be a valuable potential research target.
   Unlike the other identified genes, TOX was found to be downregulated in
   AR patients. This downregulation was consistent across both the primary
   and validation datasets, as well as in qPCR validation using NP69
   cells. The decreased expression of TOX in AR suggests a potential role
   in modulating immune responses, particularly in the context of T-cell
   function and differentiation. These findings indicate that TOX may play
   a protective role in the immune response associated with AR, and its
   downregulation could be a contributing factor to the disease's
   pathogenesis. Further research is needed to elucidate the precise
   mechanisms by which TOX influences AR and to explore its potential as a
   therapeutic target.

   It is interesting that 3 of the 4 genes are associated with protease.
   There was other evidence that ADAM10 and ADAM17 influenced allergic
   rhinitis,[144]^46 which are also transmembrane metalloproteases. It is
   well known that serine protease inhibitors were selected as treatments
   for inflammation, because proteases’ dysregulation may induce severe
   consequences, like inflammation of neurology, skin, lung and
   arthritis.[145]^47 It could be predicted that the inhibition of
   proteases may be a promising target of AR.

   The enrichment analysis results from our study provide significant
   insights into the biological processes and pathways involved in
   allergic rhinitis (AR). KEGG pathway analysis revealed that the
   differentially expressed genes (DEGs) are involved in pathways such as
   amoebiasis, ECM-receptor interaction, hematopoietic cell lineage, and
   estrogen signaling pathway. These pathways are crucial for
   understanding the underlying mechanisms of AR. For instance, the
   ECM-receptor interaction pathway is essential for cell adhesion and
   communication, which are critical in inflammatory responses and tissue
   remodeling observed in AR.[146]^48 The hematopoietic cell lineage
   pathway is involved in the development and differentiation of immune
   cells, which are pivotal in the immune response associated with
   AR.[147]^49 Additionally, the estrogen signaling pathway has been
   implicated in modulating immune responses and inflammation, suggesting
   a potential link between hormonal regulation and AR
   pathogenesis.[148]^50

   GO enrichment analysis further highlighted the involvement of DEGs in
   biological processes such as the negative regulation of proteolysis and
   regulation of endopeptidase activity. These processes are important for
   maintaining protease-antiprotease balance, which is often disrupted in
   inflammatory conditions like AR.[149]^51 The cellular component
   analysis indicated significant enrichment in apical and basal plasma
   membranes, suggesting that these genes may play roles in maintaining
   epithelial barrier integrity and function, which is often compromised
   in AR.[150]^52

   In conclusion, the enriched pathways and biological processes
   identified in this study provide a comprehensive understanding of the
   molecular mechanisms underlying AR. Future studies should focus on
   further elucidating the roles of these pathways and genes in AR, and
   collecting peripheral blood samples from patients to validate the
   diagnostic value of these 4 genes in large samples, in order to enhance
   our understanding and management of this condition.

   The immune infiltration analysis conducted using the CIBERSORT
   algorithm revealed significant differences in immune cell subtype
   proportions between healthy and AR groups, highlighting the role of
   immune cells in the pathogenesis of AR. Specifically, the analysis
   identified an increased proportion of resting mast cells in AR samples
   compared to controls. Mast cells are known to play a crucial role in
   allergic reactions by releasing histamine and other mediators that
   contribute to inflammation and symptom manifestation in AR
   patients.[151]^53 The elevated expression of CST1, CST2, and SERPINB4
   in resting mast cells suggests that these genes may be involved in the
   regulation of mast cell activity and, consequently, in the allergic
   response observed in AR.

   Certainly, our study has several limitations, such as using only ROC
   curves to predict the value of susceptible genes, small sample sizes,
   lack of validation cohorts, and absence of AR severity data, which may
   affect our conclusions.

Conclusion

   In summary, this study successfully identified differentially expressed
   genes associated with allergic rhinitis and visualized these findings
   through heatmaps and volcano plots. WGCNA analysis revealed gene
   modules closely related to AR, and GO and KEGG enrichment analyses
   highlighted key biological processes and pathways involved in AR
   mechanisms. Five machine learning algorithms identified 4 key
   diagnostic genes, and a predictive model was developed. The CIBERSORT
   algorithm demonstrated significant differences in immune cell subtype
   proportions between healthy and AR groups. The ROC curve and AUC values
   indicated high accuracy of the classification model. Finally, real-time
   quantitative PCR validated the differential expression of key genes in
   the NP69 cell line. These findings provide valuable insights into the
   molecular mechanisms of AR and offer potential biomarkers for diagnosis
   and therapeutic targets. Future studies should aim to address the
   current limitations by incorporating larger sample sizes, clinical
   validation, and wet lab experiments to further substantiate these
   findings.

Abbreviations

   AR, allergic rhinitis; CD109, C3 And PZP-Like Alpha-2-Macroglobulin
   Domain-Containing Protein 7; CD14, lipopolysaccharide receptor; CDH26,
   cadherin-Like Protein 26; CLC, charcot-Leyden Crystal Protein; CST1,
   cystain-SA-I; CST2, cystatin 2; CYP2E1, cytochrome P450 Family 2
   Subfamily E Member 1’GCNT3, glucosaminyl (N-Acetyl) Transferase 3; GEO,
   gene Expression Omnibus; HGD, homogentisate 1,2-Dioxygenase; KDELR3,
   KDEL Endoplasmic Reticulum Protein Retention Receptor 3; LMRGs, lipid
   metabolism-related genes; PCSK6, proprotein Convertase Subtilisin/Kexin
   Type 6; SERPINB4, serpin Family B Member 4; SLC16A9, solute Carrier
   Family 16 Member 9; SLC39A8, solute Carrier Family 39 Member 8; SLC7A1,
   solute Carrier Family 7 Member 1;SPDEF, SAM Pointed Domain Containing
   ETS Transcription Factor; STAT1, signal transducer and activator of
   transcription 1; TLR4 , toll-like receptor 4; TOX , thymocyte Selection
   Associated High Mobility Group Box; TRPM6, transient Receptor Potential
   Cation Channel Subfamily Member 6; UPK1B, uroplakin 1B; WGCNA, weighted
   gene co-expression network analysis

Availability of data and materials

   The datasets analyzed during the current study are available in the
   Gene Expression Omnibus (GEO) website, numbers were [152]GSE19187 and
   [153]GSE46171.

Author contributions

   HL and YC designed the study. QC, JF and YL analyzed the data and wrote
   the primary manuscript. QC and JF prepared the figures and tables from
   the literature. All authors reviewed and approved the final manuscript.

Ethics approval and consent to participate

   Not applicable.

Authors' consent for publication

   We guarantee our manuscript is original, has not been published before,
   is not currently being considered for publication elsewhere, and has
   not been posted to a preprint server. The publication is approved by
   all authors and that, if accepted, it will not be published elsewhere
   in the same form, in English or in any other language, without the
   written consent of the copyright-holder.

Funding

   This study was supported by the Taihu Lake talent plan (Top-Level, no.
   2020THRC-GD-7), the 333 project of Jiangsu Province in 2022 (ZUZHIBU
   202221001).

Declaration of competing interest

   The authors declare that they have no competing interests.

Acknowledgment