Abstract

   Rheumatoid arthritis (RA) is a common autoimmune disease that can lead
   to severe joint damage and disability. And early diagnosis and
   treatment of RA can avert or substantially slow the progression of
   joint damage in up to 90% of patients, thereby preventing irreversible
   disability. Previous research indicated that 50% of the risk for the
   development of RA is attributable to genetic factors, but the
   pathogenesis is not well understood. Thus, it is urgent to identify
   biomarkers to arrest RA before joints are irreversibly damaged. Here,
   we first use the Robust Rank Aggregation method (RRA) to identify the
   differentially expressed genes (DEGs) between RA and normal samples by
   integrating four public RA patients’ mRNA expression data.
   Subsequently, these DEGs were used as the input for the weighted gene
   co-expression network analysis (WGCNA) approach to identify RA-related
   modules. The function enrichment analysis suggested that the RA-related
   modules were significantly enriched in immune-related actions. Then the
   hub genes were defined as the candidate genes. Our analysis showed that
   the expression levels of candidate genes were significantly associated
   with the RA immune microenvironment. And the results indicated that the
   expression of the candidate genes can use as predictors for RA. We hope
   that our method can provide a more convenient approach for the early
   diagnosis of RA.

   Subject terms: Data mining, Databases, Functional clustering,
   High-throughput screening, Microarrays, Network topology,
   Bioinformatics

Introduction

   Rheumatoid arthritis (RA) is an autoimmune disease, that typically
   causes damage to the synovial membrane, cartilage, and bone and
   includes a wide range of symptoms, such as painful and swollen
   joints^[32]1. Besides intra-articular symptoms, RA may also be
   accompanied by other extra-articular symptoms, including those of
   cardiovascular, pulmonary, metabolic, and musculoskeletal
   diseases^[33]2. In the past decade, with the advent of novel
   therapeutics, the introduction of early therapy, the development of new
   classification criteria, and the application of new effective treatment
   strategies, the therapeutic revolution has dramatically improved
   articular and systemic outcomes^[34]3. Targeted therapies such as
   Methotrexate and biological reagents are important as first-line RA
   medications^[35]1. These therapies, coupled with emerging
   bioengineering and cell therapy result in a diverse group of multi-drug
   treatment regimens^[36]4. Despite advances in targeted biological and
   pharmacologic interventions that have recently come to market, many
   patients with RA continue to have inadequate responses to therapies, or
   intolerable side effects, with the resultant progression of their
   disease^[37]5. Considering many patients still suffer from uncontrolled
   symptoms, it is an urgent need for us to understand the pathogenesis
   and develop more precise biomarkers for early diagnosis.

   Recently, high-throughput sequencing technologies have been used for
   whole-genome sequencing (WGS) in synovial tissues or peripheral blood
   from RA patients, which facilitates disease pathobiology research and
   novel biomarker identification^[38]6. Researchers suggest that early
   diagnosis and treatment of RA can avert or substantially slow the
   progression of joint damage in up to 90% of patients, thereby
   preventing irreversible disability^[39]7. Previous research indicated
   that 60% of the risk for the development of RA is attributable to
   genetic factors^[40]8. Thus, better utilization of these
   high-throughput sequencing data may provide us with a basis for an
   in-depth understanding of pathogenesis. In terms of the pathogenesis of
   RA, the uncontrolled hyperplasia of synovial fibroblasts and
   macrophages play a vital role in the destruction of cartilage and
   bone^[41]9,[42]10. Pro-inflammatory cytokines secreted by fibroblasts
   and infiltrated immune cells can gradually cause cartilage
   degeneration^[43]11,[44]12. Since RA is an autoimmune disease, it is
   necessary to further explore the immune infiltration of the samples.
   Recently, there have been several algorithms that can be used to infer
   the patient’s immune microenvironment using mRNA expression data, which
   may also provide a new perspective for us to have a deeper
   understanding of the pathogenesis of RA.

   Here, we first use the Robust Rank Aggregation method (RRA) to identify
   the differentially expressed genes (DEGs) between RA and normal samples
   by integrating four public RA patients mRNA expression data.
   Subsequently, these DEGs were used as the input for the weighted gene
   co-expression network analysis (WGCNA) approach to identify RA-related
   modules. The function enrichment analysis suggested that the RA-related
   modules were significantly enriched in immune-related actions. Then the
   hub genes were defined as the candidate genes. Our analysis showed that
   the expression levels of candidate genes were significantly associated
   with the RA immune microenvironment and could be used as predictors for
   RA. We hope that our method can provide a more convenient approach for
   clinical RA early diagnosis.

Methods

Data source

   The gene expression datasets for RA were retrieved from the Gene
   Expression Omnibus (GEO) database ([45]https://www.ncbi.nlm.nih.gov/).
   The study type was restricted to expression profiling by array and
   human species. We used the following words to search datasets:
   “rheumatoid arthritis,” “RA,” “Homo sapiens,” “Microarray,” and
   “high-throughput sequencing”. The following inclusion criteria were
   used: (1) the mRNA profiling of RA patients and controls involved in
   the datasets was detected by a high-throughput array or next-generation
   sequencing. (2) datasets containing more than ten RA specimens and
   healthy controls, respectively; (3) total RNA was extracted from human
   synovial tissues of the knee joint. As a result, five GEO datasets
   ([46]GSE12021, [47]GSE55235, [48]GSE55457, [49]GSE77298, and
   [50]GSE89408) were selected. Then the gene expression profiles and
   corresponding clinical annotation information were downloaded for
   further analysis. The detailed information on the datasets used in our
   study was listed in Table [51]1.

Table 1.

   Summary of the RA datasets involved in our study.
  GSE accession Source types Platform     Analysis type Total number Control
   RA
  [52]GSE77298  Synovium     [53]GPL570   Array         23           7       16
  [54]GSE55457  Synovium     [55]GPL96    Array         23           10      13
  [56]GSE55235  Synovium     [57]GPL96    Array         20           10      10
  [58]GSE12021  Synovium     [59]GPL96    Array         21           9       12
  [60]GSE89408  Synovium     [61]GPL11154 RNA-seq       180          28      152
   [62]Open in a new tab

Differential gene expression analysis

   Previous research indicated that 60% of the risk for the development of
   RA is attributable to genetic factors^[63]8. To explore the mechanism
   of RA, differential expression analysis was performed between RA and
   control patients. Robust rank aggregation (RRA) was used to integrate
   the four datasets before differentially expressed genes (DEGs)
   identification using the “RobustRankAggreg” package in R (version
   3.6.3; [64]https://www.r-project.org/). Significant probability values
   obtained were analyzed for multiple testing using Bonferroni correction
   (p.adj). Then the fold change (FC) value was also calculated. The genes
   with adjust P value (p.adj) < 0.05 and |log2 (FC)| > 1 were identified
   as DEGs between RA and control patients, including up-regulated
   significant DEGs and down-regulated significant DEGs.

Functional enrichment analysis and protein–protein interaction analysis

   To explore the possible biological functions involved in DEGs, the Gene
   Ontology (GO) functional analysis and Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathway enrichment analysis were performed using the
   “clusterProfiler” package in R software^[65]13,[66]14. The GO
   classification includes biological process (GO-BP), cell component
   (GO-CC), and molecular function (GO-MF). As for both GO and pathway
   enrichment analyses, a P value < 0.05 was considered as statistically
   significant, respectively. To further screen the possible hub genes or
   proteins that might be critical in the progression of RA, a PPI network
   was generated by the GeNets website, a web platform for network-based
   analysis of genetic data ([67]http://apps.broadinstitute.org/genets).
   The hub genes were determined from the network by using the cytoHubba
   plugin with Cytoscape software (version 3.8.2).

The weighted gene co-expression network analysis

   Due to the heterogeneous and varying disease course, RA classification
   is critical for the clinical management of patients. Although
   identification of DEGs is necessary, determining their interconnection
   is also important. Correlation networks are increasingly being used in
   bioinformatics applications, with weighted co-expression networks
   (WGCNA) commonly used to describe the molecular mechanism and
   reconstruct co-expression networks of genes in different samples. The
   “WGCNA” package in R was used for gene co-expression network analysis.
   As input data, four RA expression profiles were first integrated into
   one using the “combat” function in the “sva” package. Then the DEGs
   between RA and control patients were identified using the “limma”
   package in R, and the genes with p < 0.05 and |log FC| > 0.25 were
   considered to be DEGs. The expression of DEGs was used as the input
   data for the WGCNA. A power of β = 4, a scale-free R2 = 0.85, a cut
   height of 0.25, and a minimal module size of 50, were selected as
   soft-threshold parameters to ensure a signed scale-free co-expression
   gene network. Finally, a total of 12 modules were generated. Among
   these modules, the blue module was highly correlated with clinical
   traits. We further set gene significance (GS) > 0.5 and module
   membership (MM) > 0.8 to define the candidate genes in WGCNA.

Immune microenvironment infiltration in RA

   Increasing evidence showed that, the uncontrolled hyperplasia of
   synovial fibroblasts and macrophages plays a vital role in the
   destruction of cartilage and bone in terms of the pathogenesis of
   RA^[68]9,[69]10. And pro-inflammatory cytokines secreted by fibroblasts
   and infiltrated immune cells can gradually cause cartilage
   degeneration^[70]11,[71]12. Infiltrating stromal and immune cells form
   the major fraction of normal cells in tumour tissue and not only
   perturb the tumour signal in molecular studies but also have an
   important role in cancer biology. Thus, ESTIMATE (Estimation of Stromal
   and Immune cells in MAlignant Tumor tissues using Expression data) was
   first performed to estimate tumor purity and the presence of
   infiltrating stromal cells/immune cells in tumor tissues utilizing gene
   expression according to the enrichment analysis of single sample gene
   set^[72]15. Next, we further explore the immune microenvironment
   infiltration in RA patients. Immune infiltration analysis by
   single-sample GSEA (ssGSEA) was used to evaluate 24 types of immune
   cells that may infiltrate into the RA immune microenvironment^[73]16.
   Then the ssGSEA scores, which represent the abundance of immune cell
   infiltration, were calculated using the “GSVA” Bioconductor package.
   And the Wilcoxon rank-sum test was applied to compare the immune
   infiltration between RA and control patients. To investigate the
   relationship between the expression of hub genes and immune cell
   infiltration, the Spearman’s correlation analysis was performed.
   Furthermore, to investigate the predictive ability of the hub genes,
   the receiver operating characteristic (ROC) curve analysis was
   performed in the training and validation cohorts. To further validate
   the predictive value of the hub genes, another external validation
   cohort was downloaded, and ROC curve analysis was performed.

Statistical analysis

   All statistical analysis was conducted with R (version 3.6.3)
   ([74]https://cran.r-project.org/bin/windows/base/old/3.6.3/). The
   following packages were employed: clusterProfiler, limma, GSVA, WGCNA,
   pROC, sva. P < 0.05 was considered to indicate a statistically
   significant difference.

Results

Rheumatoid arthritis microarray datasets

   We use network analysis to discover effective biomarkers for RA
   diagnosis. A detailed flowchart of our study is shown in Supplementary
   Fig. [75]S1. According to the eligibility criteria, five
   high-throughput datasets of RA were obtained from the GEO database in
   the present study. The detailed information on the datasets used in our
   study was listed in Table [76]1. The five cohorts include: (1)
   [77]GSE77298, consisting of 16 RA patients and 7 controls; (2)
   [78]GSE55457, consisting of 13 RA patients and 10 controls; (3)
   [79]GSE55235, consisting of 10 RA patients and 10 controls; (4)
   [80]GSE12021, consisting of 12 RA patients and 9 controls; (5)
   [81]GSE89408, consisting of 152 RA patients and 28 controls. There were
   a total of 203 RA patients and 64 controls involved in our study.
   Considering the number of samples in the prior four datasets was small,
   thus the four datasets were integrated as the training set, and the
   [82]GSE89408 was used as the validation set.

Identification of the robust differentially expressed genes

   To explore the mechanism of RA, differential expression analysis was
   performed between RA and control patients. The RRA analysis showed that
   there were a total of 184 DEGs identified between RA and control
   patients, including 136 up-regulated genes and 48 down-regulated genes
   (Supplementary Table [83]S1). Among the up-regulated genes, CXCL13 was
   the most differently expressed up-regulated gene (p = 2.37E−14,
   adjusted p = 5.12E−10), followed by IGLV1-44 (p = 7.49E−14, adjusted
   p = 1.62E−09). Otherwise, FOSB (p = 1.27E−11, adjusted p = 2.74E−07)
   and FKBP5 (p = 8.10E−10, adjusted p = 1.75E−05) were the most
   differently expressed down-regulated genes in the RRA analysis. The top
   20 up-regulated and down-regulated genes in RA were illustrated by a
   heatmap (Supplementary Fig. [84]S2A).

Functional enrichment analysis of differentially expressed genes

   To explore the possible biological functions involved in DEGs, the Gene
   Ontology (GO) functional analysis and the Kyoto Encyclopedia of Genes
   and Genomes (KEGG) pathway enrichment analysis were performed. To put
   the DEGs into a biological context, significant enriched GO terms were
   identified according to their P value and enrichment factor
   (Supplementary Table [85]S2). The top 10 significant enriched GO terms
   are summarized in Fig. [86]1A. As for BP, the DEGs were significantly
   enriched in the processes of the immune response, including antigen
   receptor-mediated signaling pathway, immune response-activating cell
   surface receptor signaling pathway, immune response-activating signal
   transduction, T cell activation, and B cell activation. In the CC
   category, DEGs were also significantly enriched in immune-related CC
   terms, such as external side of plasma membrane, immunological synapse,
   immunoglobulin complex, and T cell receptor complex. As expected, DEGs
   were also significantly enriched in immune-related MF terms, including
   antigen binding, chemokine activity, immunoglobulin receptor binding,
   and immune receptor activity. Moreover, the KEGG pathway enrichment
   analysis also indicated that the DEGs were significantly enriched in
   immune-related pathways, for example, Th17 cell differentiation, Th1
   and Th2 cell differentiation, primary immunodeficiency, IL-17 signaling
   pathway, and T cell receptor signaling pathway (Fig. [87]1B and
   Supplementary Table [88]S3). The enrichment analysis suggested that the
   DEGs play an important role in immune response-related biological
   functions.

Figure 1.

   [89]Figure 1
   [90]Open in a new tab

   Function enrichment analysis. (A) GO functional analysis of the DEGs
   between RA and control. (B) KEGG pathway analysis of the DEGs between
   RA and control.

   Next, we performed a network analysis of DEGs using the GeNets. GeNets
   defines “communities” as sets of genes with a high degree of
   connectivity based on a random forest classifier trained on datasets of
   established protein–protein interactions. The network analysis
   clustered DEGs into three distinct communities (Supplementary Fig.
   [91]S2B). We found that the most differently up-regulated gene, CXCL13,
   was also identified in the communities, highlighting that the gene with
   a greater different expression and it may play a vital role in the
   occurrence and development of the disease. The greater the node degree,
   the more edges this node has, and therefore could exert greater
   influence on the network, indicating the more biological functions they
   participated in. Thus, the top 20 genes based on the degree centrality
   included CCR5, LCK, CCL5, CXCR3, CXCL13, SDC1, ITGA4, SPP1, MMP3,
   ADAM28, DDX3Y, ACACB, EGR1, CD247, TNFSF11, IGHM, ITGB2, LDB3, RRM2,
   MS4A1, which were defined as candidate genes for further analysis.

Weighted gene co-expression network analysis

   It has been reported that genes are not independent of each other,
   co-expressed genes may have similar biological functions, and the
   effect of grouping genes is relatively strong^[92]17. WGCNA algorithm,
   which has been widely used for studying network changes, can be used to
   identify network topologies and sub-networks (called modules), using
   topological overlap dissimilarity as a measure of the distance between
   genes. Considering the complexity of RA pathogenesis, the WGCNA
   algorithm was performed to identify biologically significant gene
   modules and to better understand genes associated with RA pathogenesis
   (Fig. [93]2). Due to the heterogeneous and varying disease course, RA
   classification is critical for the clinical management of patients.
   Although identification of DEGs is necessary, determining their
   interconnection is also important. Thus, the DEGs with p < 0.05 and
   logFCs > 0.25 were selected as the input data for WGCNA to generate the
   gene co-expression modules. Applying a soft-thresholding power of 4
   (scale-free R2 = 0.85) (Fig. [94]2B,C) and cutting height as 0.25
   (Fig. [95]3B), 12 RA-related modules were identified (Fig. [96]3A).

Figure 2.

   [97]Figure 2
   [98]Open in a new tab

   Adjacency function parameters and construction of WGCNA modules. (A)
   Sample clustering to detect outliers. (B) The mean connectivity of
   eigengenes. (C) Scale independence of eigengenes. The red line
   represents the correlation coefficient square (r2) and mean
   connectivity of eigengenes under a soft threshold power of 4. (D)
   Cluster dendrogram of the DEGs. Highly similar modules are identified
   by clustering and then merged dynamically.

Figure 3.

   [99]Figure 3
   [100]Open in a new tab

   WGCNA modules analysis. (A) Heatmap of the WGCNA module correlations
   with RA and normal clinical traits. (B) Cluster dendrogram of the
   module eigengenes. The dissimilarity of module eigengenes is calculated
   to merge some similar modules with a height cut-off value of 0.25. (C)
   Scatter plot of the gene significance for the RA vs. the module
   membership in the purple module.

   In the WGCNA algorithm, the hub module was identified by the highest
   correlation between the module eigengene and the clinical traits, as
   well as the most significant correlation between the module membership
   and the gene significance. Then the relationship between the modules
   and the clinical traits was evaluated to identify the hub modules
   (Supplementary Figs. [101]S3 and [102]S4). The result showed that the
   blue module was significantly associated with RA (correlation
   coefficient = 0.77, p = 3e−08) (Fig. [103]3C). Consequently, the blue
   module was identified as the hub module in the WGCNA network and we
   selected the top 20 genes based on the gene significance and module
   membership as the candidate genes: ADAMDEC1, TRBC1, CD27, SEL1L3, LCK,
   IGLL5, IGKC, IGLJ3, CD3D, CXCL13, TNFRSF17, IGHM, CD2, MS4A1, TRAF3IP3,
   HLA-DOB, IGLV1-44, TRAC, TPD52, and PSMB9 (Supplementary Fig.
   [104]S2C).

Identification of hub genes and diagnostic efficacy verification

   Then the top 20 candidate genes in the PPI and WGCNA networks were
   overlapped to identify as the common hub genes, including LCK, CXCL13,
   IGHM, and MS4A1 (Supplementary Fig. [105]S2C).

   Then receiver operating characteristic (ROC) curve analysis was
   performed to evaluate the individual predictive power of the four
   common hub genes. The results show that the LCK has the highest AUC
   value (AUC = 0.773), followed by CXCL13 (AUC = 0.771), IGHM
   (AUC = 0.757), and MS4A1 (AUC = 0.739) (Fig. [106]4), suggesting that
   the four common hub genes have better predictive abilities.

Figure 4.

   [107]Figure 4
   [108]Open in a new tab

   ROC curve analysis in the training cohort. (A) ROC curves of the
   expression of CXCL13 to predict RA. (B) ROC curves of the expression of
   IGHM to predict RA. (C) ROC curves of the expression of LCK to predict
   RA. (D) ROC curves of the expression of MS4A1 to predict RA.

   To further validate the predictive ability of the common hub genes, the
   ROC curve analysis was performed in the [109]GSE89408 dataset. The
   results showed that the four common hub genes also existed good
   performance (CXCL13: AUC = 0.899; IGHM: AUC = 0.874; LCK: AUC = 0.772;
   MS4A1: AUC = 0.877; Fig. [110]5). Moreover, another rheumatoid
   arthritis external validation cohort was obtained from GEO database to
   further validate the predictive power of four common hub genes. In the
   external validation cohort ([111]GSE121894), which consisted of 36
   rheumatoid arthritis and 22 control patients, the ROC curve analysis
   showed that the four common hub genes possess a good predictive ability
   (CXCL13: AUC = 0.789; IGHM: AUC = 0.726; LCK: AUC = 0.622; MS4A1:
   AUC = 0.652; Supplementary Fig. [112]S5). These findings indicated that
   the four common hub genes may serve as effective biomarkers for RA
   diagnosis.

Figure 5.

   [113]Figure 5
   [114]Open in a new tab

   ROC curve analysis in the validation cohort. (A) ROC curves of the
   expression of CXCL13 to predict RA. (B) ROC curves of the expression of
   IGHM to predict RA. (C) ROC curves of the expression of LCK to predict
   RA. (D) ROC curves of the expression of MS4A1 to predict RA.

Correlation between the biomarker genes and immune-related features

   Previous studies indicated that RA is an autoimmune disease, and the
   pro-inflammatory cytokines secreted by fibroblasts and infiltrated
   immune cells can gradually cause cartilage degeneration^[115]9,[116]10.
   Therefore, exploring the correlation between the biomarker genes and
   immune-related features is important for our further understanding of
   the pathogenesis of RA. Then, we evaluated the immune state between RA
   and control samples with immune score and stromal score to estimate
   infiltrating overall immune cells and stromal cells in patients, which
   were calculated by the ESTIMATE algorithm. The results showed that the
   RA patients have significantly higher immune and stromal scores than
   that of control cases, as well as the ESTIMATE score (Fig. [117]6). The
   ESTIMATE results proved that there was indeed a large difference in
   immune microenvironment between RA samples and control samples. To
   further investigate which immune cells are responsible for the
   difference and the potential mechanism of RA. We estimated the
   infiltration of 24 immune cells using the ssGSEA method via the “GSVA”
   package in R. Compared with control samples, the RA samples had
   significantly higher infiltration of T cells, aDC, and B cells
   (Fig. [118]7A). Moreover, the Pearson correlation analysis was
   performed and the results indicated that there were 11 immune cells
   positively associated with the expression of four biomarker genes,
   including aDC, B cells, CD8 T cells, Cytotoxic cells, DC, NK CD56
   bright cells, T cells, T helper cells, Tcm cells, Tfh cells, and Th2
   cells. Conversely, the infiltration of NK cells was negatively
   associated with the expression of target genes (Fig. [119]7B).

Figure 6.

   Figure 6
   [120]Open in a new tab

   ESTIMATE analysis was performed between RA and controls. (A) Comparison
   of the ESTIMATE sore between RA and controls. (B) Comparison of the
   immune sore between RA and controls. (C) Comparison of the stromal sore
   between RA and controls. (D) Comparison of the tumor purity between RA
   and controls.

Figure 7.

   [121]Figure 7
   [122]Open in a new tab

   Correlation between the biomarker genes and immune infiltration. (A)
   Heatmap of the infiltration of 24 immune cells between the RA and
   controls drawn by the “pheatmap” package in R (version 3.6.3)
   ([123]http://cran.r-project.org/bin/windows/base/old/3.6.3/). (B)
   Correlation between the biomarker genes and 24 immune cells
   infiltration.

Conclusion

   In summary, our study integrated relatively large sample size data to
   uncover the transcriptomic diversities between normal and RA synovial
   tissue and identified 4 essential genes (LCK, MS4A1, CXCL13, IGHM) by
   bioinformatics analysis. Moreover, the findings in the present research
   may provide deepening insights into the pathogenesis of RA and explore
   useful immunotherapeutic targets of RA susceptibility.

Discussion

   Rheumatoid arthritis (RA) is an autoimmune disease of the joints, that
   is characterized by the destruction of cartilage and bone^[124]9.
   Despite the advances in the treatment of RA, the clinical course of
   this disorder is marked by a continuous pattern of relapse and
   remission^[125]3. In recent years, rapid advances in bioinformatics
   have enabled us to screen out novel biomarkers. Therefore, identifying
   more novel immune-related biomarkers that could be vital in the
   diagnosis, treatment, and prognosis of RA is urgent.

   In our study, we first used the Robust Rank Aggregation method (RRA) to
   identify the differentially expressed genes (DEGs) between RA and
   normal samples by integrating four public RA patients’ mRNA expression
   data. Subsequently, these DEGs were used as the input for the WGCNA
   approach to identify RA-related modules. The function enrichment
   analysis suggested that the RA-related modules were significantly
   enriched in immune-related functions. Then the four hub genes were
   defined as the biomarker genes (LCK, MS4A1, CXCL13, IGHM), which
   possess a good predictive ability of RA. Furthermore, our analysis also
   found that the expression levels of candidate genes were significantly
   associated with the RA immune microenvironment. The exploration of
   potential biomarker genes of RA may help to identify novel therapeutic
   targets of rheumatoid arthritis susceptibility. We hope that our method
   can provide a more convenient approach for clinical RA early diagnosis.

   Increasing studies indicated that the DEGs, such as CXCL13, IGLV1-44,
   FOSB and FKBP5, were associated with the development of RA. It is
   well-known that chemokine signaling networks with the immune
   microenvironment are highly versatile and context-dependent: exert both
   pro-tumoral and anti-tumoral activities^[126]16. The CXCL13, and its
   cognate receptor CXCR5, represent an emerging example of chemokine
   signaling axes, which express the ability to modulate tumor growth and
   progression in other ways^[127]18, indicating the CXCL13 may also play
   an important role in the development of RA. Ye et al. suggested that
   IGLV1-44 might play a key role in the synovial membrane of
   osteoarthritis^[128]19. FosB recruits KAT5 to potentiate the growth and
   metastasis of papillary thyroid cancer in a DPP4-dependent
   manner^[129]20. The researcher also suggested that higher FKBP5
   promotes inflammation by strengthening the interactions of NF-κB
   regulatory kinases, whereas opposing FKBP5 either by genetic deletion
   (CRISPR/Cas9-mediated) or selective pharmacological inhibition
   prevented the effects on NF-κB^[130]20.

   The results of GO and KEGG assays indicated that 184 DEGs were mainly
   involved in the processes of the immune response. To further explore
   potential diagnostic biomarkers for RA, we performed PPI analysis and
   screened out the top 20 genes using the stress method. Previous
   research suggested that most of the top 20 genes were associated with
   the occurrence and development of the disease. For example, Hemmatazad
   et al. indicated that CCR5 is a potential therapeutic target for cancer
   and its higher expression is associated with poor outcomes in various
   malignancies and cells expressing CCR5 modulate the immune response and
   tumor progression^[131]21. Studies have shown that the LCK regulates
   the initiation of TCR signaling, T-cell development, and T-cell
   homeostasis^[132]21.

   Correlation networks are increasingly being used in bioinformatics
   applications, with WGCNA commonly used to describe the molecular
   mechanism and reconstruct co-expression networks of genes in different
   samples. Therefore, using network analysis may lead to the discovery of
   effective biomarkers for RA diagnosis. Herein, we eventually obtained
   four hub genes (LCK, MS4A1, CXCL13, IGHM) from the intersection of the
   PPI and WGCNA methods.

   Increasing evidence indicated that the four common hub genes were
   associated with RA^[133]22–[134]25. For example, Hu et al. found that
   the IgD-Fc-Ig fusion protein restrains the activation of T and B cells
   by inhibiting IgD-IgDR-Lck signaling in rheumatoid arthritis^[135]22.
   The CXCL13/CXCR5 axis facilitates endothelial progenitor cell homing
   and angiogenesis during rheumatoid arthritis progression^[136]23. And
   the MS4A1 has been already developed as a target for RA
   treatment^[137]24,[138]25. These studies suggested that the common hub
   genes may play important roles in the pathogenesis of RA. Thus we only
   focused on the common hub genes, considering the strong connection
   between the common hub genes and the other hub genes in the PPI and
   WGCNA networks.

   Previous studies also indicated that the basic feature of RA is the
   body’s immune system disorders, in which autoreactive CD4 + T cells,
   pathogenic B cells, M1 macrophages, inflammatory cytokines, chemokines
   and autoantibodies abnormally increase in the body of RA
   patients^[139]26. And the B cell depletion therapy has well proved the
   important role of B cells in the pathogenesis of RA^[140]27,[141]28.
   Fathollahi et al. found that the NK cells play an important role in the
   pathogenesis of RA^[142]29. These findings suggested that the
   expression of biomarker genes was associated with the patient’s immune
   microenvironment, and thus partly explains why the biomarker genes have
   better predictive power.

   The advantage of our study is that we used the network analysis
   approach to identify potential RA biomarkers, and we also
   comprehensively analyzed the correlation between the biomarker genes
   and the patients’ immune microenvironment. Considering the limitations
   of RA high-throughput omics data, our method should be further
   validated in more RA patient cohorts in the future.

Supplementary Information

   [143]Supplementary Information 1.^ (4.4MB, docx)
   [144]Supplementary Table S2.^ (45.4KB, xlsx)

Author contributions

   Y.A. wrote the manuscript and revised it. Z.W. and J.H. performed data
   analysis and data interpretation. M.Y. and Y.A. conducted the
   bioinformatics and statistical analyses and prepared Figs. [145]1,
   [146]2, [147]3, [148]4, [149]5, [150]6, and [151]7. W.Z. supervised the
   project and revised the manuscript. All authors reviewed the
   manuscript.

Data availability

   The datasets supporting the conclusions of this article are available
   in the GEO (Gene Expression Omnibus)
   ([152]https://www.ncbi.nlm.nih.gov/geo/) repository.
   [[153]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE77298/GSE55
   457/GSE55235/GSE12021/GSE89408].

Competing interests

   The authors declare no competing interests.

Footnotes

   Publisher's note

   Springer Nature remains neutral with regard to jurisdictional claims in
   published maps and institutional affiliations.

   These authors contributed equally: You Ao and Wei Zhang.

Supplementary Information

   The online version contains supplementary material available at
   10.1038/s41598-023-29153-3.

References