Abstract

Background

   Rheumatoid arthritis (RA) is a chronic systemic autoimmune disease
   characterized by inflammatory cell infiltration, which can lead to
   chronic disability, joint destruction and loss of function. At present,
   the pathogenesis of RA is still unclear. The purpose of this study is
   to explore the potential biomarkers and immune molecular mechanisms of
   rheumatoid arthritis through machine learning-assisted bioinformatics
   analysis, in order to provide reference for the early diagnosis and
   treatment of RA disease.

Methods

   RA gene chips were screened from the public gene GEO database, and
   batch correction of different groups of RA gene chips was performed
   using Strawberry Perl. DEGs were obtained using the limma package of R
   software, and functional enrichment analysis such as gene ontology
   (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), disease ontology
   (DO), and gene set (GSEA) were performed. Three machine learning
   methods, least absolute shrinkage and selection operator regression
   (LASSO), support vector machine recursive feature elimination (SVM-RFE)
   and random forest tree (Random Forest), were used to identify potential
   biomarkers of RA. The validation group data set was used to verify and
   further confirm its expression and diagnostic value. In addition,
   CIBERSORT algorithm was used to evaluate the infiltration of immune
   cells in RA and control samples, and the correlation between confirmed
   RA diagnostic biomarkers and immune cells was analyzed.

Results

   Through feature screening, 79 key DEGs were obtained, mainly involving
   virus response, Parkinson's pathway, dermatitis and cell junction
   components. A total of 29 hub genes were screened by LASSO regression,
   34 hub genes were screened by SVM-RFE, and 39 hub genes were screened
   by Random Forest. Combined with the three algorithms, a total of 12 hub
   genes were obtained. Through the expression and diagnostic value
   verification in the validation group data set, 7 genes that can be used
   as diagnostic biomarkers for RA were preliminarily confirmed. At the
   same time, the correlation analysis of immune cells found that γδT
   cells, CD4^+ memory activated T cells, activated dendritic cells and
   other immune cells were positively correlated with multiple RA
   diagnostic biomarkers, CD4^+ naive T cells, regulatory T cells and
   other immune cells were negatively correlated with multiple RA
   diagnostic biomarkers.

Conclusions

   The results of novel characteristic gene analysis of RA showed that
   KYNU, EVI2A, CD52, C1QB, BATF, AIM2 and NDC80 had good diagnostic and
   clinical value for the diagnosis of RA, and were closely related to
   immune cells. Therefore, these seven DEGs may become new diagnostic
   markers and immunotherapy markers for RA.

   Keywords: Rheumatoid arthritis, Biomarkers, Diagnostic genes, Immune
   cells, Machine learning, Bioinformatics

1. Introduction

   Rheumatoid arthritis is an autoimmune disease dominated by inflammatory
   arthritis [[45]1]. It is characterized by multi-joint, symmetrical, and
   invasive arthritis of the small joints of the hand and foot. In the
   course of rheumatoid arthritis, the immune system attacks the joint
   tissue, often accompanied by extra-articular organ involvement and
   serum rheumatoid factor positive, which can eventually lead to chronic
   disability, joint destruction and loss of function [[46][2], [47][3],
   [48][4]]. According to the 2017 Global Burden of Disease (GBD)
   epidemiological report, the global prevalence of RA is 0.27 % [[49]5].
   The actual cause of RA is still unknown, which has a serious impact on
   individuals and society. However, early diagnosis of RA disease
   provides an opportunity for effective therapeutic response. At present,
   rheumatoid factor (RF) and anti-cyclic citrullinated peptide antibody
   (ACPA) are used as serum biomarkers for early diagnosis of RA [[50]6].
   However, these two serum markers are not positive in all early RA
   patients. Therefore, it is of great significance to find novel and
   feasible biomarkers for early diagnosis and effective treatment of RA
   [[51]7,[52]8].

   Recent studies have shown that immune cells interfere with the
   information transmission between cells in the immune response of RA,
   and stimulate cells to move to inflammation, infection and trauma
   sites, participate in the whole process of RA occurrence, development
   and repair, and play a crucial role in the pathological changes caused
   by RA [[53]9,[54]10]. Among them, B cells secrete rheumatoid factors
   that recognize immunoglobulin Fe segments and form immune complexes,
   and can release chemokines and fix complements, so that inflammatory
   cells accumulate in the joints of patients [[55]11]. Macrophages are
   abundant in cartilage tissue and synovium of RA patients. When
   activated, they can overexpress major histocompatibility complex (MHC)
   II molecules, chemokines and inflammatory factors, which are closely
   related to the degree of joint injury and clinical symptoms of patients
   [[56]12]. The number of neutrophils increases significantly during RA
   activity and is always activated [[57]13]. T cells bind to MHC II and
   antigen peptides, activate macrophages, release pro-inflammatory
   cytokines, activate synovial fibroblasts and chondrocytes, and secrete
   a variety of enzymes that degrade glycoproteins and collagen, thereby
   destroying tissues [[58]14]. Th17 cells can induce synovial matrix and
   innate lymphocytes to secrete granulocyte-macrophage colony stimulating
   factor (MG-CSF), thereby triggering and enhancing RA [[59]15].
   Therefore, exploring the correlation between RA and immune cells from
   the perspective of the immune system is of great significance for
   elucidating the molecular system of RA and finding new diagnostic
   biomarkers and immunotherapy targets.

   Based on the gene chip data in the GEO public database, this study used
   multiple sets of GEO data chips to obtain RA differentially expressed
   genes, and performed GO, KEGG, DO, and GSEA functional enrichment
   analysis. Three machine learning methods, such as LASSO regression,
   SVM-RFE and Random Forest, were used to identify potential biomarkers
   of RA, and their expression and diagnostic value were verified.
   Finally, the potential biomarkers of RA were analyzed for immune cell
   infiltration and immune cell correlation. This study will help to
   further understand the pathogenesis of RA and provide reference for its
   diagnosis, prevention and treatment as well as new biomarkers.

2. Materials and methods

2.1. Data acquisition and preprocessing

2.1.1. Data acquisition

   Five datasets were obtained from the GEO database (Gene Expression
   Omnibus, [60]https://www.ncbi.nlm.nih.gov/geo), as detailed in
   [61]Table 1. The four data sets of [62]GSE1919, [63]GSE93272,
   [64]GSE10500 and [65]GSE29746 were used as the training group, and the
   [66]GSE55235 data set was used as the verification group for subsequent
   analysis.

Table 1.

   Information on RA-related datasets obtained from the GEO database.
   GEO data set name   Country   Platform file name Number of RA samples
   Number of control samples
   [67]GSE1919
   Germany           [68]GPL91   5                  5
   [69]GSE93272
   Japan             [70]GPL570  232                43
   [71]GSE10500
   America           [72]GPL8300 5                  3
   [73]GSE29746
   Spain             [74]GPL4133 9                  11
   [75]GSE55235
   Germany           [76]GPL96   10                 10
   [77]Open in a new tab

2.1.2. Data processing

   The gene expression matrix was extracted using the GEOquery package of
   R 4.3.1 software, and the gene probes and non-specific probes
   containing missing values were removed and annotated according to the
   official website information. DEGs were screened with |log2FC|≥0.5,
   P < 0.05 as the screening criteria, and heat maps and volcano maps were
   drawn. The data with large values in the gene expression matrices of
   [78]GSE1919, [79]GSE93272, [80]GSE10500 and [81]GSE29746 were processed
   by log2, and the R 4.3.1 software 'limma' and 'sva' packages were used.
   After merging the data, the data set was corrected to eliminate the
   differences caused by batch effects.

2.2. Enrichment analysis

   Gene ontology (GO) enrichment analysis was performed using R 4.3.1
   software 'clusterProfiler', 'org.Hs.eg.db', 'enrichplot', 'ggplot2',
   'stringi', 'limma' packages with p-value and q-value less than 0.05 as
   screening conditions. At the same time, based on the KEGG database
   ([82]www.kegg.jp/kegg/kegg1.html) [[83][16], [84][17], [85][18]], the
   Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis was
   performed using the above R 4.3.1 software toolkit, and the results
   with significant P values were selected for visual analysis. With
   p-value and q-value less than 0.05 as the filtering conditions, disease
   ontology enrichment analysis (DO) was performed, and the top 30
   diseases with q-value significance were selected for visual analysis.
   Gene set enrichment analysis (GSEA) was performed with p-adjust<0.05 as
   the filtering condition, and the enrichment of the top five biological
   functions or pathways in the normal group samples and RA disease group
   samples was selected for visual analysis.

2.3. Machine learning to identify potential biomarkers

   The LASSO regression analysis was performed using the 'glmnet' package
   in R 4.3.1 software. The point with the smallest cross-validation error
   was used as the filtering condition. The number of genes corresponding
   to the point with the smallest cross-validation error was selected as
   the number of potential biomarkers identified by LASSO machine
   learning. The genes corresponding to the average ranking of LASSO
   regression analysis were used as RA potential biomarkers. SVM-RFE
   analysis was performed using R software 'e1071', 'kernlab', and 'caret'
   packages. The number of genes corresponding to the minimum
   cross-validation error in the analysis results was used as the number
   of potential biomarkers identified by SVM-RFE machine learning. The
   genes corresponding to the average ranking of SVM-RFE analysis were
   used as RA potential biomarkers. Random Forest analysis was performed
   using the 'randomForest' package of R software. The importance score of
   each differential gene was obtained at the minimum error of the
   cross-validation curve. Genes with an importance score greater than 1
   were used as potential biomarkers for RA. The venn package was used to
   intersect the intersection genes obtained by LASSO, SVM-RFE, and Random
   Forest to obtain the final RA potential biomarkers obtained by the
   three machine learning methods.

2.4. Expression and diagnostic value verification of potential biomarkers for
RA

   The Extra Trees classifier model was used to screen out biomarkers with
   discriminative ability in the training test set and the verification
   test set. Meanwhile, The [86]GSE55235 data set was used as the
   verification group to analyze the differences of RA potential
   biomarkers under '2.3′ and verify their expression differences in other
   RA sample data. Using the gene expression matrix of the validation
   group, the survival analysis of the RA potential markers with
   significant differences in expression was performed to verify its
   diagnostic value.

2.5. To obtain the correlation analysis of immune cell matrix and immune cell
infiltration

   Perl 5.32.1.1 software was used to organize the gene expression data
   into a gene matrix with row name as gene name and column name as sample
   name. The ‘limma’ package in R 4.3.1 language BioManager was used to
   correct the gene expression matrix of RA group and control group. The
   CIBERSORT method was used to perform deconvolution analysis on the
   expression matrix of human immune cell subtypes. The relative
   proportions of 22 immune cells were calculated, and a P value was
   obtained for each sample. The samples were screened according to
   P < 0.05 to obtain the immune cell composition matrix. The barplot
   function of R language ‘graphics’ package was used to draw the
   histogram of each immune cell composition ratio in the two groups of
   samples. The ‘corrplot’ package in R language was used to analyze the
   correlation of immune cells in the samples. The 'vioplot' package in R
   language was used to compare the proportion of immune cells in the
   samples of RA patients and control groups, and a violin map was drawn.

2.6. Correlation analysis of immune cells

   The CIBERSORT algorithm was used to evaluate the immune cell
   infiltration of normal samples and RA samples in multiple sets of GEO
   data, and P < 0.05 was used as the screening standard. The immune cells
   evaluated include: B cells naive, B cells memory, Plasma cells, T cells
   CD8, T cells CD4 naive, T cells CD4 memory resting, T cells CD4 memory
   activated, T cells CD4 memory activated, T cells follicular helper, T
   cells regulatory, T cells gamma delta, NK resting, NK activated,
   Monocytes, Macrophages M0, Macrophages M1, Macrophages M2, Macrophages
   M2, Dendritic mast cells activated, Monocytes, Macrophages M0,
   Macrophages M1, Macrophages M2, Dendritic mast cells activated.
   Correlation analysis was used to study the correlation between RA
   potential biomarkers and immune infiltrating cells to explore the
   effect of RA potential biomarkers on immune microenvironment. Finally,
   the correlation between RA potential biomarkers and immune cells was
   analyzed.

3. Result

3.1. DEGs identification

   Four RA gene chip data, including 251 RA and 62 normal samples, were
   used to obtain 79 DEGs by GEOquery package of R 4.3.1 software,
   including 77 up-regulated genes and 2 down-regulated genes. The results
   of DEGS identification were visualized by volcano map and heat map, as
   shown in [87]Fig. 1A and. B. It can be seen from the figure that among
   the four genes with significant differences in RA gene chip data,
   up-regulated genes account for the majority, and DEGs have obvious
   expression differences in normal samples and RA samples.

Fig. 1.

   [88]Fig. 1
   [89]Open in a new tab

   Differential gene expression in RA. A: Heat map, B: Volcano map.

3.2. Biological function analysis of DEGs

   GO analysis obtained 227 BP items. It mainly involves the response to
   virus (GO:0009615), defense response to virus (GO:0051607), defense
   response to symbiont (GO:0140546), antimicrobial humoral response
   (GO:0019730), mucosal innate immune response (GO:0002227),
   antimicrobial peptide-mediated antimicrobial humoral immune response
   (GO:0061844), etc. MF entry 27. It mainly involves vitamin D-dependent
   calcium-binding protein (GO:0048306), RAGE receptor binding
   (GO:0050786), structural composition of ribosome (GO:0003735),
   toll-like receptor binding (GO:0035325), etc. CC item 40. It mainly
   involves cytochrome complex (GO:0070069) respiratory chain complex
   (GO:0098803), mitochondrial respiratory complex (GO:0005746), secretory
   granule lumen (GO:0034774), cytoplasmic vesicle lumen (GO:0060205),
   vesicle lumen (GO:0031983), etc ([90]Fig. 2A).

Fig. 2.

   [91]Fig. 2
   [92]Open in a new tab

   Diagram of biological function analysis of differential genes. A
   represents GO enrichment analysis map, B represents DO enrichment
   analysis map, C represents GSEA enrichment analysis map, and D
   represents KEGG enrichment analysis map.

   DO enrichment analysis showed that RA differential genes were enriched
   in 84 diseases, including Dermatitis, Prostate cancer, Male
   reproductive organ cancer, Intracranial hypertension, Atopic
   dermatitis, etc ([93]Fig. 2B). A total of 38 KEGG pathways were
   significantly enriched, including Parkinson disease, Prion disease,
   Huntington disease, Non-alcoholic fatty liver disease, Alzheimer
   disease, etc ([94]Fig. 2D).

   GSEA enrichment analysis is shown in [95]Fig. 2C. Biological processes
   such as cell junction assembly, cell junction organization, response to
   epidermal growth factor and molecular functions such as actin binding
   and chromatin binding were significantly active in the control group.
   It is highly expressed in KEGG pathways such as Alzheimers disease,
   Huntington disease, Oxidative phosphorylation, Parkinson disease and
   Ribosome in RA samples.

3.3. Potential biomarkers of RA

   Using LASSO regression analysis from DEGs expression matrix data, when
   the cross validation error is the smallest. LASSO regression analysis
   obtained 29 potential RA biomarkers: EVI2A, COX7A2, CGRRF1, KYNU,
   RPL39, FCGBP, CD52, PDCD10, AIM2, ZNF267, BATF, SPAG1, PSMA4, TNFSF10,
   TSPAN2, NDC80, ENTPD1, S100A12, RPL34, LRRN3, C1QB, KLRB1, BMX, IFI6,
   CXCL10, TNFRSF17, CRISP3, IFI27 and IFI44L ([96]Fig. 3A). SVM-RFE
   machine learning method is used to minimize the cross-validation error.
   CKS2, PSMA2, CD58, S100A8, UQCRQ, C14orf2, CLEC2B, CAPZA2, CD52, FCGBP,
   EVI2A, IL15, CGRRF1, LY96, KYNU, COX7C, DBI, C1QB, CASP3, NDUFB3,
   SNRPG, COX7A2, ANXA3, GPR65, FAS, SUB1, IFI6, BATF, NDC80, RPL39,
   TMCO1, AIM2, SPAG1 and LRRN3 can be used as potential biomarkers of RA
   obtained by SVM-RFE method ([97]Fig. 3B). Random Forest machine
   learning method was used to obtain CKS2, LY96, S100A8, COX7C, IL15,
   EVI2A, NDC80, LRRN3, C1QB, PSMA2, CD58, CASP3, CGRRF1, LMNB1, NAT1,
   AIM2, C14orf2, BMX, SPAG1, KLRB1, BATF, CD86, DBI, TSPAN2, UQCRQ,
   ENTPD1, COX7A2, FCGBP, TFEC, POLR2K, IFI27, ANXA3, BCL2A1, RNASE2,
   CD52, RSAD2, RPS7, HAT1 and KYNU 39 RA potential biomarkers obtained by
   Random Forest method ([98]Fig. 3C). A total of 12 RA potential
   biomarkers such as EVI2A, COX7A2, CGRRF1, KYNU, FCGBP, CD52, AIM2,
   BATF, SPAG1, NDC80, LRRN3, and C1QB were obtained by intersecting the
   results obtained by the three machine learning methods ([99]Fig. 3D).

Fig. 3.

   [100]Fig. 3
   [101]Open in a new tab

   Machine learning screens potential biomarker maps. a represents the
   LASSO regression analysis chart, showing 29 potential biomarkers when
   the cross-validation error is minimal. b represents the SVM-RFE
   analysis plot, and there are 34 potential biomarkers when the
   cross-validation error is the smallest. c represents a random forest
   tree map, which is screened with an importance score greater than 1,
   showing 39 potential biomarkers. d represents the intersection graph of
   veen machine learning results, showing that the three machine learning
   algorithms have 12 common potential biomarkers.

3.4. Diagnostic value of biomarkers for rheumatoid arthritis

   The [102]GSE55235 data set was used to verify the expression
   differences of RA potential biomarkers in control samples and RA
   samples ([103]Fig. 4). P < 0.05 was considered to have significant
   differences. From [104]Fig. 4, it can be seen that 8 RA biomarkers such
   as KYNU, FCGBP, EVI2A, CD52, C1QB, BATF, AIM2 and NDC80 have
   significant differences in the expression of control samples and RA
   samples in the validation dataset, while 4 RA biomarkers such as
   COX7A2, CGRRF1, SPAG1 and LRRN3 have no significant differences in the
   expression of control samples and RA samples in the validation dataset.
   The Extra Trees classifier model was used to construct multiple
   decision trees, and their prediction results were integrated. At the
   same time, the ROC analysis of RA markers was performed using the
   validation group data set to verify their diagnostic efficacy for RA
   ([105]Fig. 5A and. B). AUC>0.800 was considered to have diagnostic
   ability. It can be seen from [106]Fig. 5 that the ROC curve of the
   Extra Trees classifier model shows that the AUC of all RA feature genes
   is greater than 0.800, and the main diagonal of the confusion matrix
   shows higher prediction accuracy, indicating that this model has higher
   generalization ability and prediction accuracy([107]Fig. 5C). The ROC
   of single RA characteristic gene showed that the AUC of seven RA
   biomarkers, such as KYNU, EVI2A, CD52, C1QB, BATF, AIM2 and NDC80, were
   all greater than 0.800, which could be preliminarily determined as the
   diagnostic biomarkers of RA. The R software 'dplyr', 'pROC', 'ggplot2',
   'survival', 'regplot', 'rms', 'ggsci', 'survminer', 'timeROC', 'ggDCA'
   and 'limma' packages were used to identify seven RA diagnostic
   biomarkers in the training group data expression matrix using the
   Norman logistic regression model to analyze their weight coefficients.
   The nomogram model is further established ([108]Fig. 6A). The R
   software 'pROC' package was used to verify the diagnostic nomogram
   model in the training group data set ([109]Fig. 6B). When the AUC value
   in nomoscore was greater than 0.6, it could be considered that the
   established diagnostic nomogram model had good diagnostic value. It can
   be seen from [110]Fig. 6B that the AUC value in nomoscore was 0.601,
   indicating that the established RA diagnostic nomogram model had good
   diagnostic value. The AUC values of each potential biomarker were all
   greater than 0.7, and the AUC value of EVI2A even reached 0.836,
   further indicating that the established RA diagnostic nomogram model
   had good diagnostic value.

Fig. 4.

   [111]Fig. 4
   [112]Open in a new tab

   Differential expression of potential biomarkers of RA in the validation
   dataset. When P＜0.05 was considered as statistically significant for
   the potential biomarker in the validation dataset.

Fig. 5.

   [113]Fig. 5
   [114]Open in a new tab

   Analysis of RA biomarkers in the validation group data set. A: ROC
   analysis of RA biomarkers with significant differences in the
   validation group data set; B: Common ROC analysis of RA biomarkers
   under the Extra Tree classifier model; C: Confusion matrix of training
   set and test set. AUC greater than 0.800 was considered to be a
   potential biomarker for RA with diagnostic ability.

Fig. 6.

   [115]Fig. 6
   [116]Open in a new tab

   Nomogram model for RA diagnosis. According to the sum of the test
   results of each index, the possibility of RA was judged. A: Nomogram
   model. B: Validation of ROC curve.

3.5. Results of immune cell infiltration

   The histogram of the proportion of immune cells showed the infiltration
   ratio of 22 immune cells in RA samples and control samples ([117]Fig.
   7A). Compared with the control group, the T cells gamma delta and
   Macrophages M0 infiltration levels in the RA group were higher. The
   infiltration level of Mast cells activated was low. The correlation
   heat map ([118]Fig. 7B) showed that Neutrophils was positively
   correlated with Macrophages M0 (r = 0.31, P < 0.05), that is, when
   Neutrophils decreased, Macrophages M0 also decreased relatively; there
   was a negative correlation between T cells CD8 and Neutrophils
   (r = −0.560, P < 0.05). The difference of immune cell infiltration
   between RA group and control group was analyzed by violin diagram
   ([119]Fig. 7C). We found that T cells CD4 naive (P = 0.009), T cells
   CD4 memory activated (P < 0.001), T cells regulatory (P < 0.001), T
   cells gamma delta (P = 0.001), Dendritic cells resting (P < 0.001),
   Mast cells activated (P = 0.012) and Neutrophils (P = 0.043) were
   significantly different between RA samples and control samples.

Fig. 7.

   [120]Fig. 7
   [121]Open in a new tab

   Results of immune cell infiltration. A represents bars of 22 immune
   cells. B represents the heat map of immune cell correlation, with a
   reder color indicating a greater positive correlation and a bluer color
   indicating a greater negative correlation. C represents violin plot,
   showing the difference in the expression of 22 immune cells between RA
   samples and control samples, and P < 0.05 was considered as significant
   difference. (For interpretation of the references to color in this