Abstract Background Rheumatoid arthritis (RA) is a chronic systemic autoimmune disease characterized by inflammatory cell infiltration, which can lead to chronic disability, joint destruction and loss of function. At present, the pathogenesis of RA is still unclear. The purpose of this study is to explore the potential biomarkers and immune molecular mechanisms of rheumatoid arthritis through machine learning-assisted bioinformatics analysis, in order to provide reference for the early diagnosis and treatment of RA disease. Methods RA gene chips were screened from the public gene GEO database, and batch correction of different groups of RA gene chips was performed using Strawberry Perl. DEGs were obtained using the limma package of R software, and functional enrichment analysis such as gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), disease ontology (DO), and gene set (GSEA) were performed. Three machine learning methods, least absolute shrinkage and selection operator regression (LASSO), support vector machine recursive feature elimination (SVM-RFE) and random forest tree (Random Forest), were used to identify potential biomarkers of RA. The validation group data set was used to verify and further confirm its expression and diagnostic value. In addition, CIBERSORT algorithm was used to evaluate the infiltration of immune cells in RA and control samples, and the correlation between confirmed RA diagnostic biomarkers and immune cells was analyzed. Results Through feature screening, 79 key DEGs were obtained, mainly involving virus response, Parkinson's pathway, dermatitis and cell junction components. A total of 29 hub genes were screened by LASSO regression, 34 hub genes were screened by SVM-RFE, and 39 hub genes were screened by Random Forest. Combined with the three algorithms, a total of 12 hub genes were obtained. Through the expression and diagnostic value verification in the validation group data set, 7 genes that can be used as diagnostic biomarkers for RA were preliminarily confirmed. At the same time, the correlation analysis of immune cells found that γδT cells, CD4^+ memory activated T cells, activated dendritic cells and other immune cells were positively correlated with multiple RA diagnostic biomarkers, CD4^+ naive T cells, regulatory T cells and other immune cells were negatively correlated with multiple RA diagnostic biomarkers. Conclusions The results of novel characteristic gene analysis of RA showed that KYNU, EVI2A, CD52, C1QB, BATF, AIM2 and NDC80 had good diagnostic and clinical value for the diagnosis of RA, and were closely related to immune cells. Therefore, these seven DEGs may become new diagnostic markers and immunotherapy markers for RA. Keywords: Rheumatoid arthritis, Biomarkers, Diagnostic genes, Immune cells, Machine learning, Bioinformatics 1. Introduction Rheumatoid arthritis is an autoimmune disease dominated by inflammatory arthritis [[45]1]. It is characterized by multi-joint, symmetrical, and invasive arthritis of the small joints of the hand and foot. In the course of rheumatoid arthritis, the immune system attacks the joint tissue, often accompanied by extra-articular organ involvement and serum rheumatoid factor positive, which can eventually lead to chronic disability, joint destruction and loss of function [[46][2], [47][3], [48][4]]. According to the 2017 Global Burden of Disease (GBD) epidemiological report, the global prevalence of RA is 0.27 % [[49]5]. The actual cause of RA is still unknown, which has a serious impact on individuals and society. However, early diagnosis of RA disease provides an opportunity for effective therapeutic response. At present, rheumatoid factor (RF) and anti-cyclic citrullinated peptide antibody (ACPA) are used as serum biomarkers for early diagnosis of RA [[50]6]. However, these two serum markers are not positive in all early RA patients. Therefore, it is of great significance to find novel and feasible biomarkers for early diagnosis and effective treatment of RA [[51]7,[52]8]. Recent studies have shown that immune cells interfere with the information transmission between cells in the immune response of RA, and stimulate cells to move to inflammation, infection and trauma sites, participate in the whole process of RA occurrence, development and repair, and play a crucial role in the pathological changes caused by RA [[53]9,[54]10]. Among them, B cells secrete rheumatoid factors that recognize immunoglobulin Fe segments and form immune complexes, and can release chemokines and fix complements, so that inflammatory cells accumulate in the joints of patients [[55]11]. Macrophages are abundant in cartilage tissue and synovium of RA patients. When activated, they can overexpress major histocompatibility complex (MHC) II molecules, chemokines and inflammatory factors, which are closely related to the degree of joint injury and clinical symptoms of patients [[56]12]. The number of neutrophils increases significantly during RA activity and is always activated [[57]13]. T cells bind to MHC II and antigen peptides, activate macrophages, release pro-inflammatory cytokines, activate synovial fibroblasts and chondrocytes, and secrete a variety of enzymes that degrade glycoproteins and collagen, thereby destroying tissues [[58]14]. Th17 cells can induce synovial matrix and innate lymphocytes to secrete granulocyte-macrophage colony stimulating factor (MG-CSF), thereby triggering and enhancing RA [[59]15]. Therefore, exploring the correlation between RA and immune cells from the perspective of the immune system is of great significance for elucidating the molecular system of RA and finding new diagnostic biomarkers and immunotherapy targets. Based on the gene chip data in the GEO public database, this study used multiple sets of GEO data chips to obtain RA differentially expressed genes, and performed GO, KEGG, DO, and GSEA functional enrichment analysis. Three machine learning methods, such as LASSO regression, SVM-RFE and Random Forest, were used to identify potential biomarkers of RA, and their expression and diagnostic value were verified. Finally, the potential biomarkers of RA were analyzed for immune cell infiltration and immune cell correlation. This study will help to further understand the pathogenesis of RA and provide reference for its diagnosis, prevention and treatment as well as new biomarkers. 2. Materials and methods 2.1. Data acquisition and preprocessing 2.1.1. Data acquisition Five datasets were obtained from the GEO database (Gene Expression Omnibus, [60]https://www.ncbi.nlm.nih.gov/geo), as detailed in [61]Table 1. The four data sets of [62]GSE1919, [63]GSE93272, [64]GSE10500 and [65]GSE29746 were used as the training group, and the [66]GSE55235 data set was used as the verification group for subsequent analysis. Table 1. Information on RA-related datasets obtained from the GEO database. GEO data set name Country Platform file name Number of RA samples Number of control samples [67]GSE1919 Germany [68]GPL91 5 5 [69]GSE93272 Japan [70]GPL570 232 43 [71]GSE10500 America [72]GPL8300 5 3 [73]GSE29746 Spain [74]GPL4133 9 11 [75]GSE55235 Germany [76]GPL96 10 10 [77]Open in a new tab 2.1.2. Data processing The gene expression matrix was extracted using the GEOquery package of R 4.3.1 software, and the gene probes and non-specific probes containing missing values were removed and annotated according to the official website information. DEGs were screened with |log2FC|≥0.5, P < 0.05 as the screening criteria, and heat maps and volcano maps were drawn. The data with large values in the gene expression matrices of [78]GSE1919, [79]GSE93272, [80]GSE10500 and [81]GSE29746 were processed by log2, and the R 4.3.1 software 'limma' and 'sva' packages were used. After merging the data, the data set was corrected to eliminate the differences caused by batch effects. 2.2. Enrichment analysis Gene ontology (GO) enrichment analysis was performed using R 4.3.1 software 'clusterProfiler', 'org.Hs.eg.db', 'enrichplot', 'ggplot2', 'stringi', 'limma' packages with p-value and q-value less than 0.05 as screening conditions. At the same time, based on the KEGG database ([82]www.kegg.jp/kegg/kegg1.html) [[83][16], [84][17], [85][18]], the Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis was performed using the above R 4.3.1 software toolkit, and the results with significant P values were selected for visual analysis. With p-value and q-value less than 0.05 as the filtering conditions, disease ontology enrichment analysis (DO) was performed, and the top 30 diseases with q-value significance were selected for visual analysis. Gene set enrichment analysis (GSEA) was performed with p-adjust<0.05 as the filtering condition, and the enrichment of the top five biological functions or pathways in the normal group samples and RA disease group samples was selected for visual analysis. 2.3. Machine learning to identify potential biomarkers The LASSO regression analysis was performed using the 'glmnet' package in R 4.3.1 software. The point with the smallest cross-validation error was used as the filtering condition. The number of genes corresponding to the point with the smallest cross-validation error was selected as the number of potential biomarkers identified by LASSO machine learning. The genes corresponding to the average ranking of LASSO regression analysis were used as RA potential biomarkers. SVM-RFE analysis was performed using R software 'e1071', 'kernlab', and 'caret' packages. The number of genes corresponding to the minimum cross-validation error in the analysis results was used as the number of potential biomarkers identified by SVM-RFE machine learning. The genes corresponding to the average ranking of SVM-RFE analysis were used as RA potential biomarkers. Random Forest analysis was performed using the 'randomForest' package of R software. The importance score of each differential gene was obtained at the minimum error of the cross-validation curve. Genes with an importance score greater than 1 were used as potential biomarkers for RA. The venn package was used to intersect the intersection genes obtained by LASSO, SVM-RFE, and Random Forest to obtain the final RA potential biomarkers obtained by the three machine learning methods. 2.4. Expression and diagnostic value verification of potential biomarkers for RA The Extra Trees classifier model was used to screen out biomarkers with discriminative ability in the training test set and the verification test set. Meanwhile, The [86]GSE55235 data set was used as the verification group to analyze the differences of RA potential biomarkers under '2.3′ and verify their expression differences in other RA sample data. Using the gene expression matrix of the validation group, the survival analysis of the RA potential markers with significant differences in expression was performed to verify its diagnostic value. 2.5. To obtain the correlation analysis of immune cell matrix and immune cell infiltration Perl 5.32.1.1 software was used to organize the gene expression data into a gene matrix with row name as gene name and column name as sample name. The ‘limma’ package in R 4.3.1 language BioManager was used to correct the gene expression matrix of RA group and control group. The CIBERSORT method was used to perform deconvolution analysis on the expression matrix of human immune cell subtypes. The relative proportions of 22 immune cells were calculated, and a P value was obtained for each sample. The samples were screened according to P < 0.05 to obtain the immune cell composition matrix. The barplot function of R language ‘graphics’ package was used to draw the histogram of each immune cell composition ratio in the two groups of samples. The ‘corrplot’ package in R language was used to analyze the correlation of immune cells in the samples. The 'vioplot' package in R language was used to compare the proportion of immune cells in the samples of RA patients and control groups, and a violin map was drawn. 2.6. Correlation analysis of immune cells The CIBERSORT algorithm was used to evaluate the immune cell infiltration of normal samples and RA samples in multiple sets of GEO data, and P < 0.05 was used as the screening standard. The immune cells evaluated include: B cells naive, B cells memory, Plasma cells, T cells CD8, T cells CD4 naive, T cells CD4 memory resting, T cells CD4 memory activated, T cells CD4 memory activated, T cells follicular helper, T cells regulatory, T cells gamma delta, NK resting, NK activated, Monocytes, Macrophages M0, Macrophages M1, Macrophages M2, Macrophages M2, Dendritic mast cells activated, Monocytes, Macrophages M0, Macrophages M1, Macrophages M2, Dendritic mast cells activated. Correlation analysis was used to study the correlation between RA potential biomarkers and immune infiltrating cells to explore the effect of RA potential biomarkers on immune microenvironment. Finally, the correlation between RA potential biomarkers and immune cells was analyzed. 3. Result 3.1. DEGs identification Four RA gene chip data, including 251 RA and 62 normal samples, were used to obtain 79 DEGs by GEOquery package of R 4.3.1 software, including 77 up-regulated genes and 2 down-regulated genes. The results of DEGS identification were visualized by volcano map and heat map, as shown in [87]Fig. 1A and. B. It can be seen from the figure that among the four genes with significant differences in RA gene chip data, up-regulated genes account for the majority, and DEGs have obvious expression differences in normal samples and RA samples. Fig. 1. [88]Fig. 1 [89]Open in a new tab Differential gene expression in RA. A: Heat map, B: Volcano map. 3.2. Biological function analysis of DEGs GO analysis obtained 227 BP items. It mainly involves the response to virus (GO:0009615), defense response to virus (GO:0051607), defense response to symbiont (GO:0140546), antimicrobial humoral response (GO:0019730), mucosal innate immune response (GO:0002227), antimicrobial peptide-mediated antimicrobial humoral immune response (GO:0061844), etc. MF entry 27. It mainly involves vitamin D-dependent calcium-binding protein (GO:0048306), RAGE receptor binding (GO:0050786), structural composition of ribosome (GO:0003735), toll-like receptor binding (GO:0035325), etc. CC item 40. It mainly involves cytochrome complex (GO:0070069) respiratory chain complex (GO:0098803), mitochondrial respiratory complex (GO:0005746), secretory granule lumen (GO:0034774), cytoplasmic vesicle lumen (GO:0060205), vesicle lumen (GO:0031983), etc ([90]Fig. 2A). Fig. 2. [91]Fig. 2 [92]Open in a new tab Diagram of biological function analysis of differential genes. A represents GO enrichment analysis map, B represents DO enrichment analysis map, C represents GSEA enrichment analysis map, and D represents KEGG enrichment analysis map. DO enrichment analysis showed that RA differential genes were enriched in 84 diseases, including Dermatitis, Prostate cancer, Male reproductive organ cancer, Intracranial hypertension, Atopic dermatitis, etc ([93]Fig. 2B). A total of 38 KEGG pathways were significantly enriched, including Parkinson disease, Prion disease, Huntington disease, Non-alcoholic fatty liver disease, Alzheimer disease, etc ([94]Fig. 2D). GSEA enrichment analysis is shown in [95]Fig. 2C. Biological processes such as cell junction assembly, cell junction organization, response to epidermal growth factor and molecular functions such as actin binding and chromatin binding were significantly active in the control group. It is highly expressed in KEGG pathways such as Alzheimers disease, Huntington disease, Oxidative phosphorylation, Parkinson disease and Ribosome in RA samples. 3.3. Potential biomarkers of RA Using LASSO regression analysis from DEGs expression matrix data, when the cross validation error is the smallest. LASSO regression analysis obtained 29 potential RA biomarkers: EVI2A, COX7A2, CGRRF1, KYNU, RPL39, FCGBP, CD52, PDCD10, AIM2, ZNF267, BATF, SPAG1, PSMA4, TNFSF10, TSPAN2, NDC80, ENTPD1, S100A12, RPL34, LRRN3, C1QB, KLRB1, BMX, IFI6, CXCL10, TNFRSF17, CRISP3, IFI27 and IFI44L ([96]Fig. 3A). SVM-RFE machine learning method is used to minimize the cross-validation error. CKS2, PSMA2, CD58, S100A8, UQCRQ, C14orf2, CLEC2B, CAPZA2, CD52, FCGBP, EVI2A, IL15, CGRRF1, LY96, KYNU, COX7C, DBI, C1QB, CASP3, NDUFB3, SNRPG, COX7A2, ANXA3, GPR65, FAS, SUB1, IFI6, BATF, NDC80, RPL39, TMCO1, AIM2, SPAG1 and LRRN3 can be used as potential biomarkers of RA obtained by SVM-RFE method ([97]Fig. 3B). Random Forest machine learning method was used to obtain CKS2, LY96, S100A8, COX7C, IL15, EVI2A, NDC80, LRRN3, C1QB, PSMA2, CD58, CASP3, CGRRF1, LMNB1, NAT1, AIM2, C14orf2, BMX, SPAG1, KLRB1, BATF, CD86, DBI, TSPAN2, UQCRQ, ENTPD1, COX7A2, FCGBP, TFEC, POLR2K, IFI27, ANXA3, BCL2A1, RNASE2, CD52, RSAD2, RPS7, HAT1 and KYNU 39 RA potential biomarkers obtained by Random Forest method ([98]Fig. 3C). A total of 12 RA potential biomarkers such as EVI2A, COX7A2, CGRRF1, KYNU, FCGBP, CD52, AIM2, BATF, SPAG1, NDC80, LRRN3, and C1QB were obtained by intersecting the results obtained by the three machine learning methods ([99]Fig. 3D). Fig. 3. [100]Fig. 3 [101]Open in a new tab Machine learning screens potential biomarker maps. a represents the LASSO regression analysis chart, showing 29 potential biomarkers when the cross-validation error is minimal. b represents the SVM-RFE analysis plot, and there are 34 potential biomarkers when the cross-validation error is the smallest. c represents a random forest tree map, which is screened with an importance score greater than 1, showing 39 potential biomarkers. d represents the intersection graph of veen machine learning results, showing that the three machine learning algorithms have 12 common potential biomarkers. 3.4. Diagnostic value of biomarkers for rheumatoid arthritis The [102]GSE55235 data set was used to verify the expression differences of RA potential biomarkers in control samples and RA samples ([103]Fig. 4). P < 0.05 was considered to have significant differences. From [104]Fig. 4, it can be seen that 8 RA biomarkers such as KYNU, FCGBP, EVI2A, CD52, C1QB, BATF, AIM2 and NDC80 have significant differences in the expression of control samples and RA samples in the validation dataset, while 4 RA biomarkers such as COX7A2, CGRRF1, SPAG1 and LRRN3 have no significant differences in the expression of control samples and RA samples in the validation dataset. The Extra Trees classifier model was used to construct multiple decision trees, and their prediction results were integrated. At the same time, the ROC analysis of RA markers was performed using the validation group data set to verify their diagnostic efficacy for RA ([105]Fig. 5A and. B). AUC>0.800 was considered to have diagnostic ability. It can be seen from [106]Fig. 5 that the ROC curve of the Extra Trees classifier model shows that the AUC of all RA feature genes is greater than 0.800, and the main diagonal of the confusion matrix shows higher prediction accuracy, indicating that this model has higher generalization ability and prediction accuracy([107]Fig. 5C). The ROC of single RA characteristic gene showed that the AUC of seven RA biomarkers, such as KYNU, EVI2A, CD52, C1QB, BATF, AIM2 and NDC80, were all greater than 0.800, which could be preliminarily determined as the diagnostic biomarkers of RA. The R software 'dplyr', 'pROC', 'ggplot2', 'survival', 'regplot', 'rms', 'ggsci', 'survminer', 'timeROC', 'ggDCA' and 'limma' packages were used to identify seven RA diagnostic biomarkers in the training group data expression matrix using the Norman logistic regression model to analyze their weight coefficients. The nomogram model is further established ([108]Fig. 6A). The R software 'pROC' package was used to verify the diagnostic nomogram model in the training group data set ([109]Fig. 6B). When the AUC value in nomoscore was greater than 0.6, it could be considered that the established diagnostic nomogram model had good diagnostic value. It can be seen from [110]Fig. 6B that the AUC value in nomoscore was 0.601, indicating that the established RA diagnostic nomogram model had good diagnostic value. The AUC values of each potential biomarker were all greater than 0.7, and the AUC value of EVI2A even reached 0.836, further indicating that the established RA diagnostic nomogram model had good diagnostic value. Fig. 4. [111]Fig. 4 [112]Open in a new tab Differential expression of potential biomarkers of RA in the validation dataset. When P<0.05 was considered as statistically significant for the potential biomarker in the validation dataset. Fig. 5. [113]Fig. 5 [114]Open in a new tab Analysis of RA biomarkers in the validation group data set. A: ROC analysis of RA biomarkers with significant differences in the validation group data set; B: Common ROC analysis of RA biomarkers under the Extra Tree classifier model; C: Confusion matrix of training set and test set. AUC greater than 0.800 was considered to be a potential biomarker for RA with diagnostic ability. Fig. 6. [115]Fig. 6 [116]Open in a new tab Nomogram model for RA diagnosis. According to the sum of the test results of each index, the possibility of RA was judged. A: Nomogram model. B: Validation of ROC curve. 3.5. Results of immune cell infiltration The histogram of the proportion of immune cells showed the infiltration ratio of 22 immune cells in RA samples and control samples ([117]Fig. 7A). Compared with the control group, the T cells gamma delta and Macrophages M0 infiltration levels in the RA group were higher. The infiltration level of Mast cells activated was low. The correlation heat map ([118]Fig. 7B) showed that Neutrophils was positively correlated with Macrophages M0 (r = 0.31, P < 0.05), that is, when Neutrophils decreased, Macrophages M0 also decreased relatively; there was a negative correlation between T cells CD8 and Neutrophils (r = −0.560, P < 0.05). The difference of immune cell infiltration between RA group and control group was analyzed by violin diagram ([119]Fig. 7C). We found that T cells CD4 naive (P = 0.009), T cells CD4 memory activated (P < 0.001), T cells regulatory (P < 0.001), T cells gamma delta (P = 0.001), Dendritic cells resting (P < 0.001), Mast cells activated (P = 0.012) and Neutrophils (P = 0.043) were significantly different between RA samples and control samples. Fig. 7. [120]Fig. 7 [121]Open in a new tab Results of immune cell infiltration. A represents bars of 22 immune cells. B represents the heat map of immune cell correlation, with a reder color indicating a greater positive correlation and a bluer color indicating a greater negative correlation. C represents violin plot, showing the difference in the expression of 22 immune cells between RA samples and control samples, and P < 0.05 was considered as significant difference. (For interpretation of the references to color in this