Abstract Rheumatoid arthritis (RA) is an autoimmune disease that exhibits a high degree of heterogeneity, marked by unpredictable disease flares and significant variations in the response to available treatments. The lack of optimal stratification for RA patients may be a contributing factor to the poor efficacy of current treatment options. The objective of this study is to elucidate the molecular characteristics of RA through the utilization of mitochondrial genes and subsequently construct and authenticate a diagnostic framework for RA. Mitochondrial proteins were obtained from the MitoCarta database, and the R package limma was employed to filter for differentially expressed mitochondrial genes (MDEGs). Metascape was utilized to perform enrichment analysis, followed by an unsupervised clustering algorithm using the ConsensuClusterPlus package to identify distinct subtypes based on MDEGs. The immune microenvironment, biological pathways, and drug response were further explored in these subtypes. Finally, a multi-biomarker-based diagnostic model was constructed using machine learning algorithms. Utilizing 88 MDEGs present in transcript profiles, it was possible to classify RA patients into three distinct subtypes, each characterized by unique molecular and cellular signatures. Subtype A exhibited a marked activation of inflammatory cells and pathways, while subtype C was characterized by the presence of specific innate lymphocytes. Inflammatory and immune cells in subtype B displayed a more modest level of activation (Wilcoxon test P < 0.05). Notably, subtype C demonstrated a stronger correlation with a superior response to biologics such as infliximab, anti-TNF, rituximab, and methotrexate/abatacept (P = 0.001) using the fisher test. Furthermore, the mitochondrial diagnosis SVM model demonstrated a high degree of discriminatory ability in distinguishing RA in both training (AUC = 100%) and validation sets (AUC = 80.1%). This study presents a pioneering analysis of mitochondrial modifications in RA, offering a novel framework for patient stratification and potentially enhancing therapeutic decision-making. Supplementary Information The online version contains supplementary material available at 10.1186/s12967-023-04426-7. Keywords: Rheumatoid arthritis, Mitochondrial proteins, Immune microenvironment, Unsupervised machine learning, Stratification Introduction Rheumatoid arthritis (RA) is a heterogeneous and prevalent autoimmune inflammatory arthritis [[35]1], leading to a rise in the number of disabled life years attributed to RA worldwide. However, these trends exhibit regional and national variations. Furthermore, RA is an autoimmune disease with an unknown etiology, and past risk factors include respiratory exposure, genetics, intestinal health, oral health, gender, lifestyle, and habits [[36]2]. Currently, Nonsteroidal anti-inflammatory drugs [[37]3], Glucocorticoids [[38]4, [39]5], and Disease-Modifying Anti-Rheumatic Drugs (methotrexate, sulfasalazine, minocycline, hydroxychloroquine, and azathioprine) are commonly utilized as the primary pharmacological interventions for managing patients with RA These drugs exert their therapeutic effects through immunosuppressive and anti-inflammatory mechanisms [[40]6–[41]9]. Despite the wide range of treatment options available to RA patients, the current standard treatment regimen is associated with a multitude of adverse effects [[42]10]. In the realm of research, the utilization of big data has the potential to unveil novel (sub-) phenotypes in unsupervised analyses, thereby enhancing precision in medical interventions through the facilitation of innovative targeted therapeutic strategies. Dana E Orange et al. conducted comprehensive analyses of patient samples, leading to the identification of three distinct subtypes of rheumatoid arthritis, with strong associations observed between these subtypes and disease activity [[43]11]. Additionally, Rodrigo Cánovas et al. [[44]12] discovered different subtypes of juvenile idiopathic arthritis, which has significantly contributed to the advancement of our understanding of this disease. Therefore, it is essential to understand RA subtypes and their molecular characterizations in order to better select patients and develop individualized therapy based on phenotypes and molecular signatures. The disruption of mitochondrial homeostasis has been implicated in the development of RA [[45]13–[46]15]. The imbalance of the endostatin environment resulting from mitochondrial impairment plays a crucial role in the pathology of RA [[47]16]. In the context of RA, METTL3 is responsible for mediating inflammatory responses by activating the NF-κB pathway and facilitating FLS activation [[48]17]. The heightened expression of SIRT4 promotes the secretion of TNF-a and IL-6, thereby expediting the process of bone destruction in individuals with osteoarthritis [[49]18, [50]19]. Moreover, PTEN Methylation has been found to promote inflammation and the activation of fibroblast-like synoviocytes in Rheumatoid Arthritis [[51]20]. Both mitochondrial metabolism and immune-inflammation are significant pathogeneses of RA. However, their interplay in RA remains unexplored and necessitates further investigation. This study employed unsupervised clustering methods to identify different subtypes in patients with RA based on mitochondrial gene expression profiles from whole blood. The subtypes were thoroughly characterized using cellular, molecular, and clinical features to gain a deeper understanding of the underlying biological mechanisms. The identified characteristic genes were then applied to independent groups of RA patients to evaluate the therapeutic outcomes of conventional triple Infliximab and anti-TNF. Additionally, machine learning was utilized to develop a diagnostic tool based on the identified features. This study aims to provide a reference for clinical precision treatment and early diagnosis of RA patients. Materials and methods Processing of RA gene expression data The Gene Expression Omnibus (GEO) database furnished microarray gene expression data for rheumatoid arthritis samples, with a comprehensive account of the study design, data preprocessing, and data interpretation for the six microarray datasets ([52]GSE110169, [53]GSE93272, [54]GSE58795, [55]GSE15258, [56]GSE37107, and [57]GSE68215 in Additional file [58]1: Table S1). Several biologic agents were included, namely: Infliximab ([59]GSE58795), anti-TNF ([60]GSE15258), rituximab ([61]GSE37107), methotrexate/abatacept ([62]GSE68215). Drug information was extracted from medical records. Additionally, microarray datasets [63]GSE110169 and [64]GSE93272 were segregated into training and test datasets. To mitigate background noise and normalize quantiles for microarray data, we retrieved raw files in ‘CEL’ format and employed the Affy and Simpleaffy packages for robust multiarray averaging. Differentially expressed mitochondrial genes: screening and function and pathway enrichment analysis The present study employed the Mitocarta 3.01 database to identify gene sets that are associated with mitochondria, with a specific focus on the 1,136 unique human mitochondrial genes [[65]21]. The R package limma was utilized to filter differentially expressed genes that are linked to mitochondria between samples of individuals with RA and healthy control samples. False-positive outcomes were corrected using the false-discovery rate (FDR). The criteria for identifying Mitochondria-associated differentially expressed genes (MDEGs) were an adjusted p-value of less than 0.05 and a log fold change (logFC) of greater than 0.32. To ascertain the enrichment of pathways, a Metascape analysis was executed for GO and KEGG pathways, wherein functional pathways with a p < 0.05 were deemed significantly enriched [[66]22]. Pearson correlation coefficients were employed to scrutinize gene expression correlations. Clustering of Mitochondria-related expression-driven subgroups in RA To gain further insights into molecular subtype heterogeneity within MDEGs profiles associated with RA, the R package ConsensuClusterPlus was utilized to perform hierarchical agglomerative clustering using the ‘km’ method, which is based on Euclidean distance, The parameter settings were as follows: maxK = 6, reps = 1000, pItem = 0.8, pFeature = 1, clusterAlg="km”, distance="euclidean”. The “km” option performs kmeans clustering directly on a data matrix, with items and features resampled. This process was repeated 1000 times to ensure clustering stability [[67]23]. The optimal cluster allocation was determined through the utilization of a cumulative distribution function (CDF). Principal component analysis (PCA) was employed to visualize the differences between subtypes. The identification of differentially expressed genes (MDEGs) was conducted across the three subtypes. Characterization of RA subtypes based on cellular, molecular, and clinical characteristics The present study assessed immune cell infiltration in patients with RA through the utilization of the ‘Xcell’ R package, which facilitated the computation of the enrichment of 64 immune genes [[68]24]. Additionally, the immune function of three subgroups of participants was determined via single-sample gene set enrichment analysis (ssGSEA) [[69]25]. Pathways linked to RA were curated based on literature references and GSEA outcomes, and gene sets were sourced from the KEGG