Abstract Osteoarthritis (OA), a degenerative disease of the joints, has one of the highest disability rates worldwide. This study investigates the role of pyroptosis-related genes in osteoarthritis and their expression in different chondrocyte subtypes at the individual cell level. Using OA-related datasets for single-cell RNA sequencing and RNA-seq, the study identified PRDEGs and DEGs and conducted Cox regression analysis to identify independent prognostic factors for OA. CASP6, NOD1, and PYCARD were found to be prognostic factors. Combined Weighted Gene Correlation Network Analysis with PPI network, a total of 15 hub genes related to pyroptosis were involved in the notch and oxidative phosphorylation pathways, which could serve as biomarkers for the diagnosis and prognosis of OA patients. The study also explored the heterogeneity of chondrocytes between OA and normal samples, identifying 19 single-cell subpopulation marker genes that were significantly different among 7 chondrocyte cell clusters. AGT, CTSD, CYBC, and THYS1 were expressed differentially among different cell subpopulations, which were associated with cartilage development and metabolism. These findings provide valuable insights into the molecular mechanisms underlying OA and could facilitate the development of new therapeutic strategies for this debilitating disease. Subject terms: Computational biology and bioinformatics, Immunology Introduction Osteoarthritis (OA) is a degenerative disease characterized by joint and muscle decline, synovial inflammation, and bone and cartilage decomposition. It is a prevalent condition among the elderly and a leading cause of disability worldwide, imposing a significant health and economic burden on individuals and society^[30]1,[31]2. While the exact cause of OA remains unclear, it has been associated with age, obesity, inflammation, trauma, and genetics. Current treatments, including medication, non-pharmacological approaches, and surgery, have limited efficacy and can have adverse side effects^[32]3,[33]4. Mesenchymal stem cell therapy has shown promise in promoting cartilage tissue repair and reducing inflammation, but more evidence is needed to support its use as a primary treatment option^[34]5. Pyroptosis, a form of active cell death triggered by endogenous or exogenous stimuli, has been implicated in the pathophysiology of OA^[35]6. Mechanisms of pyroptosis involve mitochondrial dysfunction, inflammatory responses, and oxidative stress. The elevated levels of oxidative stress and expression of pyroptosis-related proteins found in joint tissues of OA patients further suggest the significance of pyroptosis in the pathogenesis of OA^[36]7–[37]10. To gain insight into the pathogenesis of OA related to pyroptosis, a bioinformatics analysis was performed using available RNA-seq data for OA. Our research aims to comprehensively understand the molecular status of osteoarthritis (OA) and identify key genes and pathways associated with cell apoptosis in this disease. To achieve this goal, we analyzed a large amount of transcriptomic data from both OA and control synovial tissues. Additionally, we also analyzed single-cell RNA sequencing (scRNA-seq) data from chondrocytes (cartilage cells). Differential analysis was conducted to identify differentially expressed genes (DEGs) between OA samples with high and low immune estimates. Pyroptosis-related differentially expressed genes (PRDEGs) were identified using the Mann–Whitney U test based on obtained pyroptosis-related genes (PRGs). Weighted Gene Correlation Network Analysis (WGCNA) was then applied to screen for pyroptosis-related modules, and Pearson correlation analysis was used to screen module genes. Key genes were obtained by intersecting DEGs and module genes, and a protein–protein interaction (PPI) network was constructed to identify hub genes. GO/KEGG function enrichment analysis, Gene set enrichment analysis (GSEA), and Gene set variation analysis (GSVA) were conducted based on hub genes, and mRNA-miRNA, mRNA-TF, and mRNA-drug networks of hub genes were constructed. Single-cell RNA sequencing (scRNA-seq) was used to reveal different chondrocyte subtypes and compare gene expression differences in different cell subpopulations. Immune infiltration analysis was performed to obtain immune cells with different levels of infiltration between the High-Pyroptosis score group and the Low-Pyroptosis score group. This bioinformatics analysis provides insights into the pathogenesis of OA related to pyroptosis and identifies potential prognostic indicators. It also explores cell heterogeneity and expression differences of hub genes in different OA chondrocyte subpopulations, providing a new perspective on the management of OA. Methods Data collection and processing We obtained the gene expression profiling microarray [38]GSE12021^[39]11, [40]GSE55235^[41]12,[42]GSE82107^[43]13 and [44]GSE89408^[45]14 of osteoarthritis (OA) from the Gene Expression Omnibus (GEO, [46]https://www.ncbi.nlm.nih.gov/geo/) database using “GEOquery” R package^[47]15. Thereinto, the [48]GSE12021 contains 19 samples, including 10 OA samples and 9 control samples. The [49]GSE55235 consists of 20 samples, including 10 OA samples and 10 control samples. [50]GSE82107 contains 17 samples, including 10 OA samples and 7 control samples. The [51]GSE89408 consists of 49 samples, including 22 OA samples and 27 control samples. It is important to emphasize that we excluded all RA (rheumatoid arthritis) samples because our research primarily focuses on the pathophysiological processes of osteoarthritis, rather than conducting a comparative study involving various types of arthritis. Although both RA and osteoarthritis are arthritic conditions, they have significantly different pathogenic mechanisms and pathological processes. To maintain the clarity and specificity of our research, we chose to include only samples from osteoarthritis patients and excluded samples from other types of arthritis. The details of the 4 GEO dataset samples are shown in Table [52]1. We first merged the datasets [53]GSE12021, [54]GSE55235, [55]GSE82107 and [56]GSE89408, then removed the batch effect using the “sva” R package^[57]16. “Sva” (Surrogate Variable Analysis) is a commonly used R software package in gene expression data analysis. It is a method employed to detect and adjust for potential batch effects, aiming to enhance the reliability and accuracy of differential expression analysis and expression level clustering in various research studies. Next, we applied the “limma” R package^[58]17 to standardize the merged datasets to obtain an OA dataset. The “limma” is a widely used R software package in gene expression data analysis. It is a linear modeling method designed for differential expression analysis and is applicable for both microarray and RNA sequencing data. Then Principal Component Analysis (PCA) was performed on the expression matrix of the datasets before and after the removal of batch effect to verify the effect of removing the batch effect. PCA is a method of data dimensionality reduction. It extracts feature vectors (components) of data from high-dimensional data, converts them into low-dimensional data, and uses two-dimensional or three-dimensional graphs to display these features. Table 1. Dataset details. GEO ID Species OA sample Control sample Organization source GPL ID [59]GSE12021 Homo sapiens 10 9 synovial membrane [60]GPL96[HG-U133A] Affymetrix Human Genome U133A Array [61]GSE55235 Homo sapiens 10 10 synovial membrane [62]GPL96[HG-U133A] Affymetrix Human Genome U133A Array [63]GSE82107 Homo sapiens 10 7 synovial membrane [64]GPL96[HG-U133A] Affymetrix Human Genome U133A Array [65]GSE89408 Homo sapiens 22 27 synovial membrane [66]GPL570[HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array [67]Open in a new tab This scRNA-seq data set, [68]GSE169454, was retrieved from the GEO database and was generated using the [69]GPL555 Illumina HiSeq 2500 platform (Homo sapiens). The samples were collected from 4 patients with knee osteoarthritis and 3 normal subjects. We selected 5 single cell samples of chondrocytes, among which 3 were normal (Normal1: [70]GSM5203389, Normal2: [71]GSM5203390, Normal3: [72]GSM5203391) and 2 were OA (OA1: [73]GSM5203392, OA2: [74]GSM5203393). Seurat is a widely used R software package in the analysis of single-cell RNA sequencing (scRNA-seq) data. It is a highly flexible and powerful tool for processing, visualizing, and analyzing gene expression data at the single-cell level. For single-cell data processing, we used the “Seurat” R package^[75]18 to create five single-cell data as Seurat objects, and calculated the percentage of mitochondria in each cell through the "PercentageFeatureSet" function of the “Seurat” package. It is generally believed that when a cell has a high proportion of mitochondrial genes, it might be in a state of apoptosis or lysis. Therefore, for each cell, the proportion of mitochondrial genes was computed, and cells having a mitochondrial fraction more than 5% were filtered away. Due to the fact that cells with fewer than 200 genes were found to be of poor quality or to be empty droplets, and that cell duplication or multiplication might result in an excessively large gene count, we also filtered out cells with feature < 500. In addition, we filtered cells with total UMI > 20,000. Subsequently, we obtained 22,839 chondrocytes from patients with osteoarthritis and 6,360 normal chondrocytes. After completing the quality control procedure, OA samples (OA1, OA2) and normal samples (Normal1, Normal2, Normal3) were integrated respectively, and batch effects were removed using the CCA method in Seurat with the functions named the "FindIntegrationAnchors" and "IntegrateData". For Seurat objects, utilising the most differentially expressed genes, we did linear dimensional decrease using principal components (PC) calculation^[76]19. Then, we used "FindNeighbors" and "FindClusters" functions of Seurat to group cells into clusters with the optimum cluster size to identify the cell type. Then to compress the data gathered in the chosen important main components to two dimensions, we employed tSNE (t-Distributed Neighbor Embedding), realizing a visual graph-based clustering of cells. In addition, 33 pyroptosis-related genes (PRGs) were obtained based on previous studies. In addition, 33 pyroptosis-related genes (PRGs) were obtained based on previous studies^[77]20. Cell annotation Cell annotation is the process of associating and labeling each cell in single-cell sequencing data with its corresponding cell type, which helps researchers identify the cell types of individual cells and explore transcriptional differences in different physiological states and diseases. For osteoarthritis samples, 14 clusters were visualized using tSNE. For the normal samples, a total of 10 clusters were visualized using tSNE, and seven distinct chondrocyte subtypes were revealed by artificial annotation of cell type marker genes. The marker genes of homeostatic chondrocytes (HomC) contain MMP3, FOSB and JUN. The marker genes of hypertrophic chondrocytes (HTC) contain COL10A1, IBSP and JUN. The marker genes of prehypertrophic chondrocytes (preHTC) include COL10A1, IBSP, COL2A1 and TGFBI. The marker genes of regulatory chondrocytes (RegC) include CHI3L1 and CHI3L2. The marker genes of fibrochondrocytes (FC) contain COL1A1, COL1A2, S100A4, PRG4 and TMSB4X. The marker genes of reparative chondrocytes (RepC) include COL2A1, CILP, COL3A1 and COMP. The marker genes of preFC contain IL11, COL2A1, CILP and OGN^[78]18,[79]19,[80]21–[81]23. To understand the expression patterns of diagnostic markers in patients with osteoarthritis, we compared gene expression in different cell groups. Immunologic estimate The level of tumor microenvironment cell, immune cell, and stromal cell invasion have significant influence on the prognosis. To clarify the influence of genes involved in immunity and stromal cells on prognosis, the “ESTIMATE” R package^[82]24 could leverage the uniqueness of the tumor sample's transcriptional pattern to deduce the composition of tumour cells and the various invading normal cells. We used the “ESTIMATE” to estimate the immune activity of samples by using the expression profile matrix data of the OA dataset. Then ESTIMAE score, Immune score and Stromal score of the characteristics of the expression matrix were calculated based on the ESTIMATE algorithm to quantify the immune and stromalscore components of the sample. Next, we separated OA samples into two categories, one with a high ImmuneScore and another with a low ImmuneScore, based on the midpoint of the distribution. The “limma” R package was applied to conduct differential expression analysis, and differentially expressed genes (DEGs) were discovered in distinct groups. We chose | logFC |> 0.5 and p value < 0.05 as the standard screened genes for further study. Thereinto, genes with logFC > 0.5 and p value < 0.05 were up-regulated genes and those with logFC < − 0.5 and p value < 0.05 were down-regulated genes. The consequences were presented as the volcano map drawn by “ggplot2” R package^[83]25 and the heatmap drawn by “pheatmap” R package^[84]26. Calculation of pyroptosis score based on OA dataset Firstly, the Mann–Whitney U test was used to evaluate the difference between OA and control samples with regards to PRG expression (WilCoxon rank sum test).With a P-value less than 0.05, the results were considered significant, and pyroptosis-related differentially expressed genes (PRDEGs) were discriminated. The single-sample gene-set enrichment analysis (ssGSEA) algorithm could determine how many times each gene appears in the database samples. The “GSVA” R package^[85]27 was applied to count the Pyroptosis score of each sample in the OA group by the ssGSEA algorithm on account of the PRDEGs expression matrix of each sample in the OA dataset. Based on the median Pyroptosis value, we separated the OA samples into two groups: High-Pyroptosis and Low-Pyroptosis. Enrichment analysis between the High-Pyroptosis score group and the Low-Pyroptosis score group For the goal of comprehending the distinctions between the two groups' biological systems, GSEA^[86]28 was performed using gene expression profile data of OA patients. GSEA is a computer tool that permits the evaluation of whether a particular gene set demonstrates numerically substantial improvements between two biological states. The “clusterProfiler” R package^[87]29 was used to perform GSEA, with a P-Value of less than 0.05 indicating considerably enhanced data. GSVA is a method of gene set enrichment analysis for unduplicated samples. The score of the corresponding gene set could be obtained for each sample through GSVA, and the differentially expression analysis could be performed for pathways of each sample to obtain differentially expressed pathways between groups. We downloaded the reference gene set from MSigDB database^[88]30 ([89]https://www.gsea-msigdb.org/gsea/msigdb/). Finally, the "GSVA" R program was utilized to find commonalities between samples from the high-risk and low-risk groups. Also, we used “limma” R package to conduct differentially expression analysis on scores of various pathways between two groups. The parameter is set to p value < 0.05 and | logFC |> 0, which is displayed by heatmap. Identification of key genes The goals of WGCNA^[90]31 are to identify modules of co-expressed genes, study how genetic network correlate with phenotypes, and dissect key genes within these networks. The pickSoftTreshold method ultimately determined a value of 12 to be the best possible soft threshold. Subsequently, using this flexible limit, we formed a topological matrix, clustered the data hierarchically, and built a scale-free network. With 50 as the minimum number of modules, gene modules were actively slashed and identified, and Eigengenes were computed. Correlation between modules was built on account of module Eigengenes, hierarchical clustering was conducted, and modules with correlation greater than 0.5 were combined again, and finally 5 modules were gained. The association between module pyroptosis scores was analysed using Pearson's method, and candidate module genes (MEGs) were eliminated by this process (MEGs). Subsequently, we took the intersection of DEGs and MEGs to obtain genes as key genes in this study. Construction of PPI network and identification of hub genes STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is an online tool used to search and analyze the interactions between proteins and genes. It provides extensive information on protein interactions, helping researchers explore protein–protein interaction networks and functional regulatory mechanisms. PPI network predictions and network construction were performed using the online STRING (a search engine used to find interacting genes) tool^[91]32. The PPI network was seen using Cytoscape^[92]33, and hub genes were filtered using the cytoHubba^[93]34 extension. GO and KEGG enrichment analysis Gene Ontology (GO) analysis has been utilized as a general approach for mass functional enrichment, including biological processes (BP), molecular functions (MF) and Cellular Component (CC)^[94]35. The Genomes, biological processes, illnesses, and medications material is stored in the Kyoto Encyclopedia of Genes and Genomes (KEGG)^[95]36. For both DEGs and hub genes, we performed GO annotation analysis and KEGG pathway enrichment analysis using the "clusterProfiler" R package. We arbitrarily chose a significance level of 0.05 for the FDR threshold. The top 8 outcomes of GO and the top 10 results of KEGG are shown in the bar chart and dot chart, respectively. Construction of mRNA-miRNA, mRNA-TF, mRNA-drugs interaction network ENCORI^[96]37 database (https: // Starbase.sysu.edu.cn/) is the 3.0 version of the starBase database. The ENCORI database provides multiple graphical user interfaces for discussing microRNA targets, including those between miRNA and ncRNA, miRNA and mRNA, ncRNA and RNA, RNA and RNA, and RBP and ncRNA or RBP and mRNA. These interactions are supported by data mined from CLIP-seq and degradation sequencing experiments. Predictions of miRNA target genes and their functions were made using the miRDB database^[97]38. In this study, miRNAs interacting with hub genes (mRNA) were forecasted in accordance with ENCORI and miRDB databases, and the mRNA-miRNA connection system was plotted using the intersection of the two databases. CHIPBase database^[98]39 ([99]https://rna.sysu.edu.cn/chipbase/) found a matrix of binding sites for hundreds of DNA motifs, and predicted millions of Transcription factor (Transcription factors, TF) and transcriptional regulatory relationships between genes from the DNA binding protein CHIP-seq data. hTFtarget database^[100]40 ([101]http://bioinfo.life.hust.edu.cn/hTFtarget) refers to an integrated database of human TF and target regulation. Using the CHIPBase (version 3.0) and hTFtarget databases, we looked for transcription factors (TFS) that bind to hub genes. Comparative Toxicogenomics Database (CTD)^[102]41 ([103]http://ctdbase.org/) is based on an innovative digital ecosystem linking chemicals, genes, phenotypes, diseases and known toxicological information, which could be used to facilitate access to human health related information. With References ≥ 3 and Organisms ≥ 3 as screening