Abstract Objective Diabetes mellitus combined with nonalcoholic fatty liver disease is a prevalent and intricate metabolic disorder that presents a significant global health challenge, imposing economic and emotional burdens on society and families. An in-depth understanding of the disease pathogenesis is crucial for enhancing diagnostic and therapeutic efficacy. Therefore, the study aims to identify and validate autophagy-related diagnostic biomarkers associated with T2DM-associated MAFLD, investigate regulatory mechanisms in disease progression, and explore cellular diversity within the same tissue using single-cell sequencing data. Methods This study utilized four datasets retrieved from the Gene Expression Omnibus (GEO) database: [35]GSE15653, [36]GSE89632, [37]GSE24807 and [38]GSE23343. The analysis involved variance analysis, WGCNA analysis, PPI network construction, machine learning application, examination of autophagy-related gene sets, and diagnostic ROC analysis to identify and validate autophagy-related biomarkers in T2DM combined with MAFLD within an independent external dataset. Functional enrichment analysis, immune infiltration analysis, and validation of gene significance in T2DM combined with MAFLD progression were conducted using animal experiments to understand the biological functions and immunomodulatory roles of key biomarkers. Cellular diversity within liver tissues was characterized at the single-cell level, exploring interrelationships, differentiation, and developmental trajectories among cell populations through cellular communication and pseudo-temporal analyses. Results The study identified four key biomarkers (IRAK3, TNFRSF1A, CX3CR1, JUNB). Real-time fluorescence quantitative PCR analysis in animal experiments demonstrated significantly higher mRNA expression levels of IRAK3, TNFRSF1A, CX3CR1, and JUNB in T2DM and MAFLD rat liver tissues compared to the control group. Quantitative immunohistochemical analysis revealed notably elevated protein expression levels of IRAK3, TNFRSF1A, CX3CR1, and JUNB in liver tissues of rats with T2DM and MAFLD when contrasted with the control group (P < 0.05). Enrichment analysis indicated associations of T2DM combined with MAFLD pathogenesis with pathways such as the NF-kappa B signaling pathway, MAPK signaling pathway, Fluid shear stress and atherosclerosis, Insulin resistance, and Cytokine-cytokine receptor interaction. Correlative analysis uncovered connections between immune infiltration and the identified genes. Single-cell transcriptomic analysis highlighted the differentiation of CX3CR1, JUNB, and TFRC in various single-cell-annotated populations. The pseudo-temporal analysis of epithelial cells identified enriched genes at crucial nodes related to “Leukocyte transendothelial migration”, “Lipid and atherosclerosis”, and “Type II diabetes mellitus” signaling pathways. Additionally, four cellular communication signaling pathways (TNF, CXCL, VEGF, and MIF) potentially significant in T2DM combined with MAFLD progression were identified through cell communication analysis. Conclusion This study unveiled potential associations and key biomarkers (IRAK3, TNFRSF1A, CX3CR1, JUNB) concerning T2DM combined with MAFLD and relevant pathways, offering novel insights for the investigation of these two conditions. Keywords: T2DM, MAFLD, single-cell RNA sequencing, biomarkers, immune signaling pathway 1. Introduction Type 2 Diabetes Mellitus (T2DM) is a prevalent metabolic disorder characterized by insulin resistance and relative insulin deficiency, resulting in hyperglycemia. The International Diabetes Federation reported approximately 537 million adults living with diabetes globally in 2021, with T2DM representing about 90% of these cases ([39]1). Metabolic dysfunction-associated fatty liver disease (MAFLD), formerly termed non-alcoholic fatty liver disease (NAFLD), is characterized by excessive fat accumulation in the liver and affects nearly 25% of the global population. The surge in MAFLD cases mirrors shifts in the environment and lifestyle driven by rapid global industrialization; environmental elements like air pollution may heighten MAFLD risk, compounding socio-economic challenges. In essence, MAFLD represents an escalating global health issue necessitating robust preventive and therapeutic strategies. The pathogenesis of MAFLD is multifactorial, heavily influenced by obesity, insulin resistance, and dyslipidemia. The co-occurrence of MAFLD in T2DM patients raises concerns as it increases the risk of liver-related complications and cardiovascular events ([40]2, [41]3). This relationship is intricate, with overlapping metabolic pathways and inflammatory processes that may drive disease progression. Furthermore, MAFLD is closely linked to metabolic dysregulation, primarily insulin resistance, which exacerbates liver fat accumulation and inflammation ([42]4). The interplay between T2DM and MAFLD is significant; both conditions share common risk factors and pathophysiological mechanisms, including obesity and dyslipidemia, creating a vicious cycle that further deteriorates metabolic health ([43]5). Given the insidious progression of T2DM with MAFLD, identification of novel diagnostic markers significantly impacts the early detection and treatment such kind of disease. The correlation between autophagy and MAFLD has garnered substantial attention. Autophagy, a conserved catabolic process, plays a crucial role in physiological functions by degrading misfolded proteins, eliminating dysfunctional organelles, and influencing growth and aging. Autophagy’s formation of autophagosomes, which ensnare substances for degradation and unite with lysosomes housing hydrolytic enzymes, results in the breakdown of encapsulated components into biomolecule monomers like amino acids, fatty acids, and nucleotides for cellular recycling. An integral mechanism of MAFLD involves diminished autophagic capacity; the accumulation of excess lipids in hepatocytes can hinder the autophagic process, impeding efficient lipid removal and exacerbating fatty liver progression. Moreover, oxidative stress emerges as a significant determinant of autophagic levels in MAFLD; augmented reactive oxygen species (ROS) production can disrupt autophagy regulation and functionality. Research indicates a prevalent magnesium deficiency in individuals with obesity and metabolic syndrome. Magnesium supplementation demonstrates effectiveness in ameliorating metabolic disorders such as obesity and fatty liver by modulating cellular lipid metabolism through autophagy stimulation via the AMPK/mTOR pathway, thereby curbing lipid buildup. Recognizing autophagy’s pivotal role in MAFLD pathogenesis, drug development targeting the autophagic process holds promise as a novel therapeutic avenue for MAFLD treatment. The interplay between T2DM and MAFLD has not been fully elucidated, particularly regarding the role of autophagy-related genes. To address this knowledge gap, our study aimed to examine the relationship between autophagy-related genes and endoplasmic reticulum stress-associated genes in the context of T2DM and MAFLD. We utilized relevant datasets from the GEO database for differential analysis to identify differentially expressed genes (DEGs). Subsequently, we employed WGCNA, PPI network analysis, machine learning techniques, diagnostic ROC analyses, and enrichment assessments to identify key biological markers and elucidate their pathways, gaining deeper insights into the underlying biological mechanisms. In our rat model of T2DM combined with MAFLD, we verified the expression levels of predicted genes, identifying IRAK3, TNFRSF1A, CX3CR1, and JUNB as potentially significant candidates. Further, leveraging single-cell analysis, we elucidated potential mechanisms underlying the progression of T2DM and MAFLD, focusing on precise cellular localization, differentiation, developmental trajectories within pertinent cell populations, and potential intercellular communication networks. This study not only enhances our understanding of the pathogenesis of T2DM combined with MAFLD but also identifies diagnostic factors linked to autophagy. Such advancements are vital for improving early identification, precise diagnosis, personalized treatment, and disease monitoring for these conditions. The findings provide novel opportunities and challenges for the prevention and treatment of T2DM and MAFLD. The flow diagram of this study is showed in the [44]Supplementary Figure 1 . 2. Material and methods 2.1. Identification of differential expression genes The datasets [45]GSE15653, [46]GSE89632, [47]GSE24807 and [48]GSE23343 were obtained from the GEO database through the GEOquery package (version 2.72.0) ([49]6),please refer to [50]Supplementary Table S1 for more details.[51]GSE15653 and [52]GSE89632 served as training sets, while [53]GSE24807 and [54]GSE23343 was utilized as the validation set. The MAFLD samples within the [55]GSE89632 dataset were categorized by different stages: Simple Steatosis (SS) and Non-Alcoholic Steatohepatitis (NASH), and were designated as GSE89632_Simple Steatosis (GSE89632_SS) and GSE89632_Non-Alcoholic Steatohepatitis (GSE89632_NASH), respectively. The data from [56]GSE15653, GSE89632_SS, and GSE89632_NASH were normalized using the normalizeBetweenArrays function from the limma package ([57]7). Principal component analysis (PCA) was performed on the normalized dataset. Difference analysis was performed separately to obtain the respective DEGs, using limma package (version 3.58.1), with screening criteria: |log2FC| > 0.5 and P < 0.05 ([58]8, [59]9). Subsequent analyses included the generation of PCA plots, volcano plots, and basic numerical heatmaps using “ggplot2,” alongside the creation of complex numerical heatmaps utilizing the “ComplexHeatmap” package (version 2.20.0) ([60]10). 2.2. WGCNA analysis Weighted gene co-expression networks were analyzed using the WGCNA package (version 1.72-5) ([61]11) for the datasets GSE89632_SS and GSE89632_NASH. Initially, the analysis focused on the top 25% of genes based on expression variance, identifying and excluding potential outlier samples through clustering analysis. Next, the pickSoftThreshold function was employed to ascertain the optimal soft threshold. After establishing the threshold, the cutreeDynamic function identified dynamic modules and set the minimum gene count required for each module ([62]12). Initial modules were identified through the Dynamic Tree Cut (DTC) method, with similar modules subsequently merged based on the clustering relationships of modular eigengene (ME). Subsequently, the TOMsimilarity function calculated the topological molecular similarity matrix among genes, with 1,000 genes randomly selected for heatmap visualization. Pearson correlation coefficients were computed to evaluate associations between modules and clinical traits, with statistical significance assessed via the corPvalueStudent function (WGCNA R package). Modules demonstrating significant correlations with both SS and NASH (p < 0.05) were selected for downstream analysis. KEGG pathway enrichment analysis was then systematically performed on the genes within these modules. 2.3. Protein-protein interaction network The results of DEGs of [63]GSE15653 were intersected with DEGs of GSE89632_SS, GSE89632_NASH, and WGCNA modular genes, respectively, and the results were visualized in an “UpSet plot”. In order to explore the potential interactions of the above genes, we used the STRING database ([64]https://string-db.org/) for the analysis ([65]13), and the results were imported into Cytoscape (version 3.10.2) for the analysis of protein interaction networks ([66]14). Meanwhile, the Maximal Clique Centrality (MCC), Degree and Molecular Network Centrality (MNC) algorithms in CytoHubba plug-in were applied to select the top 15 genes at key positions in the PPI network ([67]15). Subsequently, the results of the three algorithms were intersected and visualized using a Wayne diagram. Chromosomal localization analysis of the intersection hub genes was performed using the “circlize package”. Finally, the correlation analysis was performed using Spearman, and the correlation pie charts were used to show the degree of association and interactions between genes, which were visualized using the “ggplot2 package”. 2.4. Acquisition of autophagy-related genes In order to deeply investigate the molecular mechanisms related to autophagy in T2DM combined with MAFLD, we combined the resources from multiple databases. We obtained the autophagy-associated molecular mechanisms from PubMedGene database ([68]https://www.ncbi.nlm.nih.gov/gene/), Genecard database ([69]https://www.genecards.org/), and GSEA database ([70]https://www.gsea-msigdb.org/), autophagy-related websites ([71]http://www.autophagy.lu/), ([72]http://hamdb.scbdd.com/home/index/) to obtain autophagy-related genes. 2.5. Machine learning The GSE89632_SS and GSE89632NASH datasets underwent individual screenings using LASSO, RFE-SVM, and Boruta methods, respectively. The diagnostic LASSO coefficient screening leveraged the glmnet package (Version 4.1.7) to analyze the cleaned data, derive variable lambda, likelihood, and L1 regularization values, along with visualization. We employed ten-fold cross-validation (seed number: 2022) for validation ([73]16). Additionally, we utilized two feature selection strategies: Recursive Feature Elimination (RFE) and the Boruta algorithm. RFE eliminates features without significant impact on Accuracy during model training repetitions ([74]17), while the Boruta algorithm selects or eliminates significant and non-significant features in each base learner training iteration based on shadow features and Z-scores ([75]18). This study integrated SVM base learner scores from the e1071 package (Version 1.7-14) with the RFE strategy ([76]17) through 10*10 fold cross-validation and applied the Boruta algorithm using the Boruta package (Version 8.0.0) ([77]18). Subsequently, the outcomes of Lasso, Boruta, and RFE-SVM screenings, combined with autophagy-related genes, were intersected to identify diagnostic biomarkers linked to autophagy in T2DM and NAFLD. These results were visually represented in Wayne diagrams. 2.6. Receiver operating characteristic analysis In datasets GSE89632_SS, GSE89632_NASH and [78]GSE15653, ROC curve analysis was performed on each of the three datasets using the pROC software package to determine the sensitivity and specificity of the above genes ([79]19), to predict the ROC-related information and data of the variables at their respective cut-off values, and to assess the diagnostic accuracy, and the results were presented as ROC The results were quantified by area under the curve (AUC), and the AUC ranges between 0.5 and 1, and values closer to 1 indicate superior diagnostic performance of the variable in predicting clinical outcomes and visualized using ggplot2. 2.7. Enrichment analysis To investigate the functions and pathways of the above biomarkers, we used the DAVID online database ([80]https://david.ncifcrf.gov/) for GO and KEGG enrichment analysis ([81]20–[82]22). The “org.Hs.eg.db” package was used to convert the ID of the input gene list, and the “clusterProfiler package” was used for enrichment analysis ([83]23). The significance of the enrichment results was assessed by calculating the z-score value for each enriched entry using the “GOplot” package ([84]24). The results were screened for significance and biological significance at P<0.05 and FDR<0.2 ([85]25). At the same time, GSEA(Gene Set Enrichment Analysis) was performed ([86]26), and the results were screened according to the following criteria: normalized enrichment score |NES| > 1, FDR < 0.25, p.adjust < 0.05. Finally, the results were visualized using the “ggplot2 package”. 2.8. Immune cell infiltration analysis In order to explore the immune cell infiltration in liver tissues of patients with T2DM combined with MAFLD, the “Single Sample Gene Set Enrichment Analysis” (ssGSEA) method was used in this study ([87]27). The ssGSEA algorithm provided by the GSVA package was utilized for this purpose. Using markers specific to each class of immune cells ([88]28) as gene sets, the enrichment score for each class of immune cells in each sample was calculated to assess the infiltration of immune cells in each sample. All analyses and visualizations were performed in R 4.2.1. We used the ggplot2 package to draw histograms to visualize the differences in immune cell infiltration between the normal and control groups. In addition, Spearman statistics were used to analyze the correlation between them one by one, and the linkET package was used for the calculation of the data portion of the network graph. The analysis results were visualized by ggplot2 package for group comparison plots, lollipop plots, correlation scatter plots, and correlation network heatmaps, thus demonstrating more intuitively the infiltration of immune cells in liver tissues of patients with T2DM combined with MAFLD. 2.9. Preparation of T2DM combined with MAFLD rat model Thirty healthy clean-grade male Sprague–Dawley rats, aged 6–8 weeks, weighing 180–220 g, were provided by Wuhan Yunkron Technology Company. The animal production and use license number is DCXR(E)2018-0021. SYXK (E) 2013-0069, respectively. After 1 week of normal feeding, high-fat and high-sugar diets and regular diets were prepared according to the formula. Twenty rats were fed with high-fat and high-sugar diet, high-fat and high-sugar diet feed formula: 10% lard, 20% sucrose, 2% cholesterol, 60% ordinary feed, 8% egg yolk powder(Jiangsu Xietong Pharmaceutical Bio-engineering Co., Ltd.) and 10 rats in normal control group were fed with normal diet. Twenty rats were fed a 100 g high-fat and high-sugar diet daily with free access to food and water. Corn oil 5 mL/kg was given by gavage on an empty stomach at 8:00 am every morning. At the end of the 12th week, 10 rats in the model group were given intraperitoneal injection of streptozotocin (STZ) (30 mg/kg) overnight fasting, and the other 10 rats in the model group were not given STZ injection. Ten control rats fed a regular diet were injected with the corresponding volume of citrate buffer. Blood samples were collected from the tail vein of 10 rats fed a high-fat and high-sugar diet and injected with STZ 3 days later, and random blood glucose was≥16.7 mmol/L. Blood glucose was monitored after 2 weeks, and fasting blood glucose was detected on the 2nd, 4th, 6th, 8th, 10th, 12th and 14th days. At the 14th week, 4 rats in each group were randomly selected to complete liver ultrasound, and fatty liver results were formed to determine the success of modeling. Finally, the rats were divided into 3 groups: (1) T2DM+MASLD group (n=10); (2) High-fat and high-glucose group (n=10); (3) normal control group (n=10); After 15 weeks, the rats were sacrificed under anesthesia, and the livers were collected, rinsed with normal saline at 4°C, weighed and placed in preservation solution and stored in a refrigerator at −80°C. 2.10. Quantitative real time PCR Real-time fluorescence quantitative PCR (Polymerase Chain Reaction): 0.15 g of rat liver tissue was taken from each group, and total RNA was extracted by the Trizol method. cDNA was reverse-transcribed into the corresponding cDNA in accordance with the reverse-transcription kit, and then subjected to real time Polymerase Chain Reaction (PCR)、RELB、S100A9 and SOCS1 primer sequences are shown in [89]Supplementary Table S2 , Chain Reaction), PCR reaction total system 20μL, PCR amplification conditions: denaturation at 95°C for 10 min, annealing at 60°C for 1 min, extension at 95°C for 15 s, 40 cycles, each sample set up 3 replicate wells, internal reference GAPDH. The results of the experiments were analyzed by Bio-Rad Fluorescence Quantitative Analysis. The results of the experiments were read by Bio-Rad fluorescence quantitative analysis software, and the quantitative calculation of the mRNA expression levels of and in each group was expressed by 2−ΔΔCt (CT is the number of cycles). The main observation indexes were RELB, S100A9, SOCS1 mRNA expression levels in rat liver tissues in each group. 2.11. Immunohistochemical staining Kidney tissue specimens were fixed for 48h and paraffin-embedded. The paraffin-embedded tissue sections were serial sectioned at 3 μm, dewaxed and hydrated. 3% H 202 was incubated at room temperature for 10 min to remove endogenous peroxidase. After 10 min of citrate antigen repair, primary antibodies RELB (1:50), S100A9(1:300), SOCS1 (1:300), were added dropwise and incubated at 37°C for 25 min, and then freshly prepared DAB color development solution was added dropwise. The film was stained with hematoxylin and sealed with neutral gum. At the same time, PBS was used as negative control instead of primary antibodies. The immunohistochemical pathology pictures were analyzed using ImageScope software as follows: three different 400x fields of view were intercepted from each section, enter the analysis module of lmageScope software, and set all the dark brown color on the tissue sections as strong positive, brownish yellow as moderate positive, light yellow as weak positive, and blue nuclei as negative. Each tissue point was then identified and analyzed to find out the area (in pixels) of strong positive, moderate positive, weak positive and negative, the percentage of positivity and finally the H-score rating. 2.12. Single-cell data preprocessing and clustering annotation High-throughput sequencing expression profile data ([90]GSE136103) were downloaded from the Gene Expression Omnibus (GEO) database ([91]http://www.ncbi.nlm.nih.gov/) ([92]29). Twenty liver tissue samples were selected, including 9 MAFLD patients and 11 healthy individuals. Doublets in each sample were identified using DoubletFinder ([93]30). Subsequently, doublets were filtered out, retaining cells meeting the following criteria: nFeature_RNA > 500 and < 3000; nCount_RNA < 10,000; percent_mito < 10. Cells from different samples were then merged using the merge function and analyzed using the Seurat package ([94]31). Using the NormalizeData function, data were normalized for each library with the LogNormalize method and scale.factor of 10,000. The top 2,000 variable features were identified through the FindVariableFeatures function ([95]32), and linear dimensionality reduction was performed using the RunPCA function with default parameters (npcs = 50). Batch effects were corrected using the RunHarmony function ([96]33), followed by nonlinear dimensionality reduction using the RunUMAP function. The nearest neighbor graph was constructed using the FindNeighbors function (k.param = 20) with 30 principal components. Cell clustering was completed using the FindClusters function. Cluster marker genes were identified through the FindAllMarkers function with min.pct = 0.5 and logfc.threshold = 1. Single-cell samples were clustered and characterized using known lineage markers and manually annotated with reference to the human liver cell atlas on the CellMarker2 website ([97]http://117.50.127.228/CellMarker/) ([98]34). Differential expression analysis between cell clusters was performed using the FindMarkers function (Wilcoxon rank-sum test, p_val < 0.05 & abs(avg_log2FC) > 0.5). 2.13. Exploring cell clusters associated with MAFLD To further explore intergroup heterogeneity across cell clusters, we analyzed proportional differences in cell counts between control and disease groups within distinct clusters, intergroup density variations, and distributional biases across conditions. Additionally, to test changes in cellular abundance at high resolution between groups/conditions, we employed the neighborhood abundance method in Milo (v2.2.0) ([99]35) using the functions “buildGraph” and “makeNhoods” with parameters: “K=30”, “d=50”, “prop=0.2”, “refined=TRUE”, and “refinement_scheme=reduced_dim”, indicating sampling using dimensionality-reduced data, with remaining parameters selected according to Milo’s standard workflow recommendations. For pathway activity scoring of each cell cluster, we used the R package AUCell (v1.28.0): first, gene expression rankings per cell were computed via AUCell_buildRankings using the expression matrix with default parameters; then, custom-defined gene sets constructed from key genes identified in bulk-RNA analyses were applied to score individual cells; during this process, for each cell, the Area Under the Curve (AUC) value was calculated with AUCell_calcAUC based on gene expression rankings, where the AUC value represents the proportion of genes in the custom-defined set appearing among top-ranked genes per cell ([100]36). Gene Set Variation Analysis implemented in the GSVA package (v2.0.0) was applied for gene set enrichment analysis using the “HALLMARK gene set” exported via GSEABase (v1.44.0). Intergroup differences in pathway activity scores per cell cluster between control and disease groups were computed using the LIMMA package (v3.62.2). Finally, based on intergroup heterogeneity patterns across cell clusters, we extracted Liver Sinusoidal Endothelial Cells (LSECs) and repeated the standard Seurat V5 workflow on these cells to identify and visualize endothelial subpopulations exhibiting significant distributional differences between groups. 2.14. Pseudo-temporal analysis Unsupervised pseudotemporal analysis was performed using the “Monocle” package (v2.34.0) ([101]37). First, a cell dataset containing the expression matrix, phenotypic data, and feature data was constructed with the newCellDataSet function (expressionFamily = negbinomial.size). Next, size factor dispersion and gene expression across cells were corrected via estimateSizeFactors and estimateDispersions functions. Subsequently, dimensionality reduction was conducted using the DDRTree method (parameter max_components = 2), and cell ordering and visualization were performed with plot_cell_trajectory ([102]38), while filtering highly variable genes correlated with pseudotime; expression changes of these genes along pseudotime were analyzed using the differentialGeneTest function. Finally, the filtered genes were clustered into distinct groups based on expression patterns and visualized using plot_pseudotime_heatmap. To identify genes bifurcating cells into different branches while exploring specific roles of key genes from bulk RNA analyses in single-cell pseudotime trajectories, we conducted “Branched Expression Analysis Modeling” (BEAM) analysis ([103]39), further enhancing the filtering threshold; genes from BEAM analysis were visualized via plot_genes_branched_heatmap. Subsequently, KEGG functional enrichment analysis was performed for genes within each cluster using clusterProfiler (v4.14.4) and org.Hs.eg.db (v3.20.0) packages. 2.15. Cell–cell communication Initially, to further explore the LSEC clusters derived from prior research that may be closely associated with MAFLD progression, we processed and re-clustered LSECs using the standard Seurat V5 workflow. Then, by comparing these clusters through UMAP visualization of MAFLD and healthy control groups, the LSEC subpopulation exhibiting the most distinct cellular distribution was identified. Additionally, beyond LSECs, we identified cell clusters specifically distributing “bulk-RNA key genes” through scatter plots of gene distribution across cells; these cell clusters were simultaneously analyzed with LSECs for cell-cell communication analysis using the “CellChat” package (v2.1.2) ([104]40). A random seed was set (seed=0528), and 8000 cells were randomly selected to create a CellChat object. Ligand-receptor interactions of “secretory signaling” were analyzed using “human” data from CellChatDB. The netVisual_circle function displayed the number and strength of inter-cell connections, while the netVisual_bubble function illustrated communications between different cell types. Subsequently, the netVisual_aggregate function demonstrated the communication network of a specific signaling pathway and calculated the contribution of various ligands. The mutual communication network between cells was visualized using the netVisual_individual function. Network centrality analysis was conducted using the netAnalysis_computeCentrality function ([105]41) and represented as a heatmap. 2.16. Statistical analyses Data analysis utilized SPSS 26.0 statistical software. Normally distributed measurement data were expressed as mean ± standard deviation and compared between two groups using independent samples t-test. Non-normally distributed measurement data were presented using the median and interquartile range, and comparisons between groups were conducted using the Mann-Whitney test. Statistical significance was set at P < 0.05. 3. Results 3.1. Screening of DEGs The datasets [106]GSE15653 and [107]GSE89632 were downloaded from GEO data. The [108]GSE89632 dataset was divided into GSE89632_SS group and GSE89632_NASH group according to SS and NASH samples. PCA principal component analysis was performed separately, and in the [109]GSE15653 dataset, PC1 was 26.2% and PC2 was 8.7% ([110] Figure 1A ); in the GSE89632_SS dataset, PC1 was 30.6% and PC2 was 12.1% ([111] Figure 1B ); in the GSE89632_NASH dataset, PC1 was 30.3% and PC2 was 12.1% ([112] Figure 1C ). The volcano plot showed that 1686 differentially expressed genes were identified in the [113]GSE15653 dataset with |log2(FC)|>0.5, p-value<0.05 as the screening threshold, of which 864 were up-regulated and 822 were down-regulated ([114] Figure 1D ). In the GSE89632_SS dataset, 2219 differentially expressed genes were identified, of which 1003 genes were up-regulated in expression and 1216 genes were down-regulated in expression, as in ([115] Figure 1E ). In the GSE89632_NASH dataset, 2248 differentially expressed genes were identified, of which 1062 genes were up-regulated in expression and 1186 genes were down-regulated in expression, as in ([116] Figure 1F ). The ring value heatmap and complex value heatmap showed the genes ranked top40 in differential folds in [117]GSE15653 ([118] Figure 1G ), GSE89632_SS ([119] Figure 1H ) and GSE89632_NASH ([120] Figure 1H ), respectively. Figure 1. [121]Graphs and heatmaps illustrating gene expression data. Panels A-C show PCA plots with distinct groupings: T2DM, SS, and NASH, compared to Control groups. Panels D-F depict scatter plots with data from GSE15653, GSE89632_SS, and GSE89632_NASH; points indicate significant expression changes. Panel G is a circular heatmap showing gene expression in GSE15653, indicating up- and down-regulated genes. Panel H shows a heatmap for GSE89632, with color codes for gene expression levels, comparing Normal and Disease across SS and NASH. [122]Open in a new tab Genes of interest. (A–C) Principal component analysis of [123]GSE15653, GSE89632_SS and GSE89632_NASH. (D–F) Volcano plots of differentially expressed mRNAs in [124]GSE15653, GSE89632_SS, and GSE89632_NASH, |logFC| > 0.5, P < 0.05. (G, H) Expression heatmaps of the top40 DEGs in the [125]GSE15653 and [126]GSE89632 datasets. 3.2. WGCNA analysis Weighted gene co-expression network analysis (WGCNA) was performed on [127]GSE89632SS, [128]GSE89632NASH. Hierarchical clustering was performed based on similarity or correlation between samples and visualized as a hierarchical dendrogram ([129] Figures 2A, E ). The distance between samples indicated the degree of similarity or correlation between samples, and outlier samples were removed by pruning operation, and the outlier samples rejected were [130]GSM2385767 and [131]GSM2385782, and the most representative subset of samples was selected, and the clustering differences between different samples in the Normal group and the MAFLD group were visualized in a joint heat map ([132] Figures 2C, G ). The top 25% of genes with the largest fluctuations were selected for WGCNA analysis based on standard deviation ranking. The scale-free fit indices and average connectivity of various soft-threshold powers were evaluated on the basis of scale-free R2. Among them, GSE89632_SS selected soft threshold powers with β = 11 and scale-free R2 = 0.8 ([133] Figure 2B ), and [134]GSE89632NASH selected soft threshold powers with β = 5 and scale-free R2 = 0.8 ([135] Figure 2F ). Gene clustering tree with module identification demonstrates the results of hierarchical clustering of genes in the two datasets and categorizes genes into different modules, further merging similar modules to reduce redundancy ([136] Supplementary Figures 2A, D ). The hierarchical clustering results between different modules were visualized using module clustering dendrogram ([137] Supplementary Figures 2B, E ), while the correlations between different modules were visualized with modal correlation heatmap ([138] Supplementary Figures 2C, F ). Finally, [139]GSE89632SS identified 7 modules ([140] Figure 2D ), GSE89632_NASH identified 7 modules ([141] Figure 2H ). We performed KEGG enrichment analysis on genes within the "brown" modules of GSE89632_SS and GSE89632_NASH. The results demonstrated that in the SS group, pathways including TNF-JNK signaling pathway,IL-1/IL-1R-JNK signaling cascade, and TLR2/4-MAPK signaling axis collectively indicate the role of pro-inflammatory signaling in driving steatosis progression toward fibrosis. Concurrently, pathways such as Environmental factor-induced PI3K signaling pathway and Environmental factor-mediated RAS-ERK signaling cascade (activated by heavy metals) suggest synergistic interactions between environmental carcinogens and metabolic dysregulation. In the NASH group, the recurrent enrichment of TNF-JNK signaling pathway and Environmental factor-mediated RAS-ERK signaling cascade further validates conserved mechanisms underlying inflammation-carcinogenesis transition. The observed negative correlation may reflect compensatory dysregulation of protective mechanisms during disease progression, where downregulation of module genes attenuates their inhibitory effects on pro-inflammatory/carcinogenic pathways, thereby accelerating pathological advancement ([142] Supplementary Files 1 , [143]2 ). Based on this observation, we selected the "brown" modules from both SS and NASH WGCNA cohorts. Figure 2. [144]Panel A and E show hierarchical cluster trees for sample outlier detection. Panels B and F display scale independence and mean connectivity plots. Panels C and G present gene dendrograms with module colors. Panels D and H feature heatmaps of module-trait relationships with different color scales, showing correlations between modules and traits like “Normal” vs. “SS” and “Control” vs. “NASH”. [145]Open in a new tab WGCNA.(A–D) GSE89632_SS.(E–H) GSE89632_NASH.(A, E) Sample dendrogram with outlier samples above the red line. (B, F) Scale-free fit indices and average connectivity for different soft thresholds. (C, G) Clustering dendrogram after fusing similar modules. (D, H) Module-trait association plot. 3.3. Protein-protein interaction network UpSet plots visualize the results of intersections of DEGs of [146]GSE15653 taken with GSE89632_SS, GSE89632_NASH, SS WGCNA, NASH WGCNA, respectively. ([147] Figure 3A ). In order to better understand the interactions between genes, we used the STRING database and constructed the PPI network ([148] Figures 3B, C ), while the CytoHubba plugin was used to identify the top 15 hub genes in the GSE89632_SS group and the GSE89632_NASH group, respectively, based on the MCC, MNC, and Degree algorithms ([149] Figures 3D–F , [150]3H–J ), and then these three algorithms took the intersection to obtain 13 hub genes each ([151] Figures 3G , [152]K ). Next, we investigated the chromosomal localization ([153] Figures 3L , [154]M ) and interrelationships ([155] Figures 3N , [156]O ) of the 13 hub genes. Figure 3. [157]Image showing multiple panels of graphical data analysis related to gene expression. Panel A presents an upset plot indicating shared features among datasets. Panels B and C display circular network graphs with nodes representing genes, colored by degree of connectivity, from dark purple to light. Panels D to J feature similar network graphs with genes highlighted in red, orange, or yellow based on specific criteria. Panels G and K display Venn diagrams illustrating intersections among different metrics. Panels L and M show circular diagrams mapping gene positions. Panels N and O contain heatmaps detailing gene expression levels across conditions, using red and green scales. [158]Open in a new tab Protein-Protein Interaction (PPI) Network and 13 Hub Genes. (A) UpSet Plot. (B, C) PPI networks for GSE89632_SS and GSE89632_NASH. (D–F) GSE89632_SS: Top 15 genes identified by the MCC, MNC, and Degree algorithms. (H–J) GSE89632_NASH: Top 15 genes identified by the MCC, MNC, and Degree algorithms. (G) GSE89632_SS: Venn diagrams illustrating the intersecting genes across the three algorithms. (K) GSE89632_NASH: Venn diagram depicting the intersection of genes identified by the three algorithms. (L) GSE89632_SS: Chromosomal localization of 13 hub genes. (M) GSE89632_NASH: Chromosomal localization of 13 hub genes. (N) GSE89632_SS: Correlation heatmap of 13 hub genes. (O) GSE89632_NASH: Correlation heatmap of 13 hub genes. 3.4. Shared screen of multiple machine learning algorithms We employed Lasso, Boruta, and RFE-SVM algorithms to identify hub genes associated with T2DM and MAFLD in the GSE89632_SS and GSE89632_NASH datasets. In GSE89632_SS, Lasso analysis revealed four significant variables: CLDN7, IRAK3, TFRC, and TNFRSF1A ([159] Figures 4A, B ). The Boruta algorithm identified ten significant variables, including CDKN1A, GJA1, IL1B, IRAK3, LBP, SOCS1, STAT3, TFRC, TLR2, and TNFRSF1A ([160] Figure 4C ). RFE_SVM highlighted eight features: IRAK3, MMP9, TNFRSF1A, TLR2, JUNB, TFRC, RELB, and CDKN1A ([161] Figure 4D ). In GSE89632_NASH, Lasso analysis identified five significant variables: ASPM, CCNF, CLEC4E, CX3CR1, and JUNB ([162] Figures 4F, G ). Furthermore, the Boruta algorithm filtered eleven significant variables, specifically ASPM, CCNF, CLEC4E, CX3CR1, GJA1, IL1B, IRAK3, JUNB, SOCS1, TLR2, and TNFRSF1A ([163] Figure 4H ). Two features were identified through RFE_SVM: JUNB and CX3CR1 ([164] Figure 4I ). Subsequently, we intersected the machine learning results from both datasets with autophagy-related genes to derive diagnostic biomarkers associated with autophagy in type 2 diabetes mellitus combined with NAFLD ([165] Figures 4E, J ). Figure 4. [166]Composite image of multiple data visualizations and analyses. Panel A shows a line plot of binomial deviance versus log lambda with error bars. Panel B displays a line graph of coefficients over log lambda values. Panel C is a boxplot of importance Z-scores for various attributes. Panel D shows a line plot of out-of-bag error rate versus number of features. Panel E is a Venn diagram illustrating overlaps in selected features across methods. Panels F and G repeat Panels A and B with different data. Panel H is a similar boxplot to Panel C. Panel I repeats Panel D. Panel J mirrors Panel E with different overlaps. Panels K to O are ROC curves comparing sensitivity and specificity across different studies. [167]Open in a new tab Machine learning screening biomarker. (A–E) Dataset GSE89632_SS. (F–J) Dataset GSE89632_NASH. (A, F) Cross validation of parameter selection in Lasso regression. (B, G) Lasso regression for 13 hub genes. (C, H) Boruta algorithm feature gene screening. (D, I) SVM-RFE algorithm feature gene screening. (E, J) Wayne plots. (K) ROC curves for IRAK3, TFRC, TNFRSF1A in GSE89632_SS. (L) ROC curves for CX3CR1, JUNB in GSE89632_NASH. (M) ROC curves for IRAK3, TFRC, TNFRSF1A, CX3CR1, JUNB in [168]GSE15653. (N) ROC curves for IRAK3, TFRC, TNFRSF1A, CX3CR1, JUNB in [169]GSE24807. (O) ROC curves for IRAK3, TFRC, TNFRSF1A, CX3CR1, JUNB in [170]GSE23343. AUC, area under the curve; TPR, true positive rate; FPR, false positive rate. 3.5. Performance of five diagnostic biomarkers We assessed the diagnostic performance of IRAK3, TFRC, TNFRSF1A, CX3CR1, and JUNB using the datasets GSE89632_SS, GSE89632_NASH, and [171]GSE15653 for training, followed by plotting ROC curves. The results indicated that in GSE89632_SS, the AUC for IRAK3, TFRC, and TNFRSF1A exceeded 0.8 ([172] Figure 4K ). In GSE89632_NASH, the AUC for CX3CR1 and JUNB also surpassed 0.8 ([173] Figure 4L ), reflecting their high predictive accuracy. In [174]GSE15653, the AUCs for all five genes were above 0.8 ([175] Figure 4M ). Subsequently, we conducted external validation using independent datasets [176]GSE24807 and [177]GSE23343. The results showed that in [178]GSE24807, the AUC values of IRAK3, TNFRSF1A, CX3CR1, and JUNB were all greater than 0.6, while in [179]GSE23343, the AUC values of IRAK3, TNFRSF1A, CX3CR1, and JUNB were all greater than 0.5 ([180] Figures 4N, O ). These results indicate that in the two independent external validation sets, these indicators also possess certain diagnostic value, which is consistent with our previous predictions. Ultimately, we identified IRAK3, TNFRSF1A, CX3CR1, and JUNB as diagnostic biomarkers for T2DM complicated with MAFLD associated with autophagy and endoplasmic reticulum stress. 3.6. Enrichment analysis of GO and KEGG During the GO and KEGG enrichment analysis, TNFRSF1A and JUNB were predominantly enriched in the TNF signaling pathway, while TNFRSF1A and TFRC showed significant enrichment in the HIF-1 signaling pathway. Furthermore, TNFRSF1A demonstrated substantial enrichment in the NF-kappa B signaling pathway, MAPK signaling pathway, Fluid shear stress and atherosclerosis, and Insulin resistance ([181] Figure 5A , [182]Figure 5C ). CX3CR1 exhibited notable enrichment in the Cytokine-cytokine receptor interaction pathway, and TFRC displayed significant enrichment in the Hematopoietic cell lineage ([183] Figures 5B, D ). Figure 5. [184]Four panels labeled A, B, C, and D depict different pathway analyses. Panels A and B are bubble charts showing pathways enriched in biological processes (BP), molecular functions (MF), cellular components (CC), and KEGG categories, with significance measured on the x-axis. Panels C and D are heat maps displaying gene-pathway interactions for KEGG pathways, with panel C highlighting IL1B, JUNB, RELB, TFRC, TNFRSF1A and panel D showing CXCR1, IL1B, JUNB, RELB, SOCS1, TFRC, TNFRSF1A. Ontology labels BP, CC, MF, and KEGG are specified in legends. [185]Open in a new tab Functional enrichment analysis. (A, B) Lollipop diagrams showing GO and KEGG enrichment analysis of GSE89632_SS group and GSE89632_NASH group, respectively. (C, D) Heatmaps showing the pathway enrichment of some hub genes in GSE89632_SS and GSE89632_NASH groups, respectively. 3.7. Enrichment analysis of GSEA In Enrichment analysis of GSEA, IRAK3、TNFRSF1A 和 JUNB commonly enriched in REACTOME_SIGNALING_BY_INTERLEUKINS([186] Supplementary Figure 3A , [187]Supplementary Figure 3E , [188]Supplementary Figure 3M ).IRAK3 mainly enriched in BIOCARTA_IL1R_PATHWAY([189] Supplementary Figure 3B );REACTOME_TOLL_LIKE_RECEPTOR_TLR1_TLR2_CASCADE([190] Supplementary Figure 3C );REACTOME_TOLL_LIKE_RECEPTOR_CASCADES ([191] Supplementary Figure 3D ).TNFRSF1A mainly enriched in KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION ([192] Supplementary Figure 3F );REACTOME_INTERLEUKIN_10_SIGNALING ([193] Supplementary Figure 3G );KEGG_MAPK_SIGNALING_PATHWAY([194] Supplementary Figure 3H ).TFRC mainly enriched in PID_HIF1_TFPATHWAY([195] Supplementary Figure 3I );KEGG_HEMATOPOIETIC_CELL_LINEAGE ([196] Supplementary Figure 3J ). JUNB mainly enriched in REACTOME_INTERLEUKIN_4_AND_INTERLEUKIN_13_SIGNALING([197] Supplementary Figure 3K );WP_TGFBETA_SIGNALING_PATHWAY([198] Supplementary Figure 3L ); PID_IL6_7_PATHWAY([199] Supplementary Figure 3N );REACTOME_SIGNALING_BY_TGF_BETA_RECEPTOR_COMPLEX ([200] Supplementary Figure 3O ). 3.8. Infiltration and functional exploration of immune cells To assess immune cell infiltration and functional differences between the T2DM and control groups, we utilized the ssGSEA method to evaluate various immune cell subpopulations. The group comparison plot revealed increased levels of Activated dendritic cells, Monocytes, and Regulatory T cells in the T2DM group, with decreased levels of Memory B cells ([201] Figure 6A ). The correlation coefficient’s absolute value indicates the strength of correlation: 0.3-0.5 denotes weak correlation, 0.5-0.8 signifies moderate correlation, and 0.8–1 represents strong correlation; P < 0.05 indicates statistical significance. In the T2DM group, IRAK3 exhibited a negative correlation with Natural killer cell infiltration level (R=-0.933) ([202] Figure 6B , [203]Supplementary Figure 4A ). The correlations between TNFRSF1A and Effector memory CD8T cells, Gamma delta T cells, Macrophages, Natural killer cells, Type 1 T helper cells, and Regulatory T cells were positive (R=0.733, 0.700, 0.717, 0.717, 0.700, 0.733) ([204] Figure 6C , [205]Supplementary Figures 4B–G ). CX3CR1 showed a negative correlation with Type 17 T helper cell infiltration (R=-0.983) ([206] Figure 6D , [207]Supplementary Figure 4H ). JUNB displayed a positive correlation with Monocyte infiltration (R=0.700) ([208] Figure 6E , [209]Supplementary Figure 4I ). Additionally, correlations existed between different types of immune cells ([210] Supplementary Figure 4J ). Figure 6. [211]Panel A shows a bar graph comparing cell value differences between control and MAFLD groups. Panels B to E present correlation plots for genes IRAK3, TNFRSF1A, CX3CR1, and JUNB with different immune cells, highlighting correlations, p-values, and variance for each. [212]Open in a new tab Immune cell infiltration assessment. (A) Subgroup comparison plot demonstrating the difference in immune cell infiltration between the two groups as calculated by the ssGSEA algorithm. (B–E) Lollipop plots showing the correlation between IRAK3 (B), TNFRSF1A (C), CX3CR1 (D), JUNB (E) and immune cells. Note: Significance levels are denoted as follows: ns p ≥0.05; * p < 0.05; ** p < 0.01; *** p < 0.001. 3.9. Real-time fluorescence quantitative PCR results The results of real-time fluorescence quantitative PCR analysis showed that there were significant differences in the expression levels of IRAK3, TNFR1, CX3CR1 and JUNB between the two groups (P < 0.05); among them, the mRNA expression levels of IRAK3, TNFR1, CX3CR1 and JUNB in the liver tissues of rats in the T2DM combined with MAFLD group were significantly higher than that in the control group (P > 0.05).), as shown in [213]Figure 7F . Figure 7. [214]Histological and graphical analysis of biomarkers in liver conditions. Panels A-D show histology images comparing control and T2DM & MAFLD groups at 40x magnification, focusing on CX3CR1, LRAK3, JUNB, and TNFR1. Panels E and F display bar graphs indicating integrated optical density and mRNA expression levels for these biomarkers, with statistical significance noted. Comparison shows higher levels in T2DM & MAFLD than controls. [215]Open in a new tab (F). IRAK3, TNFR1,CX3CR1, JUNB mRNA expression in liver tissues of rats in T2DM and MAFLD group and control group. (A–E), Expression levels of antigen-antibody complexes in the liver tissues of rats in the T2DM and MAFLD groups and the Control group. Note: Significance levels are denoted as follows: ns p ≥0.05; * p < 0.05; ** p < 0.01; *** p < 0.001. 3.10. Immunohistochemical results Immunohistochemical picture analysis showed that the expression levels of the four factors, CX3CR1, IRAK3, JUNB, and TNFR1, were significantly different between the T2DM & MAFLD groups and the Control group (P < 0.05); and the T2DM combined with MAFLD group were significantly higher than the Control group. For details, see [216]Supplementary Table S3 and [217]Figures 7A–E . 3.11. Single-cell data preprocessing and clustering annotation A comprehensive scRNA-seq analysis was conducted using publicly available data from healthy and MAFLD liver samples. The clustering results, illustrated through dendrograms([218] Supplementary Figure 5A ) and marker genes bubble plots ([219] Figure 8B ), culminated in the identification of 13 major cell clusters at a resolution of 0.6. Distinct cell types were pinpointed, including Cycling(cycling cells) (STMN1,MKI67), B_cell (CD79A,CD79B,MS4A1), Plasma(Plasma cell) (CD79A,IGHA1), pDC (plasmacytoid dendritic cell) (LILRA4,CLEC4C), Mes(mesenchymal cell) (PDGFRB,ACTA2,COL1A1,COL1A2,COL3A1,DCN), Cho (Cholangiocyte) (EPCAM,KRT19,FXYD2), T_cell (CD3D,CD3E,CD3G,CD8A), ILC (innate lymphoid cell) (KLRF1,KLRC1,GZMB,NKG7), Kupffer (CD163,MARCO), Neu(Neutrophils) (S100A8,S100A9), cDC(Conventional Dendritic Cell) (CLEC10A,CD86), LSEC(Liver Sinusoidal Endothelial Cel) (CLEC14A,CD34,VWF,STAB2,CLDN5) and LEC(Lymphatic Endothelial Cells) (PROX1,TFF3,CCL21). The findings were further visualized using UMAP ([220] Figure 8A ). Figure 8. [221]Panel A shows a UMAP plot clustering cell types with different colors. Panel B is a dot plot presenting gene expression across identities with varying dot sizes and shades. Panel C displays stacked bar charts comparing cell type ratios between control and MAFLD groups. Panel D illustrates a stacked bar chart of cell type distribution across samples. [222]Open in a new tab Cell types annotation. (A) UMAP plot showing the clustering results of cells at 0.6 resolution. (B) Bubble diagram showing Marker genes used to identify major cell types. (C) Percentage of different cell types between Control and MAFLD groups. (D) Percentage of different cell types among different samples between Control and MAFLD groups. 3.12. Exploring cell clusters associated with MAFLD We analyzed the proportional differences in cell counts between control and disease groups across distinct cell clusters, with LSEC showing a significantly higher proportion in the MAFLD group ([223] Figure 8C ), along with proportional variations of cell clusters among different samples ([224] Figure 8D ). We performed density analysis of inter-group cell counts at a two-dimensional level, where relative brightness indicates higher density of the corresponding cell cluster in its respective group ([225] Figures 9A, B ). Subsequently, we evaluated the distributional preference between MAFLD and Control groups using the Ro/e index ([226] Figure 9C ), as well as distributional preferences across individual samples ([227] Figure 9D