Abstract 5-methylcytosine modifications play a significant role in carcinogenesis; however, studies exploring 5-methylcytosine-related genes in diffuse large B-cell lymphoma patients are lacking. In this study, we aimed to understand the potential role and clinical prognostic impact of 5-methylcytosine regulators in diffuse large B-cell lymphoma and identify a prognostic biomarker based on 5-methylcytosine-associated genes. Gene expression profiles and corresponding clinical information of diffuse large B-cell lymphoma patients and normal controls were obtained from The Cancer Genome Atlas, Gene Expression Omnibus, and Genotype-Tissue Expression databases. Diffuse large B-cell lymphoma was divided into three clusters according to the 5-methylcytosine regulators, and differentially expressed genes were screened among the three clusters. Univariate Cox and Lasso-Cox regression analyses were used to screen prognostic genes and construct a prognostic risk model. Kaplan-Meier curve analysis, univariate and multivariate Cox regression analyses, and time-dependent receiver operator characteristic curve analysis were used to evaluate prognostic factors. GSVA was used to enrich potential pathways associated with 5-methylcytosine modification patterns. SsGSEA and CIBERSORT were used to assess immune cell infiltration. Six 5-methylcytosine-related genes (TUBB4A, CD3E, ZNF681, HAP1, IL22RA2, and POSTN) were used to construct a prognostic risk model, which was proved to have a good predictive effect. In addition, univariate and multivariate Cox regression risk scores were independent prognostic factors for diffuse large B-cell lymphoma. Further analysis showed that the 5-methylcytosine risk score was significantly correlated with immune cell infiltration and immune checkpoint of diffuse large B-cell lymphoma. Our study reveals for the first time a potential role for 5-methylcytosine modifications in diffuse large B-cell lymphoma, provides novel insights for future studies on diffuse large B-cell lymphoma, and offers potential prognostic biomarkers and therapeutic targets for patients with diffuse large B-cell lymphoma. Keywords: Diffuse large B-Cell lymphoma, 5-Methylcytosine, Risk model, Prognosis, Biomarker Highlights * • Finding prognostic markers is critical for the detection and treatment of DLBCL patients. * • M5C modification plays an important role in the occurrence and development of DLBCL. * • Construction of a DLBCL prognostic model of m5C-related genes for the first time. * • The strong relationship between m5C risk and the immune microenvironment reveals its potential link to immunotherapy. 1. Introduction Diffuse large B-cell lymphoma (DLBCL), the most common type of non-Hodgkin's lymphoma, is a heterogeneous disease with different clinical manifestations, genetic characteristics, treatment responses, and prognoses [[51]1]. Rituximab has been a breakthrough in the treatment of DLBCL, with over half of patients with DLBCL receiving the R–CHOP regimen, which includes cyclophosphamide, doxorubicin, vincristine, and prednisone plus rituximab; however, 30–40 % of patients still relapse or even fail to respond to R–CHOP therapy [[52]2]. Over 50 % of chemotherapy-sensitive patients with relapsed/refractory DLBCL who are administered high-dose chemotherapy followed by autologous stem cell transplantation eventually relapse [[53]3]. The advent of anti-CD19 chimeric antigen receptor (CAR) T-cell therapy has solved the treatment dilemma for some relapsed/refractory DLBCL patients. Although CAR T-cell therapies have achieved response rates in relapsed/refractory DLBCL patients, they are associated with significant toxicities in the form of cytokine-release syndrome and immune effector cell-associated neurotoxicity syndrome [[54][4], [55][5], [56][6]]. Due to these toxicities, the application of CAR T-cell therapies has been limited among elderly and unfit patients. Therefore, novel and more effective treatment strategies for patients with DLBCL are urgently required. The advent of high-throughput sequencing technology has helped researchers gain a comprehensive understanding of the gene expression profile of DLBCL and identify many diagnostic, prognostic, and therapeutic biomarkers for DLBCL [[57][7], [58][8], [59][9]]. RNA modifications, including N6-methyladenosine (m6A), 5-methylcytosine (m5C), N1-methyladenosine (m1A), and pseudouridine (Ψ), play important roles in carcinogenesis [[60]10,[61]11]. Recently, the less studied m5C modification has received increasing attention, and growing evidence suggests that m5C regulates RNA stabilization, splicing, nuclear export, transcription, and translation, thereby mediating biological functions, such as cell proliferation, differentiation, apoptosis, and senescence [[62]12,[63]13]. Similar to m6A modifications, m5C RNA methylation is dynamically regulated by the corresponding m5C regulators and can be functionally classified into three isoforms -"writers”, “erasers”, and " readers”. The RNA C5-cytosine methyltransferase NSUN2 is upregulated in several types of cancer including pancreatic cancer, nasopharyngeal carcinoma, uveal melanoma, and gastric cancer [[64][14], [65][15], [66][16]]. In gastric cancer, NSUN2 has been found to promote cancer cell proliferation, migration, and invasion [[67]17]. Similarly, the enzymes NSUN4, NSUN5, NSUN6, and NSUN7 have been associated with the development of colorectal cancer, hepatocellular carcinoma, pancreatic cancer, and glioma [[68][18], [69][19], [70][20], [71][21]]. Abnormal expression and mutations of the TET family and ALKBH1 ″erasers” are also associated with several malignancies [[72][22], [73][23], [74][24], [75][25]]. In renal cell carcinoma, high expression of ALKBH1 is correlated with malignant features of the tumor [[76]26]. Furthermore, the m5C binding protein YBX1 has been found to maintain the stability of m5C-containing oncogenes and promote bladder cancer progression [[77]27]. YBX1 has also been found to promote tumor progression in breast, pancreatic, and non-small cell lung cancers [[78][28], [79][29], [80][30]], and mediate resistance to the first-line chemotherapy drug sorafenib in hepatocellular carcinoma [[81]31]. As far as we are aware, no studies have specifically investigated the potential of RNA m5C modification as a combined prognostic factor in DLBCL. The purpose of this research was to explore the clinical value of m5C regulators in DLBCL and to establish a prognostic biomarker for this condition based on m5C regulators, which has not been done before. Additionally, we examined the relationship between the risk of m5C modification and the distribution of immune cells, and found evidence indicating a potential correlation with immunotherapy. Our findings offer new insights for further research on m5C modifications and personalized treatment of DLBCL. 2. Materials and methods 2.1. Data collection RNA sequencing data and clinical information (FPKM values) of patients with DLBCL and normal subjects were obtained from the Genotype-Tissue Expression (GTEx), The Cancer Genome Atlas (TCGA), and Gene Expression Omnibus (GEO) databases. We gathered 48 DLBCL sample datasets from TCGA database and 444 normal control datasets from the GTEx database, removed batch effects, merged the data, and normalized the data using log[2](FPKM+1) values. The [82]GSE10846 and [83]GSE181063 datasets were extracted from the GEO database, and log[2] transformation was performed to normalize mRNA expression to eliminate batch effects for subsequent analyses. Finally, the training cohort (n = 228) and internal validation cohort (n = 152) were randomly (5:2) stratified from the 380 DLBCL samples in [84]GSE10846. The external validation cohort consisted of 1310 DLBCL samples from the [85]GSE181063 dataset to validate the prognostic value of m5C regulator signatures. The RCircos package in R was used to analyze and visualize the copy number variants (CNVs) derived from the TCGA database [[86]32]. 2.2. Collection and gene expression analysis of m5C regulators Based on the relevant literature [[87]10,[88][33], [89][34], [90][35], [91][36], [92][37]], we selected 22 m5C regulators for further analysis, excluding those that are not expressed in DLBCL, including eight “writers” (DNMT3B, NSUN2, NSUN3, NSUN4, NSUN5, NSUN6, NSUN7, and NOP2), four “erasers” (TET1, TET2, TET3, and TDG), and ten “readers” (MBD1, MBD2, MBD3, MBD4, MECP2, NEIL1, NTHL1, SMUG1, UHRF1, and UHRF2). We removed m5C regulators with very low expression (mean expression <0.5) in DLBCL samples and used the “limma” R package to compare the expression of the m5C regulators between different groups. Spearman correlation analysis of DLBCL samples from TCGA database was performed for differentially expressed m5C regulators using “corrplot” R package, and statistical significance was defined as p value < 0.05. 2.3. Consensus clustering analysis based on m5C regulators According to the different expression patterns of 22 m5C regulators, DLBCL samples in the [93]GSE10846 dataset were clustered into several subgroups by the “ConsensusClusterPlus” R package [[94]38]. The “pca3d” and “rgl” R packages were used for principal component analysis (PCA) to evaluate sample clustering [[95]39]. The association of different m5C subgroups with clinical information, such as age, gender, stage, molecular subtype, Eastern Cooperative Oncology Group (ECOG) performance status, and lactate dehydrogenase (LDH) ratio, was assessed using chi-square test. In addition, the Kaplan-Meier (K-M) overall survival curves for different subgroups were plotted by the R package “survival”. The “GSEABase”, “GSVA”, and “pheatmap” R packages were used for gene differential analysis to evaluate the enriched m5C-related pathways [[96]40]. The gene set “c2. cp.kegg.v7.4. symbs” was retrieved from the MSigDB database as the background, and statistical significance was defined as |log Fold change (FC)| >1 and FDR value < 0.05. 2.4. Construction and validation of a m5C prognostic risk score model in DLBCL Differentially expressed genes (DEGs) between different clusters were screened with the empirical Bayesian approach using the “limma” R package (|log FC| > 1, FDR <0.05) [[97]41]. Moreover, we performed univariate Cox regression analysis to screen out candidate genes with independent prognostic value. Further, relevant genes were filtered again to construct an m5C risk prediction model with minimal risk of overfitting using the “glmnet” and “survival” R packages, together with Lasso-Cox regression analysis [[98]42]. Finally, a risk score based on m5C modification features was constructed using the formula: Risk score = [MATH: i=1nCoeffici< mi>enti*< msub>Expres sioni :MATH] . The formula was applied to calculate risk scores for the training cohort, internal validation cohort, and external validation cohort and to divide DLBCL patients into low-risk and high-risk groups depending on their median m5C risk score. The survival difference between the two subgroups was compared using the “survivor” and “survminer” R packages. To confirm the predictive reliability of the model, receiver operating characteristic (ROC) curve analysis at 1-, 3-, and 5-years was performed using the “time-ROC ″R package [[99]43]. 2.5. Establishment of the predictive nomogram Data on clinical and pathological characteristics of patients, with respect to age, subtype, ECOG status, staging, LDH ratio, and risk score, were obtained from the GEO database. Univariate and multivariate Cox regression analyses were conducted to determine independent prognostic factors in patients with DLBCL. Using the 'rms' R package, a nomogram was constructed based on risk scores and clinical variables significantly correlated with DLBCL prognosis, and the predictive prognostic ability for the nomogram was assessed using a calibration plot and ROC curve. 2.6. Immune analyses Using the R packages “GSEABase” and “GSVA”, we performed single-sample gene set enrichment analysis (ssGSEA) [[100]40] to evaluate the enrichment scores for 16 different classes of immune cells and 13 different immune-related pathway activities in the three m5C clusters of DLBCL. Further, the Cell Type Identification by Estimation of Relative Subtypes of RNA Transcripts (CIBERSORT) [[101]44] algorithm was used to assess the relative proportions of 22 types of infiltrating immune cells to examine differences in immune cell subtypes among the m5C high-risk and low-risk groups. Subsequently, we calculated the correlation between risk scores and the enrichment fraction of immune cells, as well as the activity of immune-related pathways. Differences in the expression of potential immune checkpoint genes between the high-risk and low-risk groups were compared using the software package “ggpubr” R. 2.7. GSEA and functional enrichment analysis Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) pathway analysis was applied to determine the potential molecular mechanisms associated with m5C risk [[102]45]. Subsequently, GSEA was performed to screen for the most prominent enrichment pathways in the high-risk and low-risk groups (|log FC| > 1 and FDR< 0.05) [[103]46]. 2.8. RNA extraction and RT-PCR Total RNA from DLBCL tissues and normal tissues was extracted by TRIzol reagent (TaKaRa, Shiga, Japan). Immediately afterwards, the Evo M-MLV RT Mixing Kit (Accurate Biotechnology (Hunan) Co., Ltd., China) was employed for cDNA synthesis, followed by the qRT-PCR analysis using the SYBR® Green Premix Pro Taq HS qPCR Kit (Accurate Biotechnology (Hunan) Co, Ltd., China). The mRNA levels between the experimental and control groups were normalized using β-actin. The sequences of the primers used are presented in [104]Table S1. 2.9. Statistical analysis All statistical analyses performed in our study were carried out using R software. Differences with p < 0.05 (*), 0.01 (**), and 0.001 (***) were considered to be statistically significant unless otherwise stated. 3. Results 3.1. Landscape and multi-omics analysis of m5C regulators in DLBCL Apart from DNMT3A, NSUN1, DNMT1, DNMT2, YBX1, ALYREF, and UNG, which are genes with very low expression levels in DLBCL, a comparison of mRNA expression levels for 22 m5C regulators between DLBCL samples and normal tissues was performed using datasets obtained from TCGA and GTEx databases. Compared to normal controls, 19 of the 22 m5C regulators were differentially expressed in DLBCL samples. Among these genes, 13 m5C regulators (DNMT3B, NSUN2, NSUN3, NSUN4, NSUN5, NSUN6, TET3, TDG, MBD2, MBD4, NTHL1, UHRF1, and NOP2) were upregulated and six (NSUN7, TET2, MBD1, MECP2, NEIL1, and UHRF2) were downregulated ([105]Fig. 1A). The 22 m5C regulators were significantly correlated with each other ([106]Fig. 1B). Further CNV analysis revealed that alterations in CNV of genes were very frequent; among the m5C regulators, MBD2, SMUG1, MECP2, NSUN5, UHRF2, NTHL1, MBD1, NSUN3, NSUN2, NSUN6, and DNMT3B showed increased copy numbers, whereas the copy numbers of TET3, TDG, NEIL1, NSUN4, MBD4, TET2, TET1, NOP2, and UHRF1 were reduced ([107]Fig. 1C). [108]Fig. 1D summarizes the categories, correlations, and prognosis of m5C regulators in TCGA-DLBCL cohort. Fig. 1. [109]Fig. 1 [110]Open in a new tab Expression of 22 m5C regulators in DLBCL tissues and normal tissues. (A) Gene expression of m5C regulators in DLBCL tissues and normal tissues. ***p < 0.001; **p < 0.01; *p < 0.05. (B) Correlation between the expression levels of m5C regulators in DLBCL. Darker shades of blue indicate a stronger negative correlation, and darker shades of red indicate a stronger positive correlation. * indicates p<0.05. (C) CNV frequency of m5C regulators in TCGA-DLBCL cohort. Red dots: increased copy number; green dots: decreased copy number. (D) Gene network of m5C regulators. M5C, 5-methylcytosine; DLBCL, diffuse large B-cell lymphoma; CNV, copy number variation. (For interpretation of the references to color in this figure legend, the reader is referred to