Graphical abstract graphic file with name ga1.jpg [35]Open in a new tab Keywords: Lung adenocarcinoma, Gene expression, H3K79me2, Driver genes Highlights * • The efficacy of H3K79me2 on gene expression regulation is affirmed in LUAD. * • An open-source algorithm for identifying LUAD-related driver genes is presented. * • 12 H3K79me2-targeted driver genes with clinical values are verified by qPCR. * • The regions with obvious H3K79me2 signals changes on driver genes are pinpointed. Abstract Lung adenocarcinoma is a malignancy with a low overall survival and a poor prognosis. Studies have shown that lung adenocarcinoma progression relates to locus-specific/global changes in histone modifications. To explore the relationship between histone modification and gene expression changes, we focused on 11 histone modifications and quantitatively analyzed their influences on gene expression. We found that, among the studied histone modifications, H3K79me2 displayed the greatest impact on gene expression regulation. Based on the Shannon entropy, 867 genes with differential H3K79me2 levels during tumorigenesis were identified. Enrichment analyses showed that these genes were involved in 16 common cancer pathways and 11 tumors and were target-regulated by trans-regulatory elements, such as Tp53 and WT1. Then, an open-source computational framework was presented ([36]https://github.com/zlq-imu/Identification-of-potential-LUND-driver -genes). Twelve potential driver genes were extracted from the genes with differential H3K79me2 levels during tumorigenesis. The expression levels of these potential driver genes were significantly increased/decreased in tumor cells, as assayed by RT–qPCR. A risk score model comprising these driver genes was further constructed, and this model was strongly negatively associated with the overall survival of patients in different datasets. The proportional hazards assumption and outlier test indicated that this model could robustly distinguish patients with different survival rates. Immune analyses and responses to immunotherapeutic and chemotherapeutic agents showed that patients in the high and low-risk groups may have distinct tendencies for clinical selection. Finally, the regions with clear H3K79me2 signal changes on these driver genes were accurately identified. Our research may offer potential molecular biomarkers for lung adenocarcinoma treatment. 1. Introduction Lung adenocarcinoma (LUAD) is a cancer that occurs due to abnormal and uncontrolled cell growth in the lung, usually in the peripheral lung [37][1]. As the most common histological type of non-small cell lung cancer, the incidence and mortality of LUAD are increasing in China, accounting for 40% of all lung cancers [38][2]. In recent decades, although many strategies have been developed, such as chemotherapy, targeted agents and immunotherapy, the 5-year survival rate of LUAD remains below 20% [39][3]. Therefore, further understanding the molecular mechanisms of LUAD tumorigenesis and identifying the oncogenic drivers of LUAD have attracted wide attention. Epigenetic disorders are considered markers of cancer development and progression. Abnormalities in epigenetic modifications can be observed in gene promoters, gene coding regions and other functional elements [40][4], [41][5]. Locus-specific changes in histone modifications (HMs) can adversely affect the expression of nearby genes. Global changes in specific HMs can define previously unrecognized subgroups of cancer patients [42][6]. Recent studies have reported that elevated H3K27me3 levels in the promoter region of FTO inhibit FTO expression. Down-regulated FTO expression significantly increases the m^6A modifications on MYC, resulting in the binding of YTHDF1 to promote MYC mRNA translation and LUAD tumor cell glycolysis and growth [43][7]. The loss of H3K4me3 is associated with defects in activation-induced cytidine deaminase-induced DNA breakage and reduces mutation frequencies in LUAD-related genes, such as MALAT1 and SNHG3 [44][8]. Genome-wide loss of histone acetylation can induce tumor suppressor gene silencing and abnormal transcription, and this mechanism has emerged as a promising therapeutic target for numerous cancers [45][9], [46][10]. H3K36me3 inhibits tumor growth by suppressing CXCL1-mediated cell cycle activation in LUAD [47][11]. In addition, high LUAD incidence rates are related to genetic alterations and outdoor pollution [48][12], [49][13], and HMs can establish a link between genetic background and environmental exposure [50][14]. Thus, abnormal HM patterns can serve as clinical tools or predictive biomarkers to assist clinical choices of treatment strategies. As an essential HM, H3K79me2 is associated with transcriptional regulation, maintenance of enhancer-promoter interactions, DNA replication initiation and DNA damage responses [51][15], [52][16]. It also plays important regulatory roles in MLL-rearranged leukemia [53][17], breast cancer [54][18] and colorectal cancer [55][19]. Not surprisingly, the broad roles of H3K79me2 make it increasingly important, and thus, it may become a critical therapeutic target for various cancers. Based on these findings, we hoped to identify the HM with the greatest effect on gene expression regulation in LUAD. Therefore, we first calculated the distribution patterns of 11 HMs shared by LUAD tumor and normal cells and quantitatively assessed the influences of HM signal changes on LUAD-related gene expression. On this basis, the genes with differential H3K79me2 levels during tumorigenesis were screened. An open-source computational framework was presented to extract potential LUAD driver genes (PLDGs) from the genes with differential H3K79me2 levels during tumorigenesis. Then, the relative expression levels of PLDGs in LUAD tumor and normal cells were detected by RT–qPCR. To accelerate clinical applications, the expression values of PLDGs in the TCGA cohort were transformed into a risk score model. The reliability of the model was assessed via the proportional hazards (PH) assumption and outlier test and validated in a GEO cohort. Moreover, the infiltration levels of immune cells and responses to therapeutic agents for LUAD patients in the high- and low-risk groups were evaluated. Finally, by refocusing on the signal distributions of H3K79me2 on the 12 PLDGs, we accurately located the regions where H3K79me2 signals were significantly altered. 2. Materials and methods 2.1. Data Human reference genome annotation data (GRCH38) are available in the UCSC Genome Browser ([56]https://genome.ucsc.edu/). Protein-coding genes were extracted. For genes with multiple transcripts, only one transcript was randomly retained. The genome-wide profiles of 11 HMs and polyA+ RNA-seq data for A549 cells (LUAD cells, tumor) and IMR90 cells (lung fibroblasts cells, normal) were indexed from the ENCODE database ([57]https://www.encodeproject.org/). Their corresponding accession numbers are provided in [58]Supplementary file [59]Table S1. Gene expression levels were normalized using transcripts per million [60][20]. Raw clinical profiles and RNA-seq data for patients with LUAD were retrieved from the TCGA dataset ([61]https://portal.gdc.cancer.gov/). 475 cancer samples and 27 normal samples with survival and clinical information were included. Gene expression levels were standardized by transcripts per million. The neoantigen data of LUAD patients in the TCGA dataset were separated from The Cancer Immunome Atlas ([62]https://tcia.at/). In addition, an independent GEO dataset ([63]GSE30219) containing clinical and microarray data of 307 LUAD patients was indexed at [64]https://www.ncbi.nlm.nih.gov/geo/. The microarray data were background-corrected and quantile-normalized using the robust multiarray average algorithm. 2.2. Formulation of histone modification signals To analyze the distributions of HMs and evaluate their roles, we extracted the +/- 5 kb region around the transcription start site (TSS) and divided the region into 100 bins with a size of 100 bp. The signals of HMs within normal and tumor samples were normalized by Eq. [65](1), as described in our previous studies [66][21]. [MATH: Hlmij=(hlmij×109)/(hlm×100) :MATH] (1) where [MATH: Hlmij :MATH] denotes the l-th HM signal in the i-bin of the j-th gene within the m-th sample, [MATH: hlmij :MATH] describes the read counts of the l-th HM mapped into the i-bin of the j-th gene in the m-th sample, and [MATH: hlm :MATH] is the total read counts of the l-th HM in the m-th sample. 100 bp is the length of the j-th bin. Meanwhile, we quantified the l-th HM signal level in the -5∼+5 kb DNA region flanking the TSS by Eq. [67](2) to explore the impacts of HM combinations on LUAD-related gene expression. [MATH: Hlmj=(hlmj× 109)/(hlm×10000) :MATH] (2) where [MATH: Hlmj :MATH] represents the signal of the l-th HM in the j-th gene of the m-th sample, and [MATH: hlmj :MATH] denotes the l-th HM read counts mapped into the -5∼+5 kb DNA region in the j-th gene within the m-th sample. Then, 10,000 bp describes the length of the DNA region. 2.3. Selection and prediction of differentially expressed genes (DEGs) The DEGs between LUAD tumor and normal cells were identified by the ‘DESeq2’ R package. As a result, 3404 up-regulated DEGs (up-DEGs) and 3477 down-regulated DEGs (down-DEGs) were obtained with an adjusted p < 0.01 and |log2(FC)|>1. To assess the effects of HMs on LUAD-related gene expression changes, the DEGs were randomly split into two sets: two-thirds were used as the training set, and the rest were selected as the validation set. The random forest (RF) algorithm was applied in the training dataset and testing dataset to discriminate up-DEGs from down-DEGs. The signal changes of each HM in the 100 bins, the signal changes of 11 HMs in the DNA regions flanking the TSS (-5 to 5 kb), or the signal changes of 11 HMs in the same bin during tumorigenesis were selected as the information parameters. In this process, the number of decision trees was set to 800, and the number of information parameters for the best split at each node was sqrt(p), where p was the number of information parameters. 10-fold cross validation method was adopted to test the prediction quality. The averaged receiver operating characteristic curve (AUC) was used to measure the impacts of HM signals on LUAD-related gene expression changes. To avoid any potential bias, the general linear model (GLM) and support vector machine (SVM) algorithms with default settings as described in ref. [68][22] were adopted to validate the analytical results. 2.4. Identification of genes with differential H3K79me2 levels during tumorigenesis To recognize genes with differential H3K79me2 levels during tumorigenesis, the following strategies were adopted (as shown in [69]Supplementary file [70]Fig. S1A): Fig. 1. [71]Fig. 1 [72]Open in a new tab Correlation analysis of HM signals and gene expression. (A, C) Up-DEGs, (B, D) Down-DEGs. (A, B) HM signal changes (purple bars) and significant differences (pink bars) during LUAD tumorigenesis. The red lines in the 1st and 2nd circles correspond to the change ratio of HM signal = 1 and –log10(P value) = 2. (C, D) Spearman correlation coefficients between gene expression changes and HM signal changes in each of the 100 bins. (For interpretation of the references to colour