Graphical abstract
graphic file with name ga1.jpg
[35]Open in a new tab
Keywords: Lung adenocarcinoma, Gene expression, H3K79me2, Driver genes
Highlights
* •
The efficacy of H3K79me2 on gene expression regulation is affirmed
in LUAD.
* •
An open-source algorithm for identifying LUAD-related driver genes
is presented.
* •
12 H3K79me2-targeted driver genes with clinical values are verified
by qPCR.
* •
The regions with obvious H3K79me2 signals changes on driver genes
are pinpointed.
Abstract
Lung adenocarcinoma is a malignancy with a low overall survival and a
poor prognosis. Studies have shown that lung adenocarcinoma progression
relates to locus-specific/global changes in histone modifications. To
explore the relationship between histone modification and gene
expression changes, we focused on 11 histone modifications and
quantitatively analyzed their influences on gene expression. We found
that, among the studied histone modifications, H3K79me2 displayed the
greatest impact on gene expression regulation. Based on the Shannon
entropy, 867 genes with differential H3K79me2 levels during
tumorigenesis were identified. Enrichment analyses showed that these
genes were involved in 16 common cancer pathways and 11 tumors and were
target-regulated by trans-regulatory elements, such as Tp53 and WT1.
Then, an open-source computational framework was presented
([36]https://github.com/zlq-imu/Identification-of-potential-LUND-driver
-genes). Twelve potential driver genes were extracted from the genes
with differential H3K79me2 levels during tumorigenesis. The expression
levels of these potential driver genes were significantly
increased/decreased in tumor cells, as assayed by RT–qPCR. A risk score
model comprising these driver genes was further constructed, and this
model was strongly negatively associated with the overall survival of
patients in different datasets. The proportional hazards assumption and
outlier test indicated that this model could robustly distinguish
patients with different survival rates. Immune analyses and responses
to immunotherapeutic and chemotherapeutic agents showed that patients
in the high and low-risk groups may have distinct tendencies for
clinical selection. Finally, the regions with clear H3K79me2 signal
changes on these driver genes were accurately identified. Our research
may offer potential molecular biomarkers for lung adenocarcinoma
treatment.
1. Introduction
Lung adenocarcinoma (LUAD) is a cancer that occurs due to abnormal and
uncontrolled cell growth in the lung, usually in the peripheral lung
[37][1]. As the most common histological type of non-small cell lung
cancer, the incidence and mortality of LUAD are increasing in China,
accounting for 40% of all lung cancers [38][2]. In recent decades,
although many strategies have been developed, such as chemotherapy,
targeted agents and immunotherapy, the 5-year survival rate of LUAD
remains below 20% [39][3]. Therefore, further understanding the
molecular mechanisms of LUAD tumorigenesis and identifying the
oncogenic drivers of LUAD have attracted wide attention.
Epigenetic disorders are considered markers of cancer development and
progression. Abnormalities in epigenetic modifications can be observed
in gene promoters, gene coding regions and other functional elements
[40][4], [41][5]. Locus-specific changes in histone modifications (HMs)
can adversely affect the expression of nearby genes. Global changes in
specific HMs can define previously unrecognized subgroups of cancer
patients [42][6]. Recent studies have reported that elevated H3K27me3
levels in the promoter region of FTO inhibit FTO expression.
Down-regulated FTO expression significantly increases the m^6A
modifications on MYC, resulting in the binding of YTHDF1 to promote MYC
mRNA translation and LUAD tumor cell glycolysis and growth [43][7]. The
loss of H3K4me3 is associated with defects in activation-induced
cytidine deaminase-induced DNA breakage and reduces mutation
frequencies in LUAD-related genes, such as MALAT1 and SNHG3 [44][8].
Genome-wide loss of histone acetylation can induce tumor suppressor
gene silencing and abnormal transcription, and this mechanism has
emerged as a promising therapeutic target for numerous cancers [45][9],
[46][10]. H3K36me3 inhibits tumor growth by suppressing CXCL1-mediated
cell cycle activation in LUAD [47][11]. In addition, high LUAD
incidence rates are related to genetic alterations and outdoor
pollution [48][12], [49][13], and HMs can establish a link between
genetic background and environmental exposure [50][14]. Thus, abnormal
HM patterns can serve as clinical tools or predictive biomarkers to
assist clinical choices of treatment strategies. As an essential HM,
H3K79me2 is associated with transcriptional regulation, maintenance of
enhancer-promoter interactions, DNA replication initiation and DNA
damage responses [51][15], [52][16]. It also plays important regulatory
roles in MLL-rearranged leukemia [53][17], breast cancer [54][18] and
colorectal cancer [55][19]. Not surprisingly, the broad roles of
H3K79me2 make it increasingly important, and thus, it may become a
critical therapeutic target for various cancers.
Based on these findings, we hoped to identify the HM with the greatest
effect on gene expression regulation in LUAD. Therefore, we first
calculated the distribution patterns of 11 HMs shared by LUAD tumor and
normal cells and quantitatively assessed the influences of HM signal
changes on LUAD-related gene expression. On this basis, the genes with
differential H3K79me2 levels during tumorigenesis were screened. An
open-source computational framework was presented to extract potential
LUAD driver genes (PLDGs) from the genes with differential H3K79me2
levels during tumorigenesis. Then, the relative expression levels of
PLDGs in LUAD tumor and normal cells were detected by RT–qPCR. To
accelerate clinical applications, the expression values of PLDGs in the
TCGA cohort were transformed into a risk score model. The reliability
of the model was assessed via the proportional hazards (PH) assumption
and outlier test and validated in a GEO cohort. Moreover, the
infiltration levels of immune cells and responses to therapeutic agents
for LUAD patients in the high- and low-risk groups were evaluated.
Finally, by refocusing on the signal distributions of H3K79me2 on the
12 PLDGs, we accurately located the regions where H3K79me2 signals were
significantly altered.
2. Materials and methods
2.1. Data
Human reference genome annotation data (GRCH38) are available in the
UCSC Genome Browser ([56]https://genome.ucsc.edu/). Protein-coding
genes were extracted. For genes with multiple transcripts, only one
transcript was randomly retained.
The genome-wide profiles of 11 HMs and polyA+ RNA-seq data for A549
cells (LUAD cells, tumor) and IMR90 cells (lung fibroblasts cells,
normal) were indexed from the ENCODE database
([57]https://www.encodeproject.org/). Their corresponding accession
numbers are provided in [58]Supplementary file [59]Table S1. Gene
expression levels were normalized using transcripts per million
[60][20].
Raw clinical profiles and RNA-seq data for patients with LUAD were
retrieved from the TCGA dataset ([61]https://portal.gdc.cancer.gov/).
475 cancer samples and 27 normal samples with survival and clinical
information were included. Gene expression levels were standardized by
transcripts per million. The neoantigen data of LUAD patients in the
TCGA dataset were separated from The Cancer Immunome Atlas
([62]https://tcia.at/). In addition, an independent GEO dataset
([63]GSE30219) containing clinical and microarray data of 307 LUAD
patients was indexed at [64]https://www.ncbi.nlm.nih.gov/geo/. The
microarray data were background-corrected and quantile-normalized using
the robust multiarray average algorithm.
2.2. Formulation of histone modification signals
To analyze the distributions of HMs and evaluate their roles, we
extracted the +/- 5 kb region around the transcription start site (TSS)
and divided the region into 100 bins with a size of 100 bp. The signals
of HMs within normal and tumor samples were normalized by Eq. [65](1),
as described in our previous studies [66][21].
[MATH: Hlmij=(hlmij×109)/(hlm×100) :MATH]
(1)
where
[MATH: Hlmij :MATH]
denotes the l-th HM signal in the i-bin of the j-th gene within the
m-th sample,
[MATH: hlmij :MATH]
describes the read counts of the l-th HM mapped into the i-bin of the
j-th gene in the m-th sample, and
[MATH: hlm :MATH]
is the total read counts of the l-th HM in the m-th sample. 100 bp is
the length of the j-th bin.
Meanwhile, we quantified the l-th HM signal level in the -5∼+5 kb DNA
region flanking the TSS by Eq. [67](2) to explore the impacts of HM
combinations on LUAD-related gene expression.
[MATH: Hlmj=(hlmj×
109)/(hlm×10000) :MATH]
(2)
where
[MATH: Hlmj :MATH]
represents the signal of the l-th HM in the j-th gene of the m-th
sample, and
[MATH: hlmj :MATH]
denotes the l-th HM read counts mapped into the -5∼+5 kb DNA region in
the j-th gene within the m-th sample. Then, 10,000 bp describes the
length of the DNA region.
2.3. Selection and prediction of differentially expressed genes (DEGs)
The DEGs between LUAD tumor and normal cells were identified by the
‘DESeq2’ R package. As a result, 3404 up-regulated DEGs (up-DEGs) and
3477 down-regulated DEGs (down-DEGs) were obtained with an adjusted
p < 0.01 and |log2(FC)|>1.
To assess the effects of HMs on LUAD-related gene expression changes,
the DEGs were randomly split into two sets: two-thirds were used as the
training set, and the rest were selected as the validation set. The
random forest (RF) algorithm was applied in the training dataset and
testing dataset to discriminate up-DEGs from down-DEGs. The signal
changes of each HM in the 100 bins, the signal changes of 11 HMs in the
DNA regions flanking the TSS (-5 to 5 kb), or the signal changes of 11
HMs in the same bin during tumorigenesis were selected as the
information parameters. In this process, the number of decision trees
was set to 800, and the number of information parameters for the best
split at each node was sqrt(p), where p was the number of information
parameters. 10-fold cross validation method was adopted to test the
prediction quality. The averaged receiver operating characteristic
curve (AUC) was used to measure the impacts of HM signals on
LUAD-related gene expression changes. To avoid any potential bias, the
general linear model (GLM) and support vector machine (SVM) algorithms
with default settings as described in ref. [68][22] were adopted to
validate the analytical results.
2.4. Identification of genes with differential H3K79me2 levels during
tumorigenesis
To recognize genes with differential H3K79me2 levels during
tumorigenesis, the following strategies were adopted (as shown in
[69]Supplementary file [70]Fig. S1A):
Fig. 1.
[71]Fig. 1
[72]Open in a new tab
Correlation analysis of HM signals and gene expression. (A, C) Up-DEGs,
(B, D) Down-DEGs. (A, B) HM signal changes (purple bars) and
significant differences (pink bars) during LUAD tumorigenesis. The red
lines in the 1st and 2nd circles correspond to the change ratio of HM
signal = 1 and –log10(P value) = 2. (C, D) Spearman correlation
coefficients between gene expression changes and HM signal changes in
each of the 100 bins. (For interpretation of the references to colour