Abstract
Objective
This study aimed to identify prognostic genes for diffuse large B-cell
lymphoma (DLBCL), using bioinformatic methods.
Methods
Five gene expression data sets were downloaded from the Gene Expression
Omnibus database. Significance analysis of microarrays algorithm was
used to identify differentially expressed genes (DEGs) from two data
sets. Functional enrichment analysis was performed for the DEGs with
the Database for Annotation, Visualization and Integration Discovery
(DAVID). Survival analysis was performed with the Kaplan–Meier method
using function survfit from package survival of R for the other three
data sets. Cox univariate regression analysis was used to further
screen out prognostic genes.
Results
Thirty-one common DEGs were identified in the two data sets, mainly
enriched in the regulation of lymphocyte activation, immune response,
and interleukin-mediated signaling pathway. Combined with 47
DLBCL-related genes acquired by literature retrieval, a total of 78
potential prognostic genes were obtained. Cases from the other three
data sets were used in hierarchical clustering, and the 78 genes could
cluster them into several subtypes with significant differences in
survival curves. Cox univariate regression analysis revealed 45, 33,
and eleven prognostic genes in the three data sets, respectively. Five
common prognostic genes were revealed, including LCP2, TNFRSF9, FUT8,
IRF4, and TLE1, among which LCP2, FUT8, and TLE1 were novel prognostic
genes.
Conclusion
Five prognostic genes of DLBCL were identified in this study. They
could not only be used for molecular subtyping of DLBCL but also be
potential targets for treatment.
Keywords: diffuse large B-cell lymphoma, gene expression profile,
differentially expressed genes, survival analysis, subtype
Introduction
Diffuse large B-cell lymphoma (DLBCL) is one of the most common types
of non-Hodgkin lymphoma, which occurs primarily in older individuals.
It is an aggressive tumor. R-CHOP, an improved form of
cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP) with
the addition of rituximab, is a standard treatment for DLBCL.
Many subtypes of the lymphoid neoplasms are established based on the
World Health Organization classification system, and DLBCL is the most
common type in Asians.[30]^1 However, classification merely based on
morphology and clinical information is difficult and thus a
considerable percentage of cases are not classified. Gene expression
profiling studies have attempted to distinguish heterogeneous groups of
DLBCL from each other.[31]^2^–[32]^4 For instance, by gene expression
profile, two groupings of germinal center B-cell-like and the activated
B-cell-like were identified as two DLBCL subtypes in the current World
Health Organization classification.[33]^5 The study by Lenz et al[34]^6
provides genetic evidence that the DLBCL subtypes are distinct diseases
that use different oncogenic pathways. Obviously, DNA microarrays
provide a better understanding of the biology of DLBCL and advance the
development of novel diagnostic tools.[35]^7
Meanwhile, many genes with prognostic effect have been reported in
DLBCL, such as BCL2[36]^8 and BCL6.[37]^9 Hu et al[38]^10 suggested
that MYC/BCL2 coexpression, rather than cell-of-origin classification,
is a better predictor of prognosis in patients with DLBCL treated with
R-CHOP. Additionally, Gratzinger et al[39]^11 reported the prognostic
value of vascular endothelial growth factor and vascular endothelial
growth factor receptors in DLBCL patients treated with
anthracycline-based chemotherapy. Besides, Hussain et al[40]^12 found
that X-linked inhibitor of apoptosis expression is a poor prognostic
factor for DLBCL.
Due to the heterogeneity of DLBCL, more works are necessary to advance
molecular subtyping as well as to discover the prognostic genes. In
this study, two gene expression data sets were analyzed to identify
differentially expressed genes (DEGs), which were regarded as potential
prognostic genes for DLBCL, and to ascertain whether these genes would
be used to well distinguish the subtypes of DLBCL in other three
expression profile data sets.
Methods
Gene expression data
All the five gene expression data sets were downloaded from the Gene
Expression Omnibus.
1. The data set of [41]GSE32918[42]^13^,[43]^14 collected gene
expression profiles of 172 DLBCL samples. The platform of Illumina
[44]GPL8432 (Illumina HumanRef-8 WG-DASL v3.0) was used. It
included a total of 294 sequencing data since some samples were
sequenced repeatedly.
2. The data set of [45]GSE10846[46]^15^,[47]^16 included gene
expression profiles of 181 clinical samples from
chemotherapy-treated patients and 233 clinical samples from
rituximab–chemotherapy-treated patients. The platform was
Affymetrix [48]GPL570 (Affymetrix Human Genome U133 Plus 2.0
Array). A total of 416 gene expression data were included.
3. The data set of [49]GSE11318[50]^6 consisted of gene expression
profiles of 203 DLBCL samples, based on the platform of Affymetrix
[51]GPL570.
4. The data set of [52]GSE9327[53]^17 collected gene expression
profiles of 36 DLBCL samples and eight reactive lymph nodes
samples, which were used as controls. The platform of [54]GPL6011
(CNIO Human Oncochip 1.0, 1.2, and 2.0) was used.
5. The data set of [55]GSE30881[56]^18 contained gene expression
profiles of 23 DLBCL samples and ten healthy controls, in order to
investigate the changes in NF-κB pathway activation. The platform
was Affymetrix [57]GPL3738 (Affymetrix Canine Genome 2.0 Array).
Pretreatment of raw data
Probes were mapped to genes according to the annotation files. For a
gene corresponding to more than one probe, the average probe value was
calculated as the gene expression value for the specific gene.[58]^19
Subsequently, log2 conversion and quantile normalization[59]^20 were
applied on the data.
A total of 4,356 and 16,454 unique genes were identified in [60]GSE9327
and [61]GSE30881, respectively. Both [62]GSE10846 and [63]GSE11318 were
obtained using [64]GPL570, and a total of 20,693 unique genes were
acquired. Besides, 18,403 unique genes were identified in [65]GSE32918.
Clinical information
The expression profiles of [66]GSE10846 and [67]GSE11318 provided
clinical information such as age, sex, stage, lactate dehydrogenase
(LDH) level, extranodal versus nodal presentation, treatment, subtype,
survival time, and survival status. [68]GSE32918 described age, sex,
treatment, subtype, survival time, and survival status. According to
these three data sets, we found that “stage” could well separate
samples into different groups with diverse survival time while “age”,
“sex”, and “treatment” could not.
Screening of DEGs
Significance analysis of microarrays algorithm[69]^21 was adopted to
screen out DEGs. It can reduce the false-positive rate in multiple
testing via controlling false discovery rate. Relative difference
(statistic d) is calculated as follows:
[MATH:
d=X<
mn>1′|X2′S+<
mi>s0 :MATH]
(1)
Statistic d measures the relative differences in gene expression
levels, and it is the corrected t.
[MATH: X1′
:MATH]
represents the average expression level of a gene under certain state,
[MATH: X2′
:MATH]
represents the average expression level of a gene under another state,
and s represents the variance of a gene.
Adjusted P-value <0.05 and log |fold change| >1.5 were set as the
threshold to select the DEGs.
Functional enrichment analysis
Gene ontology enrichment analysis and Kyoto Encyclopedia of Genes and
Genomes pathway enrichment analysis were performed for the DEGs with
DAVID[70]^22 to examine the potential altered functions and pathways of
these DEGs. False discovery rate <0.05 was set as the cutoff.
Survival analysis
Kaplan–Meier method (K–M method; product-limit method) is suitable for
analysis with small sample size. The analysis procedure is as follows:
1) Put the samples in ascending order according to the survival time,
rank i=1, 2, …, n. 2) List the number of surviving at the beginning of
each time point (in fact, a short time). 3) Calculate the probability
of death at each time point q and survival probability p (p=1−q). 4)
Calculate the survival rate S(ti) for each time point, which equals to
the product of each survival probability from the starting point to ti.
S(ti)=p[1]×p[2]×p[3] … p[ti]. Finally, plot survival curves with
survival time in abscissa and survival rate in ordinate.
Survival analysis was performed with function survfit from package
survival of R.[71]^23 Difference in survival curves for two groups was
analyzed with log-rank method using function survdiff from package
survival.[72]^24
Screening of risk factors
Cox univariate regression analysis was carried out using function coxph
from package survival to screen out risk factors related to
survival.[73]^25 The formula is as follows:
[MATH: h(t,x)=h0(t)exp(βi×x
i) :MATH]
(2)
h[0](t) is the basic risk function, the risk function when all
covariates X[1], X[2], …, X[m] are 0 or under standard conditions, and
it is generally unknown. h (t, x) represents the risk function when
each covariate X is given a fixed value, and it is proportional to
h[0](t). Therefore, the model is also known as the proportional hazard
model. X[1], X[2], …, X[m] are covariates while β[1], β[2], …, β[m] are
regression coefficients. When the regression coefficient β[i]>0, that
is, the risk ratio >1, it indicates that the covariate is a risk
factor. The greater the covariate is, the shorter the survival time is.
When the regression coefficient β[i]<0, that is, the risk ratio <1, it
indicates that the covariate is a protective factor, so the greater the
covariate is, the longer the survival time is.
Results
Differentially expressed genes and enriched biological functions
According to the aforementioned criteria, a total of 437 DEGs were
identified in DLBCL from the data set [74]GSE9327 and 1,457 DEGs from
the data set [75]GSE30881. Thirty-one overlapping genes were selected
out and functional enrichment analysis was performed for these genes,
which are mainly involved in the regulation of lymphocyte activation,
immune response, and interleukin-mediated signaling pathway ([76]Figure
1), suggesting that the 31 DEGs were closely associated with the
development of DLBCL.
Figure 1.
[77]Figure 1
[78]Open in a new tab
Functional enrichment analysis result for the 31 differentially
expressed genes (DEGs) (top 20 gene ontology [GO] terms ranked by the
significance).
Notes: X-axis represents the adjusted P-value transformed by log2, and
Y-axis denotes the enriched GO terms.
Abbreviation: IL, interleukin.
Moreover, 47 DLBCL-related genes were acquired via literature
retrieval.[79]^2^,[80]^15^,[81]^26^–[82]^31
Survival analysis result
The 31 DEGs and 47 DLBCL-related genes were combined and a total of 78
potential prognostic genes were obtained, which were used to classify
samples with diverse survival time from other three data sets.
1. In the data set of [83]GSE10846, 71 out of the 78 genes were
detected. Using hierarchical clustering, the 71 genes could well
cluster the 416 DLBCL samples into four subtypes ([84]Figure 2A).
The differences in survival curves of the four subtypes were found
to be significant (P=7.65e−11; [85]Figure 2B).
2. In the data set of [86]GSE11318, 71 out of the 78 genes were
detected. Using hierarchical clustering, the 71 genes could well
classify the 203 DLBCL samples into three subtypes ([87]Figure 2C).
The difference in survival curves of the three subtypes was found
to be significant (P=7.5e−05; [88]Figure 2D).
3. In the data set of [89]GSE32918, 69 out of the 78 genes were
detected. Some samples were sequenced repeatedly, and thus average
expression levels were calculated as the final values. Using
hierarchical clustering, the 69 genes could cluster the 172 DLBCL
samples into three subtypes ([90]Figure 2E). The difference in
survival curves of the three subtypes was found to be significant
(P=0.013; [91]Figure 2F).
Figure 2.
[92]Figure 2
[93]Open in a new tab
Subtyping of diffuse large B-cell lymphoma (DLBCL) in three gene data
sets using the 78 predicted and curated DLBCL-related genes.
Notes: (A, C, and E) Hierarchical clustering that denotes the subtypes
of DLBCL clustered by the 78 genes in the gene data sets of
[94]GSE10846, [95]GSE11318, and [96]GSE32918, respectively; (B, D, and
F) Kaplan–Meier survival curves of the subtypes in the gene data sets
of [97]GSE10846, [98]GSE11318, and [99]GSE32918, respectively.
Prognostic genes
The correlation between each gene and the survival of DLBCL patients
was calculated with Cox univariate regression analysis to further
screen out genes with prognostic value. In the data set of
[100]GSE10846, 45 genes were found to have significant prognostic
effect, while in [101]GSE11318, 33 genes had prognostic effect, and in
[102]GSE32918, eleven genes showed prognostic value. Five prognostic
genes were common among the three data sets ([103]Figure 3; [104]Table
1). According to the coefficient, lymphocyte cytosolic protein 2 (LCP2)
and tumor necrosis factor receptor superfamily member 9 (TNFRSF9) might
be related to poor prognosis while fucosyltransferase 8 (FUT8),
interferon regulatory factor 4 (IRF4), and transducin-like enhancer of
split 1 (TLE1) might bring in favorable prognosis.
Figure 3.
Figure 3
[105]Open in a new tab
Venn diagram of the prognostic genes from three gene expression data
sets ([106]GSE10846, [107]GSE11318, and [108]GSE21918).
Table 1.
Five common prognostic genes
Gene names [109]GSE10846
__________________________________________________________________
[110]GSE11318
__________________________________________________________________
[111]GSE32918
__________________________________________________________________
P-value Coefficient P-value Coefficient P-value Coefficient
FUT8 1.07E–05 0.340027 0.010986 0.247543 0.035247 0.276876
IRF4 0.000575 0.249417 0.00936 0.261503 0.039336 0.235775
LCP2 0.027791 −0.18334 0.026033 −0.24189 0.009313 −0.50978
TLE1 0.00152 0.222874 3.28E–05 0.354751 0.001212 0.357821
TNFRSF9 3.65E–08 −0.38752 0.005578 −0.24641 0.045842 −0.23852
[112]Open in a new tab
Discussion
In this study, five gene expression data sets were downloaded from the
Gene Expression Omnibus. Thirty-one common DEGs were identified from
two gene expression data sets, mainly enriching in the regulation of
lymphocyte activation, immune response, and interleukin-mediated
signaling pathway, which were closely associated with the development
of DLBCL. Combined with 47 DLBCL-related genes acquired by literature
retrieval, 78 potential prognostic genes were obtained, which could
successfully cluster the DLBCL samples from another three gene
expression data sets into several subtypes with significant differences
in survival. Prognostic genes were screened out via Cox univariate
regression analysis, and five common genes were acquired, such as LCP2,
TNFRSF9, FUT8, IRF4, and TLE1.
TNFRSF9[113]^32 and IRF4[114]^33 are two known prognostic genes of
DLBCL. TNFRSF9 is a member of the TNF-receptor superfamily that can
induce proliferation in peripheral monocytes. Alizadeh et al[115]^32
indicate that expression levels of LIM domain only 2 (LMO2) and TNFRSF9
powerfully predict the overall survival in patients with DLBCL. TNFRSF9
can also serve as the target to treat DLBCL. The study by Houot et
al[116]^34 demonstrates that anti-CD137 therapy has a potent
antilymphoma activity in a mouse model. IRF4 belongs to the interferon
regulatory factor (IRF) family of transcription factors. Salaverria et
al[117]^35 report that translocations activating IRF4 identify a
subtype of germinal center-derived B-cell lymphoma affecting
predominantly children and young adults. Therefore, it may be a
therapeutic target of DLBCL.[118]^36
LCP2, FUT8, and TLE1 may be novel prognostic genes of DLBCL. LCP2 plays
a positive role in promoting T-cell development and activation as well
as mast cell and platelet function. FUT8 is an enzyme belonging to the
family of fucosyltransferases. It may contribute to the malignancy of
cancer cells and to their invasive and metastatic capabilities.[119]^37
Chen et al[120]^38 found that FUT8 is upregulated during
epithelial–mesenchymal transition via the transactivation of
β-catenin/lymphoid enhancer-binding factor (LEF)-1. Based on these
instances, we speculated that FUT8 might exert a similar role in DLBCL
and thus contributes to the metastasis of DLBCL. TLE1 is a multitasked
transcriptional corepressor that acts through the acute myelogenous
leukemia 1, Wnt, and Notch signaling pathways. Promoter CpG island
hypermethylation-associated inactivation of TLE1 has been observed in
DLBCL.[121]^39 Fraga et al[122]^40 further point out that TLE1
epigenetic inactivation contributes to the development of hematologic
malignancies by disrupting critical differentiation and
growth-suppressing pathways. However, the exact role of TLE1 in DLBCL
remains to be explored. We supposed that more researches may unveil
clinical applications of the three genes.
Overall, five critical genes with prognostic effect were disclosed in
DLBCL via bioinformatic analysis of existing gene expression data. Two
out of the five genes have been reported while the other three are
novel predictors. Further researches on these genes can benefit
molecular subtyping and also provide potential therapeutic targets of
DLBCL.
Highlights
1. A set of 31 common DEGs were identified from two gene expression
data sets.
2. Totally, 78 potential prognostic genes were suggested be used for
subtyping of DLBCL.
3. Five prognostic genes, including three novel ones, were identified
in DLBCL.
Footnotes
Disclosure
The authors report no conflicts of interest in this work.
References