Abstract

   Hepatocellular carcinoma (HCC) is an aggressive form of cancer
   characterized by a high recurrence rate following resection. Studies
   have implicated stromal and immune cells, which form part of the tumor
   microenvironment, as significant contributors to the poor prognoses of
   HCC patients. In the present study, we first downloaded gene expression
   datasets for HCC patients from The Cancer Genome Atlas database and
   categorized the patients into low and high stromal or immune score
   groups. By comparing those groups, we identified differentially
   expressed genes significantly associated with HCC prognosis. The Gene
   Ontology database was then used to perform functional enrichment
   analysis, and the STRING network database was used to construct
   protein-protein interaction networks. Our results show that most of the
   differentially expressed genes were involved in immune processes and
   responses and the plasma membrane. Those results were then validated
   using another a dataset from a HCC cohort in the Gene Expression
   Omnibus database and in 10 pairs of HCC tumor tissue and adjacent
   nontumor tissue. These findings enabled us to identify several tumor
   microenvironment-related genes that associate with HCC prognosis, and
   some those appear to have the potential to serve as HCC biomarkers.

   Keywords: TCGA, GEO, tumor microenvironment, immune scores,
   disease-free survival

INTRODUCTION

   Hepatocellular carcinoma (HCC) is a prevalent malignant tumor and a
   leading cause of cancer-associated death globally [[52]1]. More than
   50% of the world's HCC occurs in China, and the incidence rate has been
   increased yearly [[53]2]. The 5-year survival rate among advanced HCC
   patients is less than 5% due to the disease’s high recurrence and
   metastasis rates [[54]3]. The occurrence and development of HCC is a
   multifactorial, multistage process that involves the hepatocytes
   themselves as well as their microenvironment [[55]4].

   Previous studies have shown that the tumor microenvironment not only
   influences gene expression in HCC, but also the clinical outcomes of
   the patients [[56]5–[57]7]. Immune and stromal cells are vital nontumor
   components of the tumor microenvironment and can be used for diagnostic
   and prognostic evaluation. The immune score is a standard method for
   quantifying T cell and cytotoxic T cell density within a tumor
   microenvironment and provides data that are predictive of patient
   outcomes [[58]8]. Immune and matrix scores are calculated using the
   ESTIMATE (Estimation of STromal and Immune cells in MAlignant Tumor
   tissues using Expression data) algorithm. This method facilitates
   quantification of immune and matrix constituents within tumors [[59]9],
   and there is increasing evidence that an inflammatory environment is
   predictive of clinical outcome [[60]10–[61]13].

   To better understand the impact of stromal and immune cell-related
   genes on prognosis, in the present study we systematically analyzed
   tumor expression profiles and explored tumor microenvironment-related
   genes associated with a poor prognosis and their potential regulatory
   mechanisms. We initially analyzed HCC cohorts in The Cancer Genome
   Atlas database and used ESTIMATE immune scores to predicted genes that
   significantly affect outcomes in HCC patients. We then used the string
   database to perform functional annotation of these genes, after which
   we used qRT-PCR to verify expression of eight genes of interest in
   clinical samples and TCGA database. Finally, we performed a survival
   analysis to verify the impact of the eight genes of interest using a
   different HCC dataset from the Gene Expression Omnibus database
   ([62]GSE14520).

RESULTS

Immune scores are significantly related to disease-free survival in HCC

   Gene expression profiles with pathologic diagnosis and clinical data
   from 319 HCC patients were retrieved from TCGA database. Based on the
   ESTIMATE algorithm, stromal scores ranged from -1,741.56 to 1,195.07,
   and immune scores ranged from -1,209.16 to 2,934.36 ([63]Figure 1A).
   The patients were then divided into high and low score (divided based
   on median score) groups to determine the potential correlation between
   disease-free survival and immune or stromal scores. Kaplan-Meier
   survival curves revealed that patients in the high immune score group
   had a longer median disease-free survival time than those in the low
   score group (p = 0.0081 in log-rank test). ([64]Figure 1C and [65]1D).

Figure 1.

   [66]Figure 1
   [67]Open in a new tab

   Immune scores and stromal scores are associated with HCC disease-free
   survival. (A) TCGA liver cancer expression profile data using ESTIMATE
   method to calculate immune score and matrix score. Box-plot shows that
   the level of Immune scores and stromal scores. (B) Heatmap of the DEGs
   of immune scores of top half (high score) vs. bottom half (low score).
   p<0.05, fold change >1). Genes with higher expression are shown in red,
   lower expression are shown in green, genes with same expression level
   are in black. (C) HCC cases were divided into two groups based on their
   immune scores. Median disease-free survival of the high score group is
   longer than low score group (log-rank test, p<0.05). (D) Similarly, HCC
   cases were divided into two groups based on their stromal scores. The
   median disease-free survival of the low score group is longer than the
   high score group (log-rank test p=0.43), however, it is not
   statistically different.

Correlation between gene expression and immune scores

   We next evaluated the 319 HCC cases from TCGA database to assess the
   relationship between gene expression and immune scores. Using a fold
   change of |logFC |> 1 and significance threshold of P <0.05, we
   selected 1195 differentially expressed genes (DEGs), of which 1144 were
   overexpressed and 51 were underexpressed. A heat map of the DEGs is
   shown in [68]Figure 1B. Each row represents one gene, and each column
   represents one sample. The samples are sorted from left to right based
   on the immune score; the blue group on the left are the samples from
   the high score group, while the pink group on the right are the samples
   from the low score group. The genes were ranked according to the
   P-value of the differential expression analysis from smaller to larger.
   Red indicates overexpression, while green indicates underexpression,
   and the darker the red or green color, the greater the difference in
   expression.

   Functional enrichment analysis was performed to study the function of
   the DEGs. Gene Ontology (GO) analysis revealed that the DEGs were
   involved with “plasma membrane,” “immune process,” “immune response,”
   and “signaling receptor” activities. In addition, Kyoto Encyclopedia of
   Genes and Genomes (KEGG) pathway analysis showed significant
   involvement of the chemokine signaling pathway and cytokine-cytokine
   receptor interaction pathway. [69]Figure 2 shows the top ten enrichment
   results for GO annotation and KEGG pathway enrichment, and
   [70]Supplementary Tables 1–[71]4 list the specific enrichment
   information.

Figure 2.

   [72]Figure 2
   [73]Open in a new tab

   GO term and KEGG pathway analysis for all DEGs. Top 10 GO terms. False
   discovery rate (FDR) of GO analysis was acquired from STRING database.
   p <0.05. (A) biological process, (B) cellular component, (C) molecular
   function, and (D) KEGG pathway.

   We performed a survival analysis and generated Kaplan-Meier survival
   plots to examine the possible impact of each DEG on disease-free
   survival. Of the 1195 DEGs, 214 predicted poor or good disease-free
   survival in the log-rank test ([74]Supplementary File 1 list the total
   214 genes, [75]Figure 3 list 8 of 214 DEGs) (p<0.05).

Figure 3.

   [76]Figure 3
   [77]Open in a new tab

   Correlation of expression of individual DEGs in disease-free survival
   in TCGA. Kaplan-Meier survival curves were generated for selected DEGs
   extracted from the comparison of groups of high (red line) and low
   (blue line) gene expression. p<0.05 in Log-rank test. DFS, disease-free
   survival.

Protein-protein interactions network and functional enrichment analysis from
genes of significant value

   We used the STRING database to generate protein-protein interaction
   (PPI) networks to explore the interactions among the 214 predictive
   DEGs. We detected 7 modules in a network that included 156 nodes and
   1,049 edges ([78]Figure 4A). The two most significant modules were
   selected for subsequent analysis. In module 1 ([79]Figure 4B), 199
   edges involving 24 nodes were formed within the network. The genes,
   with the most connections to the immune response included PRF1, CCR7,
   IL7R, CD3E, GZMB, CXCR3, CCR5, CD4, CD40LG, CD8A, EOMES, LCK, PDCD1LG2,
   SLAMF1, SPN, ZAP70, CD48 and CD5. In module 2 ([80]Figure 4C), critical
   genes related to immune responses, including GZMA, KLRK1, KLRD1, IBTLA,
   CCR2, CD8B, L12RB1, IL18, ITK, SLAMF6 and TRAT1, were located at the
   center of the module.

Figure 4.

   [81]Figure 4
   [82]Open in a new tab

   (A) The whole PPI networks of the 214 predictive DEGs. (B) The module 1
   of the two most significant modules in the whole PPI network. (C) The
   module 2 of the two most significant modules in the whole PPI network.

   Functional enrichment clustering of these genes exhibited a strong
   association with immune responses. Top 10 GO terms included “signaling
   receptor activity,” “immune response,” “immune system process,” and
   “cell periphery, extracellular region.” In addition, KEGG pathway
   analysis showed all the pathways were associated with immune responses
   ([83]Figure 5A–[84]5D). Significant GO terms identified include 5 for
   “molecular function,” 12 for “cellular component” and 30 for
   “biological process.” [85]Supplementary Tables 5–[86]8 lists the
   specific enrichment information.

Figure 5.

   [87]Figure 5
   [88]Open in a new tab

   GO term and KEGG pathway analysis for DEGs significantly associated
   with disease-free survival. Top pathways with FDR < 0.05, -log FDR
   >1.301 are shown: (A) biological process, (B) cellular component, (C)
   molecular function, and (D) KEGG pathway.

GEO database and clinical sample validation

   To determine whether the aforementioned genes are also of prognostic
   significance in other HCC cases, a gene expression dataset
   ([89]GSE14520) from a different HCC cohort (221 cases) were downloaded
   and analyzed. Among the 214 predictive DEGs, 13 were confirmed to be
   significantly associated with clinical prognosis. Of those, eight genes
   (TNFSF8, CD3E, ITK, KLRD1, PRKCQ, TRAF3IP3, PHLDA2, C11orf21) were of
   particular interest because their differential expression had not been
   previously reported in HCC patients ([90]Figure 6). We used qRT-PCR to
   assess expression of these eight genes of interest in 10 pairs of fresh
   HCC and adjacent nontumor tissues. The results showed that levels of
   the transcripts of these eight genes were frequently higher (p<0.05) in
   the corresponding nontumor tissues than to the HCC tissues ([91]Figure
   7).

Figure 6.

   [92]Figure 6
   [93]Open in a new tab

   Validation of DEGs extracted from TCGA database with disease-free
   survival in GEO cohort. Kaplan-Meier survival curves were generated for
   selected DEGs extracted from the comparison of groups of high (red
   line) and low (blue line) gene expression. p<0.05 in Log-rank test.
   DFS, disease-free survival.

Figure 7.

   [94]Figure 7
   [95]Open in a new tab

   Verification of these 8 interested DEGs in clinical samples. Relative
   mRNA levels of 8 genes in 10 HCC samples were frequently overexpressed
   in nontumor tissues compared with matched HCC tissues(p<0.05) by
   qRT-PCR except CD3E.

Correlation between expression of genes of interested and expression of
immune checkpoint gene

   Using TCGA datasets, we found that expression of the eight genes of
   interest correlated positively with the mRNA expression of the immune
   checkpoint gene PDCD1 ([96]Figure 8) (p<0.05). Of the eight genes, the
   strongest correlation was between CD3E and PDCD1.

Figure 8.

   [97]Figure 8
   [98]Open in a new tab

   Correlation between expression of interested DEGs and immune checkpoint
   gene. Pearson correlation of expression and ImmuneScore dataset. The
   all 8 interested genes had significant correlation(p<0.05), especially
   CD3E, ITK and TRAF3IP3. X-axis represented expression level of 8
   interested genes in each sample. Y-axis represented expression level of
   PDCD1 in each sample.

DISCUSSION

   In the present study, we used a dataset from TCGA to identify genes
   related to the tumor microenvironment that influenced disease-free
   survival. By comparing between groups with high or low immune scores
   and performing a GO term functional analysis, we initially extracted
   1195 differentially expressed genes involved in the tumor
   microenvironment. We then carried out a disease-free survival analysis,
   which revealed that 214 of the genes were related to poor disease
   outcomes in HCC patients. Finally, using a validation cohort from the
   GEO database ([99]GSE14520), we detected 13 genes related to tumor
   microenvironment that correlated significantly with prognosis. Of those
   five (GZMA, CD79A, IGJ, CYP3A4, SPP1) are reportedly predictive of
   overall survival or are involved in HCC pathogenesis
   ([100]Supplementary Table 9). The remaining eight genes, including
   TNFSF8, CD3E, ITK, KLRD1, PRKCQ, TRAF3IP3, PHLDA2, and C11orf21, have
   not previously been reported to impact the clinical prognosis of HCC
   patients, and may have the potential to serve as new biomarkers for
   HCC. There was also a strong relationship between these eight genes,
   especially CD3E, ITK and TRAF3IP3, and the immune checkpoint gene
   PDCD1, which suggests HCC patients showing high expression of CD3E, ITK
   and TRAF3IP3 may respond well to immunotherapy.

   PPI network analysis revealed interleukin-7 receptor (IL7R) and killer
   cell lectin-like receptor K1 (KLRK1) to be highly interconnected nodes
   ([101]Figure 4). IL7R plays a vital role in lymphocyte development. For
   example, IL7R-deficiency may contribute to severe combined
   immunodeficiency (SCID). Moreover, it was recently reported that IL7R
   is associated with the risk of HCC [[102]14–[103]16]. KLRK1, also
   called NKG2D (NKG2-D-activating NK receptor), binds noncovalently with
   the DAP10 signaling protein to deliver costimulatory or activating
   signals to T and NK cells. Sheppard et al. suggested NKG2D promotes
   tumor growth in a chronic inflammation model of HCC [[104]17, [105]18].

   Statistics from several large-scale clinical trials of cancer patients
   demonstrate that the number, type, and region of infiltrating
   lymphocytes within tumor tissue are decisive for predicting clinical
   outcomes [[106]8, [107]19–[108]21]. Previous studies of colon cancer
   [[109]22], ovarian cancer [[110]23, [111]24], lung cancer [[112]25],
   and melanoma [[113]26] showed that immune cell activation-related genes
   are upregulated in patients with good prognoses. Pages et al. [[114]27]
   used immunological scoring methods to follow up the survival and
   recurrence in stage I and II colon cancer patients. They found that
   both overall and disease-free survival were significantly better in
   patients with higher immune scores than in those with lower immune
   scores. Patients with an immune score of 4 had more prolonged overall
   survival, and 95% of these patients had no tumor recurrence within 18
   years after surgery. On the other hand, 50% of patients with an immune
   score of 0 had tumor recurrence within 2 years after surgery. In the
   present study, higher immune scores associated with better prognosis in
   HCC patients, which is consistent with that earlier report. These
   findings may offer a different perspective on the complex interaction
   between tumors and their immune environment in HCC. However, there
   remains a need for further studies on these genes, which we anticipate
   will provide additional new insight into the impact of the tumor
   microenvironment in HCC.

MATERIALS AND METHODS

Source of data

   Gene expression profiles, as well as patients' clinical information,
   were obtained from TCGA database
   ([115]https://tcga-data.nci.nih.gov/tcga/). Immune and stromal scores
   were calculated using the ESTIMATE algorithm
   ([116]https://bioinformatics.mdanderson.org/estimate/). The validation
   dataset ([117]GSE14520) was extracted from the GEO database
   ([118]https://www.ncbi.nlm.nih.gov/geo/)

ESTIMATE algorithm

   ESTIMATE (Estimation of STromal and Immune cells in MAlignant Tumour
   tissues using Expression data) is an algorithmic tool. The detail
   algorithm can be seen at [119]Supplementary File 3. Yoshihara and his
   colleagues [[120]28] calculate stromal and immune scores to predict the
   level of infiltrating stromal and immune cells through single-sample
   gene set-enrichment analysis (ssGSEA), and these scores make up the
   basis of the ESTIMATE score to infer tumour purity in tumour tissue.

Clinical samples

   Ten pairs of fresh HCC specimens and adjacent nontumor tissue were
   collected from Southern Medical University, Zhujiang Hospital
   (Guangzhou, China). The human specimens used for validation in this
   study were collected between May and June 2020. Use of these specimens
   was approved by the local ethics committee. The adjacent samples were
   taken at a distance of at least 5 cm from the tumor, and all tissues
   were examined histologically. None of the patients had received
   preoperative chemotherapy or radiotherapy.

RNA extraction and quantitative real-time polymerase chain reaction(qRT-PCR)

   Total RNA was extracted from the tissue specimens using TRIzol
   (Invitrogen), and qRT-PCR was performed with SYBR Green Dye (Takara,
   Dalian, China) according to the manufacturer’s instructions. The primer
   sequences are listed in [121]Supplementary File 2.

Identification of DEGs and Heatmap and clustering analysis

   Data were analyzed using R software (version 3.6.2) and its package
   Limma. |logFC|>1 and p < 0.05 were set as the cut-offs to screen for
   DEGs. Heatmaps and clustering were generated using the R software.

PPI network construction

   The STRING network database ([122]https://string-db.org/) was employed
   for construction of PPI (protein-protein interaction) networks using
   the Cytoscape app. We then selected specific networks with ten or more
   nodes for subsequent analysis, and the degree of connectivity for each
   network node was calculated. To detect densely connected regions, we
   used the Cytoscape plug-in clustering algorithm called Molecular
   COmplex DEtection (MCODE) to identify clusters based on their
   topological features.

Recurrence-free survival curve analysis

   We carried out a Kaplan-Meier analysis using survival R package to
   assess the association between recurrence-free survival among patients
   and the levels of DEG expression. The constructed curves were compared
   using the log-rank test.

Enrichment analysis of DEGs

   We used the STRING network database to perform functional enrichment
   and pathway enrichment analyses of the DEGs. The enrichment content
   included GO, Reactome Pathways, Kyoto Encyclopedia of Genes and Genomes
   (KEGG) Pathways, UniProt Keywords, PFAM Protein Domains, SMART Protein
   Domains, INTERPRO Protein Domains and Features, and other databases.
   The pathway enrichment analysis was based on references from the KEGG