Abstract

   Glioblastoma (GBM) is a common malignant brain tumor of the central
   nervous system with a poor prognosis. In order to identify the
   prognostic signatures of GBM, we screened differentially expressed
   genes (DEGs) that were based on a single-cell RNA sequencing
   (scRNA-seq) dataset. These genes characteristically represent the
   intra-tumor heterogenicity of glioblastoma. Moreover, we performed
   univariate analysis, log-rank test and multivariate Cox regression
   analyses to confirm a gene set that could be related to the overall
   survival (OS) among DEGs. Prognostic associated signatures (PAS) were
   utilized to construct a model for predicting OS in GBM patients. When
   considering either the training or the validation sets, time-dependent
   receiver operating characteristic (ROC) curves all indicated that our
   model displayed an excellent predictive ability. Additionally, we
   analyzed PAS at the single-cell level and found that the PAS score was
   associated with somatic mutations and clinical factors. Three factors,
   which included the PAS score, radiotherapy status, and age, were all
   used to establish a nomogram to predict the 6-month and 1-year survival
   probabilities. In conclusion, we constructed an optimal model that was
   derived from scRNA-seq to better predict the survival probability of
   GBM patients. These genes might also act as potential prognostic
   biomarkers and enable surgeons to develop individually therapeutic
   schedules and improve the prognosis of GBM patients.

   Keywords: glioblastoma, single cell, differentially expressed genes,
   survival analysis, prognostic model

Introduction

   Glioblastoma (GBM) is the most common malignancy among primary central
   nervous system (CNS) tumors, which is classified as a grade IV glioma
   according to the World Health Organization (WHO) [33]^1. The median
   overall survival (OS) of GBM patients remains at only 15 months
   post-diagnosis, with a 5-year disease-free survival probability of 10
   percent [34]^2^-[35]^4. The poor-prognosis is mostly due to a high
   proliferation rate, treatment-resistance to chemotherapy and targeted
   therapies, and aggressive infiltration of cancer cells into the
   surrounding normal brain tissues [36]^5.

   Over the past few decades, microarray analysis and RNA sequencing
   (RNA-seq) from bulk tissue has evolved as an important tool in the
   analysis of differential gene expression across the transcriptome
   [37]^6. With the development of this technology, many public cancer
   databases that are associated with various omics approaches have been
   established to reveal the underlying mechanisms accounting for disease
   occurrence, which includes the Cancer Genome Atlas (TCGA) [38]^7 and
   the Gene Expression Omnibus (GEO). Using these gene expression profiles
   to identify biomarkers associated with prognosis, subsequent studies
   are increasingly being reported. However, GBM is a highly heterogeneous
   cancer wherein differential genes that are screened from bulk tissue
   data are not sufficiently representative of this disease [39]^8.

   Recently, single-cell RNA-seq (scRNA-seq) analysis has gradually
   emerged as a new area of intense research effort with its advances
   helping to explain intra-tumorigenic heterogeneity [40]^9^-[41]^11. In
   addition, tumor tissue that is resected from the core tumor of affected
   patients might include different cell types that harbor distinct
   phenotypic states [42]^12. Single-cell RNA-seq technology enables
   scientists to identify individual cells in heterogeneous cell
   populations, which can overcome the aforementioned limitations
   [43]^13^, [44]^14. Moreover, scRNA-seq analysis enables the discovery
   of significant genes that are truly characteristic of tumor cells
   [45]^15.

   In this current study, we first identified differentially expressed
   genes (DEGs) from a scRNA-seq dataset of tumor cells as compared normal
   cells. These genes characteristically represent the intra-tumor
   heterogenicity of glioblastoma. Then we integrated the dataset with the
   bulk RNA-seq dataset from the TCGA and microarray datasets of GEO to
   obtain a prognosis-associated gene set. Finally, we explored utilizing
   this gene set to construct a more effective model to accurately predict
   the prognosis of GBM patients. This approach has the potential of
   providing guidance for additionally targeted and individualized
   treatment of patients.

Materials and Methods

Data sources

   Raw sequencing data of single-cell datasets in this study was fetched
   from the Gene Expression Omnibus (GEO) database under the accession
   number [46]GSE84465, for which, 3589 cells were included in the
   following analysis for which PAS scored an initial quality check. Raw
   fastq data was mapped to the hg19 human reference genome by HISAT2
   (v2.1.0) [47]^16 (with parameters: -k 10 -rdg 99999999, 99999999 -rfg
   99999999, 99999999 -mp 1, 1 -np 1 -score -min L, 0, -0.1 -no-mixed -no
   -softclip -no -discordant -secondary -seed 12345), and aligned reads
   were quantified by feature counts software [48]^17 to the count level.

   We downloaded the TCGA-GBM RNA-seq and clinical data as a training set
   from the University of California at Santa Cruz (UCSC) Xena Browser.
   The fragments per kilobase of exon per million fragments mapped (FPKM)
   value, was used to normalize gene expression from which, we transformed
   it into log2 (FPKM+1) datasets. The microarray data from the
   [49]GSE16011 was also collected from the GEO database as a validation
   set and was processed by normalizing gene expression signals and log2
   transformation through mas5 algorithm in R package “affy” (v1.64.0). In
   total, there were 154 GBM patients in the TCGA- GBM dataset and 155 GBM
   patients in the [50]GSE16011 dataset with complete clinical data. In
   addition, the MuTect2 somatic mutation data was obtained from the UCSC
   Xena Browser. A workflow of the analysis conducted in this study was
   shown in Figure [51]1.

Figure 1.

   [52]Figure 1
   [53]Open in a new tab

   Workflow of the analytical procedure for constructing and validating
   prognostic signatures in GBM patients.

Cell clustering in single-cell RNA-seq data

   We used the statistical R package “Seurat” (v3.1.1) [54]^18^, [55]^19
   to process single-cell data. First, we normalized the data by following
   the standard pre-processing workflow in Seurat. Then, we calculated a
   subset of features that exhibited high cell-to-cell variation, from
   which we selected the top 500 features by setting the method as
   dispersion. Finally, the Uniform Manifold Approximation and Projection
   (UMAP) algorithm was used to perform dimensionality reduction. The same
   marker profiles of the cell type that were reported in the original
   paper [56]^12 were compared with our current study to define the type
   of each cell.

Differential expression analysis

   Differentially expressed genes (DEGs) were identified with the R
   package “DESeq2” (v1.22.2) [57]^20 by defining neoplastic cells as the
   tumor group, and then astrocytes, oligodendrocytes, oligodendrocyte
   progenitor cells (OPCs) as the normal group. The DESeq2 pipeline was
   used with default settings. We filtered genes with the criteria by a
   log2 fold change >2 and adjusted this to an alpha value of p <0.01.

Functional and pathway enrichment analysis

   To compare the biological processes (BP), cellular components (CC) and
   molecular functions (MF) of DEGs, we used the R package
   “clusterProfiler” (v3.10.1) [58]^21 to conduct gene set enrichment
   analysis (GSEA) of the Gene Ontology (GO). The five most significant
   biological processes in the results were shown with an adjusted alpha
   value of p <0.05. Additionally, the Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathway enrichment analysis of DEGs was performed with
   the same package. An UpSet plot of the top 10 pathways were drawn from
   this research.

Survival analysis

   The log-rank test, univariate Cox regression analysis and multi-Cox
   regression analysis were respectively used to screen PAS with OS values
   using the R package “survival” (v3.1-8). The threshold with
   significance in all methods was set at an alpha value of p <0.05. In
   addition, Kaplan Meier (KM) survival curves were generated to
   graphically exhibit the prognostic outcomes between high and low risk
   groups that were divided through the median of gene expression levels
   or the PAS score.

Construction of the prognostic model

   Due to the high-dimension features and complexity of the data, we
   performed a least absolute shrinkage and selection operator (LASSO)
   method analysis using the R package “glmnet” (v2.0-18) [59]^22. In
   addition, 10-fold cross-validation was used to establish a best Cox
   proportional hazards model. Then, to quantify the risk of OS for each
   patient, the PAS score was calculated as the sum product of the RNA
   (Expi) expression levels and LASSO coefficients (Li). The efficacy of
   the prognostic model was validated by depicting the areas under the
   curve (AUC) of the ROC by using the R package “survivalROC” (v1.0.3).
   To monitor the OS and to predict the survival probability in GBM
   patients, we merged the PAS score with interesting clinical variables
   and constructed a nomogram using the R package “regplot” (v0.2).

Oncoprint of somatic mutation

   Somatic mutation information was stored in Mutation Annotation Format
   (MAF) form and was visualized using the R package “maftools” (v1.8.10)
   tool [60]^23. All parameters were set to the default settings.

Statistical Methods

   All data analyses were completed using the 'R” statistical software
   package (v3.5.3) and the corresponding fundamental analyses packages.
   An alpha value of p <0.05 was considered statistically significant.

Results

Differentially expressed gene identification from a single-cell RNA-seq
dataset

   We selected 1670 single cells that included 1091 neoplastic cells, 88
   astrocytes, 85 oligodendrocytes, and 406 oligodendrocyte progenitor
   cells (OPCs). These cells were analyzed by the standard workflow of the
   R package “Seurat.” A linear dimensionality reduction method helped
   identify significant components (Figure [61]2A). Additionally, we used
   the Uniform Manifold Approximation and Projection (UMAP) algorithm,
   which is a novel manifold learning technique for dimension reduction to
   visualize the single-cell RNA-seq data based on the significant
   components. We discovered that cells could be clustered into four
   groups (Figure [62]2B). The differential marker genes between clusters
   are illustrated in [63]Table S1, and the top 10 most enriched genes in
   each cluster were exhibited in the heat map (Figure [64]2C). Next, we
   compared the marker genes in our study with those identified in the
   original article [65]^12 and used the same marker genes to name the
   clusters, where results were similar.

Figure 2.

   [66]Figure 2
   [67]Open in a new tab

   Characteristics of single-cell RNA-seq data and DEGs. (A) Jack Straw
   Plot showing the p-value distributions for each PC. (B) Dimension
   reduction analysis of single-cell RNA-seq data by the UMAP algorithm
   that clusters cells into four groups. (C) Heat map expression profiles
   of the top 10 genetic markers in each cluster. (D) PCA of tumorigenic
   and normal cells that can be clearly separated. (E) Volcano plot of
   DEGs with a log2 fold-change >2, and an adjusted alpha value of p
   <0.01. (F) Heat map expression profiles of the 100 most significant
   genes for tumorigenic and normal cells.

   Concrete classification information of cells used in this study are
   illustrated in [68]Table S2. In order to find the DEGs between
   tumorigenic and normal cells, we defined astrocytes, oligodendrocytes
   and OPCs as normal cells. By principal component analysis (PCA),
   tumorigenic and normal cells were separated based on principle
   component 1 (PC1) and PC2 (Figure [69]2D). Then the “DEseq2” package
   identified the DEGs. We excluded genes with the cut-off criteria of
   log2 fold change >2 and adjusted the alpha value to p<0.01. In the
   results, we included the 1367 identified up-regulated genes and the 959
   down-regulated genes in tumor cells that were displayed in the volcano
   plot (Figure [70]2E). The two groups were clearly discriminated by
   these 100 DEGs in the heat map (Figure [71]2F).

GSEA of GO and KEGG pathway enrichment analysis in DEGs

   The 1367 DEGs of the transcriptomic data from single cells were
   selected for gene set enrichment analysis (GSEA) to explore the
   biological effects. We chose the log2 fold change as the reference
   phenotype and the five most significant Gene Ontology (GO) biological
   processes were listed. We observed that up-regulated genes in tumor
   cells were enriched in the cell cycle, cell division process and
   tumor-associated pathways (Figure [72]3A), as for cellular components
   and molecular functions, these genes were enriched in the nuclear and
   DNA binding associated molecular process ([73]Figure S1A and [74]Figure
   S1B). By contrast, up-regulated genes in normal cells were enriched in
   glial cell development and synaptic transmission (Figure [75]3B), as
   for cellular components and molecular functions, these genes were
   enriched in cytoplasm and transmembrane transportation ([76]Figure S1A
   and [77]Figure S1B). These results were consistent with the
   characteristics of both cell types. In addition, it is intriguing to
   note that the KEGG pathway enrichment analysis revealed an association
   of DEGs with lipid and fatty acid metabolism (Figure [78]3C), and
   similar results were shown on enrichment analysis of just up-regulated
   or down-regulated DEGs ([79]Figure S1C and [80]Figure S1D). Precisely
   how an altered associated metabolism contributed to progression of GBM
   remains unknown. However, quite valuable research study outcomes have
   been published that illustrated how fatty acid synthesis was related to
   EGFR signaling in glioma stem cells [81]^24.

Figure 3.

   [82]Figure 3
   [83]Open in a new tab

   GSEA results and KEGG analysis of differentially expressed genes. (A)
   The five most significant biological processes for tumor cells in the
   GSEA of GO. (B) The five most significant biological processes for
   normal cells in the GSEA of GO. (C) Diagram illustrating the top 10
   pathways in the KEGG enrichment analysis of DEGs.

Construction of prognostic-associated signatures in the TCGA-GBM cohort

   To discover prognostic-associated signatures, we combined the 2326 DEGs
   that were defined above with the TCGA-GBM bulk RNA-seq and clinical
   data. There were 154 GBM patients with intact clinical data and 110
   genes of DEGs were not expressed in TCGA bulk RNA-seq data. Before
   constructing a Cox proportional hazards model to better predict OS in
   GBM patients, we used the log-rank test and univariate cox regression
   analysis respectively to preliminarily screen genes with prognostic
   potential.

   In the obtained results, we found that 181 genes had prognostic
   potential in the log-rank test and 279 genes had prognostic potential
   as determined by univariate Cox regression analysis (p < 0.05). We
   generally thought that the more up-regulated genes that were expressed
   in tumor cells of studied patients, then the worse their prognosis
   would be. Similarly, the greater the number of down-regulated genes
   that were expressed in tumor cells of studied patients, the better
   their prognosis would be. Thus, a total of 98 genes that included 89
   up-regulated genes with a hazard ratio (HR) >1, and 9 down-regulated
   genes with a HR < 1 (p < 0.05; both in the log-rank test, and
   univariate cox regression analysis), were selected to construct a Cox
   proportional hazards model (Figure [84]4A).

Figure 4.

   [85]Figure 4
   [86]Open in a new tab

   Establishing a survival model in the GBM cohort. (A) Summary of the
   selected DEGs with a prognostic capacity, in which we only selected
   genes during the area covered by indicated colors. (B) The optimal
   account of genes that corresponded to minimum lambda was 18 in the
   TCGA-GBM cohort. (C) The ROC analysis curves in predicting OS by the
   PAS score in the TCGA-GBM cohort. (D) The KM curves for low- and
   high-risk groups in the TCGA-GBM cohort. (E) The ROC curves for
   predicting OS by the PAS score in the [87]GSE16011 cohort. (F) The KM
   curves for low- and high-risk groups in the [88]GSE16011 cohort.

   By utilizing the least absolute shrinkage and selection operator
   (LASSO) method, we established a best Cox proportional hazards model
   (10-fold cross- validation) with 18 genes (Figure [89]4B). The complete
   information related to the OS of these 18 genes are shown in Table
   [90]1. In addition, Kaplan-Meier (KM) curves of each gene can be seen
   in [91]Figure S2. The prognostic associated signature score (PAS score)
   for each patient was calculated as the sum of the product of the RNA
   expression levels (Expi) and LASSO coefficients (Li). In the TCGA-GBM
   cohort, the areas under the curve (AUC) of the receiver operating
   characteristic (ROC) curves of this model for predicting 1-, 3-, and
   5-year OS were 0.803, 0.876, and 0.985 respectively (Figure [92]4C).

Table 1.

   Statistical analysis of 18 genes associated with survival in the
   TCGA-GBM cohort.
   Variables Log-rank test Univariate cox regression analysis LASSO
   P value HR (95% CI) P value Coefficient
   AGAP2-AS1 0.018 1.24(1.10-1.40) <0.001 0.057478
   CLEC18C 0.012 3.73(1.58-8.80) 0.003 0.2067
   CNPY4 0.046 1.87(1.32-2.65) <0.001 0.054552
   COL22A1 0.015 1.38(1.15-1.65) <0.001 0.090813
   CRNDE 0.010 1.31(1.04-1.66) 0.025 0.002454
   HOXB2 0.001 1.19(1.05-1.34) 0.007 0.014237
   HOXD11 0.049 1.32(1.06-1.64) 0.012 0.019821
   LOXL1 0.011 1.43(1.21-1.69) <0.001 0.034217
   MBLAC1 0.006 1.79(1.18-2.73) 0.007 0.10079
   OSMR-AS1 0.013 2.92(1.69-5.05) <0.001 0.44677
   PCDHB3 0.029 1.24(1.01-1.52) 0.039 0.01033
   PTPRN <0.001 1.42(1.21-1.67) <0.001 0.19405
   RGS14 0.006 1.57(1.18-2.07) 0.002 0.121256
   TCAF2 <0.001 1.85(1.31-2.62) <0.001 0.09673
   TSHZ2 <0.001 1.93(1.34-2.78) <0.001 0.023308
   TSPAN4 0.015 2.11(1.40-3.18) <0.001 0.0621
   BEST3 0.037 0.73(0.59-0.91) 0.006 -0.0173
   FERMT1 0.004 0.81(0.70-0.94) 0.006 -0.02211
   [93]Open in a new tab

   Furthermore, we divided patients into high-risk and low-risk groups
   according to the median of PAS score. KM analysis indicated that
   patients with a high PAS score suffered significantly poorer OS
   outcomes (p < 0.0001; Figure [94]4D). These results lend support to our
   model possessing good sensitivity and specificity.

Validation of PAS in an external GBM cohort

   To validate our PAS, we selected external data from GEO under accession
   number [95]GSE16011 to determine whether it made sense in other
   cohorts. There were 155 GBM patients with complete survival data and
   microarrays. We conducted the same workflows to measure the AUC of the
   ROC curves and divided patients into two groups by the median of the
   PAS score. In the results analyses, the AUC of the ROC curves of this
   model in predicting 1-, 3-, and 5-year OS rates were 0.602, 0.796, and
   0.842 respectively (Figure [96]4E). Further, KM analysis also indicated
   that patients with a high PAS score suffered significantly poorer OS
   outcomes (p < 0.001; Figure [97]4F). All results revealed that these
   PAS were important in GBM patients without an increased level of bias.

PAS levels in a single-cell dataset that is correlated with somatic mutations

   After we identified 18 prognostic-associated signatures, we mapped the
   expression of the signatures to the UMAP-reduction plot, which was done
   in the single-cell dataset that was described above. It showed that two
   genes had lower expression levels in neoplastic cell clusters as
   compared to normal cells, while 16 genes displayed higher expression
   levels in the neoplastic cell cluster (Figure [98]5A and [99]Figure
   S3). Specifically, five genes were involved in cell adhesion and
   migration, which included FERMT1, COL22A1, LOXL1, PCDHB3, and TCAF2.
   Four genes were transcription factor or involved in the regulation of
   transcription, like HOXB2, HOXD11, PTPRN, and TSHZ2. Moreover, we
   explored somatic mutation that was found when comparing the high with
   the low PAS score groups in the TCGA-GBM cohort.

Figure 5.

   [100]Figure 5
   [101]Open in a new tab

   Analysis of 18 prognostic genes in the single-cell and bulk analysis
   dataset. (A) Differential expression levels of each gene in four
   clusters in the single-cell dataset. (B) The somatic mutation landscape
   of TCGA-GBM patients with a high PAS score. (C) The somatic mutational
   landscape of TCGA-GBM patients with a low PAS score.

   The mutational landscape indicated that both groups had different
   mutation events, and the low PAS score group was found to have an
   increased frequency of mutation events (Figures [102]5B and [103]5C).
   In addition, patients with higher mutations might have an increased
   number of neoantigens that will increase the sensitivity of patients to
   chemotherapy, immunotherapy and targeted molecular therapy
   [104]^25^-[105]^27. Thus, these observations might provide a key reason
   to account for why low PAS score patients had improved prognostic
   outcomes.

Development of a nomogram for predicting OS in GBM patients

   For the purpose of establishing a comprehensive model to predict the OS
   probability in GBM patients, we first analyzed an association between
   the PAS score and several clinical factors in the TCGA-GBM dataset. The
   hazard ratios for OS according to the PAS score, age, gender, and
   radiotherapy status were measured by multivariate Cox regression
   analysis (Figure [106]6A). It revealed that the PAS score and
   radiotherapy status were independent prognostic factors (p <0.001).
   Considering that the resistance of the human body to disease weakens
   with age, we also included age into the nomogram that was integrated
   with two independent factors.

Figure 6.

   [107]Figure 6
   [108]Open in a new tab

   Development of a nomogram in predicting the 6-month and 1-year survival
   probabilities in GBM patients. (A) Multivariate Cox regression analysis
   of the PAS score with several clinical factors. (B) Integration of
   three factors to construct a nomogram for the prediction of 6-month and
   1-year OS rates. (C) Calibration curve to validate the predictive
   efficacy of the model for 6 month and 1-year OS rates.

   The final three prognostic factors were utilized to develop a nomogram
   for predicting 6-month and 1-year OS probability (Figure [109]6B). A
   calibration curve was drawn to validate model efficacy (Figure
   [110]6C). Thus, the actual observed OS outcomes were approximately
   consistent with the predicted OS outcomes regardless of the 6-month or
   1-year time frame, which showed an excellent working ability.

Discussion

   Bulk RNA-seq data from the TCGA-GBM dataset fails to accurately reflect
   the status of tumor and intra-tumorigenic heterogeneity [111]^9^,
   [112]^10. This is in part due to the dominant cell cluster masking the
   transcriptomic characteristics of other cellular clusters. With the aid
   of single-cell RNA-seq datasets that were analyzed in this study, we
   compared and precisely discovered the DEGs between glioma cells and
   their original cells. These DEGs better describe the intra-tumorigenic
   heterogeneity of glioma, and provide us with a gene set for further
   analysis.

   Furthermore, these DEGs indicated distinct molecular networks in glioma
   cells as compared to glia cells. GSEA analysis on DEGs revealed that
   up-regulated genes in glioma cells were enriched in the cell cycle, the
   cell division process and other tumor-associated pathways. By contrast,
   for normal cells, up-regulated genes were enriched in glial cell
   development-associated and synaptic function- associated pathways. KEGG
   enrichment analysis showed intriguing results as well, which were
   mostly enriched in pathways associated with lipid metabolism. Recent
   reports have shown that lipid metabolism is critical to glioma cells in
   terms of energy synthesis [113]^28^, [114]^29, and fatty acids were
   shown to regulate the EGFR signaling pathway and to promote survival
   and proliferation of glioblastoma stem cells [115]^24.

   To connect the DEGs with prognosis, we first employed Cox regression
   and log-rank test analytics to preliminarily select genes with
   prognostic predicting potential, following which, we then employed
   variable selection analysis on the TCGA-GBM dataset. The final model
   included 18 genes that performed well on the dataset with a five-year
   OS AUC of 0.985.

   Moreover, we used data from the [116]GSE16011 cohort to test the
   generalization ability of the model, in which we found that accuracy
   was slightly decreased, and yet still achieved a five-year OS AUC of
   0.842. In addition, patients that were divided by the PAS value
   displayed a distinct mutational landscape, which partially explained
   differences on OS. Demographic information that included age, and
   clinical information like chemotherapy were also good prognostic
   prediction factors, and thus we established a model using a nomogram to
   predict the overall survival of patients.

Conclusion

   We discovered a signature with only 18 genes by using a single cell
   RNA-seq dataset, from which we developed a PAS to evaluate patient
   prognoses. These genes characteristically represent the intra-tumor
   heterogenicity of glioblastoma. This signature has acceptable
   performance in terms of both the training and external datasets and
   showed acceptable generalization capabilities. This is a simple model
   with good performance characteristics that can be applied in the
   clinic. PAS is also associated with the somatic mutation profile of
   patients. These genes might act as potential prognostic biomarkers and
   provide useful clinical guidance to clinicians with the intent of
   developing personalized therapeutic schedules.

Supplementary Material

   Supplementary figures.
   [117]Click here for additional data file.^ (1.8MB, pdf)

   Supplementary table 1.
   [118]Click here for additional data file.^ (333.1KB, csv)

   Supplementary table 2.
   [119]Click here for additional data file.^ (48.2KB, csv)

Acknowledgments