Abstract

   Clear cell renal cell carcinoma (ccRCC) is a prevalent malignant tumor
   in the field of urology. The effect of cell heterogeneity on the
   prognosis and reaction to treatment of ccRCC in large populations is
   still unclear. By analyzing public single cell RNA-sequencing and bulk
   RNA-sequencing data with the Scissor algorithm, we have identified
   three distinct prognosis related cancer cell subtypes which play an
   indispensable role on tumor metastasis, immune response and
   proliferation respectively. Besides, regulatory T cells (Tregs) and
   matrix producing cancer associated fibroblasts (matCAFs) were also
   recognized as crucial cell subtypes in the tumor microenvironment
   (TME). Moreover, potential interactions between Scissor + cells and
   other cells in TME were investigated to uncover regulatory mechanisms
   via ‘Cell Chat’ and cell2location algorithm. It is interesting that the
   interferon gamma signaling pathway and p53 signaling pathway contribute
   to the Scissor + transition of Tregs and matCAFs. The distinct
   activated transcription factor patterns were uncovered as well as the
   essential ligand-receptor pairs in the interactions among different
   cell subtypes, such as CXCL12-CXCR4 and COL6A2-SDC4. Then, we developed
   a risk score signature consisting of 10 genes, utilizing a
   101-combination machine learning computational framework, which showed
   promising results in predicting the prognosis of patients. Furthermore,
   our study revealed variations in immune cell infiltration and the
   expression of immune related factors within the tumor microenvironment
   between different risk score groups, as well as the different
   sensitivity to the immunotherapy. In the end, we suggested Rapamycin as
   the additional therapy for the advanced ccRCC. In conclusion, our study
   created a signature to provide opportunities for predicting prognosis
   and improving treatments of ccRCC.

   Keywords: Scissor, Machine learning, Multi-omics, Clear cell renal
   carcinoma

   Subject terms: Cancer, Cell biology

Introduction

   Clear cell renal carcinoma (ccRCC) is the most frequently diagnosed
   histological subtype of RCC and it leads to over 179,000 deaths
   worldwide^[34]1. Though some patients who undergo surgery experience
   increased survival rates, there are still numerous people suffered from
   the metastasis and recurrence of ccRCC. Over the years, only the renal
   tumor cell vaccine and sunitinib have been shown to enhance the
   prognosis, with nearly all other supplementary treatments failing to
   show their advantages^[35]2. Immune checkpoint blockade (ICB) therapy
   and combination regimens have given rise to the hope of these patients
   but the overall response rate still remains less than 58% in most
   cases^[36]3. Therefore, it is pivotal to construct a method assessing
   the prognosis and therapeutic response and continue to discover new
   drugs for ccRCC.

   Extensive heterogeneity, as a significant feature of cancer cells
   driven by genetics, epigenetics and microenvironment, accounts for the
   diverse therapeutic response among patients^[37]4. The use of
   single-cell RNA-sequencing (scRNA-seq) and spatial transcriptome (ST)
   technology has significantly improved our comprehension of tumor
   heterogeneity in recent times^[38]5. In spite of the extensive analysis
   of the heterogeneity in cancer cells and tumor microenvironment
   (TME)^[39]6,[40]7,[41]27, a comprehensive understanding of
   prognosis-related cellular subtypes is still lack due to the costs
   associated with high-throughput scRNA-seq when applied to large
   populations simultaneously. Due to the cost-effectiveness of bulk
   RNA-seq for use in a sizable group, the Scissor technique, which
   combines phenotype-linked bulk RNA-seq information with scRNA-seq data,
   seems capable of pinpointing a distinct cell subset linked to survival
   outlook^[42]8,[43]12. Furthermore, such illustration of prognosis
   related cancer cells offers a novel insight into their special
   biological characteristics and behaviors, which guides the development
   of clinical outcome.

   As an automatic data analysis method, machine learning has been
   employed in clinical diagnosis and therapy^[44]9. Furthermore,
   advancements in next-generation sequencing have led to the creation of
   multiple prognostic and predictive gene signature models using machine
   learning techniques in previous studies^[45]8,[46]10–[47]12,[48]14. To
   gain a better extrapolation potential, researchers incorporated 10
   machine learning methods used in clinical analysis and combined any two
   methods to establish a novel algorithm including 101 kinds of
   combinations^[49]11,[50]13,[51]14,[52]49. Such an integrative procedure
   may help to further promote the clinical diagnosis and therapy.

   This study extensively categorized cancer cells and cells in the TME
   using publicly available scRNA-seq, ST data and bulk RNA-seq data. Then
   the prognosis related cancer cell subtypes and other crucial cell types
   in TME were revealed via the Scissor algorithm and their molecular and
   biological characteristics were further analyzed. Using hdWGCNA, we
   identified co-expressed gene modules linked to prognosis-related tumor
   subtypes and created a gene expression signature for this specific
   group using a variety of machine learning techniques, aiding in disease
   prognosis prediction and treatment planning.

Materials and methods

Data acquisition

   In total, 14 tumor samples scRNA-seq data in three separate public
   datasets were obtained from the NCBI GEO databases at
   [53]http://www.ncbi.nlm.nih.gov/geo/, including [54]GSE171306
   ([55]GSM5222644, [56]GSM5222645), [57]GSE224630 ([58]GSM7028034,
   [59]GSM7028035, [60]GSM7028036, [61]GSM7028037, [62]GSM7028039) and
   [63]GSE159115 ([64]GSM4819725, [65]GSM4819727, [66]GSM4819729,
   [67]GSM4819734, [68]GSM4819736, [69]GSM4819737, [70]GSM4819738). And
   one representative sample (R29_T) was obtained from Zenodo
   [71]https://zenodo.org/record/8063124)^[72]15. Three bulk RNA
   sequencing datasets were acquired from the IMvigor210 immunotherapy
   group, The Cancer Genome Atlas (TCGA), and the International Cancer
   Genome Consortium (ICGC) repository in order to develop and validate
   the accuracy of the predictive model.

scRNA-seq data processing

   scRNA-seq data were processed via the ‘Seurat’ R package (version:
   4.2.2)^[73]16. Doublets were identified and removed using the
   ‘DoubletFinder’ R package (version 2.0.3)^[74]17. Cell quality was
   evaluated based on three criteria: total UMI count exceeding 200,
   number of genes exceeding 200, and percentage of mitochondrial genes
   below 20. To avoid unexpected noise, genes associated with mitochondria
   and ribosome were excluded. Following this, PCA was utilized following
   the normalization and scaling of raw counts using the SCTransform
   feature. The ‘Harmony’ R package (version: 1.2.0) was applied to reduce
   batch effects among individual scRNA-seq data^[75]18. The
   ‘FindNeighbors’ and ‘FindClusters’ functions were utilized for
   clustering analysis. Next, the cells were observed using the uniform
   manifold approximation and projection (UMAP) algorithm. Similarly,
   subclustering analysis was performed. The Seurat functions
   ‘FindAllMarkers’ and ‘FindMarkers’ were utilized to detect
   differentially expressed genes (DEGs) within various clusters. Known
   cell-type marker genes were used to annotate each cell cluster in
   succession. Furthermore, ‘Infercnv’ (version 1.14.2) R package
   ([76]https://github.com/broadinstitute/inferCNV) was applied to confirm
   the identity of cancer cells as previous described^[77]19.

Scissor

   We evaluated the arrangement of clusters in each sample using the
   '[78]Scissor' R package (version 2.0.0). The input data for
   “[79]Scissor” consists of a single-cell expression matrix, a batch
   expression matrix, and a target phenotype. Afterwards, the data is used
   to calculate correlation matrices and cell–cell similarity networks,
   which are then used in a network regularization sparse regression model
   to pinpoint the most relevant cell subsets associated with patients’
   overall survival phenotypes. Based on the estimated regression
   coefficient, the cells were categorized as either Scissor + or Scissor-
   cells, representing positive and negative correlations with the
   prognosis, respectively. Subsequent reliability significance testing
   was performed to assess the suitability of the selected data for
   phenotype cell association. The cells identified by the scissors would
   then undergo further characterization in downstream analysis.

Deconvolution of ST data

   We employed the cell2location deconvolution method^[80]20 on ST data to
   reconstruct the spatial organization of the various cell subtypes
   within the TME. Cell2location was executed using raw counts for both
   spatial and single-cell reference data, with default parameters
   applied.

Functional enrichment analysis

   Seurat’s ‘FindAllMarkers’ function was utilized to identify the
   differentially expressed genes (DEGs) within each cell cluster. A
   cut-off criterion was applied using a |logFC| greater than 0.25 and an
   adjusted p-value less than 0.05. Based on the DEGs, Hallmark pathway
   enrichment analysis was performed to explore common biological pathways
   among the three tumor Scissor + subtypes.

Transcription factor analyses

   Activated transcription factors (TFs) for each cell type were
   identified using the pySCENIC (version 0.12.1)(Van ^[81]21). Following
   the provided tutorials, we downloaded the TF data file
   (hg38*.genes_vs_motifs.rankings.feather) and motif annotation data file
   (motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl) from the CisTarget
   database ([82]https://resources.aertslab.org/cistarget/). A
   co-expression network was constructed from the scRNA-seq expression
   matrix using the GRNBoost2 function. Next, TF-motif enrichment analyses
   were conducted to identify regulons, and the activity levels of each
   regulon were calculated using AUCell functions.

Pseudotime trajectory analysis

   Unsupervised pseudotemporal analysis was conducted using the DDR-Tree
   algorithm and default settings in the ‘Monocle’ package (version
   2.26.0). The ‘plot_pseudotime_heatmap’ function was used to demonstrate
   the temporal expression patterns of module genes along the pseudotime
   trajectories in the three tumor Scissor + subtypes.

Analysis of gene co-expression networks with high dimensions and weighted
connections (hdWGCNA)

   In order to identify potential Scissor + cancer cells related genes
   associated with patients’ prognosis, we conducted a comprehensive
   analysis by “hdWGCNA” package (version 0.2.27)^[83]22. Specifically, 50
   cells were aggregated into per metacells for the tumor cell
   subpopulations with the hdWGCNA function MetacellsByGroups.
   Subsequently, we subsetted the Seurat object for each tumor cell
   subtype and performed the standard hdWGCNA pipeline.

Construction and validation of a 101-combination machine learning framework
based predictive model

   An analysis was performed to compare the bulk RNA-seq genes with the
   top 100 genes in the Scissor + tumor cell-related modules identified by
   hdWGCNA. The overlapping genes were considered to impact the prognosis
   of patients at both bulk and single-cell transcriptome levels. In order
   to develop a reliable prognostic model with strong predictive
   capabilities, we proceeded with the following steps:
    1. A total of 400 patients were randomly chosen from the TCGA-KIRC
       dataset for training purposes, while the remaining patients were
       designated as an internal validation dataset. The ICGC dataset was
       utilized as an external validation set. The analysis utilized ten
       different machine learning algorithms, such as Lasso, Ridge,
       stepwise Cox, CoxBoost, RSF, Enet, plsRcox, SuperPC, GBM, and
       survival-SVM. 101 combinations of 10 algorithms were tested using
       the TCGA-KIRC training dataset to select variables and build models
       in a tenfold cross-validation setup.
    2. Each of the created models was assessed in both the TCGA internal
       validation set and the ICGC external validation set. The C index
       was computed for every model in the training, internal validation,
       and external validation sets, and then the models were ranked
       according to the average C index of internal validation and
       external validation datasets to assess their predictive accuracy.
       Genes were identified using the selected algorithms with a largest
       average C index to establish a final risk signature. Following
       this, a risk score was computed for each sample in order to
       evaluate patient outcomes:

       graphic file with name d33e384.gif

   Then a nomogram was constructed with clinical phenotype parameters by
   ‘rms’ package (version 6.7–1). Further, its time dependent performance
   was evaluated through receiver operating characteristic (ROC) curves
   using the timeROC package (version 0.4) as well as the calibration
   curves in both the TCGA-KIRC dataset and ICGC dataset.

Cell–Cell Communication analysis

   To identify possible cell–cell interactions, the ‘CellChat’ package
   (version 1.6.1) was used with the default settings^[84]23 and the
   selected ligands and receptors (L-R) pairs were analyzed among specific
   cell subclusters.

Deconvolution of tumor immune microenvironment

   The CIBERSORT algorithm^[85]24, using linear support vector regression
   to analyze the expression matrix of 22 human immune cell subtypes, was
   used to study the differences in immune cell makeup between high and
   low-risk groups in the TCGA-KIRC dataset. To uncover the correlation
   between tumor cells and other cells in TME, we applied CIBERSORTx as
   tutorials suggest^[86]25. In brief, we randomly sampled 100 cells in
   each cluster as the reference scRNA-seq data and integrated the
   TCGA-KIRC dataset through the online analyzing tools
   ([87]https://cibersortx.stanford.edu).

Clinical analysis

   Using the ideal risk score threshold identified by the ‘survminer’ R
   software (v0.4.9), the TCGA training, internal validation, ICGC, and
   IMvigor210 datasets were split into high and low-risk categories.
   Following this, an analysis of Kaplan–Meier curves was performed with
   the ‘survminer’ R package and the ‘survival’ R package (version 3.4) to
   determine if there was a significant difference in OS between the high
   and low-risk groups (log-rank test, p < 0.05).

Statistical analysis

   R software (version 4.3.1) was used for statistical analyses and data
   visualizations. Spearman correlation coefficients were utilized to
   evaluate the relationships between two continuous variables. We
   compared quantitative data between subgroups using Wilcox test or
   KruskalWallis test. If not other stated, p values were corrected in
   multi-comparison tests using the Benjamini–Hochberg method A p-value
   less than 0.05 was deemed statistically significant.

Results

Scissor analysis reveals the characteristics and distribution of prognosis
related cells with scRNA-seq data

   To explore the cellular composition and variety of cell types in ccRCC,
   we performed a study analyzing 14 tumor samples in conjunction with
   their publicly accessible single-cell RNA sequencing information. By
   employing UMAP analyses, we were able to distinguish 30 unique cell
   populations (Fig. [88]1A). Based on known lineage markers and copy
   number variation, we grouped these populations into 13 main cell
   populations, which consist of mast cells, dendritic cells, macrophages,
   NK/NKT cells, neutrophils, CD4 + T cells, CD8 + T cells,
   ACKR1 + endothelial cells, ACKR1- endothelial cells, intermediate
   endothelial cells, fibroblasts, and cancer cells (Fig. [89]1B;
   Supplementary Fig. [90]S1). Additionally, the study presents the
   analysis of significant known lineage markers in the 13 clusters, along
   with the expression and distribution of these markers (Fig. [91]1D-E).
   Furthermore, the identity of the inferred cancer cells was confirmed by
   calculating somatic large-scale chromosomal copy number variation (CNV)
   scores using immune cells and stromal cells as normal cell references