Abstract Clear cell renal cell carcinoma (ccRCC) is a prevalent malignant tumor in the field of urology. The effect of cell heterogeneity on the prognosis and reaction to treatment of ccRCC in large populations is still unclear. By analyzing public single cell RNA-sequencing and bulk RNA-sequencing data with the Scissor algorithm, we have identified three distinct prognosis related cancer cell subtypes which play an indispensable role on tumor metastasis, immune response and proliferation respectively. Besides, regulatory T cells (Tregs) and matrix producing cancer associated fibroblasts (matCAFs) were also recognized as crucial cell subtypes in the tumor microenvironment (TME). Moreover, potential interactions between Scissor + cells and other cells in TME were investigated to uncover regulatory mechanisms via ‘Cell Chat’ and cell2location algorithm. It is interesting that the interferon gamma signaling pathway and p53 signaling pathway contribute to the Scissor + transition of Tregs and matCAFs. The distinct activated transcription factor patterns were uncovered as well as the essential ligand-receptor pairs in the interactions among different cell subtypes, such as CXCL12-CXCR4 and COL6A2-SDC4. Then, we developed a risk score signature consisting of 10 genes, utilizing a 101-combination machine learning computational framework, which showed promising results in predicting the prognosis of patients. Furthermore, our study revealed variations in immune cell infiltration and the expression of immune related factors within the tumor microenvironment between different risk score groups, as well as the different sensitivity to the immunotherapy. In the end, we suggested Rapamycin as the additional therapy for the advanced ccRCC. In conclusion, our study created a signature to provide opportunities for predicting prognosis and improving treatments of ccRCC. Keywords: Scissor, Machine learning, Multi-omics, Clear cell renal carcinoma Subject terms: Cancer, Cell biology Introduction Clear cell renal carcinoma (ccRCC) is the most frequently diagnosed histological subtype of RCC and it leads to over 179,000 deaths worldwide^[34]1. Though some patients who undergo surgery experience increased survival rates, there are still numerous people suffered from the metastasis and recurrence of ccRCC. Over the years, only the renal tumor cell vaccine and sunitinib have been shown to enhance the prognosis, with nearly all other supplementary treatments failing to show their advantages^[35]2. Immune checkpoint blockade (ICB) therapy and combination regimens have given rise to the hope of these patients but the overall response rate still remains less than 58% in most cases^[36]3. Therefore, it is pivotal to construct a method assessing the prognosis and therapeutic response and continue to discover new drugs for ccRCC. Extensive heterogeneity, as a significant feature of cancer cells driven by genetics, epigenetics and microenvironment, accounts for the diverse therapeutic response among patients^[37]4. The use of single-cell RNA-sequencing (scRNA-seq) and spatial transcriptome (ST) technology has significantly improved our comprehension of tumor heterogeneity in recent times^[38]5. In spite of the extensive analysis of the heterogeneity in cancer cells and tumor microenvironment (TME)^[39]6,[40]7,[41]27, a comprehensive understanding of prognosis-related cellular subtypes is still lack due to the costs associated with high-throughput scRNA-seq when applied to large populations simultaneously. Due to the cost-effectiveness of bulk RNA-seq for use in a sizable group, the Scissor technique, which combines phenotype-linked bulk RNA-seq information with scRNA-seq data, seems capable of pinpointing a distinct cell subset linked to survival outlook^[42]8,[43]12. Furthermore, such illustration of prognosis related cancer cells offers a novel insight into their special biological characteristics and behaviors, which guides the development of clinical outcome. As an automatic data analysis method, machine learning has been employed in clinical diagnosis and therapy^[44]9. Furthermore, advancements in next-generation sequencing have led to the creation of multiple prognostic and predictive gene signature models using machine learning techniques in previous studies^[45]8,[46]10–[47]12,[48]14. To gain a better extrapolation potential, researchers incorporated 10 machine learning methods used in clinical analysis and combined any two methods to establish a novel algorithm including 101 kinds of combinations^[49]11,[50]13,[51]14,[52]49. Such an integrative procedure may help to further promote the clinical diagnosis and therapy. This study extensively categorized cancer cells and cells in the TME using publicly available scRNA-seq, ST data and bulk RNA-seq data. Then the prognosis related cancer cell subtypes and other crucial cell types in TME were revealed via the Scissor algorithm and their molecular and biological characteristics were further analyzed. Using hdWGCNA, we identified co-expressed gene modules linked to prognosis-related tumor subtypes and created a gene expression signature for this specific group using a variety of machine learning techniques, aiding in disease prognosis prediction and treatment planning. Materials and methods Data acquisition In total, 14 tumor samples scRNA-seq data in three separate public datasets were obtained from the NCBI GEO databases at [53]http://www.ncbi.nlm.nih.gov/geo/, including [54]GSE171306 ([55]GSM5222644, [56]GSM5222645), [57]GSE224630 ([58]GSM7028034, [59]GSM7028035, [60]GSM7028036, [61]GSM7028037, [62]GSM7028039) and [63]GSE159115 ([64]GSM4819725, [65]GSM4819727, [66]GSM4819729, [67]GSM4819734, [68]GSM4819736, [69]GSM4819737, [70]GSM4819738). And one representative sample (R29_T) was obtained from Zenodo [71]https://zenodo.org/record/8063124)^[72]15. Three bulk RNA sequencing datasets were acquired from the IMvigor210 immunotherapy group, The Cancer Genome Atlas (TCGA), and the International Cancer Genome Consortium (ICGC) repository in order to develop and validate the accuracy of the predictive model. scRNA-seq data processing scRNA-seq data were processed via the ‘Seurat’ R package (version: 4.2.2)^[73]16. Doublets were identified and removed using the ‘DoubletFinder’ R package (version 2.0.3)^[74]17. Cell quality was evaluated based on three criteria: total UMI count exceeding 200, number of genes exceeding 200, and percentage of mitochondrial genes below 20. To avoid unexpected noise, genes associated with mitochondria and ribosome were excluded. Following this, PCA was utilized following the normalization and scaling of raw counts using the SCTransform feature. The ‘Harmony’ R package (version: 1.2.0) was applied to reduce batch effects among individual scRNA-seq data^[75]18. The ‘FindNeighbors’ and ‘FindClusters’ functions were utilized for clustering analysis. Next, the cells were observed using the uniform manifold approximation and projection (UMAP) algorithm. Similarly, subclustering analysis was performed. The Seurat functions ‘FindAllMarkers’ and ‘FindMarkers’ were utilized to detect differentially expressed genes (DEGs) within various clusters. Known cell-type marker genes were used to annotate each cell cluster in succession. Furthermore, ‘Infercnv’ (version 1.14.2) R package ([76]https://github.com/broadinstitute/inferCNV) was applied to confirm the identity of cancer cells as previous described^[77]19. Scissor We evaluated the arrangement of clusters in each sample using the '[78]Scissor' R package (version 2.0.0). The input data for “[79]Scissor” consists of a single-cell expression matrix, a batch expression matrix, and a target phenotype. Afterwards, the data is used to calculate correlation matrices and cell–cell similarity networks, which are then used in a network regularization sparse regression model to pinpoint the most relevant cell subsets associated with patients’ overall survival phenotypes. Based on the estimated regression coefficient, the cells were categorized as either Scissor + or Scissor- cells, representing positive and negative correlations with the prognosis, respectively. Subsequent reliability significance testing was performed to assess the suitability of the selected data for phenotype cell association. The cells identified by the scissors would then undergo further characterization in downstream analysis. Deconvolution of ST data We employed the cell2location deconvolution method^[80]20 on ST data to reconstruct the spatial organization of the various cell subtypes within the TME. Cell2location was executed using raw counts for both spatial and single-cell reference data, with default parameters applied. Functional enrichment analysis Seurat’s ‘FindAllMarkers’ function was utilized to identify the differentially expressed genes (DEGs) within each cell cluster. A cut-off criterion was applied using a |logFC| greater than 0.25 and an adjusted p-value less than 0.05. Based on the DEGs, Hallmark pathway enrichment analysis was performed to explore common biological pathways among the three tumor Scissor + subtypes. Transcription factor analyses Activated transcription factors (TFs) for each cell type were identified using the pySCENIC (version 0.12.1)(Van ^[81]21). Following the provided tutorials, we downloaded the TF data file (hg38*.genes_vs_motifs.rankings.feather) and motif annotation data file (motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl) from the CisTarget database ([82]https://resources.aertslab.org/cistarget/). A co-expression network was constructed from the scRNA-seq expression matrix using the GRNBoost2 function. Next, TF-motif enrichment analyses were conducted to identify regulons, and the activity levels of each regulon were calculated using AUCell functions. Pseudotime trajectory analysis Unsupervised pseudotemporal analysis was conducted using the DDR-Tree algorithm and default settings in the ‘Monocle’ package (version 2.26.0). The ‘plot_pseudotime_heatmap’ function was used to demonstrate the temporal expression patterns of module genes along the pseudotime trajectories in the three tumor Scissor + subtypes. Analysis of gene co-expression networks with high dimensions and weighted connections (hdWGCNA) In order to identify potential Scissor + cancer cells related genes associated with patients’ prognosis, we conducted a comprehensive analysis by “hdWGCNA” package (version 0.2.27)^[83]22. Specifically, 50 cells were aggregated into per metacells for the tumor cell subpopulations with the hdWGCNA function MetacellsByGroups. Subsequently, we subsetted the Seurat object for each tumor cell subtype and performed the standard hdWGCNA pipeline. Construction and validation of a 101-combination machine learning framework based predictive model An analysis was performed to compare the bulk RNA-seq genes with the top 100 genes in the Scissor + tumor cell-related modules identified by hdWGCNA. The overlapping genes were considered to impact the prognosis of patients at both bulk and single-cell transcriptome levels. In order to develop a reliable prognostic model with strong predictive capabilities, we proceeded with the following steps: 1. A total of 400 patients were randomly chosen from the TCGA-KIRC dataset for training purposes, while the remaining patients were designated as an internal validation dataset. The ICGC dataset was utilized as an external validation set. The analysis utilized ten different machine learning algorithms, such as Lasso, Ridge, stepwise Cox, CoxBoost, RSF, Enet, plsRcox, SuperPC, GBM, and survival-SVM. 101 combinations of 10 algorithms were tested using the TCGA-KIRC training dataset to select variables and build models in a tenfold cross-validation setup. 2. Each of the created models was assessed in both the TCGA internal validation set and the ICGC external validation set. The C index was computed for every model in the training, internal validation, and external validation sets, and then the models were ranked according to the average C index of internal validation and external validation datasets to assess their predictive accuracy. Genes were identified using the selected algorithms with a largest average C index to establish a final risk signature. Following this, a risk score was computed for each sample in order to evaluate patient outcomes: graphic file with name d33e384.gif Then a nomogram was constructed with clinical phenotype parameters by ‘rms’ package (version 6.7–1). Further, its time dependent performance was evaluated through receiver operating characteristic (ROC) curves using the timeROC package (version 0.4) as well as the calibration curves in both the TCGA-KIRC dataset and ICGC dataset. Cell–Cell Communication analysis To identify possible cell–cell interactions, the ‘CellChat’ package (version 1.6.1) was used with the default settings^[84]23 and the selected ligands and receptors (L-R) pairs were analyzed among specific cell subclusters. Deconvolution of tumor immune microenvironment The CIBERSORT algorithm^[85]24, using linear support vector regression to analyze the expression matrix of 22 human immune cell subtypes, was used to study the differences in immune cell makeup between high and low-risk groups in the TCGA-KIRC dataset. To uncover the correlation between tumor cells and other cells in TME, we applied CIBERSORTx as tutorials suggest^[86]25. In brief, we randomly sampled 100 cells in each cluster as the reference scRNA-seq data and integrated the TCGA-KIRC dataset through the online analyzing tools ([87]https://cibersortx.stanford.edu). Clinical analysis Using the ideal risk score threshold identified by the ‘survminer’ R software (v0.4.9), the TCGA training, internal validation, ICGC, and IMvigor210 datasets were split into high and low-risk categories. Following this, an analysis of Kaplan–Meier curves was performed with the ‘survminer’ R package and the ‘survival’ R package (version 3.4) to determine if there was a significant difference in OS between the high and low-risk groups (log-rank test, p < 0.05). Statistical analysis R software (version 4.3.1) was used for statistical analyses and data visualizations. Spearman correlation coefficients were utilized to evaluate the relationships between two continuous variables. We compared quantitative data between subgroups using Wilcox test or KruskalWallis test. If not other stated, p values were corrected in multi-comparison tests using the Benjamini–Hochberg method A p-value less than 0.05 was deemed statistically significant. Results Scissor analysis reveals the characteristics and distribution of prognosis related cells with scRNA-seq data To explore the cellular composition and variety of cell types in ccRCC, we performed a study analyzing 14 tumor samples in conjunction with their publicly accessible single-cell RNA sequencing information. By employing UMAP analyses, we were able to distinguish 30 unique cell populations (Fig. [88]1A). Based on known lineage markers and copy number variation, we grouped these populations into 13 main cell populations, which consist of mast cells, dendritic cells, macrophages, NK/NKT cells, neutrophils, CD4 + T cells, CD8 + T cells, ACKR1 + endothelial cells, ACKR1- endothelial cells, intermediate endothelial cells, fibroblasts, and cancer cells (Fig. [89]1B; Supplementary Fig. [90]S1). Additionally, the study presents the analysis of significant known lineage markers in the 13 clusters, along with the expression and distribution of these markers (Fig. [91]1D-E). Furthermore, the identity of the inferred cancer cells was confirmed by calculating somatic large-scale chromosomal copy number variation (CNV) scores using immune cells and stromal cells as normal cell references