Abstract Cancer research aims to identify genes that cause or control disease progression. Although a wide range of gene sets have been published, they are usually in poor agreement with one another. Furthermore, recent findings from a gene-expression cohort of different cancer types, known as positive random bias, showed that sets of genes chosen randomly are significantly associated with survival time much higher than expected. In this study, we propose a method based on Brouwer’s fixed-point theorem that employs significantly survival-associated random gene sets and reveals a small fixed-point gene set for cancers with a positive random bias property. These sets significantly correspond to cancer-related pathways with biological relevance for the progression and metastasis of the cancer types they represent. Our findings show that our proposed significant gene sets are biologically related to each cancer type available in the cancer genome atlas with the positive random bias property, and by using these sets, positive random bias is significantly more reduced in comparison with state-of-the-art methods in this field. The random bias property is removed in 8 of these 17 cancer types, and the number of random sets of genes associated with survival time is significantly reduced in the remaining 9 cancers. Subject terms: Computational biology and bioinformatics, Systems biology, Biomarkers, Diseases Introduction According to the American Cancer Society, cancer is the second most common cause of death in the US^[28]1. In addition, cancers are heterogeneous diseases, with comparable diagnoses and identical treatment regimens resulting in vastly different outcomes for patients. On the other hand, early cancer diagnosis and prognosis have substantial effects on patients’ therapeutic targets^[29]2. This has prompted researchers to seek out factors that can aid in predicting the course of cancer disease. The findings of numerous studies that relied solely on clinical characteristics such as lymph node status and histological grade to classify clinical outcomes demonstrated that these characteristics were insufficient. This has led to the development of studies considering genomic data (e.g., gene expression) alongside clinical features. In general, the goal of such studies was to select a preferably small number of genes, known as the signature, and to utilize them in predicting a patient’s survival outcome using gene expression profiles^[30]3. Nonetheless, detecting a robust gene set across various datasets that can accurately predict a patient’s survival outcome has become a key challenge in cancer research. In the last two decades, numerous articles have been published on finding survival-relative genes in various cancer types, each proposing a gene set that was highly associated with cancer progression and metastasis^[31]4–[32]10. Nevertheless, there was little overlap between the resulting gene sets from studies with different cohorts but similar analytical approaches^[33]11. Therefore, the lack of similarities between the reported gene sets in these studies indicates that the results depend on the cohorts being studied. As a result, identifying a robust gene set across multiple datasets that accurately predicts a patient’s outcome has become a formidable challenge in cancer research. In this regard, considering the cancer patient’s survival time is one of the most critical aspects of finding such a gene set^[34]11. In 2012, Venet et al.^[35]12 argued this point and conducted a study to estimate the association between randomly selected gene sets and breast cancer patient survival time in a Netherlands Cancer Institute (NKI) cohort. As one might expect, using the expression of random genes to divide samples into two distinct groups results in groups that are not significantly different in terms of survival time, and samples are assigned to each group randomly. In other words, the p-values obtained from statistical tests comparing survival curves of the groups generated by random gene sets must be distributed normally, with only 5% of the p-values falling below 0.05^[36]13. By contrast, Venet’s analysis revealed that in the case of breast cancer, groups generated by many of the random gene sets showed a statistically significant difference in survival time. That is to say, these random gene sets were significantly associated with the patient’s survival time. Additionally, in some cases, the random gene sets were more significantly associated with survival time than some of the published signatures^[37]12. These findings suggest that many of the signatures identified through breast cancer gene expression analysis may not be causal to cancer progression, despite being significantly associated with survival time^[38]13. Venet et al. justified this issue by pointing to the operation of the proliferation signature, which considerably impacts a substantial portion of the genome. They suggested that most random gene sets contain some genes from the proliferation signature and, thus, are associated with the proliferation signature and, indirectly, survival. They defined the meta-PCNA signature to determine the proliferation rate and introduced a method to remove this signature’s impact on the expression data. They concluded that removing the effects of meta-PCNA genes on the expression of the genes in the NKI breast cancer dataset cohort was a perfect way to reduce the association between random genes and survival. In 2018, Shimoni coined the term “random bias” to describe this concept in cancer^[39]13. Random bias is an unexpected situation in which more (less) than 5% of random gene sets are associated with some clinical attribute, such as survival time, in a statistically significant way. Shimoni examined The cancer genome atlas (TCGA) data for 34 different cancer types to see if there was a significant association between random gene sets and survival time. According to his analysis, random bias could be found in a wide variety of cancer types. Shimoni’s findings revealed that 17 out of the 34 datasets exhibited positive random bias, indicating that more than 5% of randomly selected gene sets in these cancers are significantly associated with survival time. Ten of these cancer types did not exhibit random bias, while seven of the datasets exhibited negative random bias, cancers with less than 5% significant survival-associated random (SSAR) gene sets. Shimoni utilized Venet’s approach to eliminating the confounder effect of the proliferation signature from TCGA expression data to reduce the effect of random bias in all types of cancer. His analysis concluded that Venet’s methods were ineffective in removing random bias in most cancer types and impractical in the TCGA breast cancer cohort. To solve this problem, Shimoni proposed that dividing samples into small subgroups using an unsupervised clustering method could decrease the proportion of SSAR gene sets in a wide range of cancer types. Shimoni’s results showed that out of the 106 clusters generated for all cancer types that had exhibited both positive and negative random bias, in only 65 of these clusters, the property of random bias was eliminated. Despite the fact that random bias was not eliminated in 41 of 106 cases, he contends that clustering can effectively eliminate random bias in several TCGA cancer types. Despite producing some promising and valuable insights, the existing research has produced contradictory results, is still limited in scope, and faces several critical theoretical and analytical challenges. Previous studies have shown that significant survival-associated random gene sets can provide valuable insights into the biology of breast cancer and aid in identifying biologically cancer-related genes^[40]14. Notably, since random bias can be observed in many cancer types, it is possible that SSAR gene sets may also provide informative results for most cancer types. Building upon these findings, we assert that each cancer type has a fixed-point gene set that is biologically associated with cancer survival time which can be identified by SSAR gene sets. Additionally, these fixed-point gene sets are responsible for the observed random bias, and by removing their effects from expression data, it is possible to decrease the proportion of significant survival-associated random gene sets. To identify these fixed-point gene sets, we introduce an iterative novel approach for detecting gene sets that are not only statistically significant but also biologically relevant for cancer research and clinical practice. Specifically, we aim to identify fixed-point gene sets for each TCGA cancer type that exhibit positive random bias. By applying this approach, we aim to eliminate positive random bias and reduce the proportion of SSAR gene sets in a large number of cancer studies. Moreover, we demonstrate that the identified gene sets are highly biologically significant and can be considered as signatures for their associated cancer type. This suggests that the proposed approach can provide valuable insights into the underlying biology of cancer and improve the accuracy and reliability of survival analyses in various cancer types. Overall, our approach provides a systematic and rigorous method for detecting biologically relevant gene sets associated with cancer survival time and can have important implications for cancer research and clinical practice. Materials and methods Dataset Loi et al.^[41]15 collected microarray expression data of 17,585 genes from 380 individuals with primary breast tumors. The Rdata file was downloaded from NCBI’s Gene Expression Omnibus (GEO) with accession number [42]GSE6532 ([43]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse6532). Netherlands Cancer Institute (NKI) cohort The NKI, also known as the van de Vijver et al. data set^[44]5, was provided in Venet’s paper^[45]12. This dataset contains microarray expression data of 13,108 genes for 295 breast cancer patients in stages I or II and their clinical data. The Cancer Genome Atlas (TCGA) The expression data of TCGA cancer types (17 cases) that exhibit positive random bias (based on Shimoni’s finding) were downloaded from the [46]https://portal.gdc.cancer.gov site. We looked at level 3 data normalized using RNA-Seq by Expectation-Maximization (RSEM) method based on Shimoni’s approach; each dataset contains RNAseq expression datasets for cancer patients and their survival time and clinical data. We used standard TCGA study abbreviations for the cancer type names (as defined in [47]https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-s tudy-abbreviations). Method In this section, we outline our novel approach for identifying the fixed-point gene set for each cancer type that exhibits the positive random bias property. Our method is based on an iterative algorithm to systematically and efficiently identify the gene set of interest. We explain the steps involved in our approach and provide a detailed description of how our method works to detect biologically relevant gene sets associated with cancer survival time. Fixed-point gene set identifier method (FPGI) 1. Initialize [MATH: X0= :MATH] , [MATH: j=1 :MATH] . 2. Randomly select a set of genes, [MATH: Gj :MATH] , with size m from all genes (Fig. [48]1A). 3. Use principal component analysis (PCA) on the gene expression data matrix of [MATH: Gj :MATH] to divide samples into two equally sized groups (A and B) based on the median of the first principal component. Test the null hypothesis that there is no difference in survival time between these two groups using the log-rank test. If the p-value is less than 0.05, proceed to the next step. Otherwise, go back to step 2 and increase j by one and choose another random set ( [MATH: Gj+1 :MATH] ) (Fig. [49]1B). 1. Set [MATH: DEG=G j :MATH] . Use the Significant Analysis of Microarrays (SAM) method to detect differentially expressed genes between groups A and B, and consider the first m genes (most significant genes) as [MATH: DEG :MATH] (Fig. [50]1C). 2. Compare the sets [MATH: DEG :MATH] and DEG. If they were not the same, set [MATH: Gj=DEG :MATH] and go back to step 3. 3. If [MATH: DEG :MATH] and DEG were the same, set [MATH: Xj=Xj-1DEG :MATH] and go to step 4. 4. Increase j by one and go back to step 2. 5. Continue the whole process for [MATH: j=1,,< mn>6000 :MATH] and identify [MATH: ZC=X6000 :MATH] as the fixed-point gene set of cancer type C that exhibits positive random bias property. Figure 1. [51]Figure 1 [52]Open in a new tab The figure depicts the FPGI method for identifying a fixed-point gene set in a given cancer type. The method starts with [MATH: j=0 :MATH] , and an empty set [MATH: X0 :MATH] and proceeds iteratively for [MATH: j=1,,6 000 :MATH] . At each iteration, a set of genes ( [MATH: Gj :MATH] ) is randomly selected from all genes expression data (GE all genes) and used to construct an gene expression data matrix (GE random genes) (part A). The samples are then divided into two groups A and B using principal component analysis (PCA), and their survival time is compared using log-rank test and p-value (part B). If the p-value is less than 0.05, set [MATH: DEG=Gj :MATH] and the first 50 differentially expressed genes ( [MATH: DEG :MATH] ) between A and B are identified using the Significance Analysis of Microarrays (SAM) method (part C). If DEG is the same as [MATH: DEG :MATH] , it is added to the fixed-point set [MATH: Xj :MATH] , otherwise, the process is repeated with [MATH: DEG :MATH] as [MATH: Gj :MATH] until convergence. In this method, j represents the iteration number. The final result, [MATH: X6000 :MATH] , is considered as the fixed-point set of the cancer type C. Why FPGI method converges? This iterative method tries to identify gene sets associated with positive random bias in cancer samples. This method combines Principal Component Analysis (PCA) to divide samples into two groups with Significant Analysis of Microarray (SAM) to identify differentially expressed genes between the two groups of samples resulting from PCA. This iterative process is repeated until a fixed set of genes is obtained. Utilizing the SAM method at each iteration ensures the convergence of our method, which helps to identify gene sets that are statistically significant and likely biologically relevant. In fact, the application of the SAM method during each iteration is the primary contributor to the method’s convergence. SAM is designed to identify gene sets with statistically significant differences in expression between two sample groups. Using PCA to refine the search for relevant gene sets is the second critical factor contributing to the convergence of the FPGI method. PCA is an efficient method for identifying sample subpopulations with potentially distinct gene expression profiles. By dividing the samples into two groups based on the identified gene set at each iteration, FPGI can identify subpopulations of samples with comparable gene expression profiles and narrow our search to the most relevant gene sets. In addition, the stopping criterion of this method, which requires that the gene set identified in each iteration be identical to the gene set identified in a previous iteration, ensures that the method does not continue to iterate forever. In addition, the stopping criterion of this method, which requires that the gene set identified in each iteration be identical to the gene set identified in a previous iteration, ensures that the method does not continue to iterate forever. To investigate this, we tested our algorithm on different gene set sizes, including 5, 50, 100, and 200 genes. Our results show that for all scenarios, the algorithm converges quickly (with a maximum of 20 iterations). The combination of SAM, and PCA, as well as identifying statistically significant gene sets at each iteration and refining the search based on the identified gene set, and the stopping criterion ensures that FPGI method converges on a fixed set of genes, that are biologically significant, making it a reliable and robust technique for identifying relevant gene sets in cancer samples. Scoring function Our scoring function, denoted as w, maps from the fixed-point gene set, [MATH: ZC :MATH] to the set of natural numbers, [MATH: N :MATH] . Specifically, given a gene g, we define its score w(g) as the number of times g appears in the gene set [MATH: Xj :MATH] , where [MATH: Xj :MATH] is computed in step 3 part c for [MATH: j=1,,6 000 :MATH] . Therefore, w(g) provides a measure of the significance of each gene in [MATH: ZC :MATH] by quantifying its frequency of occurrence across iterations. In the result section, we will discuss how the fixed-point set, [MATH: ZC :MATH] is associated with survival time and plays a key role in the phenomenon of positive random bias for the vast majority of cancer types exhibiting this property. Despite the fact that the results presented in this section pertain solely to BRCA, the same conclusions hold true for the other 16 types of cancer analyzed. Results Frequency of genes in fixed-point set To ensure comprehensive coverage of the search space and the inclusion of all available genes in the dataset, we generated the union of 6,000 random gene sets, each containing 50 genes. This union resulted in a multi set of 300,000 genes. For the BRCA dataset, the union set contained all of the 18,275 genes of the dataset at least once, and some genes were repeated up to 34 times. Thus, our iterative method, which starts from a random gene set and repeats 6000 times, covers all genes in the dataset, giving each gene a chance to be chosen in a random gene set. On the other hand, the fixed-point set of BRCA ( [MATH: ZBRCA :MATH] ) consists of only 295 genes, with a maximum and minimum frequency of 397 and 52 respectively. Compared to the original dataset, this is a very small set of genes. These results demonstrate that our method attempts to cover the entire search space and starts from all available genes in the dataset for each cancer type. Eventually, it settles on a small subset of genes that is a subset of the corresponding cancer type genes. The scoring values for the 50 top genes in [MATH: ZBRCA :MATH] are shown in Fig. [53]2. We choose this number of genes for the figure because we mainly report our results using a random gene set size of 50 throughout most of the paper, even though we analyzed our method using random gene sets of size 50, 100, and 200. It is noteworthy that the scores of the top 50 genes in [MATH: ZBRCA :MATH] ranged from 346 to 397. The frequencies of these genes in the random gene sets and in [MATH: ZBRCA :MATH] are plotted in blue and orange bars, respectively. Similar plots for other cancer types are available in Supplementary Information File [54]1, and they follow the same pattern. Figure 2. [55]Figure 2 [56]Open in a new tab Frequency of genes in all random gene sets versus scores of the fixed-point gene set. The orange bars on the y-axis represent the 50 top scoring genes in the breast. In random gene sets, the scoring value of these genes is represented by blue bars. Overall, we observe that the [MATH: ZC :MATH] set is significantly smaller than the original dataset. This suggests that the genes in the [MATH: ZC :MATH] set may be important for cancer progression. To investigate this claim, we analyze the biological relevance of [MATH: ZC :MATH] to cancer C in the following sections. Protein-protein interaction (PPI) network and pathway enrichment analysis In the first step of our analysis, in order to determine functional interactions between proteins coding genes of the resulted [MATH: ZC :MATH] , we used the Search Tool for Retrieval of Interacting Genes (STRING) database [57]https://string-db.org^[58]16. The PPI network was constructed using active interaction sources such as text mining, experiments, databases, neighborhood, gene fusion, co-occurrence, and co-expression, and a species was restricted to “Homo sapiens”. The nodes in the network represented the proteins, while the edges reflected the interaction. In STRING, each protein-protein interaction is annotated with one or more ‘scores’, these scores are indicators of confidence. Each score is assigned a confidence level between 0 and 1, with 1 representing the highest level of confidence. To obtain more reliable findings, we used a score of 0.9 for the confidence of interactions. The [MATH: ZBRCA :MATH] gene-based network consisted of 113 non-isolated nodes and 810 edges. The constructed network has a PPI enrichment p-value less than 1.0e−16, indicating that interactions between genes were not random. This suggested that the [MATH: ZBRCA :MATH] genes interacted more frequently than would be predicted for a random collection of proteins with the same size and degree distribution (in this case expected number of edges is 84). This enrichment indicated that the proteins as a group are biologically related. As it is illustrated in Fig. [59]3, the network, which was reduced to none isolated nodes contained some distinct, dense modules. Similar results were observed for other cancer types, which are available in Supplementary Information File [60]2. These results indicated that the genes in the [MATH: ZC :MATH] sets were highly associated with one another. Figure 3. [61]Figure 3 [62]Open in a new tab STRING protein-protein interaction analysis of the fixed-point gene set of the BRCA dataset. The network contained 113 nodes and 810 edges (vs. 84 expected edges); enrichment p-value less than 1.0e [MATH: - :MATH] 16. Figure were constructed using the STRING database (version 11.5; [63]https://string-db.org/). Accordingly, in the second step, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses were conducted using The Database for Annotation, Visualization and Integrated Discovery (DAVID) ([64]https://david.ncifcrf.gov) to find corresponding significant pathways of [MATH: ZC :MATH] for each of the 17 cancer types^[65]17,[66]18. Supplementary Information Fig. [67]1 shows the significant pathways common in at least two types of cancer. In this figure, the names of the pathways are listed in the first column, and the pathways associated with each cancer type are depicted in the second column using different colors. As presented in Supplementary Information Table [68]1, most of the significantly enriched pathways of [MATH: ZC :MATH] were highly associated with cancer C. Association of fixed-point set with disease The Genetic Association Database (GAD) tool on the David Functional Annotation server ([69]https://david.ncifcrf.gov) was utilized to investigate the association between [MATH: ZC :MATH] genes and disease. GAD is a database of published genetic association studies that enable the investigation of complex common human genetic diseases^[70]17,[71]18. Table [72]1 demonstrates the enriched disease and the disease class for each of the 17 cancer types. The table suggested that the genes in [MATH: ZC :MATH] were associated with cancer C. For instance, in the case of BRCA, the top-level disease class and disease assigned by GAD were cancer and breast cancer with p-values of 2.8e [MATH: - :MATH] 4 and 6.5e [MATH: - :MATH] 8, respectively. As noted by Ansar et al.^[73]14, these findings indicated how our method can detect meaningful information in SSAR gene sets. Table 1. Enriched disease and disease class achieved from fixed-point sets by Genetic Association Disease (GAD). Dataset GAD disease class Class p-value GAD disease p-value ACC Cancer 2.2e [MATH: - :MATH] 3 Plasma HDL cholesterol (HDL-C) levels 9.9e [MATH: - :MATH] 6 BLCA Cancer 9.2e [MATH: - :MATH] 6 Urinary bladder neoplasms 1.8e [MATH: - :MATH] 2 BRCA Cancer 2.8e [MATH: - :MATH] 4 Breast cancer 6.5e [MATH: - :MATH] 8 GBMLGG Cancer 2.8e [MATH: - :MATH] 4 Schizophrenia 2.0e [MATH: - :MATH] 6 HNSC cardiovascular 1.6e [MATH: - :MATH] 5 Cardiomyopathy, Dilated|DCM—Dilated cardiomyopathy 3.7e [MATH: - :MATH] 6 KIPAN Cancer 7.5e [MATH: - :MATH] 4 Type 2 Diabetes| edema | rosiglitazone 7.5e [MATH: - :MATH] 5 KIRC Cancer 2.5e [MATH: - :MATH] 6 Chronic renal failure|Kidney failure, Chronic 1.9e [MATH: - :MATH] 3 KIRP Cancer 3.1e [MATH: - :MATH] 5 Type 2 Diabetes| edema | rosiglitazone 9.6e [MATH: - :MATH] 6 LGG Pharmacogenomic 2.3e [MATH: - :MATH] 5 Several psychiatric disorders 2.0e [MATH: - :MATH] 6 LIHC Cancer 9.0e [MATH: - :MATH] 5 Liver cancer 6.1e [MATH: - :MATH] 2 LUAD Cancer 2.4e [MATH: - :MATH] 17 Lung cancer 2.0e [MATH: - :MATH] 6 LUSC Cancer 9.3e [MATH: - :MATH] 4 Lung Diseases|Resp distress syndrome neonatal 2.1e [MATH: - :MATH] 6 MESO Cancer 1.1e [MATH: - :MATH] 3 Lung cancer 2.5e [MATH: - :MATH] 2 PAAD Cancer 1.5e [MATH: - :MATH] 2 Type 2 Diabetes| edema | rosiglitazone 8.2e [MATH: - :MATH] 4 THYM Immune 1.6e [MATH: - :MATH] 2 Pulmonary disease, Mycobacterium malmoense 1.5e [MATH: - :MATH] 4 UCEC Cancer 2.0e [MATH: - :MATH] 4 Dermatitis, Atopic 1.3e [MATH: - :MATH] 6 UVM Cancer 1.3e [MATH: - :MATH] 2 Uveitis 3.8e [MATH: - :MATH] 2 [74]Open in a new tab Association of top scoring genes of fixed-point set with cancer C Through pathway enrichment, PPI network analysis, and disease class association, the biological significance of [MATH: ZC :MATH] with respect to its corresponding cancer type was investigated in the previous sections, and it was determined that the obtained fixed-point gene sets were significantly associated with cancer progression and metastasis. Although, the most significant advantage of our method was that the high-scoring genes of [MATH: ZC :MATH] in many studies had been shown as cancer driver genes. In the ACC, CD68 gene, the highest-scoring gene in [MATH: ZACC :MATH] , has been identified as a prognostic biomarker for adrenocortical carcinoma^[75]19. As another example, the TBX2 and TBX3 genes, with scores of 1276 and 1226 in the fixed-point gene set of BLCA, were excellent markers for predicting progression to muscle-invasive bladder cancer in patients with primary pTaG1/2 bladder cancer^[76]20. In^[77]21, it has been proposed that the C6orf97 gene, the highest scoring gene of [MATH: ZBRCA :MATH] , might play important roles not only in carcinogenesis but also in the progression of breast cancer patients toward a more aggressive phenotype. In 2019, Yeng et al.^[78]22 had indicated that ACTA1, a gene from [MATH: ZHNSC :MATH] with a score of 245, was a biomarker of head and neck squamous cell carcinoma. As another instance, silencing of ANK2 with a score of 358 in [MATH: ZPAAD :MATH] decreased the proliferation of the pancreatic tumor cells and reduced their tumorigenicity in vitro and in vivo^[79]23. The highest-scoring genes across all [MATH: ZC :MATH] sets are depicted in Supplementary information Fig. [80]2 and, Published papers that have investigated their association with cancer are available in Supplementary information Table [81]2. Random bias correction The random bias phenomenon, as described by Venet et al. and Shimoni, suggested that many of the signatures identified in numerous analyses of cancer types might not be causal of cancer progression, despite their significant association with survival time. Consequently, random bias is a confounding property that must not be ignored^[82]12,[83]13. As proposed by Venet et al. random bias was caused by the activity of proliferation genes (meta-PCNA gene signature) in data that had a substantial impact on the expression data, and the activity of this signature significantly influences each random set selected from the data. They hypothesized that by removing the effect of meta-PCNA genes from expression, the random bias in the NKI breast cancer dataset could be eliminated^[84]12. However, Shimoni demonstrated that removing the impact of meta-PCNA genes could not effectively reduce the proportion of SSAR gene sets in TCGA cancer types. Venet’s strategy might depend on the platform or data^[85]13. In this paper, we claimed that for each cancer type, we required a specific set of genes that removing its impact on the expression data could reduce the proportion of SSAR gene sets. To investigate this claim, we used the fixed-point gene set ( [MATH: ZC :MATH] ) of each cancer type and demonstrated that removing the influence of these genes from expression data could dramatically reduce the proportion of SSAR gene sets in the vast majority of cancer types. To accomplish this, we selected 10% of the highest scoring genes of [MATH: ZC :MATH] and then removed their impact from the expression data similar to Venet’s approach^[86]12. The result of this analysis is available in Table [87]2, where rows denote the cancer type and the proportion of significant p-value in percentage (SSAR%), the proportion of significant random gene set after removing the effect of meta-PCNA in percentage (PCNA-SSAR%), and the proportion of significant SSAR gene set after removing the effect of corresponding [MATH: ZC :MATH] ( [MATH: ZC :MATH] -SSAR) are collected in first, second and third columns, respectively. As shown in Table [88]2, the proportion of significant (positive random bias) in 14 out of 17 cancer types was significantly more reduced by using the selected genes from [MATH: ZC :MATH] rather than meta-PCNA. Table 2. The proportion of significant survival associated random gene sets after removing the fixed-point set and meta-PCNA signature. Dataset SSAR% PCNA-SSAR% [MATH: ZC :MATH] -SSAR% ACC 71 40 5 BLCA 50 44 7 BRCA 21 18 5 GBMLGG 99 85 25 HNSC 26 27 20 KIPAN 64 26 27 KIRC 82 68 26 KIRP 57 17 5 LGG 80 64 19 LIHC 32 5 7 LUAD 49 18 11 LUSC 14 10 6 MESO 53 20 15 PAAD 45 7 13 THYM 16 18 7 UCEC 58 41 27 UVM 51 45 7 [89]Open in a new tab “SSAR%” is the proportion of significant survival-associated random gene sets, “PCNA-SSAR %” is the proportion of significant survival-associated random gene sets after removing the effect of meta-PCNA signature from their expression data, and “ [MATH: ZC :MATH] -SSAR%” is the proportion of significant survival associated random gene set after removing the effect of [MATH: ZC :MATH] gene set from expression data. From another point of view, Shimoni used the PhenoClust, an unsupervised clustering method, to reduce the effect of random bias in TCGA cancer types^[90]13,[91]24. In Shimoni’s approach, the samples of each cancer type have been divided into sub-clusters, the association between each random gene set and the survival time of each cluster’s samples have been determined, and the proportion of SSAR gene sets was calculated for each cluster. Samples of all 17 cancer types with positive random bias property have been divided into 92 clusters^[92]13. In Fig. [93]4 proportion of SSAR gene sets is represented with colored dots and the small horizontal grey lines show the proportion of SSAR gene sets after removing the effect of corresponding [MATH: ZC :MATH] . Our evaluation of the performance of the proposed method was based on the assumption that a method is considered perfect for removing the random bias property if the results of the method cause only 0.05 of the random sets to remain significant. We compared the performance of our proposed method, FPGI, with that of Shimoni’s clustering method using the equation presented in the Supplementary Information File [94]3, where [MATH: SSARCL(i) :MATH] denotes the proportion of significant random gene sets of cluster i of cancer C, and [MATH: NC :MATH] represents the number of generated clusters for the corresponding cancer type. We evaluated the distance between the proportion of SSAR gene sets after excluding the effect of 10% of the highest scoring genes of [MATH: ZC :MATH] from the expression data, denoted by [MATH: DZC :MATH] , to demonstrate the effectiveness of our method. The results of this analysis are presented in Supplementary Information File [95]3 table. This table indicates the cancer types with positive random bias property in the first column, [MATH: ADC :MATH] for each cancer type in the second column, and [MATH: DZC :MATH] in the third column. Our results demonstrate that, in 9 out of 17 cancer types, [MATH: DZC :MATH] is less than [MATH: ADC :MATH] , which indicates that our proposed method outperforms Shimoni’s clustering method. In 5 out of the 8 remaining cancer types the results of these two methods are comparable^[96]12,[97]13. While it is true that our method shows better results compared to a specific cluster and not all clusters, we believe that this comparison still provides valuable insight into the performance of our proposed method. Moreover, our proposed method provides an alternative explanation to the same problem and also provides a significant set of genes to continue exploration. Figure 4. Figure 4 [98]Open in a new tab Each horizontal line represents a TCGA cancer type with positive random bias property. Each dot along the x-axis represents the proportion of significant survival-associated random gene set in each cluster. The short vertical gray line illustrates the proportion of significant survival-associated random gene set after removing the effect of [MATH: ZC :MATH] genes from expression data. Reproducibility In cancer research, identifying a reliable gene set independent of datasets and can accurately predict a patient’s survival outcome has become a major obstacle. Numerous articles on discovering survival-relative genes in various cancer types have been published over the recent decades, each proposing a gene set, with the authors asserting that the purpose gene set was significantly associated with cancer progression and metastasis. However, there was little overlap between the gene sets resulting from studies with different cohorts but similar analytic methods. In this paper, we introduced a set of significant survival-relative genes and indicated that we could reduce the proportion of SSAR gene sets by employing them. We demonstrated that these finding sets were nearly robust across different cancer cohorts. For this reason, we evaluated our results using various breast cancer cohorts and data sets. Regarding this, we considered three distinct breast cancer cohorts (NKI, TCGA, and LOI)^[99]5,[100]15. For the NKI, TCGA, and LOI cohorts, the resulting [MATH: ZC :MATH] sets contained 364, 295, and 426 genes, respectively. Each pair of these three gene sets has more than 30% of their genes in common, and 37 genes are in the intersection of all [MATH: ZC :MATH] sets that contained critical genes for breast cancer like AURKA and AURKB^[101]25,[102]26. Also, the PPI network based on these 37 genes was very dense, and the corresponding pathways of these genes were significantly associated with breast cancer (see Supplementary Information Fig. [103]3). It could be concluded that our proposed method for identifying survival-associated genes demonstrated significant overlap across different cohorts. Discussion It has previously been shown that gene expression data from random gene sets have a significant relation with cancer survival time, which has been discovered in a variety of cancers. Venet et al.^[104]12 first discovered this phenomenon in a microarray-measured expression dataset in breast cancer. As it turns out, this pattern can be seen in nearly all of the TCGA data matrix’s RNAseq-derived gene expression data. Venet et al.^[105]12 hypothesized that the phenomenon of random bias observed in gene expression data is caused by the activity of the proliferation signature, which affects a substantial portion of the human genome, and by removing the effect of this signature from expression data the random bias property will be removed. However, As reported by Shimoni^[106]13, the proliferation signature alone is insufficient to eliminate this bias in most cancers. While we agree with the general assumption of Venet et al. we believe that the specific set of genes that contribute to positive random bias varies widely between cancer types. To address this issue, we propose the existence of a fixed-point gene set for each type of cancer, which exerts a strong influence on a large number of genes in the genome and is strongly associated with survival time. This fixed-point gene set can induce survival prediction ability in randomly selected gene sets derived from expression data. To identify these gene sets, we developed an innovative and iterative method . The iterative nature of our method is a key strength that enables us to identify gene sets that are not the result of chance or noise, but instead represent significant differences in gene expression between groups. This is achieved by repeatedly dividing samples into two groups based on the differential expression of a gene set, and using this information to identify a new, refined gene set. The iterative process enables us to focus on the most biologically relevant genes for a given cancer type, and exclude genes that may be false positives or irrelevant to the disease. By applying this method to multiple cancer types, we can build a more comprehensive understanding of the underlying molecular mechanisms driving cancer development and progression. Furthermore, because our method is based on statistical significance, we can be confident that the gene sets we identify reflect genuine differences in gene expression between cancer types. This in turn gives us greater confidence in the biological relevance of the genes we identify, and increases the potential for these genes to be used as diagnostic or therapeutic targets in the future. In our study, we have analyzed a wide range of cancer types, and in order to validate the biological relevance of the identified gene set, we conducted protein-protein interaction (PPI) network and pathway analyses. The PPI network analysis helped us identify key biological pathways and processes involved in cancer development and progression and how the fixed-point set was related to these pathways. We also compared our gene set with previously published cancer signatures and confirmed that our identified gene set was highly correlated with the known cancer pathways. In addition, we evaluated the association of our fixed-point set with cancer disease class and related cancer types using the Genetic Association Database (GAD). The results showed that our gene set was highly associated with cancer disease class and related cancer types, providing further evidence of the biological relevance of our identified genes. To ensure that our findings were not dataset-specific, we reanalyzed our method with other independent datasets. The results showed that our identified fixed-point set of genes was consistently present across different datasets, further validating our approach and increasing the confidence in our results. Overall, our study has identified a set of highly cancer-related genes using an iterative approach that eliminates false positives and ensures the biological relevance of the identified genes. The validation of our gene set through various approaches and its consistency across multiple cancer types and datasets further supports its potential as a diagnostic or therapeutic target for cancer treatment. FPGI method is inspired by Brouwer’s fixed-point theorem, to identify this fixed-point gene set. Brouwer’s fixed-point theorem states that if you have a continuous function f that maps a compact, convex set X onto itself, and function f is a contraction, that it reduces distances between points in X by a constant between 0 and 1, then there is always a point Z in X such that [MATH: f(z)=Z :MATH] . In other words, X contains a fixed point that does not move as a result of the function f. In this work, we drew inspiration from this theorem by considering the set of all subsets of size m from all genes to be the compact, convex set X, the symmetric difference of the sets to be the metric on X, and the composition of the SAM and PCA methods to be the continuous function on X. To create this method, we first use a technique for detecting a significant relationship between gene expression data and cancer sample survival time in order to divide the group into two equal-sized subgroups with different survival dynamics. Second, we use a method to find genes whose expression significantly differs between these two subgroups. We used PCA to estimate the association between a randomly chosen set of genes and survival time. Specifically, we calculated the median of the first principal component, and then based on its values the patients were divided into two equaled-sized groups (A and B). The use of PCA as a method to estimate the association between a gene set and survival time has been previously validated in the literature, and we chose this approach because based on Venet’s results PCA method has been shown to reveal stronger outcome associations than other methods such as, kmeans and hierarchical clustering^[107]27–[108]33. Moreover, the use of PCA in this method allows for a more efficient and effective identification of genes that are relevant to patient survival, as it reduces the number of variables needed to analyze the data and highlights the most important variables. In summary, PCA is a widely used statistical technique that we used to estimate the association between a randomly chosen set of genes and survival time. The use of the first principal component as a prognostic score has been validated in the literature and has been shown to reveal stronger outcome associations than other methods. In addition, there are many well-known methods to detect differentially expressed genes between two different groups. Recent studies showed that our method of choice (SAM methods) is a stronger approach^[109]2. As mentioned in the result section, in 8 out of 17 cancer types by removing the effect of fixed-point, positive random bias was removed and the number of random significant gene sets was reduced to a significant level of 5% and in other 9 datasets the proportion of SSAR gene set was dramatically reduced. For instance, in case of GBMLGG it was reduced from 99 to 26%. In general, as explained in previous sections, various studies proposed a set of genes that are significantly associated with cancer progression and metastasis; however, the fact that many random gene sets may exist with similar association undermine their identity. We observed that our suggested sets are truly causal, in the sense that altering their expression or activity will influence survival where by removing the effect of these genes’ expression from our dataset, the association of such random set with survival, is eliminated. Since the choice of association method and also methods for detecting significant genes is of course crucial and has a substantial impact on our results, we can examine alternative methods to enhance our findings in future research. To assess the robustness of our method, we conducted additional experiments by running our algorithm on random gene sets of size 100 and 200 for BRCA cancer. We observed that our method consistently identified nearly the same set of genes as the fixed-point set, regardless of the size of the set. Specifically, we found a high degree of overlap between the fixed-point sets obtained for sizes 50, 100, and 200, demonstrating the robustness and consistency of our method to the choice of random gene set size. These results indicate that our method is robust and reliable for identifying the fixed-point gene set in different random gene sets of varying sizes. The details of this analysis are provided in the Supplementary Information File [110]4. The file contains a figure and three tables that compare the selection of random gene sets of size 50, 100, and 200 genes for the fixed-point analysis. Tables of the Supplementary Information File [111]4 report the top ten highest scoring genes of [MATH: ZBRCA :MATH] identified by random gene sets of sizes 50, 100 , and 200, respectively. In addition, figure of the Supplementary Information File [112]4 illustrates the proportion of significant survival associated random gene sets before and after removing [MATH: ZC :MATH] . Conclusion Our study introduces a novel method that utilizes significant survival-associated random gene sets to identify a fixed-point gene set for cancers with positive random bias. The aim of this method is to identify a small set of genes specific to each cancer type that significantly affects survival time, referred to as the fixed-point gene set. This gene set remains stable across different random gene sets and serves as a core biological process underlying cancer progression for each specific cancer type. We expect our algorithm to converge to a similar fixed-point gene set that consistently affects survival time in different random samples from the same cancer type. Our approach combines Principal Component Analysis (PCA) and Significance Analysis of Microarrays (SAM) methods to reduce noise and approach the fixed-point gene set in each iteration. The empirical results on multiple cancer types demonstrate that our method effectively eliminates the random bias and improves the accuracy of survival prediction in gene expression data. Our study also highlights the biological significance of the [MATH: ZC :MATH] genes and their association with cancer-related pathways. By using multiple studies, we show that the highest-scoring [MATH: ZC :MATH] genes are strongly associated with the progression and metastasis of their respective cancer types. Removing the effect of 10% of the highest-scoring genes on [MATH: ZC :MATH] from the expression data drastically reduces the proportion of random significant survival-associated gene sets and, in some cases, eliminates the positive random bias phenomenon. Supplementary Information [113]Supplementary Figure 1.^ (15.4MB, tiff) [114]Supplementary Figure 2.^ (7.7MB, tiff) [115]Supplementary Figure 3.^ (14.6MB, tiff) [116]Supplementary Information 1.^ (2.5MB, pdf) [117]Supplementary Information 2.^ (14.3MB, pdf) [118]Supplementary Information 3.^ (93.2KB, pdf) [119]Supplementary Information 4.^ (308.9KB, pdf) [120]Supplementary Table 1.^ (83.6KB, pdf) [121]Supplementary Table 2.^ (87.1KB, pdf) Author contributions M.M analyzed data and wrote the paper. R.A and C.E supervised the research. All authors read and approved the final manuscrpit. Data availability The datasets and codes can be found in the GitHub repository [122]https://github.com/maryammagy/FPGI. Competing interests The authors declare no competing interests. Footnotes Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Supplementary Information The online version contains supplementary material available at 10.1038/s41598-023-35588-5. References