Abstract

   Cancer research aims to identify genes that cause or control disease
   progression. Although a wide range of gene sets have been published,
   they are usually in poor agreement with one another. Furthermore,
   recent findings from a gene-expression cohort of different cancer
   types, known as positive random bias, showed that sets of genes chosen
   randomly are significantly associated with survival time much higher
   than expected. In this study, we propose a method based on Brouwer’s
   fixed-point theorem that employs significantly survival-associated
   random gene sets and reveals a small fixed-point gene set for cancers
   with a positive random bias property. These sets significantly
   correspond to cancer-related pathways with biological relevance for the
   progression and metastasis of the cancer types they represent. Our
   findings show that our proposed significant gene sets are biologically
   related to each cancer type available in the cancer genome atlas with
   the positive random bias property, and by using these sets, positive
   random bias is significantly more reduced in comparison with
   state-of-the-art methods in this field. The random bias property is
   removed in 8 of these 17 cancer types, and the number of random sets of
   genes associated with survival time is significantly reduced in the
   remaining 9 cancers.

   Subject terms: Computational biology and bioinformatics, Systems
   biology, Biomarkers, Diseases

Introduction

   According to the American Cancer Society, cancer is the second most
   common cause of death in the US^[28]1. In addition, cancers are
   heterogeneous diseases, with comparable diagnoses and identical
   treatment regimens resulting in vastly different outcomes for patients.
   On the other hand, early cancer diagnosis and prognosis have
   substantial effects on patients’ therapeutic targets^[29]2. This has
   prompted researchers to seek out factors that can aid in predicting the
   course of cancer disease. The findings of numerous studies that relied
   solely on clinical characteristics such as lymph node status and
   histological grade to classify clinical outcomes demonstrated that
   these characteristics were insufficient. This has led to the
   development of studies considering genomic data (e.g., gene expression)
   alongside clinical features. In general, the goal of such studies was
   to select a preferably small number of genes, known as the signature,
   and to utilize them in predicting a patient’s survival outcome using
   gene expression profiles^[30]3.

   Nonetheless, detecting a robust gene set across various datasets that
   can accurately predict a patient’s survival outcome has become a key
   challenge in cancer research. In the last two decades, numerous
   articles have been published on finding survival-relative genes in
   various cancer types, each proposing a gene set that was highly
   associated with cancer progression and metastasis^[31]4–[32]10.
   Nevertheless, there was little overlap between the resulting gene sets
   from studies with different cohorts but similar analytical
   approaches^[33]11. Therefore, the lack of similarities between the
   reported gene sets in these studies indicates that the results depend
   on the cohorts being studied. As a result, identifying a robust gene
   set across multiple datasets that accurately predicts a patient’s
   outcome has become a formidable challenge in cancer research. In this
   regard, considering the cancer patient’s survival time is one of the
   most critical aspects of finding such a gene set^[34]11.

   In 2012, Venet et al.^[35]12 argued this point and conducted a study to
   estimate the association between randomly selected gene sets and breast
   cancer patient survival time in a Netherlands Cancer Institute (NKI)
   cohort. As one might expect, using the expression of random genes to
   divide samples into two distinct groups results in groups that are not
   significantly different in terms of survival time, and samples are
   assigned to each group randomly. In other words, the p-values obtained
   from statistical tests comparing survival curves of the groups
   generated by random gene sets must be distributed normally, with only
   5% of the p-values falling below 0.05^[36]13. By contrast, Venet’s
   analysis revealed that in the case of breast cancer, groups generated
   by many of the random gene sets showed a statistically significant
   difference in survival time. That is to say, these random gene sets
   were significantly associated with the patient’s survival time.
   Additionally, in some cases, the random gene sets were more
   significantly associated with survival time than some of the published
   signatures^[37]12. These findings suggest that many of the signatures
   identified through breast cancer gene expression analysis may not be
   causal to cancer progression, despite being significantly associated
   with survival time^[38]13. Venet et al. justified this issue by
   pointing to the operation of the proliferation signature, which
   considerably impacts a substantial portion of the genome. They
   suggested that most random gene sets contain some genes from the
   proliferation signature and, thus, are associated with the
   proliferation signature and, indirectly, survival. They defined the
   meta-PCNA signature to determine the proliferation rate and introduced
   a method to remove this signature’s impact on the expression data. They
   concluded that removing the effects of meta-PCNA genes on the
   expression of the genes in the NKI breast cancer dataset cohort was a
   perfect way to reduce the association between random genes and
   survival.

   In 2018, Shimoni coined the term “random bias” to describe this concept
   in cancer^[39]13. Random bias is an unexpected situation in which more
   (less) than 5% of random gene sets are associated with some clinical
   attribute, such as survival time, in a statistically significant way.
   Shimoni examined The cancer genome atlas (TCGA) data for 34 different
   cancer types to see if there was a significant association between
   random gene sets and survival time. According to his analysis, random
   bias could be found in a wide variety of cancer types. Shimoni’s
   findings revealed that 17 out of the 34 datasets exhibited positive
   random bias, indicating that more than 5% of randomly selected gene
   sets in these cancers are significantly associated with survival time.
   Ten of these cancer types did not exhibit random bias, while seven of
   the datasets exhibited negative random bias, cancers with less than 5%
   significant survival-associated random (SSAR) gene sets. Shimoni
   utilized Venet’s approach to eliminating the confounder effect of the
   proliferation signature from TCGA expression data to reduce the effect
   of random bias in all types of cancer. His analysis concluded that
   Venet’s methods were ineffective in removing random bias in most cancer
   types and impractical in the TCGA breast cancer cohort. To solve this
   problem, Shimoni proposed that dividing samples into small subgroups
   using an unsupervised clustering method could decrease the proportion
   of SSAR gene sets in a wide range of cancer types. Shimoni’s results
   showed that out of the 106 clusters generated for all cancer types that
   had exhibited both positive and negative random bias, in only 65 of
   these clusters, the property of random bias was eliminated. Despite the
   fact that random bias was not eliminated in 41 of 106 cases, he
   contends that clustering can effectively eliminate random bias in
   several TCGA cancer types. Despite producing some promising and
   valuable insights, the existing research has produced contradictory
   results, is still limited in scope, and faces several critical
   theoretical and analytical challenges.

   Previous studies have shown that significant survival-associated random
   gene sets can provide valuable insights into the biology of breast
   cancer and aid in identifying biologically cancer-related genes^[40]14.
   Notably, since random bias can be observed in many cancer types, it is
   possible that SSAR gene sets may also provide informative results for
   most cancer types. Building upon these findings, we assert that each
   cancer type has a fixed-point gene set that is biologically associated
   with cancer survival time which can be identified by SSAR gene sets.
   Additionally, these fixed-point gene sets are responsible for the
   observed random bias, and by removing their effects from expression
   data, it is possible to decrease the proportion of significant
   survival-associated random gene sets. To identify these fixed-point
   gene sets, we introduce an iterative novel approach for detecting gene
   sets that are not only statistically significant but also biologically
   relevant for cancer research and clinical practice. Specifically, we
   aim to identify fixed-point gene sets for each TCGA cancer type that
   exhibit positive random bias. By applying this approach, we aim to
   eliminate positive random bias and reduce the proportion of SSAR gene
   sets in a large number of cancer studies. Moreover, we demonstrate that
   the identified gene sets are highly biologically significant and can be
   considered as signatures for their associated cancer type. This
   suggests that the proposed approach can provide valuable insights into
   the underlying biology of cancer and improve the accuracy and
   reliability of survival analyses in various cancer types. Overall, our
   approach provides a systematic and rigorous method for detecting
   biologically relevant gene sets associated with cancer survival time
   and can have important implications for cancer research and clinical
   practice.

Materials and methods

Dataset

   Loi et al.^[41]15 collected microarray expression data of 17,585 genes
   from 380 individuals with primary breast tumors. The Rdata file was
   downloaded from NCBI’s Gene Expression Omnibus (GEO) with accession
   number [42]GSE6532
   ([43]https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=gse6532).

Netherlands Cancer Institute (NKI) cohort

   The NKI, also known as the van de Vijver et al. data set^[44]5, was
   provided in Venet’s paper^[45]12. This dataset contains microarray
   expression data of 13,108 genes for 295 breast cancer patients in
   stages I or II and their clinical data.

The Cancer Genome Atlas (TCGA)

   The expression data of TCGA cancer types (17 cases) that exhibit
   positive random bias (based on Shimoni’s finding) were downloaded from
   the [46]https://portal.gdc.cancer.gov site. We looked at level 3 data
   normalized using RNA-Seq by Expectation-Maximization (RSEM) method
   based on Shimoni’s approach; each dataset contains RNAseq expression
   datasets for cancer patients and their survival time and clinical data.
   We used standard TCGA study abbreviations for the cancer type names (as
   defined in
   [47]https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-s
   tudy-abbreviations).

Method

   In this section, we outline our novel approach for identifying the
   fixed-point gene set for each cancer type that exhibits the positive
   random bias property. Our method is based on an iterative algorithm to
   systematically and efficiently identify the gene set of interest. We
   explain the steps involved in our approach and provide a detailed
   description of how our method works to detect biologically relevant
   gene sets associated with cancer survival time.

Fixed-point gene set identifier method (FPGI)

    1. Initialize
       [MATH: <mrow><msub><mi>X</mi><mn>0</mn></msub><mo>=</mo><mi
       mathvariant="normal">∅</mi></mrow> :MATH]
       ,
       [MATH: <mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow> :MATH]
       .
    2. Randomly select a set of genes,
       [MATH: <msub><mi>G</mi><mi>j</mi></msub> :MATH]
       , with size m from all genes (Fig. [48]1A).
    3. Use principal component analysis (PCA) on the gene expression data
       matrix of
       [MATH: <msub><mi>G</mi><mi>j</mi></msub> :MATH]
       to divide samples into two equally sized groups (A and B) based on
       the median of the first principal component. Test the null
       hypothesis that there is no difference in survival time between
       these two groups using the log-rank test. If the p-value is less
       than 0.05, proceed to the next step. Otherwise, go back to step 2
       and increase j by one and choose another random set (
       [MATH:
       <msub><mi>G</mi><mrow><mi>j</mi><mo>+</mo><mn>1</mn></mrow></msub>
       :MATH]
       ) (Fig. [49]1B).
         1. Set
            [MATH:
            <mrow><mi>D</mi><mi>E</mi><mi>G</mi><mo>=</mo><msub><mi>G</mi>
            <mi>j</mi></msub></mrow> :MATH]
            . Use the Significant Analysis of Microarrays (SAM) method to
            detect differentially expressed genes between groups A and B,
            and consider the first m genes (most significant genes) as
            [MATH:
            <mrow><mi>D</mi><mi>E</mi><msup><mi>G</mi><mo>′</mo></msup></m
            row> :MATH]
            (Fig. [50]1C).
         2. Compare the sets
            [MATH:
            <mrow><mi>D</mi><mi>E</mi><msup><mi>G</mi><mo>′</mo></msup></m
            row> :MATH]
            and DEG. If they were not the same, set
            [MATH:
            <mrow><msub><mi>G</mi><mi>j</mi></msub><mo>=</mo><mi>D</mi><mi
            >E</mi><msup><mi>G</mi><mo>′</mo></msup></mrow> :MATH]
            and go back to step 3.
         3. If
            [MATH:
            <mrow><mi>D</mi><mi>E</mi><msup><mi>G</mi><mo>′</mo></msup></m
            row> :MATH]
            and DEG were the same, set
            [MATH:
            <mrow><msub><mi>X</mi><mi>j</mi></msub><mo>=</mo><msub><mi>X</
            mi><mrow><mi>j</mi><mo>-</mo><mn>1</mn></mrow></msub><mo>∪</mo
            ><mi>D</mi><mi>E</mi><mi>G</mi></mrow> :MATH]
            and go to step 4.
    4. Increase j by one and go back to step 2.
    5. Continue the whole process for
       [MATH:
       <mrow><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><
       mn>6000</mn></mrow> :MATH]
       and identify
       [MATH:
       <mrow><msub><mi>Z</mi><mi>C</mi></msub><mo>=</mo><msub><mi>X</mi><m
       n>6000</mn></msub></mrow> :MATH]
       as the fixed-point gene set of cancer type C that exhibits positive
       random bias property.

Figure 1.

   [51]Figure 1
   [52]Open in a new tab

   The figure depicts the FPGI method for identifying a fixed-point gene
   set in a given cancer type. The method starts with
   [MATH: <mrow><mi>j</mi><mo>=</mo><mn>0</mn></mrow> :MATH]
   , and an empty set
   [MATH: <msub><mi>X</mi><mn>0</mn></msub> :MATH]
   and proceeds iteratively for
   [MATH:
   <mrow><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mn>6
   000</mn></mrow> :MATH]
   . At each iteration, a set of genes (
   [MATH: <msub><mi>G</mi><mi>j</mi></msub> :MATH]
   ) is randomly selected from all genes expression data (GE all genes)
   and used to construct an gene expression data matrix (GE random genes)
   (part A). The samples are then divided into two groups A and B using
   principal component analysis (PCA), and their survival time is compared
   using log-rank test and p-value (part B). If the p-value is less than
   0.05, set
   [MATH:
   <mrow><mi>D</mi><mi>E</mi><mi>G</mi><mo>=</mo><msub><mi>G</mi><mi>j</mi
   ></msub></mrow> :MATH]
   and the first 50 differentially expressed genes (
   [MATH:
   <mrow><mi>D</mi><mi>E</mi><msup><mi>G</mi><mo>′</mo></msup></mrow>
   :MATH]
   ) between A and B are identified using the Significance Analysis of
   Microarrays (SAM) method (part C). If DEG is the same as
   [MATH:
   <mrow><mi>D</mi><mi>E</mi><msup><mi>G</mi><mo>′</mo></msup></mrow>
   :MATH]
   , it is added to the fixed-point set
   [MATH: <msub><mi>X</mi><mi>j</mi></msub> :MATH]
   , otherwise, the process is repeated with
   [MATH:
   <mrow><mi>D</mi><mi>E</mi><msup><mi>G</mi><mo>′</mo></msup></mrow>
   :MATH]
   as
   [MATH: <msub><mi>G</mi><mi>j</mi></msub> :MATH]
   until convergence. In this method, j represents the iteration number.
   The final result,
   [MATH: <msub><mi>X</mi><mn>6000</mn></msub> :MATH]
   , is considered as the fixed-point set of the cancer type C.

Why FPGI method converges?

   This iterative method tries to identify gene sets associated with
   positive random bias in cancer samples. This method combines Principal
   Component Analysis (PCA) to divide samples into two groups with
   Significant Analysis of Microarray (SAM) to identify differentially
   expressed genes between the two groups of samples resulting from PCA.
   This iterative process is repeated until a fixed set of genes is
   obtained. Utilizing the SAM method at each iteration ensures the
   convergence of our method, which helps to identify gene sets that are
   statistically significant and likely biologically relevant. In fact,
   the application of the SAM method during each iteration is the primary
   contributor to the method’s convergence. SAM is designed to identify
   gene sets with statistically significant differences in expression
   between two sample groups. Using PCA to refine the search for relevant
   gene sets is the second critical factor contributing to the convergence
   of the FPGI method. PCA is an efficient method for identifying sample
   subpopulations with potentially distinct gene expression profiles. By
   dividing the samples into two groups based on the identified gene set
   at each iteration, FPGI can identify subpopulations of samples with
   comparable gene expression profiles and narrow our search to the most
   relevant gene sets. In addition, the stopping criterion of this method,
   which requires that the gene set identified in each iteration be
   identical to the gene set identified in a previous iteration, ensures
   that the method does not continue to iterate forever.

   In addition, the stopping criterion of this method, which requires that
   the gene set identified in each iteration be identical to the gene set
   identified in a previous iteration, ensures that the method does not
   continue to iterate forever. To investigate this, we tested our
   algorithm on different gene set sizes, including 5, 50, 100, and 200
   genes. Our results show that for all scenarios, the algorithm converges
   quickly (with a maximum of 20 iterations).

   The combination of SAM, and PCA, as well as identifying statistically
   significant gene sets at each iteration and refining the search based
   on the identified gene set, and the stopping criterion ensures that
   FPGI method converges on a fixed set of genes, that are biologically
   significant, making it a reliable and robust technique for identifying
   relevant gene sets in cancer samples.

Scoring function

   Our scoring function, denoted as w, maps from the fixed-point gene set,
   [MATH: <msub><mi>Z</mi><mi>C</mi></msub> :MATH]
   to the set of natural numbers,
   [MATH: <mi mathvariant="double-struck">N</mi> :MATH]
   . Specifically, given a gene g, we define its score w(g) as the number
   of times g appears in the gene set
   [MATH: <msub><mi>X</mi><mi>j</mi></msub> :MATH]
   , where
   [MATH: <msub><mi>X</mi><mi>j</mi></msub> :MATH]
   is computed in step 3 part c for
   [MATH:
   <mrow><mi>j</mi><mo>=</mo><mn>1</mn><mo>,</mo><mo>…</mo><mo>,</mo><mn>6
   000</mn></mrow> :MATH]
   . Therefore, w(g) provides a measure of the significance of each gene
   in
   [MATH: <msub><mi>Z</mi><mi>C</mi></msub> :MATH]
   by quantifying its frequency of occurrence across iterations.

   In the result section, we will discuss how the fixed-point set,
   [MATH: <msub><mi>Z</mi><mi>C</mi></msub> :MATH]
   is associated with survival time and plays a key role in the phenomenon
   of positive random bias for the vast majority of cancer types
   exhibiting this property. Despite the fact that the results presented
   in this section pertain solely to BRCA, the same conclusions hold true
   for the other 16 types of cancer analyzed.

Results

Frequency of genes in fixed-point set

   To ensure comprehensive coverage of the search space and the inclusion
   of all available genes in the dataset, we generated the union of 6,000
   random gene sets, each containing 50 genes. This union resulted in a
   multi set of 300,000 genes. For the BRCA dataset, the union set
   contained all of the 18,275 genes of the dataset at least once, and
   some genes were repeated up to 34 times. Thus, our iterative method,
   which starts from a random gene set and repeats 6000 times, covers all
   genes in the dataset, giving each gene a chance to be chosen in a
   random gene set.

   On the other hand, the fixed-point set of BRCA (
   [MATH: <msub><mi>Z</mi><mtext>BRCA</mtext></msub> :MATH]
   ) consists of only 295 genes, with a maximum and minimum frequency of
   397 and 52 respectively. Compared to the original dataset, this is a
   very small set of genes. These results demonstrate that our method
   attempts to cover the entire search space and starts from all available
   genes in the dataset for each cancer type. Eventually, it settles on a
   small subset of genes that is a subset of the corresponding cancer type
   genes.

   The scoring values for the 50 top genes in
   [MATH: <msub><mi>Z</mi><mtext>BRCA</mtext></msub> :MATH]
   are shown in Fig. [53]2. We choose this number of genes for the figure
   because we mainly report our results using a random gene set size of 50
   throughout most of the paper, even though we analyzed our method using
   random gene sets of size 50, 100, and 200. It is noteworthy that the
   scores of the top 50 genes in
   [MATH: <msub><mi>Z</mi><mrow><mi
   mathvariant="italic">BRCA</mi></mrow></msub> :MATH]
   ranged from 346 to 397. The frequencies of these genes in the random
   gene sets and in
   [MATH: <msub><mi>Z</mi><mtext>BRCA</mtext></msub> :MATH]
   are plotted in blue and orange bars, respectively. Similar plots for
   other cancer types are available in Supplementary Information File
   [54]1, and they follow the same pattern.

Figure 2.

   [55]Figure 2
   [56]Open in a new tab

   Frequency of genes in all random gene sets versus scores of the
   fixed-point gene set. The orange bars on the y-axis represent the 50
   top scoring genes in the breast. In random gene sets, the scoring value
   of these genes is represented by blue bars.

   Overall, we observe that the
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   set is significantly smaller than the original dataset. This suggests
   that the genes in the
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   set may be important for cancer progression. To investigate this claim,
   we analyze the biological relevance of
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   to cancer C in the following sections.

Protein-protein interaction (PPI) network and pathway enrichment analysis

   In the first step of our analysis, in order to determine functional
   interactions between proteins coding genes of the resulted
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   , we used the Search Tool for Retrieval of Interacting Genes (STRING)
   database [57]https://string-db.org^[58]16. The PPI network was
   constructed using active interaction sources such as text mining,
   experiments, databases, neighborhood, gene fusion, co-occurrence, and
   co-expression, and a species was restricted to “Homo sapiens”. The
   nodes in the network represented the proteins, while the edges
   reflected the interaction. In STRING, each protein-protein interaction
   is annotated with one or more ‘scores’, these scores are indicators of
   confidence. Each score is assigned a confidence level between 0 and 1,
   with 1 representing the highest level of confidence. To obtain more
   reliable findings, we used a score of 0.9 for the confidence of
   interactions. The
   [MATH: <msub><mi>Z</mi><mtext>BRCA</mtext></msub> :MATH]
   gene-based network consisted of 113 non-isolated nodes and 810 edges.
   The constructed network has a PPI enrichment p-value less than 1.0e−16,
   indicating that interactions between genes were not random. This
   suggested that the
   [MATH: <msub><mi>Z</mi><mtext>BRCA</mtext></msub> :MATH]
   genes interacted more frequently than would be predicted for a random
   collection of proteins with the same size and degree distribution (in
   this case expected number of edges is 84). This enrichment indicated
   that the proteins as a group are biologically related. As it is
   illustrated in Fig. [59]3, the network, which was reduced to none
   isolated nodes contained some distinct, dense modules. Similar results
   were observed for other cancer types, which are available in
   Supplementary Information File [60]2. These results indicated that the
   genes in the
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   sets were highly associated with one another.

Figure 3.

   [61]Figure 3
   [62]Open in a new tab

   STRING protein-protein interaction analysis of the fixed-point gene set
   of the BRCA dataset. The network contained 113 nodes and 810 edges (vs.
   84 expected edges); enrichment p-value less than 1.0e
   [MATH: <mo>-</mo> :MATH]
   16. Figure were constructed using the STRING database (version 11.5;
   [63]https://string-db.org/).

   Accordingly, in the second step, Kyoto Encyclopedia of Genes and
   Genomes (KEGG) pathway enrichment analyses were conducted using The
   Database for Annotation, Visualization and Integrated Discovery (DAVID)
   ([64]https://david.ncifcrf.gov) to find corresponding significant
   pathways of
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   for each of the 17 cancer types^[65]17,[66]18. Supplementary
   Information Fig. [67]1 shows the significant pathways common in at
   least two types of cancer. In this figure, the names of the pathways
   are listed in the first column, and the pathways associated with each
   cancer type are depicted in the second column using different colors.
   As presented in Supplementary Information Table [68]1, most of the
   significantly enriched pathways of
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   were highly associated with cancer C.

Association of fixed-point set with disease

   The Genetic Association Database (GAD) tool on the David Functional
   Annotation server ([69]https://david.ncifcrf.gov) was utilized to
   investigate the association between
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   genes and disease. GAD is a database of published genetic association
   studies that enable the investigation of complex common human genetic
   diseases^[70]17,[71]18. Table [72]1 demonstrates the enriched disease
   and the disease class for each of the 17 cancer types. The table
   suggested that the genes in
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   were associated with cancer C. For instance, in the case of BRCA, the
   top-level disease class and disease assigned by GAD were cancer and
   breast cancer with p-values of 2.8e
   [MATH: <mo>-</mo> :MATH]
   4 and 6.5e
   [MATH: <mo>-</mo> :MATH]
   8, respectively. As noted by Ansar et al.^[73]14, these findings
   indicated how our method can detect meaningful information in SSAR gene
   sets.

Table 1.

   Enriched disease and disease class achieved from fixed-point sets by
   Genetic Association Disease (GAD).
   Dataset GAD disease class Class p-value GAD disease p-value
   ACC Cancer 2.2e
   [MATH: <mo>-</mo> :MATH]
   3 Plasma HDL cholesterol (HDL-C) levels 9.9e
   [MATH: <mo>-</mo> :MATH]
   6
   BLCA Cancer 9.2e
   [MATH: <mo>-</mo> :MATH]
   6 Urinary bladder neoplasms 1.8e
   [MATH: <mo>-</mo> :MATH]
   2
   BRCA Cancer 2.8e
   [MATH: <mo>-</mo> :MATH]
   4 Breast cancer 6.5e
   [MATH: <mo>-</mo> :MATH]
   8
   GBMLGG Cancer 2.8e
   [MATH: <mo>-</mo> :MATH]
   4 Schizophrenia 2.0e
   [MATH: <mo>-</mo> :MATH]
   6
   HNSC cardiovascular 1.6e
   [MATH: <mo>-</mo> :MATH]
   5 Cardiomyopathy, Dilated|DCM—Dilated cardiomyopathy 3.7e
   [MATH: <mo>-</mo> :MATH]
   6
   KIPAN Cancer 7.5e
   [MATH: <mo>-</mo> :MATH]
   4 Type 2 Diabetes| edema | rosiglitazone 7.5e
   [MATH: <mo>-</mo> :MATH]
   5
   KIRC Cancer 2.5e
   [MATH: <mo>-</mo> :MATH]
   6 Chronic renal failure|Kidney failure, Chronic 1.9e
   [MATH: <mo>-</mo> :MATH]
   3
   KIRP Cancer 3.1e
   [MATH: <mo>-</mo> :MATH]
   5 Type 2 Diabetes| edema | rosiglitazone 9.6e
   [MATH: <mo>-</mo> :MATH]
   6
   LGG Pharmacogenomic 2.3e
   [MATH: <mo>-</mo> :MATH]
   5 Several psychiatric disorders 2.0e
   [MATH: <mo>-</mo> :MATH]
   6
   LIHC Cancer 9.0e
   [MATH: <mo>-</mo> :MATH]
   5 Liver cancer 6.1e
   [MATH: <mo>-</mo> :MATH]
   2
   LUAD Cancer 2.4e
   [MATH: <mo>-</mo> :MATH]
   17 Lung cancer 2.0e
   [MATH: <mo>-</mo> :MATH]
   6
   LUSC Cancer 9.3e
   [MATH: <mo>-</mo> :MATH]
   4 Lung Diseases|Resp distress syndrome neonatal 2.1e
   [MATH: <mo>-</mo> :MATH]
   6
   MESO Cancer 1.1e
   [MATH: <mo>-</mo> :MATH]
   3 Lung cancer 2.5e
   [MATH: <mo>-</mo> :MATH]
   2
   PAAD Cancer 1.5e
   [MATH: <mo>-</mo> :MATH]
   2 Type 2 Diabetes| edema | rosiglitazone 8.2e
   [MATH: <mo>-</mo> :MATH]
   4
   THYM Immune 1.6e
   [MATH: <mo>-</mo> :MATH]
   2 Pulmonary disease, Mycobacterium malmoense 1.5e
   [MATH: <mo>-</mo> :MATH]
   4
   UCEC Cancer 2.0e
   [MATH: <mo>-</mo> :MATH]
   4 Dermatitis, Atopic 1.3e
   [MATH: <mo>-</mo> :MATH]
   6
   UVM Cancer 1.3e
   [MATH: <mo>-</mo> :MATH]
   2 Uveitis 3.8e
   [MATH: <mo>-</mo> :MATH]
   2
   [74]Open in a new tab

Association of top scoring genes of fixed-point set with cancer C

   Through pathway enrichment, PPI network analysis, and disease class
   association, the biological significance of
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   with respect to its corresponding cancer type was investigated in the
   previous sections, and it was determined that the obtained fixed-point
   gene sets were significantly associated with cancer progression and
   metastasis. Although, the most significant advantage of our method was
   that the high-scoring genes of
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   in many studies had been shown as cancer driver genes. In the ACC, CD68
   gene, the highest-scoring gene in
   [MATH: <msub><mi>Z</mi><mtext>ACC</mtext></msub> :MATH]
   , has been identified as a prognostic biomarker for adrenocortical
   carcinoma^[75]19. As another example, the TBX2 and TBX3 genes, with
   scores of 1276 and 1226 in the fixed-point gene set of BLCA, were
   excellent markers for predicting progression to muscle-invasive bladder
   cancer in patients with primary pTaG1/2 bladder cancer^[76]20.
   In^[77]21, it has been proposed that the C6orf97 gene, the highest
   scoring gene of
   [MATH: <msub><mi>Z</mi><mtext>BRCA</mtext></msub> :MATH]
   , might play important roles not only in carcinogenesis but also in the
   progression of breast cancer patients toward a more aggressive
   phenotype. In 2019, Yeng et al.^[78]22 had indicated that ACTA1, a gene
   from
   [MATH: <msub><mi>Z</mi><mtext>HNSC</mtext></msub> :MATH]
   with a score of 245, was a biomarker of head and neck squamous cell
   carcinoma. As another instance, silencing of ANK2 with a score of 358
   in
   [MATH: <msub><mi>Z</mi><mtext>PAAD</mtext></msub> :MATH]
   decreased the proliferation of the pancreatic tumor cells and reduced
   their tumorigenicity in vitro and in vivo^[79]23. The highest-scoring
   genes across all
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   sets are depicted in Supplementary information Fig. [80]2 and,
   Published papers that have investigated their association with cancer
   are available in Supplementary information Table [81]2.

Random bias correction

   The random bias phenomenon, as described by Venet et al. and Shimoni,
   suggested that many of the signatures identified in numerous analyses
   of cancer types might not be causal of cancer progression, despite
   their significant association with survival time. Consequently, random
   bias is a confounding property that must not be ignored^[82]12,[83]13.
   As proposed by Venet et al. random bias was caused by the activity of
   proliferation genes (meta-PCNA gene signature) in data that had a
   substantial impact on the expression data, and the activity of this
   signature significantly influences each random set selected from the
   data. They hypothesized that by removing the effect of meta-PCNA genes
   from expression, the random bias in the NKI breast cancer dataset could
   be eliminated^[84]12. However, Shimoni demonstrated that removing the
   impact of meta-PCNA genes could not effectively reduce the proportion
   of SSAR gene sets in TCGA cancer types. Venet’s strategy might depend
   on the platform or data^[85]13. In this paper, we claimed that for each
   cancer type, we required a specific set of genes that removing its
   impact on the expression data could reduce the proportion of SSAR gene
   sets. To investigate this claim, we used the fixed-point gene set (
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   ) of each cancer type and demonstrated that removing the influence of
   these genes from expression data could dramatically reduce the
   proportion of SSAR gene sets in the vast majority of cancer types. To
   accomplish this, we selected 10% of the highest scoring genes of
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   and then removed their impact from the expression data similar to
   Venet’s approach^[86]12. The result of this analysis is available in
   Table [87]2, where rows denote the cancer type and the proportion of
   significant p-value in percentage (SSAR%), the proportion of
   significant random gene set after removing the effect of meta-PCNA in
   percentage (PCNA-SSAR%), and the proportion of significant SSAR gene
   set after removing the effect of corresponding
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   (
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   -SSAR) are collected in first, second and third columns, respectively.
   As shown in Table [88]2, the proportion of significant (positive random
   bias) in 14 out of 17 cancer types was significantly more reduced by
   using the selected genes from
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   rather than meta-PCNA.

Table 2.

   The proportion of significant survival associated random gene sets
   after removing the fixed-point set and meta-PCNA signature.
   Dataset SSAR% PCNA-SSAR%
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   -SSAR%
   ACC     71    40         5
   BLCA    50    44         7
   BRCA    21    18         5
   GBMLGG  99    85         25
   HNSC    26    27         20
   KIPAN   64    26         27
   KIRC    82    68         26
   KIRP    57    17         5
   LGG     80    64         19
   LIHC    32    5          7
   LUAD    49    18         11
   LUSC    14    10         6
   MESO    53    20         15
   PAAD    45    7          13
   THYM    16    18         7
   UCEC    58    41         27
   UVM     51    45         7
   [89]Open in a new tab

   “SSAR%” is the proportion of significant survival-associated random
   gene sets, “PCNA-SSAR %” is the proportion of significant
   survival-associated random gene sets after removing the effect of
   meta-PCNA signature from their expression data, and “
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   -SSAR%” is the proportion of significant survival associated random
   gene set after removing the effect of
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   gene set from expression data.

   From another point of view, Shimoni used the PhenoClust, an
   unsupervised clustering method, to reduce the effect of random bias in
   TCGA cancer types^[90]13,[91]24. In Shimoni’s approach, the samples of
   each cancer type have been divided into sub-clusters, the association
   between each random gene set and the survival time of each cluster’s
   samples have been determined, and the proportion of SSAR gene sets was
   calculated for each cluster. Samples of all 17 cancer types with
   positive random bias property have been divided into 92
   clusters^[92]13. In Fig. [93]4 proportion of SSAR gene sets is
   represented with colored dots and the small horizontal grey lines show
   the proportion of SSAR gene sets after removing the effect of
   corresponding
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   . Our evaluation of the performance of the proposed method was based on
   the assumption that a method is considered perfect for removing the
   random bias property if the results of the method cause only 0.05 of
   the random sets to remain significant. We compared the performance of
   our proposed method, FPGI, with that of Shimoni’s clustering method
   using the equation presented in the Supplementary Information File
   [94]3, where
   [MATH:
   <mrow><mi>S</mi><mi>S</mi><mi>A</mi><msub><mi>R</mi><mtext>CL</mtext></
   msub><mrow><mo stretchy="false">(</mo><mi>i</mi><mo
   stretchy="false">)</mo></mrow></mrow> :MATH]
   denotes the proportion of significant random gene sets of cluster i of
   cancer C, and
   [MATH: <msub><mi>N</mi><mtext>C</mtext></msub> :MATH]
   represents the number of generated clusters for the corresponding
   cancer type. We evaluated the distance between the proportion of SSAR
   gene sets after excluding the effect of 10% of the highest scoring
   genes of
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   from the expression data, denoted by
   [MATH: <mrow><mi>D</mi><msub><mi>Z</mi><mtext>C</mtext></msub></mrow>
   :MATH]
   , to demonstrate the effectiveness of our method. The results of this
   analysis are presented in Supplementary Information File [95]3 table.
   This table indicates the cancer types with positive random bias
   property in the first column,
   [MATH: <mrow><mi>A</mi><msub><mi>D</mi><mtext>C</mtext></msub></mrow>
   :MATH]
   for each cancer type in the second column, and
   [MATH: <mrow><mi>D</mi><msub><mi>Z</mi><mtext>C</mtext></msub></mrow>
   :MATH]
   in the third column. Our results demonstrate that, in 9 out of 17
   cancer types,
   [MATH: <mrow><mi>D</mi><msub><mi>Z</mi><mtext>C</mtext></msub></mrow>
   :MATH]
   is less than
   [MATH: <mrow><mi>A</mi><msub><mi>D</mi><mtext>C</mtext></msub></mrow>
   :MATH]
   , which indicates that our proposed method outperforms Shimoni’s
   clustering method. In 5 out of the 8 remaining cancer types the results
   of these two methods are comparable^[96]12,[97]13. While it is true
   that our method shows better results compared to a specific cluster and
   not all clusters, we believe that this comparison still provides
   valuable insight into the performance of our proposed method. Moreover,
   our proposed method provides an alternative explanation to the same
   problem and also provides a significant set of genes to continue
   exploration.

Figure 4.

   Figure 4
   [98]Open in a new tab

   Each horizontal line represents a TCGA cancer type with positive random
   bias property. Each dot along the x-axis represents the proportion of
   significant survival-associated random gene set in each cluster. The
   short vertical gray line illustrates the proportion of significant
   survival-associated random gene set after removing the effect of
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   genes from expression data.

Reproducibility

   In cancer research, identifying a reliable gene set independent of
   datasets and can accurately predict a patient’s survival outcome has
   become a major obstacle. Numerous articles on discovering
   survival-relative genes in various cancer types have been published
   over the recent decades, each proposing a gene set, with the authors
   asserting that the purpose gene set was significantly associated with
   cancer progression and metastasis. However, there was little overlap
   between the gene sets resulting from studies with different cohorts but
   similar analytic methods. In this paper, we introduced a set of
   significant survival-relative genes and indicated that we could reduce
   the proportion of SSAR gene sets by employing them. We demonstrated
   that these finding sets were nearly robust across different cancer
   cohorts. For this reason, we evaluated our results using various breast
   cancer cohorts and data sets. Regarding this, we considered three
   distinct breast cancer cohorts (NKI, TCGA, and LOI)^[99]5,[100]15. For
   the NKI, TCGA, and LOI cohorts, the resulting
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   sets contained 364, 295, and 426 genes, respectively. Each pair of
   these three gene sets has more than 30% of their genes in common, and
   37 genes are in the intersection of all
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   sets that contained critical genes for breast cancer like AURKA and
   AURKB^[101]25,[102]26. Also, the PPI network based on these 37 genes
   was very dense, and the corresponding pathways of these genes were
   significantly associated with breast cancer (see Supplementary
   Information Fig. [103]3). It could be concluded that our proposed
   method for identifying survival-associated genes demonstrated
   significant overlap across different cohorts.

Discussion

   It has previously been shown that gene expression data from random gene
   sets have a significant relation with cancer survival time, which has
   been discovered in a variety of cancers. Venet et al.^[104]12 first
   discovered this phenomenon in a microarray-measured expression dataset
   in breast cancer. As it turns out, this pattern can be seen in nearly
   all of the TCGA data matrix’s RNAseq-derived gene expression data.
   Venet et al.^[105]12 hypothesized that the phenomenon of random bias
   observed in gene expression data is caused by the activity of the
   proliferation signature, which affects a substantial portion of the
   human genome, and by removing the effect of this signature from
   expression data the random bias property will be removed. However, As
   reported by Shimoni^[106]13, the proliferation signature alone is
   insufficient to eliminate this bias in most cancers. While we agree
   with the general assumption of Venet et al. we believe that the
   specific set of genes that contribute to positive random bias varies
   widely between cancer types.

   To address this issue, we propose the existence of a fixed-point gene
   set for each type of cancer, which exerts a strong influence on a large
   number of genes in the genome and is strongly associated with survival
   time. This fixed-point gene set can induce survival prediction ability
   in randomly selected gene sets derived from expression data. To
   identify these gene sets, we developed an innovative and iterative
   method .

   The iterative nature of our method is a key strength that enables us to
   identify gene sets that are not the result of chance or noise, but
   instead represent significant differences in gene expression between
   groups. This is achieved by repeatedly dividing samples into two groups
   based on the differential expression of a gene set, and using this
   information to identify a new, refined gene set. The iterative process
   enables us to focus on the most biologically relevant genes for a given
   cancer type, and exclude genes that may be false positives or
   irrelevant to the disease.

   By applying this method to multiple cancer types, we can build a more
   comprehensive understanding of the underlying molecular mechanisms
   driving cancer development and progression. Furthermore, because our
   method is based on statistical significance, we can be confident that
   the gene sets we identify reflect genuine differences in gene
   expression between cancer types. This in turn gives us greater
   confidence in the biological relevance of the genes we identify, and
   increases the potential for these genes to be used as diagnostic or
   therapeutic targets in the future.

   In our study, we have analyzed a wide range of cancer types, and in
   order to validate the biological relevance of the identified gene set,
   we conducted protein-protein interaction (PPI) network and pathway
   analyses. The PPI network analysis helped us identify key biological
   pathways and processes involved in cancer development and progression
   and how the fixed-point set was related to these pathways. We also
   compared our gene set with previously published cancer signatures and
   confirmed that our identified gene set was highly correlated with the
   known cancer pathways.

   In addition, we evaluated the association of our fixed-point set with
   cancer disease class and related cancer types using the Genetic
   Association Database (GAD). The results showed that our gene set was
   highly associated with cancer disease class and related cancer types,
   providing further evidence of the biological relevance of our
   identified genes.

   To ensure that our findings were not dataset-specific, we reanalyzed
   our method with other independent datasets. The results showed that our
   identified fixed-point set of genes was consistently present across
   different datasets, further validating our approach and increasing the
   confidence in our results.

   Overall, our study has identified a set of highly cancer-related genes
   using an iterative approach that eliminates false positives and ensures
   the biological relevance of the identified genes. The validation of our
   gene set through various approaches and its consistency across multiple
   cancer types and datasets further supports its potential as a
   diagnostic or therapeutic target for cancer treatment.

   FPGI method is inspired by Brouwer’s fixed-point theorem, to identify
   this fixed-point gene set. Brouwer’s fixed-point theorem states that if
   you have a continuous function f that maps a compact, convex set X onto
   itself, and function f is a contraction, that it reduces distances
   between points in X by a constant between 0 and 1, then there is always
   a point Z in X such that
   [MATH: <mrow><mi>f</mi><mo stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo><mo>=</mo><mi>Z</mi></mrow> :MATH]
   . In other words, X contains a fixed point that does not move as a
   result of the function f. In this work, we drew inspiration from this
   theorem by considering the set of all subsets of size m from all genes
   to be the compact, convex set X, the symmetric difference of the sets
   to be the metric on X, and the composition of the SAM and PCA methods
   to be the continuous function on X.

   To create this method, we first use a technique for detecting a
   significant relationship between gene expression data and cancer sample
   survival time in order to divide the group into two equal-sized
   subgroups with different survival dynamics. Second, we use a method to
   find genes whose expression significantly differs between these two
   subgroups.

   We used PCA to estimate the association between a randomly chosen set
   of genes and survival time. Specifically, we calculated the median of
   the first principal component, and then based on its values the
   patients were divided into two equaled-sized groups (A and B). The use
   of PCA as a method to estimate the association between a gene set and
   survival time has been previously validated in the literature, and we
   chose this approach because based on Venet’s results PCA method has
   been shown to reveal stronger outcome associations than other methods
   such as, kmeans and hierarchical clustering^[107]27–[108]33. Moreover,
   the use of PCA in this method allows for a more efficient and effective
   identification of genes that are relevant to patient survival, as it
   reduces the number of variables needed to analyze the data and
   highlights the most important variables. In summary, PCA is a widely
   used statistical technique that we used to estimate the association
   between a randomly chosen set of genes and survival time. The use of
   the first principal component as a prognostic score has been validated
   in the literature and has been shown to reveal stronger outcome
   associations than other methods. In addition, there are many well-known
   methods to detect differentially expressed genes between two different
   groups. Recent studies showed that our method of choice (SAM methods)
   is a stronger approach^[109]2.

   As mentioned in the result section, in 8 out of 17 cancer types by
   removing the effect of fixed-point, positive random bias was removed
   and the number of random significant gene sets was reduced to a
   significant level of 5% and in other 9 datasets the proportion of SSAR
   gene set was dramatically reduced. For instance, in case of GBMLGG it
   was reduced from 99 to 26%. In general, as explained in previous
   sections, various studies proposed a set of genes that are
   significantly associated with cancer progression and metastasis;
   however, the fact that many random gene sets may exist with similar
   association undermine their identity. We observed that our suggested
   sets are truly causal, in the sense that altering their expression or
   activity will influence survival where by removing the effect of these
   genes’ expression from our dataset, the association of such random set
   with survival, is eliminated. Since the choice of association method
   and also methods for detecting significant genes is of course crucial
   and has a substantial impact on our results, we can examine alternative
   methods to enhance our findings in future research.

   To assess the robustness of our method, we conducted additional
   experiments by running our algorithm on random gene sets of size 100
   and 200 for BRCA cancer. We observed that our method consistently
   identified nearly the same set of genes as the fixed-point set,
   regardless of the size of the set. Specifically, we found a high degree
   of overlap between the fixed-point sets obtained for sizes 50, 100, and
   200, demonstrating the robustness and consistency of our method to the
   choice of random gene set size. These results indicate that our method
   is robust and reliable for identifying the fixed-point gene set in
   different random gene sets of varying sizes. The details of this
   analysis are provided in the Supplementary Information File [110]4. The
   file contains a figure and three tables that compare the selection of
   random gene sets of size 50, 100, and 200 genes for the fixed-point
   analysis. Tables of the Supplementary Information File [111]4 report
   the top ten highest scoring genes of
   [MATH: <msub><mi>Z</mi><mrow><mi
   mathvariant="italic">BRCA</mi></mrow></msub> :MATH]
   identified by random gene sets of sizes 50, 100 , and 200,
   respectively. In addition, figure of the Supplementary Information File
   [112]4 illustrates the proportion of significant survival associated
   random gene sets before and after removing
   [MATH: <msub><mi>Z</mi><mi>C</mi></msub> :MATH]
   .

Conclusion

   Our study introduces a novel method that utilizes significant
   survival-associated random gene sets to identify a fixed-point gene set
   for cancers with positive random bias. The aim of this method is to
   identify a small set of genes specific to each cancer type that
   significantly affects survival time, referred to as the fixed-point
   gene set. This gene set remains stable across different random gene
   sets and serves as a core biological process underlying cancer
   progression for each specific cancer type. We expect our algorithm to
   converge to a similar fixed-point gene set that consistently affects
   survival time in different random samples from the same cancer type.
   Our approach combines Principal Component Analysis (PCA) and
   Significance Analysis of Microarrays (SAM) methods to reduce noise and
   approach the fixed-point gene set in each iteration. The empirical
   results on multiple cancer types demonstrate that our method
   effectively eliminates the random bias and improves the accuracy of
   survival prediction in gene expression data.

   Our study also highlights the biological significance of the
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   genes and their association with cancer-related pathways. By using
   multiple studies, we show that the highest-scoring
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   genes are strongly associated with the progression and metastasis of
   their respective cancer types. Removing the effect of 10% of the
   highest-scoring genes on
   [MATH: <msub><mi>Z</mi><mtext>C</mtext></msub> :MATH]
   from the expression data drastically reduces the proportion of random
   significant survival-associated gene sets and, in some cases,
   eliminates the positive random bias phenomenon.

Supplementary Information

   [113]Supplementary Figure 1.^ (15.4MB, tiff)
   [114]Supplementary Figure 2.^ (7.7MB, tiff)
   [115]Supplementary Figure 3.^ (14.6MB, tiff)
   [116]Supplementary Information 1.^ (2.5MB, pdf)
   [117]Supplementary Information 2.^ (14.3MB, pdf)
   [118]Supplementary Information 3.^ (93.2KB, pdf)
   [119]Supplementary Information 4.^ (308.9KB, pdf)
   [120]Supplementary Table 1.^ (83.6KB, pdf)
   [121]Supplementary Table 2.^ (87.1KB, pdf)

Author contributions

   M.M analyzed data and wrote the paper. R.A and C.E supervised the
   research. All authors read and approved the final manuscrpit.

Data availability

   The datasets and codes can be found in the GitHub repository
   [122]https://github.com/maryammagy/FPGI.

Competing interests

   The authors declare no competing interests.

Footnotes

   Publisher's note

   Springer Nature remains neutral with regard to jurisdictional claims in
   published maps and institutional affiliations.

Supplementary Information

   The online version contains supplementary material available at
   10.1038/s41598-023-35588-5.

References