Abstract A novel method is developed for predicting the stage of a cancer tissue based on the consistency level between the co-expression patterns in the given sample and samples in a specific stage. The basis for the prediction method is that cancer samples of the same stage share common functionalities as reflected by the co-expression patterns, which are distinct from samples in the other stages. Test results reveal that our prediction results are as good or potentially better than manually annotated stages by cancer pathologists. This new co-expression-based capability enables us to study how functionalities of cancer samples change as they evolve from early to the advanced stage. New and exciting results are discovered through such functional analyses, which offer new insights about what functions tend to be lost at what stage compared to the control tissues and similarly what new functions emerge as a cancer advances. To the best of our knowledge, this new capability represents the first computational method for accurately staging a cancer sample. The R source code used in this study is available at GitHub ([32]https://github.com/yxchspring/CECS). Subject terms: Cancer genomics, Cancer models Introduction We present a computational approach to stage accurately cancer tissues based on their RNA-seq data. The stage of a cancer is a key parameter for clinically characterizing the cancer. As a cancer advances, the disease generally evolves from a localized issue to a whole-body problem^[33]1–[34]3, not just in term of whether a cancer is metastasized or not, as cancer tends to persistently release certain molecules such as protons, cytokines and polyamines^[35]4–[36]6 as well as “consume” certain molecules like sodium and iron, leading to substantial alterations of their blood concentrations over time. For some molecular species, such changes will trigger highly damaging responses by different organs throughout the body. Cachexia, i.e., loss of muscle cells throughout the body, is one consequence of such responses towards the advanced stage of a cancer^[37]7–[38]10 Intracellularly, considerable changes take place in metabolisms as a cancer evolves, giving rise to gradual and extensive metabolic reprogramming in cancer^[39]11–[40]14. Hence, cancers detected at different stages require distinct treatment plans. Therefore, accurate staging of a cancer is vitally important to the cancer patient and his/her physician. Somewhat surprisingly, the clinical practice of cancer staging has not changed much in the past 40 years^[41]15–[42]17 as it is still done predominantly based on the morphology and the size of a cancer tissue, examined manually by cancer pathologists under microscope, assisted by limited protein biomarkers. One would intuitively expect that cancer staging nowadays should have been done in a more objective manner based on molecular data, knowing that cancer tissue omic data, particularly gene-expression data are easily obtainable in a financially viable manner. However, the reality is: while gene-expression data represent the easiest to get and the most informative omic data for studying cancer tissues, they have not been widely used for cancer staging outside of laboratory studies^[43]18–[44]20. Published work is mostly on transcriptomic biomarkers for cancer prognostic prediction^[45]21–[46]27 rather than cancer staging. A key challenge in achieving this goal comes from the reality that scientists have yet to identify genes whose (differential) expression patterns in cancer vs. controls are specifically associated with individual stages of a cancer type, and hence can be used for cancer-stage prediction. Our own analyses have discovered that co-expression patterns are considerably more informative than differential expressions of individual genes for cancer staging. Here we present a co-expression based cancer staging method. To the best of our knowledge, there are no published studies that predict cancer stages using co-expression patterns of cancer tissues. A technical challenge in applying co-expression data for cancer staging is: how to derive co-expression information of genes in individual tissue samples since it generally requires multiple samples to infer such information while cancer staging needs to be done on individual tissues. Fortunately, Chen and co-workers have recently published a statistical method for inference of co-expressed genes in a single sample through comparing the co-expressed genes in a set of reference samples and those in the reference set plus the current sample^[47]28. Specifically, the approach assesses if the co-expression patterns among the reference samples are enhanced or weakened by including the sample into the reference set, namely an expanded set. A pair of genes in the new sample is considered as having the same co-expression pattern in the reference set and the expanded set if its co-expression level in the latter is not statistically lower than in the former. Hence, when applied to all gene pairs, a set of co-expressed genes can be derived for the given sample with respect to the reference set. This method has been applied to solving a variety of co-expression analysis problems and found to be highly effective^[48]28. We have adapted and applied this approach to cancer tissue staging. Specifically, we assume that some samples for each stage of a cancer type are available, along with their genome-scale transcriptomic data, from which co-expression patterns can be derived reliably for each stage of the cancer type. Then a new sample is assigned to a stage if the sample’s co-expression pattern is most consistent with the co-expression patterns of the stage of the reference samples within a specified level of difference. We have applied this staging approach to eight cancer types in the TCGA database for stage prediction, representing all the cancer types that has at least ten cancer samples in each of the four stages. The consistency levels range from 71 to 95% across the eight cancer types we studied. The reason we have applied our method only to the TCGA data is that the data are collected from cancer tissue samples, rather than cell lines^[49]29,[50]30, with the highest data quality compared to other databases. An important application of this methodology is to elucidate the functional differences between cancer samples at different stages, hence providing important and useful information regarding cancer evolution from early to the advanced stage. To do this, we have developed a new method for assessing the statistical significance of pathways enriched by a set of gene pairs rather than a set of genes as commonly done. By applying this method, we have examined what normal functions tend to disappear at what stage and what new functions may emerge at what stage of a cancer type. This functional analysis results have revealed novel understanding about cancer evolution, hence providing concrete examples for a profound postulation made by Otto Warburg 50 ago: “the highly differentiated cells are now transformed into fermenting anaerobes, which have lost all their body functions and retain only the now useless function of growth”^[51]31–[52]33. Results Gene expression data of eight cancer types, namely BRCA, COAD, HNSC, KIRC, KIRP, LUAD, STAD and THCA, are extracted from the TCGA database. Our cancer-stage prediction is conducted and assessed on these samples. The detailed information about these cancer data are given in the Methods section. Identification of co-expressed genes For each cancer type, edgeR in the R package is used to identify the differentially expressed genes (DEGs) using [MATH: |log(FC)|>2.5 :MATH] and p value < 0.05 as the cutoffs. Pearson correlation coefficient (PCC) is used to calculate the co-expression level between two genes. A pair of genes (x, y) is deemed to be co-expressed (CEGs) if [MATH: PCCx,y>2.5 :MATH] with p value < 0.05 (see Methods). Table [53]1 summarizes the numbers of DEGs and CEGs for each cancer type at each stage. Table 1. The numbers of DEGs and CEGs in each of the eight cancer types. Stage #DEGs and #CEGs BRCA COAD HNSC KIRC KIRP LUAD STAD THCA Stage 1 vs. control #DEGs Up 1,255 1,718 347 1,110 640 2,004 673 893 Down 512 729 975 594 735 129 670 228 #CEGs Up 61,638 385,089 1,428 56,850 1,113 130,324 3,717 3,651 Down 1,690 11,763 26,804 920 2,564 121 15,338 600 Stage 2 vs. control #DEGs Up 1,564 1,581 547 1,399 700 1,268 662 662 Down 527 681 943 560 784 190 713 488 #CEGs Up 14,410 143,619 634 62,299 14,452 3,102 1,892 7,835 Down 981 6,472 25,765 9,450 13,173 258 9,888 9,125 Stage 3 vs. control #DEGs Up 1,040 1,607 597 1,159 955 1,269 597 903 Down 553 588 949 739 640 217 744 289 #CEGs Up 3,925 104,838 1,308 8,426 5,864 2,847 572 5,109 Down 1,912 6,200 16,024 2,306 784 1,007 7,295 2,074 Stage 4 vs. control #DEGs Up 818 1,035 798 1,325 657 1,097 504 932 Down 721 685 872 727 727 156 939 535 #CEGs Up 7,386 4,597 542 18,161 9,068 14,427 952 3,449 Down 15,550 8,372 15,892 5,343 24,119 923 13,429 3,142 [54]Open in a new tab An algorithm for representing cancer samples as co-expression networks We have developed an algorithm for representing the gene-expression data of cancer tissue samples of a given cancer type as four stage-specific co-expression networks, one for each stage, and their perturbed networks when a new sample is added to the sample set of each stage. The level of perturbation due to inclusion of the new sample to each of the four co-expression networks, in general, will be significantly different between the network where the new sample intrinsically belongs and the three other networks. This serves as the basis of our cancer staging algorithm. A co-expression network is built over samples in each stage of a given cancer type, consisting of only gene pairs that are highly co-expressed, where each gene pair is represented as an edge connecting two nodes denoting the two genes. When a new sample is added to the sample set of each stage, the co-expression levels of some gene pairs may change. Chen and co-authors have made the following observation^[55]28: if two genes are co-expressed over a sample set, then adding a new sample to the set should not change their co-expression level significantly if their expression levels in the new sample are linearly consistent with those in the sample set; otherwise the co-expression level will decrease or remain at a low level. In addition, we have noted that cancer samples in the same stage tend to have a large collection of stage-specific co-expressed genes, used to execute the biological functions specific to the stage. By integrating these two insights, we have the following key observation: for a given co-expression network of a specific stage, adding a new sample that “intrinsically” belongs to the stage should not alter significantly the structure of the co-expression network; in contrast, when a sample is added to the sample set of a different stage, it will affect the co-expression levels of some gene pairs, hence altering the structure of the co-expression network. Our algorithm follows. Step 1: Identification of DEGs for co-expression analyses. To ensure that the numbers of DEGs are approximately the same across different stages to avoid sample-size related bias, we have selected n DEGs with the largest variance for each stage, where n is the smallest number of DEGs in a stage across the four for the given cancer type. Step 2: Construction of co-expression networks. Samples of each stage are divided into three groups: 30% as the reference, 40% for training, and 30% for testing. A co-expression network is constructed over the reference set for each of the four stages: each DEG is defined as a node and a pair of co-expressed DEGs above a PCC-based threshold (see METHODS) as an edge linking the two genes. Step 3: Construction of a perturbed network over each sample set plus a new sample. For each co-expression network N built at Step 2 and a new sample s, calculate the PCC value for each co-expressed gene pair in N over the expanded sample set. If the relationship between the new PCC and the threshold is reversed compared to the original PCC, remove it from N if PCC > threshold; and otherwise add the edge to N. Step 4: Data preparation for cancer-stage classifier training. For each new sample considered for cancer staging, represent each of its four perturbed networks as a one-dimensional vector: each pair of co-expressed genes in a co-expression network is given a fixed location in the vector, containing the PCC value or a 0.0 if the gene pair is removed in the perturbed network, hence allowing direct comparisons among such PCC-based vectors. The detailed process of our algorithm is shown in Fig. [56]1. Figure 1. [57]Figure 1 [58]Open in a new tab An illustration of our algorithm. (A) Identification of DEGs between cancer versus control tissues at each stage. (B) Construction of co-expression networks for samples in each of the four stages with the DEGs obtained from step A. (C) Construction of perturbed networks over samples in each stage plus a new sample denoted by T[i]. (D) Representation of each perturbed network as a feature vector needed for training, giving rise to four feature vectors concatenated into a long vector, which will be fed into a trainer as the feature vector for sample T[i]. A 4-way classifier for cancer staging A machine learning-based classifier is trained to predict the stage, 1 through 4, for a given cancer sample based on the PCC vectors defined in Sect. 2.1. Intuitively, if a new sample belongs to a specific stage, its perturbed network should be largely the same as the corresponding co-expression network; otherwise, the perturbed network may lose most of the stage-specific co-expressed genes, i.e., edges, from the original co-expression network. We have used the following six machine-learning methods: Naive Bayes, treebag, C5.0, random forests (RF), random ferns (RFerns), and weighted subspace random forests (WSRF), respectively, to train the classifier. Cancer stage prediction Using the above cancer-staging algorithm, we have predicted stages for all the test samples of the eight cancer types. For each cancer type, we have randomly selected 30% of the samples from each stage and used them to derive the co-expression patterns; 40% for classification model training; and the remaining 30% for testing. Three-fold cross-validation with 100 repeats is used when training a classifier for each of the six machine learning methods. This process is iterated 10 times, and the average of the staging accuracy is used as the final evaluation results. Table [59]2 summarizes the prediction results by C5.0, and prediction results by other methods are summarized in Supplementary Tables [60]S1(1–5). We note that most of the machine learning methods give comparable results except for Naive Bayes and random Ferns, whose performances are poorer than the others as detailed in the Table [61]S1. Table 2. Prediction performance of cancer stages using C5.0. Stage Measure BRCA COAD HNSC KIRC KIRP LUAD STAD THCA 1 Sensitivity 0.7852 0.9227 0.6714 0.9519 0.9353 0.8795 0.24 0.9753 Specificity 0.983 0.9927 0.8783 0.9737 0.9 0.9227 0.9678 0.9172 2 Sensitivity 0.9409 0.9755 0.51 0.8938 0.6 0.74 0.8 0.9333 Specificity 0.9737 0.9641 0.9393 0.977 0.9928 0.9377 0.9257 0.9836 3 Sensitivity 0.7932 0.9237 0.4696 0.9639 0.7714 0.6667 0.7955 0.8212 Specificity 0.9722 0.9817 0.926 0.9933 0.9508 0.9688 0.7983 0.9422 4 Sensitivity 0.72 0.9667 0.8675 0.9292 0.8 0.4714 0.7273 0.4188 Specificity 0.922 0.9894 0.886 0.9809 0.9465 0.8965 0.889 0.9692 All Accuracy 0.8768 0.9504 0.7283 0.9452 0.8707 0.7933 0.7078 0.8772 Kappa 0.7957 0.9293 0.546 0.9175 0.7436 0.6807 0.5696 0.7928 [62]Open in a new tab To understand what might be the reasons for the inconsistent predictions by our method compared to the annotated stages in TCGA by pathologists, we have examined the prediction results for HNSC and STAD, the two cancer types with the worst overall prediction performance (Table [63]2). Tables [64]3 and [65]4 list, for each stage, the numbers of samples correctly predicted and of predicted to earlier or later stages of HNSC and STAD, respectively. Table 3. The confusion matrix for predicted vs. annotated stage of HNSC. Predicted/annotated Stage 1 Stage 2 Stage 3 Stage 4 Stage 1 4 5 2 1 Stage 2 2 14 5 5 Stage 3 1 0 16 9 Stage 4 0 1 0 62 [66]Open in a new tab Table 4. The confusion matrix for predicted vs. annotated stage of STAD. Predicted/actual Stage 1 Stage 2 Stage 3 Stage 4 Stage 1 8 1 2 0 Stage 2 0 19 3 0 Stage 3 4 11 37 2 Stage 4 3 1 2 9 [67]Open in a new tab Since there is no ground truth for the actual stages of the cancer samples under consideration that can be used to assess the quality of the two staging methods, we have compared the distributions of the number of DEGs across samples at different “stages” by the two methods, as shown in Fig. [68]2. We see from the boxplots that our predicted stages give rise to boxplots with somewhat higher level of regularity compared to that of the annotated stages, hence providing one piece of evidence that our predicted stages, which is based on molecular information, might be more intuitively meaningful. Figure 2. [69]Figure 2 [70]Open in a new tab The distributions of the number of DEGs across samples at different “stages” by the manually annotated stages and our predicted stages. (a) The distribution of the numbers of DEGs (y-axis) in each annotated stage of HNSC. (b) The distribution of the numbers of DEGs in each predicted stage of HNSC. Figure [71]3 shows the similar information for STAD to that in Fig. [72]2. Analysis results on other cancer types are given in Supplementary Tables [73]S2(1–6) and Figures [74]S1(1–6). Overall, we consider that our predicted stages are probably as scientifically justified as the manually annotated stages by cancer pathologists or better. Figure 3. [75]Figure 3 [76]Open in a new tab The distributions of the number of DEGs across samples at different “stages” by the manually annotated stages and our predicted stages. (a) The distribution of the numbers of DEGs in each annotated stage of STAD. (b) The distribution of the numbers of DEGs in each predicted stage of STAD. Pathways enriched by co-expressed genes We have conducted pathway enrichment analyses over co-expressed genes in each stage of each of the eight cancers against the GO Biological Processes using our new scoring scheme (see “Methods”). Table [77]5 summarizes the numbers of the enriched pathways by co-expressed genes, with the pathway names given in Supplementary Tables [78]S3-1 (controls), [79]S3-2 (up-regulated), and [80]S3-3 (down-regulated), and information about pathways, hence functions, that disappear at each stage as well as new pathways that emerge at each stage in cancer versus controls, hence providing footprint information of cancer evolution. Table 5. The numbers of pathways enriched by co-expressed genes in controls and at each stage. Stage #CEPs BRCA COAD HNSC KIRC KIRP LUAD STAD THCA Stage 1 vs. control Control 223 178 126 195 138 48 84 44 Up 120 13 74 22 4 46 53 82 Down 71 102 92 34 66 6 66 2 Stage 2 vs. control Control 323 150 136 181 109 56 102 137 Up 166 25 39 135 66 54 18 16 Down 83 115 104 2 117 5 65 164 Stage 3 vs. control Control 251 102 111 201 99 53 102 57 Up 143 25 27 265 81 41 35 85 Down 73 81 114 17 17 9 74 20 Stage 4 vs. control Control 326 131 111 226 125 100 205 119 Up 135 38 27 313 99 110 34 56 Down 109 70 108 33 66 5 100 58 [81]Open in a new tab ^#CEPs is for the number of co-expressed gene pairs; Up is for the number of CEPs by up-regulated genes; and Down is similarly for down-regulated genes. We have also calculated the numbers of enriched pathways by co-expressed genes in controls, which (I) remain enriched throughout all stages of the cancer samples of each type; and (II) disappear by each stage of cancer samples, which do not appear again in a later stage, and in total for each cancer type. And we have also calculated (III) the number of new pathways that are not present in controls but present in earlier stages (1 and 2) or advanced stages (3 and 4). All these are shown in Table [82]6 (I), (II) and (III), and the detailed pathways in cancer are listed in Supplementary Tables [83]S4-1, [84]S4-2 and [85]S4-3. From these tables, we conclude: * (i) it is somewhat surprising to see from Table [86]S4-1 that different sets of functions remain unchanged throughout the development of a cancer type across the eight cancer types. For example, for BRCA, it is cell cycle and cell division activities that represent the predominant class of functions that remain unchanged throughout stages 1–4. And this is the only type of cancer with this or similar property. For COAD, it is three classes of functions, namely cellular stress, immune responses and tissue repair that remain unchanged throughout the evolution of the cancer. For HNSC, it is the combination of two functional classes: tissue repair and cellular stress that remain unchanged throughout its evolution. For KIRC, no functional activities remain unchanged throughout its evolution. For KIRP, it is some developmental activities that remain unchanged. For LUAD, it is a few cell division activities that remain unchanged. And for STAD, it is predominantly immune responses that remain changed. * (ii) from Table [87]S4-2, we see the following: (1) pathway disappearance in cancer predominantly take place at stage 1 for six cancer types or stage 4 for two cancer types; and (2) most of the lost pathways tend to be cancer specific or at most shared by 2–3 cancer types except for a few, namely neutral lipid metabolic process (shared by 6 cancer types), triglyceride metabolic process (shared by 5), acylglycerol metabolic process (by 5), response to drug (by 4), regulation of lipid localization (by 4), regulation of hormone levels (by 4), and organic anion transport (by 4), indicating that they may have negative effects on cancer development, hence selected for removal. The detailed list of the lost pathways by multiple cancer types is given in Table [88]S5. * (iii) from Table [89]S4-3, we note that different cancer types tend to have different sets of emerging functions in cancer tissues vs. controls, which generally fall into the following classes: development and proliferation, immune related, stress related, migration related, metabolisms, tissue repair, and neural functions. Table 6. The number of enriched pathways in normal controls. #CEPs BRCA COAD HNSC KIRC KIRP LUAD STAD THCA Total 442 274 168 355 261 133 264 257 (I) 69 29 78 0 3 5 34 9 (II) Stage 1 91 78 20 103 68 21 21 16 Stage 2 64 31 7 7 17 3 20 20 Stage 3 16 8 1 9 9 0 19 14 Stage 4 56 11 0 11 9 7 85 40 Total 227 128 28 130 103 31 145 90 (III) 1–2 106 52 89 29 118 48 50 99 3,4 166 66 40 163 111 36 69 74 [90]Open in a new tab Total on the second row is the number of pathways enriched by CEPs in control samples for each cancer type while Total under (II) is for the number of unique pathways enriched by CEPs across all cancer samples of each type. For BRCA, two classes of new functions account for the majority of the new functions, hence considered as predominant: development and proliferation and metabolisms in both early (stages 1 and 2) and advanced (stages 3–4) cancers. For COAD, the two predominant functional classes are development and proliferation and stress related in both early and advanced cancers. For HNSC, the new functions in early-stage cancer tissues are development and proliferation and immune related; and for the advanced tissues, only the former remains to be predominant. For KIRC, no single class of functions stands out in the early stage; and immune related and development and proliferation stand out. For KIRP, development and proliferation and metabolisms stand out in both the early and advanced cancer tissues. For LUAD, development and proliferation and stress related functions stand out in the early stage; and the latter changes to neural activities in the advanced stage. For STAD, tissue repair and immune related functions stand out in both early and advanced stages. In addition, development and proliferation become one of three standout functional classes with the other two in the advanced stage cancer tissues. For THCA, immune and tissue repair stand out in the early stage; and the former changes to development and proliferation in the advanced stage. Among these functions, development and proliferation related functions become increasingly predominant as a cancer advances from early to the advanced stage for virtually all cancer types. Similarly, the percentages of the following functions also increase as a cancer advances: stress, immune, and migration related. Discussion Our preliminary analyses strongly indicate that differential expressions of individual genes do not have adequate information for accurate cancer staging, and conserved co-expression patterns across cancer samples of the same stage do as we have demonstrated through here. This represents a key technical contribution to the research of cancer biology. We anticipate that a similar technique could be used for various similar problems such as cancer grading, classification of primary cancers that have metastasized vs. that have not. Our prediction results are generally consistent with those assigned manually by cancer pathologists. In cases where our predictions are inconsistent with the manual annotation, further studies are needed as there are no clear indication of which “predictions” are more accurate between the two, although from one specific angle, our predictions seem to be biologically more meaningful. This should not be surprising since our prediction is based on functional commonalities shared by most of the cancer tissue samples of a specific stage. We anticipate that systematic applications of this new tool could lead to improved and biologically more meaningful staging schemes for different cancer types. For example, by studying how the overall functionality of cancer samples changes as a cancer advances, one could possibly identify key “jumps” in changes in the total functionality, which can be used to distinguish distinct phases of the evolution for specific cancer types, compared to the current staging schemes, which are largely based on sizes and morphology of tumors. Cancer staging based on such molecular functions could lead to improved treatment plans that can target at key functional hubs or weakest points in cancer metabolic networks at distinct phases. Otto Warburg speculated fifty some years ago about cancer evolution as: “the highly differentiated cells are now transformed into fermenting anaerobes, which have lost all their body functions and retain only the now useless function of growth”^[91]31–[92]33. Since then, very little has been established regarding what specific functions are lost as a cancer evolves. We consider that a scientific contribution made by this study is: we have provided some information along this direction, although our study is clearly primitive. A further study is planned to elucidate detailed functionalities of cancer at individual stage and of different types. Both functionalities shared by all or most of the cancer types and specific to individual cancer types are of great interests. Our co-expression based functional identification will prove to be a highly effective tool for conducting such studies. Regarding the predominant new functions in cancer vs. controls as revealed by our analyses, it is understandable why development and proliferation represents a predominant one across a majority of the cancer types under study as cancer proliferation, unlike normal developmental processes, may require segments from multiple developmental programs, which might be activated possibly by different signals for different reasons such as the need for tissue repair, to have the cell-cycle genes activated and form a somewhat coordinated cell cycle process in support of continuous cell proliferation. Other emerging functions, such as immune, tissue repair, metabolisms and/or neural activities, tend to be less conserved across different cancer types. Hence it is natural to ask: are new functions in each cancer type relevant to or even possibly dictate the clinical behaviors of different cancer types such as more vs. less malignant cancers? Clearly, further and more in-depth analyses are clearly needed to address this question. Conclusion A new algorithm for cancer staging is developed based on co-expression patterns unique to specific cancer stages, along with a new method for assessing statistical significance of pathways enriched by co-expressed genes. Our test results have shown that our staging results are comparable with (or superior to) manual staging results by human pathologists. Highly exciting new insights are gained through our analyses of new pathways in cancer vs. controls as well as pathways that disappear gradually throughout the evolution of individual cancer types. We anticipate that the co-expression based analyses will prove to be an important direction for functional studies in cancer research. Data and methods Data 14 cancer types were initially selected since this set of cancers has been used in our previous studies^[93]34–[94]37 as they each have sufficiently large number of samples in TCGA, namely: BLCA, BRCA, COAD, ESCA, HNSC, KICH, KIRC, KIRP, LIHC, LUAD, LUSC, PRAD, STAD, and THCA. Here, we further require that each cancer type have at least ten samples for each stage, which leaves only eight cancer types: BRCA, COAD, HNSC, KIRC, KIRP, LUAD, STAD, THCA. Table [95]7 gives the detailed information for each of the eight cancer types. Table 7. The number of tissue samples for eight cancer types. Cancer type Control Stage 1 Sage 2 Stage 3 Stage 4 BRCA 113 182 624 249 20 COAD 41 75 179 131 64 HNSC 44 25 70 78 261 KIRC 72 266 57 123 82 KIRP 32 172 22 51 15 LUAD 59 278 121 84 26 STAD 32 53 111 150 38 THCA 58 286 52 113 57 [96]Open in a new tab Calculation of co-expressed genes For a given set of cancer tissues and their transcriptomic data, we calculate the Pearson correlation coefficient [MATH: (ρ) :MATH] between each pair of expressed genes across the samples as follows: [MATH: ρX,Y=EXY-E(X)E(Y)E(X2-E2(X))E(Y2-E2(Y)) :MATH] where E(X) is the expected value of expression levels of gene x across all samples. A pair of genes is deemed to be co-expressed if [MATH: ρX,Y>0.7 :MATH] with p value < 0.05, where the p value is calculated as follows: [MATH: t=ρ×< msqrt>n-2 1-ρ2 :MATH] with n being the number of samples. Pathway enrichment We have developed a new scoring scheme to assess the statistical significance of a pathway enriched by a set of co-expressed DEGs at a specific stage of a cancer type. For a pathway with n gene pairs containing k co-expressed gene pairs over a given set of cancer samples, the following hypergeometric distribution^[97]38 is used to calculate the statistical significance of this pathway enriched by the k gene pairs where N gene pairs are differentially expressed in cancer vs. controls, of which K pairs of genes are co-expressed: [MATH: PX=k=KkN-K n-kNn :MATH] Supplementary information [98]Supplementary file1^ (10.8KB, xlsx) [99]Supplementary file2^ (24.7KB, docx) [100]Supplementary file3^ (17.5KB, docx) [101]Supplementary file4^ (118.1KB, docx) [102]Supplementary file5^ (74.5KB, xlsx) [103]Supplementary file6^ (47KB, xlsx) [104]Supplementary file7^ (41.5KB, xlsx) [105]Supplementary file8^ (27KB, xlsx) [106]Supplementary file9^ (128.2KB, xlsx) [107]Supplementary file10^ (102.6KB, xlsx) Acknowledgements