Abstract It is difficult for existing methods to quantify, and track the constant evolution of cancers due to high heterogeneity of mutations. However, structural variations associated with nucleotide number changes show repeatable patterns in localized regions of the genome. Here we introduce SPKMG, which generalizes nucleotide number based properties of genes, in statistical terms, at the genome-wide scale. It is measured from the normalized amount of aligned NGS reads in exonic regions of a gene. SPKMG values are calculated within OncoTrack. SPKMG values being continuous numeric variables provide a statistical metric to track DNA level changes. We show that SPKMG measures of cancer DNA show a normative pattern at the genome-wide scale. The analysis leads to the discovery of core cancer genes and also provides novel dynamic insights into the stage of cancer, including cancer development, progression, and metastasis. This technique will allow exome data to also be used for quantitative LOH/CNV analysis for tracking tumour progression and evolution with a higher efficiency. __________________________________________________________________ Precision medicine is based on the concept of treatments tailored to patients’ molecular profiles[30]^1. SNPs (Single Nucleotide Polymorphisms) are the most frequent genetic variations in humans. For some monogenic diseases, some SNPs are directly linked to the cause[31]^2 of the disease; whereas, for many polygenic diseases, SNPs may eventually contribute to an often complex signature linked to the disease phenotype. With the goal of advancing precision medicine for cancer, most research has been focused on identifying SNPs for the stratification of patients at baseline and now more regularly when the patient disease relapses. The SNP profile in cancer is however extremely diverse, displaying inter-patient, intra-tumour, inter-metastatic, and intra-metastatic heterogeneity[32]^3, and is further compounded by genome mosaicism[33]^4. This mutational heterogeneity in tumours makes it difficult to develop therapies which will benefit many patients[34]^5. Additionally, one of the major challenges in anti-cancer therapeutics is the evolution of tumours to develop new mutations, making the previous treatments ineffective[35]^6. In addition, the rapid development of novel cancer immunotherapies is calling for a more holistic approach to neo-antigen biomarker and mutational load analysis than targeted sequence analysis[36]^7,[37]^8. Therefore, if robust, reproducible patterns, based on tumour evolution, can be identified within patient populations, this would propel the efforts for developing targeted therapies and precision medicine in cancer, benefiting a larger group. In addition to SNPs, CNVs (Copy Number Variations) and LOH (Loss of Heterozygosity) are present in the genome. In CNVs, portions of the genome are deleted, amplified, translocated (intra-chromosome or inter-chromosome), or inverted[38]^9. Theoretically the number of copies of a section of DNA throughout the human genome will be 2 (bi-allelic). But in reality, the number of copies of a localized genomic region could in fact be 0, 1, 2, 3, 4 or more (any integer value). Copy numbers 0 and 1 represent deletion and LOH respectively. Copy numbers 3 and higher integers signify amplification. Thiagalingam et al. in their paper on LOH found that each of the five chromosomes analysed was affected by LOH with high frequencies, ranging from 47% to 78%[39]^10. The experiment showed that there are repeatable patterns in LOH in regions within DNA. Ni et al., showed that CNVs also exhibit localized reproducible patterns in cancer[40]^11. There is also evidence for the heritability of some localized CNVs and LOHs in familial tumours[41]^12. While LOHs/CNVs show some patters at a population level, the phenotypic associations of CNVs, particularly with respect to disease association, are not trivial, and are not well understood. Some studies failed to identify substantial association of overall CNV burden or specific CNVs with disease in genome-wide association studies[42]^13,[43]^14. On the other hand, others have found higher overall CNV burden in tumour samples compared to healthy[44]^15,[45]^16. While one study[46]^14 concluded that common CNVs are unlikely to contribute much to disease heritability for many diseases, another found that the proportion of rare CNVs was higher in the tumour samples compared to healthy samples[47]^17. Additionally, despite the uncertainty in association patterns, locus specific associations with CNVs have been described for many diseases, including cancer. For example, homozygous deletion of GSTM1 is found to be correlated with cancer prognosis[48]^18. SNPs are binary, categorical, and heterogeneous. CNVs, when looked at as absolute sequence alterations are categorical and heterogeneous too. The boundaries (breakpoints) of CNVs can vary over a population, making phenotypic association difficult as ‘consensus’ CNV loci need to be identified[49]^19. While disease association of CNVs may be directly associated with change in copy number (as in the case of deletions or amplifications), it is more related to the length of the CNV and the number of genes covered[50]^20. This feature of CNVs implies that it may be more useful to study phenotypic associations at localized gene level, rather than for the large genomic region, where one CNV may span genic, multi-genic, and non-genic regions. Localized CNV polymorphisms have been associated with many cancers[51]^21,[52]^22. There is also evidence that localized CNVs have an impact on pharmacodynamics due to altered metabolism of compounds[53]^23. Such CNVs affect the efficacy and toxicity of drugs through regulation of proteins involved in metabolism[54]^23. Some CNVs have been associated with high drug toxicity and increased incidence of adverse events due to increase in copy number[55]^23. Although all this evidence indicates that LOHs/CNVs have inherent characteristics specific to a disease, there is insufficient evidence to conclusively define locus-disease associations. Since CNVs generally result in a change in nucleotide number at specific locations, this presents an opportunity for the quantification of these DNA level variations. Though, this will exclude the regions with copy number neutral variations. We hypothesized that the disease-specific concordance for localized CNVs is a disease characteristic. We therefore hypothesized that nucleotide level loss (or gain) in the coding regions of DNA with copy number 0, 1, 3, …, N, will show a cancer specific homogeneous statistical pattern. Moreover, this gene specific measure will have phenotype association. In other words, our hypothesis generalizes the principle of localized LOHs/CNVs for cancer, through the quantification of DNA, based on the change in nucleotide number (loss or gain) in the coding regions of the DNA. DNA pileup is a technique that was originally used by SAMtools[56]^24 for measuring the depth of coverage of NGS data, and is used in many tools such as, GATK[57]^25, XHMM[58]^26 etc. We explored the concept of pileup in the context of LOHs/CNVs at a quantitative level, but with a different approach. Any LOH, CNV or structural variation in an exon, be it a deletion, amplification, or translocation, will have a quantitative effect on the protein product. The nucleotide count within the exonic regions of a gene will decrease in case of a deletion, and increase in case of an amplification. In the case of translocation, it will decrease at some gene and increase at some other. Using an adaptation of the pileup concept, we quantified DNA content variation within a gene at the genome-wide scale and named this SPKMG. We formulated SPKMG (Sequence Per Kilobase of exon, per Megabase of the mappable Genome) measures for each gene. Unlike mutations, which are binary, the SPKMG measure for a gene is a real number – i.e., both numeric and continuous. A very high SPKMG value signifies amplification of a portion of a gene; a low SPKMG value signifies deletion of a part of the gene; whereas, zero SPKMG means the gene has been entirely deleted. Since SPKMG values are real numbers, we will be able to employ statistical techniques to unleash the structure and functional relationship of these numbers for a population. SPKMG values are calculated within a software which we have named OncoTrack. We hypothesized that if nucleotide count alterations due to structural variations in the genome are truly associated with a tumour state, then we will be able to capture these and find some meaningful phenotypic associations through SPKMG. For this, we looked at the SPKMG measure from two perspectives, case/control comparative analysis, and control-free analyses using correlation and mutual information. Our goal was to assess the usefulness of SPKMG as an improved way of looking for population level patterns for tracking and predicting molecular changes in tumours. In this paper we explore the efficacy of SPKMG as a more efficient quantification technique for precision medicine and for tracking cancer progression. To establish our theory, we have used total 69 exome datasets that comprise of 11 non-BRCA1/2 familial breast cancer (BC)[59]^27, 19 Esophageal Squamous Cell Carcinoma (ESCC) tumour samples (T) with 19 ESCC matching germline normal (N) samples[60]^28, and 20 healthy (control) samples (H)[61]^27,[62]^29. The OncoTrack analysis pipeline architecture is shown in [63]Fig. 1. Figure 1. Architecture of OncoTrack pipeline. [64]Figure 1 [65]Open in a new tab Results Non-BRCA1/BRCA2 Familial Breast Cancer Population The SPKMG was computed for the breast cancer patients’ data. These are given in [66]Supplementary Table 1. We compared the SPKMG values for the 11 breast cancer patients with the reference population of 20 healthy data ([67]Supplementary Table 2). [68]Figure 2(a,b) show the MDS plot, constructed using the edgeR[69]^30,[70]^31 MDS algorithm, and hierarchical clustering for SPKMG for all 31 samples. It is seen that the breast cancer patients are clustering closely compared to healthy patients ([71]Fig. 2). This is in line with Thiagalingam et al.[72]^10 and Ni et al.[73]^11 observations that there are reproducible patterns in LOHs and CNV across the cancer patients. This shows that SPKMG follows a quantitative and normative pattern that has the potential to be used to define quantitative properties of a cancer population. To further examine the properties of SPKMG, we compared the values between the healthy samples and breast cancer patients, as well as within the breast cancer patients. Figure 2. Clustering of Breast cancer and healthy individuals. [74]Figure 2 [75]Open in a new tab (a) MDS plot of SPKMG values of 11 breast cancer and 20 healthy samples. Distances were calculated using log fold changes. Marker 1 and Marker 2 in this figure are the top two leading log fold change dimensions in the data. (b) SPKMG heatmap and hierarchical clustering of the breast cancer and healthy samples. Breast Cancer Comparative Analysis (Case/Control) We performed the comparative analysis of the breast cancer population and the control population ([76]Supplementary Table 2) using edgeR. We selected the most statistically significant genes with p-values ≤ 1.0e-20 (or adjusted p-values ≤ 1.0e-17). This gave us 28 genes with adjusted p-values ranging from 6.3E-063 to 5.5E-018 ([77]Table 1). An analysis of the COSMIC[78]^32 database showed that all these 28 genes have been discovered in multiple cancer studies, with study counts ranging from 6 for KIR3DL1 to 43 for FLG. All these genes have been widely researched with the number of PubMed citations ranging from 8 for SPRRD2 to 92 for FLG (in the References section of