Abstract

   It is difficult for existing methods to quantify, and track the
   constant evolution of cancers due to high heterogeneity of mutations.
   However, structural variations associated with nucleotide number
   changes show repeatable patterns in localized regions of the genome.
   Here we introduce SPKMG, which generalizes nucleotide number based
   properties of genes, in statistical terms, at the genome-wide scale. It
   is measured from the normalized amount of aligned NGS reads in exonic
   regions of a gene. SPKMG values are calculated within OncoTrack. SPKMG
   values being continuous numeric variables provide a statistical metric
   to track DNA level changes. We show that SPKMG measures of cancer DNA
   show a normative pattern at the genome-wide scale. The analysis leads
   to the discovery of core cancer genes and also provides novel dynamic
   insights into the stage of cancer, including cancer development,
   progression, and metastasis. This technique will allow exome data to
   also be used for quantitative LOH/CNV analysis for tracking tumour
   progression and evolution with a higher efficiency.
     __________________________________________________________________

   Precision medicine is based on the concept of treatments tailored to
   patients’ molecular profiles[30]^1. SNPs (Single Nucleotide
   Polymorphisms) are the most frequent genetic variations in humans. For
   some monogenic diseases, some SNPs are directly linked to the
   cause[31]^2 of the disease; whereas, for many polygenic diseases, SNPs
   may eventually contribute to an often complex signature linked to the
   disease phenotype. With the goal of advancing precision medicine for
   cancer, most research has been focused on identifying SNPs for the
   stratification of patients at baseline and now more regularly when the
   patient disease relapses. The SNP profile in cancer is however
   extremely diverse, displaying inter-patient, intra-tumour,
   inter-metastatic, and intra-metastatic heterogeneity[32]^3, and is
   further compounded by genome mosaicism[33]^4. This mutational
   heterogeneity in tumours makes it difficult to develop therapies which
   will benefit many patients[34]^5. Additionally, one of the major
   challenges in anti-cancer therapeutics is the evolution of tumours to
   develop new mutations, making the previous treatments
   ineffective[35]^6. In addition, the rapid development of novel cancer
   immunotherapies is calling for a more holistic approach to neo-antigen
   biomarker and mutational load analysis than targeted sequence
   analysis[36]^7,[37]^8. Therefore, if robust, reproducible patterns,
   based on tumour evolution, can be identified within patient
   populations, this would propel the efforts for developing targeted
   therapies and precision medicine in cancer, benefiting a larger group.

   In addition to SNPs, CNVs (Copy Number Variations) and LOH (Loss of
   Heterozygosity) are present in the genome. In CNVs, portions of the
   genome are deleted, amplified, translocated (intra-chromosome or
   inter-chromosome), or inverted[38]^9. Theoretically the number of
   copies of a section of DNA throughout the human genome will be 2
   (bi-allelic). But in reality, the number of copies of a localized
   genomic region could in fact be 0, 1, 2, 3, 4 or more (any integer
   value). Copy numbers 0 and 1 represent deletion and LOH respectively.
   Copy numbers 3 and higher integers signify amplification. Thiagalingam
   et al. in their paper on LOH found that each of the five chromosomes
   analysed was affected by LOH with high frequencies, ranging from 47% to
   78%[39]^10. The experiment showed that there are repeatable patterns in
   LOH in regions within DNA. Ni et al., showed that CNVs also exhibit
   localized reproducible patterns in cancer[40]^11. There is also
   evidence for the heritability of some localized CNVs and LOHs in
   familial tumours[41]^12.

   While LOHs/CNVs show some patters at a population level, the phenotypic
   associations of CNVs, particularly with respect to disease association,
   are not trivial, and are not well understood. Some studies failed to
   identify substantial association of overall CNV burden or specific CNVs
   with disease in genome-wide association studies[42]^13,[43]^14. On the
   other hand, others have found higher overall CNV burden in tumour
   samples compared to healthy[44]^15,[45]^16. While one study[46]^14
   concluded that common CNVs are unlikely to contribute much to disease
   heritability for many diseases, another found that the proportion of
   rare CNVs was higher in the tumour samples compared to healthy
   samples[47]^17. Additionally, despite the uncertainty in association
   patterns, locus specific associations with CNVs have been described for
   many diseases, including cancer. For example, homozygous deletion of
   GSTM1 is found to be correlated with cancer prognosis[48]^18.

   SNPs are binary, categorical, and heterogeneous. CNVs, when looked at
   as absolute sequence alterations are categorical and heterogeneous too.
   The boundaries (breakpoints) of CNVs can vary over a population, making
   phenotypic association difficult as ‘consensus’ CNV loci need to be
   identified[49]^19. While disease association of CNVs may be directly
   associated with change in copy number (as in the case of deletions or
   amplifications), it is more related to the length of the CNV and the
   number of genes covered[50]^20. This feature of CNVs implies that it
   may be more useful to study phenotypic associations at localized gene
   level, rather than for the large genomic region, where one CNV may span
   genic, multi-genic, and non-genic regions. Localized CNV polymorphisms
   have been associated with many cancers[51]^21,[52]^22. There is also
   evidence that localized CNVs have an impact on pharmacodynamics due to
   altered metabolism of compounds[53]^23. Such CNVs affect the efficacy
   and toxicity of drugs through regulation of proteins involved in
   metabolism[54]^23. Some CNVs have been associated with high drug
   toxicity and increased incidence of adverse events due to increase in
   copy number[55]^23.

   Although all this evidence indicates that LOHs/CNVs have inherent
   characteristics specific to a disease, there is insufficient evidence
   to conclusively define locus-disease associations. Since CNVs generally
   result in a change in nucleotide number at specific locations, this
   presents an opportunity for the quantification of these DNA level
   variations. Though, this will exclude the regions with copy number
   neutral variations. We hypothesized that the disease-specific
   concordance for localized CNVs is a disease characteristic. We
   therefore hypothesized that nucleotide level loss (or gain) in the
   coding regions of DNA with copy number 0, 1, 3, …, N, will show a
   cancer specific homogeneous statistical pattern. Moreover, this gene
   specific measure will have phenotype association. In other words, our
   hypothesis generalizes the principle of localized LOHs/CNVs for cancer,
   through the quantification of DNA, based on the change in nucleotide
   number (loss or gain) in the coding regions of the DNA.

   DNA pileup is a technique that was originally used by SAMtools[56]^24
   for measuring the depth of coverage of NGS data, and is used in many
   tools such as, GATK[57]^25, XHMM[58]^26 etc. We explored the concept of
   pileup in the context of LOHs/CNVs at a quantitative level, but with a
   different approach. Any LOH, CNV or structural variation in an exon, be
   it a deletion, amplification, or translocation, will have a
   quantitative effect on the protein product. The nucleotide count within
   the exonic regions of a gene will decrease in case of a deletion, and
   increase in case of an amplification. In the case of translocation, it
   will decrease at some gene and increase at some other.

   Using an adaptation of the pileup concept, we quantified DNA content
   variation within a gene at the genome-wide scale and named this SPKMG.
   We formulated SPKMG (Sequence Per Kilobase of exon, per Megabase of the
   mappable Genome) measures for each gene. Unlike mutations, which are
   binary, the SPKMG measure for a gene is a real number – i.e., both
   numeric and continuous. A very high SPKMG value signifies amplification
   of a portion of a gene; a low SPKMG value signifies deletion of a part
   of the gene; whereas, zero SPKMG means the gene has been entirely
   deleted. Since SPKMG values are real numbers, we will be able to employ
   statistical techniques to unleash the structure and functional
   relationship of these numbers for a population. SPKMG values are
   calculated within a software which we have named OncoTrack.

   We hypothesized that if nucleotide count alterations due to structural
   variations in the genome are truly associated with a tumour state, then
   we will be able to capture these and find some meaningful phenotypic
   associations through SPKMG. For this, we looked at the SPKMG measure
   from two perspectives, case/control comparative analysis, and
   control-free analyses using correlation and mutual information. Our
   goal was to assess the usefulness of SPKMG as an improved way of
   looking for population level patterns for tracking and predicting
   molecular changes in tumours. In this paper we explore the efficacy of
   SPKMG as a more efficient quantification technique for precision
   medicine and for tracking cancer progression.

   To establish our theory, we have used total 69 exome datasets that
   comprise of 11 non-BRCA1/2 familial breast cancer (BC)[59]^27, 19
   Esophageal Squamous Cell Carcinoma (ESCC) tumour samples (T) with 19
   ESCC matching germline normal (N) samples[60]^28, and 20 healthy
   (control) samples (H)[61]^27,[62]^29. The OncoTrack analysis pipeline
   architecture is shown in [63]Fig. 1.

Figure 1. Architecture of OncoTrack pipeline.

   [64]Figure 1
   [65]Open in a new tab

Results

Non-BRCA1/BRCA2 Familial Breast Cancer Population

   The SPKMG was computed for the breast cancer patients’ data. These are
   given in [66]Supplementary Table 1. We compared the SPKMG values for
   the 11 breast cancer patients with the reference population of 20
   healthy data ([67]Supplementary Table 2). [68]Figure 2(a,b) show the
   MDS plot, constructed using the edgeR[69]^30,[70]^31 MDS algorithm, and
   hierarchical clustering for SPKMG for all 31 samples. It is seen that
   the breast cancer patients are clustering closely compared to healthy
   patients ([71]Fig. 2). This is in line with Thiagalingam et al.[72]^10
   and Ni et al.[73]^11 observations that there are reproducible patterns
   in LOHs and CNV across the cancer patients. This shows that SPKMG
   follows a quantitative and normative pattern that has the potential to
   be used to define quantitative properties of a cancer population. To
   further examine the properties of SPKMG, we compared the values between
   the healthy samples and breast cancer patients, as well as within the
   breast cancer patients.

Figure 2. Clustering of Breast cancer and healthy individuals.

   [74]Figure 2
   [75]Open in a new tab

   (a) MDS plot of SPKMG values of 11 breast cancer and 20 healthy
   samples. Distances were calculated using log fold changes. Marker 1 and
   Marker 2 in this figure are the top two leading log fold change
   dimensions in the data. (b) SPKMG heatmap and hierarchical clustering
   of the breast cancer and healthy samples.

Breast Cancer Comparative Analysis (Case/Control)

   We performed the comparative analysis of the breast cancer population
   and the control population ([76]Supplementary Table 2) using edgeR. We
   selected the most statistically significant genes with
   p-values ≤ 1.0e-20 (or adjusted p-values ≤ 1.0e-17). This gave us 28
   genes with adjusted p-values ranging from 6.3E-063 to 5.5E-018
   ([77]Table 1). An analysis of the COSMIC[78]^32 database showed that
   all these 28 genes have been discovered in multiple cancer studies,
   with study counts ranging from 6 for KIR3DL1 to 43 for FLG. All these
   genes have been widely researched with the number of PubMed citations
   ranging from 8 for SPRRD2 to 92 for FLG (in the References section of