Abstract

   We applied two state-of-the-art, knowledge independent data-mining
   methods – Dynamic Quantum Clustering (DQC) and t-Distributed Stochastic
   Neighbor Embedding (t-SNE) – to data from The Cancer Genome Atlas
   (TCGA). We showed that the RNA expression patterns for a mixture of
   2,016 samples from five tumor types can sort the tumors into groups
   enriched for relevant annotations including tumor type, gender, tumor
   stage, and ethnicity. DQC feature selection analysis discovered 48 core
   biomarker transcripts that clustered tumors by tumor type. When these
   transcripts were removed, the geometry of tumor relationships changed,
   but it was still possible to classify the tumors using the RNA
   expression profiles of the remaining transcripts. We continued to
   remove the top biomarkers for several iterations and performed cluster
   analysis. Even though the most informative transcripts were removed
   from the cluster analysis, the sorting ability of remaining transcripts
   remained strong after each iteration. Further, in some iterations we
   detected a repeating pattern of biological function that wasn’t
   detectable with the core biomarker transcripts present. This suggests
   the existence of a “background classification” potential in which the
   pattern of gene expression after continued removal of “biomarker”
   transcripts could still classify tumors in agreement with the tumor
   type.

Introduction

   Dozens of public genomic data repositories relevant to human biology
   have emerged to support biomedical science. These repositories include
   The Cancer Genome Atlas (TCGA^[32]1), the Genotype-Tissue Expression
   (GTEx) project^[33]2,[34]3, and the Database of Genotypes and
   Phenotypes (dbGaP^[35]4). These databases are deep and diverse,
   containing data ranging from DNA sequence to epigenetic state to
   dynamic gene expression output. TCGA, for example, contains multiple
   measurements for over 14,000 tumors of multiple types. GTEx contains
   the gene expression patterns and matching DNA sequences for over 11,000
   human tissue samples. dbGaP contains over 2.4 million molecular assays
   relevant to human disease and phenotypes. Clearly, there is massive
   opportunity to mine these and future databases for biological insight.

   A powerful quantitative measurement of genome information flow within a
   biological sample is the steady state RNA-based gene expression
   profile; this profile is captured in a gene expression vector
   representing the number of RNA molecules produced by tens of thousands
   of genes in the specimen. A researcher can aggregate profiles from
   multiple samples retrieved under varied biological contexts into m × n
   gene expression matrices (GEMs) where rows (m) are gene or RNA
   transcript identifiers and columns (n) are samples. One can think of a
   GEM as a compendium of molecular snapshots from different points of
   anatomy, developmental stage, and environment. While a liver gene
   vector, for example, has the same genes as a neighboring pancreas gene
   vector, their gene expression intensities vary in a manner reflective
   of their underlying biology.

   GEMs can be normalized to reduce technical variation between biological
   samples^[36]3,[37]5,[38]6. Correlation analysis can be performed to
   identify genes whose expression patterns are synchronized across
   samples using software such as WGCNA^[39]7 and KINC^[40]8. Similar gene
   expression patterns are predicted to have a common biological purpose.
   However, the brute force interrogation of all samples in a GEM may not
   be the best approach to detect context-specific gene interactions;
   overrepresentation of one biological condition might drown out rare
   gene interactions. One approach to addressing variation between samples
   involves sorting the gene expression vectors into sample clusters of
   similar contexts. Thus, liver expression profiles might produce one
   sample cluster, as will pancreas profiles, etc. Correlation analysis
   can be performed in sorted sample clusters. Sample sorting is becoming
   increasingly relevant given the rapid growth of samples in databases
   that are being interrogated for biological function. If a sample
   sorting technique can robustly sort mislabeled or outlier samples in to
   alternate groups, then noise should be reduced for each group possibly
   making it easier to identify meaningful biomarkers and pathways.

   Two approaches exist for clustering GEMs into sample groups of similar
   global gene expression patterns: knowledge-dependent and
   knowledge-independent methods. Both can increase the probability of
   detecting gene expression patterns relevant to the sorted biological
   contexts. Knowledge-dependent methods sort the GEM into sub-GEMs based
   upon annotations associated with the samples. Thus, one could prepare a
   mixed knowledge-independent condition GEM from a data repository like
   TCGA and then sort the GEM into tumor types prior to analysis. This
   approach, while logical – the metadata associated with TCGA is well
   curated – weakens if the sample label is assigned incorrectly or is
   representative of multiple subgroups. This can happen, for example,
   when a tumor type is incorrectly annotated or when tumors are related
   by molecular architecture as opposed to tissue of origin^[41]9.

   A less-biased, knowledge-independent approach uses clustering methods
   to sort datasets into groups based only upon their global expression
   pattern. For example, k-means clustering of global gene expression
   profiles sorts samples into a pre-defined number of groups and has been
   shown to improve gene-gene interaction detection^[42]10–[43]12. Sample
   metadata is assigned after sorting to provide conditional context to
   the clusters. One sample cluster might be enriched for a specific
   biological context label, such as a tumor type, and any genetic
   relationships from that cluster can be associated with that label. A
   drawback to k-means clustering is that the number of clusters should be
   known beforehand, resulting in bias in the sample grouping if an
   inappropriate number of distinct clusters is chosen.

   There are emerging knowledge-independent clustering methods that can be
   applied to GEMs and that do not introduce as much bias. One method,
   t-distributed stochastic neighbor embedding (t-SNE^[44]13,[45]14) –
   like most knowledge independent sample clustering approaches – relies
   upon strongly reducing the dimensionality of the gene expression space
   prior to sample comparison. Two common algorithms used to perform this
   task are principal component analysis (PCA^[46]15) and singular value
   decomposition (SVD^[47]15). Typically, this machine learning technique
   projects high dimensional data into two or three dimensions. It should
   be noted that t-SNE has been applied to GTEx^[48]16 and TCGA^[49]17
   datasets, where the TCGA study used t-SNE as part of an integrated
   omics sorting workflow called MEREDITH^[50]17.

   Dynamic Quantum Clustering (DQC^[51]18), unlike other clustering
   approaches, does not need to use strong dimensionality reduction to
   analyze high-dimensional data; however, it is common in a DQC analysis
   to use modest SVD-based dimensionality reduction to speed up the
   analysis. DQC begins by replacing each column of a GEM (that can be
   thought of as a vector in an n-dimensional Euclidean space) by an
   analytic function in m-variables; specifically, each column of the GEM
   is replaced by a Gaussian function centered on the m-dimensional
   location specified by the corresponding gene expression vector. The sum
   of these functions is then used to create a potential function, V, that
   is a proxy for the density of the data in feature space^[52]19. By
   construction, the local minima and saddle points of this potential
   function represent regions of higher local density of the data. This
   potential function is then used to create a Hamiltonian operator as
   Equation ([53]1):
   [MATH: <mi>H</mi><mo>=</mo><mo>−</mo><mspace
   width="-.25em"></mspace><mfrac><mn>1</mn><mrow><mn>2</mn><mi>m</mi></mr
   ow></mfrac><msup><mrow><mo>∇</mo></mrow><mrow><mn>2</mn><mspace
   width=".25em"></mspace></mrow></msup><mo>+</mo><mi>V</mi> :MATH]
   1

   Each individual Gaussian, ψ(x), is then evolved according to the
   corresponding Heisenberg equation of motion^[54]20 in Equation ([55]2):
   [MATH: <mi>ψ</mi><mrow><mo
   stretchy="true">(</mo><mrow><mi>x</mi><mo>,</mo><mspace
   width=".25em"></mspace><mi>δ</mi><mi>t</mi></mrow><mo
   stretchy="true">)</mo></mrow><mo>=</mo><msup><mrow><mi>e</mi></mrow><mr
   ow><mo>−</mo><mi>i</mi><mi>δ</mi><mi>t</mi><mi>H</mi></mrow></msup><msp
   ace width=".25em"></mspace><mi>ψ</mi><mrow><mo
   stretchy="true">(</mo><mi>x</mi><mo stretchy="true">)</mo></mrow>
   :MATH]
   2

   The center of this time-dependent Gaussian will move a short distance,
   implementing a modified, operator form of gradient descent that moves
   the original center towards the nearest local minimum of V. Due to the
   non-local effect of quantum evolution, there are important differences
   that allow DQC to avoid the difficulties associated with simple
   gradient descent for many points in high dimension. In particular,
   choosing a low value for the mass parameter, m, exploits quantum
   tunneling to avoid getting trapped in small fluctuations, avoiding many
   of the issues related to working in high dimension.

   Another major difference between a DQC analysis in m-dimensions and
   other data-mining methods is that the entire analysis is encoded as an
   m-dimensional animation. This visual presentation of the computation
   provides a detailed record of each computational stage, showing how
   clusters and other structures form. A benefit of the DQC approach is
   that it avoids introducing bias; there is no need to invent hypotheses
   to test, assume the number or type of clusters that exist, or invoke
   prior sample knowledge. Selected frames from DQC animations are shown
   in this report and the full animations are available for viewing in
   Supplementary Videos [56]1 and [57]2.

   This study analyzed a mixed tumor type GEM from TCGA containing the
   RNAseq expression profiles of 2,016 tumors from five tumor types: lower
   grade glioma (LGG), glioblastoma multiforme (GBM), ovarian serous
   cystadenocarcinoma (OV), urothelial bladder cancer (BLCA), and
   papillary thyroid carcinoma (THCA). We discuss the clustering and
   biomarker discovery potential using t-SNE and DQC approaches. In
   addition, we examine the effect of salient biomarker removal and the
   ability of both techniques to continue to classify the tumors into
   meaningful groups in the absence transcripts with high classification
   potential. When applied to deeply sequenced tumors, our approach can be
   used to detect biomarker combinations that sort tumor types without
   prior knowledge of where the tumor was initiated.

Results

Tumor Separation via DQC and t-SNE Analyses

   We first examined the clustering potential of DQC^[58]18 and compared
   it to an approach that performs strong dimensional reduction,
   t-SNE^[59]13. To do this, we constructed a mixed tumor GEM consisting
   of 2,016 samples from TCGA^[60]1. Specifically, we mixed tumor
   expression profiles from five TCGA labeled sample groups: bladder
   cancer (BLCA; n = 427), glioblastoma (GBM; n = 174), lower grade glioma
   (LGG; n = 534), ovarian carcinoma (OV; n = 309), and thyroid cancer
   (THCA; n = 572). It should be noted that some of the groups contained
   low numbers of non-tumor samples with the same label as tumor type
   (BLCA = 19/427; GBM = 5/174; LGG = 0/534; OV = 0/309; THCA = 59/572).
   This GEM was used as the input for both the DQC and t-SNE analyses
   (Fig. [61]1) discussed in the following sections.

Figure 1.

   [62]Figure 1
   [63]Open in a new tab

   Tumor Classification Potential Revealed by t-Distributed Stochastic
   Neighbor Embedding (t-SNE) and Dynamic Quantum Clustering (DQC). Each
   point represents a tumor sample. (A) The full 2,016-tumor gene
   expression matrix (73,599-transcripts) embedded using t-SNE (left) or
   evolved by DQC (right); (B) The full 2,016-tumor gene expression matrix
   (48 “core transcripts”) embedded using t-SNE (left) or evolved by DQC
   (right). Numbers denote consensus clusters identified by HDBSCAN. We
   circled the brain tumor “arch” – discussed in the text – that is only
   seen in the DQC analysis; (C) The full 2,016-tumor gene expression
   matrix (all transcripts minus 48 “core transcripts”) embedded using
   t-SNE (left) or evolved by DQC (right); (D) The full 2,016-tumor gene
   expression matrix (48 random transcripts) embedded using t-SNE (left)
   or evolved by DQC (right). Full DQC animation of 48 “core transcripts”
   can be seen in Supplemental Video 2.

   We began by using DQC to produce animations showing the “quantum
   evolution” of the 2,016 tumor samples for the SVD-decomposition of the
   GEM containing all 73,599 transcripts (Fig. [64]1A). This initial
   analysis was done in both 50 and 60 SVD-dimensions. These values were
   chosen by the requirement that restricting to the first 50 or 60 SVD
   eigenvalues would approximate the original mixed GEM to better than the
   1.5% level (as measured by the change in matrix norm defined as the
   square root of the sum of the squares of every entry in the matrix).
   Since the overall pattern of data-separation is evident in all
   dimensions after this DQC evolution, we only show plots for the first
   twelve SVD-dimensions (frames of dimensions 1–3 in Fig. [65]1). This
   DQC analysis produced clusters with structure in which BLCA, OV, and
   THCA samples formed extended flat shapes populated by many distinct
   sub-clusters. Note, however, that the overall shape of the region where
   these clusters are located only showed a slight change during DQC
   evolution. In contrast, the GBM and LGG samples showed obvious change
   with DQC evolution, first forming complex filamentary structures that
   then separate into two somewhat distorted clusters.

   After the initial DQC-evolution, we used DQC based feature selection
   (see Materials and Methods for details) to discover 48 “core
   transcripts” (Supplemental Table [66]1) that produced a very similar
   view of the data. We then subjected a subset GEM containing just the
   columns corresponding to these 48 transcripts to DQC evolution
   (Fig. [67]1B) in the full 48 dimensions. Because these 48 “core
   transcripts” do a very good job of representing the important structure
   in the data, the DQC results are very similar to the DQC evolution of
   the full GEM. As before, BLCA, OV, and THCA samples formed planes in
   three dimensions occupied by many distinct clusters. However, evolution
   of the GEM restricted to the “core transcripts” revealed that the GBM
   and LGG samples formed a “brain arch” with GBM samples distributed at
   one extreme and LGG along the rest of the structure (circled in
   Fig. [68]1B). This strong separation of GBM from LGG is evidence that –
   at least for glial cell tumors – the 48 “core transcripts” are
   providing information about tumor type and not just tissue of origin.
   It should be noted that only the DQC analysis revealed a strong
   separation of GBM and LGG tumors; the tSNE-analysis divided these
   tumors into several clusters, but none of these sub-clusters were
   highly enriched for a specific tumor type.

   The geometry visible in the previous plot changes markedly when the
   TCGA matrix is analyzed with the 48 core transcripts removed
   (Fig. [69]1C). In this case the shape of the data is entirely
   different, in that it exhibits four clearly separated lobes. Still, DQC
   evolution of this dataset showed that all five tumor types could be
   easily separated from one another. This surprising result led to
   further study described later in this report. Finally, to test the
   efficacy of DQC-based feature selection, we showed that repeated
   selection of 48 random transcripts did not separate the tumors into
   tissue of origin by t-SNE or DQC (representative result in
   Fig. [70]1D).

   Sample clustering of the full TCGA GEM was repeated using the
   t-SNE-HDBSCAN pipeline (Fig. [71]1A–D). Like DQC, t-SNE segregated the
   tumors into five groups using all 73,599 transcripts (Fig. [72]1A), the
   48 core transcripts (Fig. [73]1B), or all transcripts minus the 48 core
   transcripts (Fig. [74]1C). Forty-eight random transcripts did not
   cluster the tumors via t-SNE (Fig. [75]1D; 1/20 runs shown with similar
   result). The 48-core transcript subset segregated the five tumor types
   into nine clusters in two dimensions as identified using the
   HDBSCAN/cluster ensembles consensus clustering approach (Fig. [76]1B).
   Consensus clusters are labeled as numbers in Fig. [77]1B and discussed
   below. In contrast with DQC, the 48 core transcripts failed to produce
   t-SNE embeddings that cleanly separated the five tumor types, failed to
   cleanly separate GBM from LGG, and failed to reveal the “brain arch”
   (Fig. [78]1B) although in general t-SNE embedding produced clusters
   that segregated more distinctly than DQC.

The “Brain Arch” Tumor Substructure

   DQC analysis revealed an interesting substructure between LGG and GBM
   brain tumors. To examine this structure in more detail, we dissected
   the arch samples into seven groups by k-means clustering (Fig. [79]2A).
   Groups 1 through 5 are primarily LGG tumors while groups 6 and 7 are
   GBM. Interestingly, visualizing the expression levels across the “brain
   arch” shows a trend of epithelial, thyroid, and other genes turned on
   at the left-hand leg of the arch (LGG-enriched group 1) which tend to
   be off at the right-hand leg (GBM-enriched group 7). There also appears
   to be immune response, extracellular matrix, and differentiation genes
   turned on at the bottom of the right leg (GBM-enriched group 7 of the
   arch and off at the bottom of the left leg (LGG group 1)).

Figure 2.

   [80]Figure 2
   [81]Open in a new tab

   LGG and GBM Samples Create Complex “Brain Arch” Structure in DQC
   Evolutions. (A) The 48 “core transcript” subset of LGG (green) and GBM
   (blue) samples produces an arch in DQC. Segmentation of samples within
   the arch was performed by k-means clustering at frame 56 of the DQC
   evolution of the 48 transcripts. Each arch group was labeled with a
   number; (B) Average gene expression heatmap of all 48 transcripts along
   the arch for each position numbered to the left. Yellow is high
   expression and red is low expression. Locus names (Supplemental
   Table [82]1) are shown on the far right.

Annotating Tumor Consensus Clusters

   Tumors that cluster by gene expression pattern would be expected to
   show enrichment for tumor type label and other attributes. Enrichment
   analysis was performed on the consensus clusters labeled in Fig. [83]1B
   (p < 0.001). TCGA patient attributes exist for age, race, ethnicity,
   gender, and tumor stage were available for the majority of the 2,016
   tumor samples. t-SNE embeddings of the 48 core transcripts with patient
   attribute values labeled by color are shown in Fig. [84]3 and listed in
   Fig.[85] 4.

Figure 3.

   [86]Figure 3
   [87]Open in a new tab

   t-SNE Embeddings of 48 Transcripts Color-coded for Patient Attributes.
   (A) Age: red = “age at or above 40”, blue = “age below 40”; (B)
   Race/ethnicity: blue = “American Indian or Alaska native”,
   red = “Asian”, green = “Black or African American”, purple = “native
   Hawaiian or Pacific islander”, pink = “Hispanic or Latino”,
   yellow = “white”; (C) Gender: green = “male”, purple = “female”; (D)
   Tumor stage: blue = “stage I”, green = “stage II”, yellow = “stage
   III”, red = “stage IV”, purple = “stage IV-A/B/C”. Gray in all cases
   indicates missing data. Note the embedding of the samples used here is
   different than Fig. [88]1B but the cluster identities are identical.

Figure 4.

   [89]Figure 4
   [90]Open in a new tab

   t-SNE Embedding-derived Patient Attribute Enrichment. Consensus cluster
   identification and attribute enrichment results from 1000 t-SNE
   embeddings of the 48 transcripts. Gray boxes indicate significant
   enrichment (p < 0.001).

   All clusters were statistically enriched (p < 0.001) for at least one
   tumor type label. Clusters 6, 7, and 8 were enriched for both GBM and
   LGG tumor type labels. Thyroid tumor clusters were enriched for the
   female gender. One of the bladder tumor clusters and two of the mixed
   brain tumor clusters were enriched for the male gender. The bladder and
   ovarian tumor clusters were uniformly enriched for age above 40 years,
   while several of the thyroid and brain tumor clusters were enriched for
   age below 40. The age enrichment corresponded roughly to stage
   enrichment: both thyroid clusters were enriched for tumors in their
   earliest stage while the mixed BLCA-OV clusters 2, 3, and 4 (where data
   was available) were enriched for stage IV tumors. Little enrichment was
   observed based on race or ethnicity although thyroid-enriched cluster 0
   was also enriched for “Asian” as a race. The mixed GBM-LGG tumor
   cluster 6 was enriched for the annotation term “white”. It is worth
   noting that we did not attempt to correct for bias that may exist
   within the TCGA database itself with regard to sample collection.

Tumor Classification Potential by DQC Based Transcript Selection and Removal

   As can be seen in Fig. [91]1C, the initial reduced GEM (i.e. the one
   obtained from the original GEM by removing the 48 core transcripts) can
   still be used to successfully segregate the tumors into five groups via
   either t-SNE or DQC. Moreover, the DQC analysis once again successfully
   separated the GBM tumors from the LGG tumors. So, we see that – at
   least for the case of glial cell tumors – information about tumor type
   and not just tissue of origin is being encoded in the reduced GEM.
   Still, brain tumors aside, the observed DQC and t-SNE substructure both
   distinguished the other tumors. Why would classification occur in the
   absence of the core biomarkers? It is well known that hundreds to
   thousands of genes might be differentially expressed between tumor
   types. In the case of GBM and LGG, it has previously been reported that
   2275 genes are differentially expressed between these tumor
   types^[92]21. Therefore, one would expect several thousand transcripts
   to have a combined sorting ability as the gene set without core
   biomarkers is still likely embedded with differentially expressed
   genes. However, we were curious about the effect of deeper biomarker
   removal and the effect on sample sorting. Thus, we performed successive
   rounds of DQC-based feature (i.e. important transcripts) selection and
   removal.

   Since this part of the analysis was only focused on how systematic
   removal of features would affect the ability to separate tumors, we
   decided to expedite the analysis by limiting DQC-based feature
   selection to the first 12 SVD dimensions to identify the next layer of
   important transcripts. For these 12 eigenvectors, we plotted the sorted
   absolute values of their components and set a threshold based upon
   breaks in the plot. These thresholds steadily decreased as we repeated
   the process of important transcript selection and removal. At each
   iteration, we checked that the new set of important transcripts alone
   could still separate the glial cell tumors by diagnosis (Fig. [93]5)
   and not just tissue of origin. Furthermore, we tested the gene set for
   the significant enrichment of biological annotation terms (Fig. [94]6;
   Supplemental Table [95]2), removed the important transcripts, and then
   repeated the transcript selection process. We repeated this procedure a
   total of 18 times, thus identifying multiple layers of important
   transcripts corresponding to relaxed threshold magnitudes for the
   SVD-eigenvectors.

Figure 5.

   [96]Figure 5
   [97]Open in a new tab

   Layer Extraction by DQC Based Feature Selection. Eighteen iterations of
   DQC based feature selection identified transcripts in windows of
   decreasing information content. The first nine iterations are shown
   with a representative t-SNE embedding on the left and a DQC snapshot in
   the first three dimensions on the right for each iteration. Iteration 1
   is the same data as shown in Fig. [98]1B. The number of transcripts
   identified with SVD-entropy from iteration 1 to 18: 48, 57, 55, 22, 22,
   73, 54, 23, 146, 87, 41, 34, 413, 279, 209, 119, 80, 2102. Iterations
   in red show repeating patterns in Fig. [99]7.

Figure 6.

   [100]Figure 6
   [101]Open in a new tab

   Cluster Coherence Increases with Sample Size and Functional Enrichment
   Decreases with Decreasing DQC Feature Selection Threshold. (A) Percent
   of clusters enriched for at least one tumor type increases with sample
   size; (B) The number of Reactome pathways at each level of feature
   selection corrected for transcript count.

   Remaining transcripts after iterative removal of important transcripts
   were studied by both DQC and t-SNE (iterations 1–9 shown in
   Fig. [102]5). All iterations reveal varying capacity of the remaining
   transcripts to segregate the five tumor types. The number of
   transcripts in each interval also varied greatly. The number of
   transcripts identified at each level - from iteration 1 to 18- are 48
   (core transcripts from the full GEM described above), 57, 55, 22, 22,
   73, 55, 23, 146, 87, 41, 34, 413, 280, 209, 119, 80, 2101. The
   iterations with least number of transcripts, obtained in iterations 4
   and 5, identified 22 RNA transcripts each; the largest, at 2101
   transcripts, had significant tumor “classification potential” despite
   the population of transcripts being the least sensitive as measured by
   difference of absolute value of the components of the SVD-eigenvectors.
   We performed DQC evolutions for iterations 1–9 but due to size
   constraints, a single example DQC evolution for iteration 9 can be
   found in Supplemental Video 1.

   In this analysis, we noticed a trend where, for both DQC and t-SNE
   analyses, layers with larger transcript population size generally
   exhibit greater capacity for tumor segregation (Fig. [103]6A). In
   contrast, performing biochemical pathway enrichment analysis on the
   transcripts at each level, we see a decrease in biological function
   (i.e. enriched Reactome pathways; Fig. [104]6B) at each level. While
   the first core transcript iteration contains a complex set of tumor
   type and tissue relevant pathways, iterations 2, 3, 4, 6, 9, and 13
   repeat identical pathways (Fig. [105]7).

Figure 7.

   Figure 7
   [106]Open in a new tab

   Biological Enrichment at Each Layer of SVD-Entropy Threshold
   Relaxation. Specific Reactome pathway enrichment (FDR < 0.01) at each
   level of feature selection (01–18) is shown as a black box. The total
   number of shared functions across all magnifications is shown.

“Background” Tumor Classification Potential

   To test if the effect of classification potential was simply due to the
   number of transcripts in the GEM, random samplings of transcripts from
   the TCGA matrix were taken to determine the tumor classification
   potential of random GEMs of increasing size (Fig. [107]8). Random
   samplings of 200 transcripts or less either produced indistinct
   clusters or clusters of mixed tumor type identity. Larger random
   samples of transcripts, when t-SNE embedded and HDBSCAN clustered, were
   generally better able to segregate the five tumor types. We term this
   random classification potential the “background classification
   potential” as opposed to the more specific biomarker classification
   potential seen in early iterations of DQC based feature selection.

Figure 8.

   [108]Figure 8
   [109]Open in a new tab

   Background Tumor Classification Potential Revealed by Random Transcript
   Subsets of Sufficient Size. t-SNE embeddings of random subsets of the
   tumor gene expression matrix are presented at various sample sizes. Top
   row, left to right: subsets of size 25, 50, 75, 100, 125; middle row:
   subsets of size 150, 175, 200, 225, 250; bottom row: subsets of size
   300, 400, 500, 1000. Sample colors are as follows: BLCA, light blue;
   GBM, dark blue; LGG, green; OV, red; THCA, magenta.

Discussion

   The motivation for this study was to identify better methods for
   segregating samples into biologically-relevant groups based upon
   quantitative dimensions (i.e. steady state RNA expression). We expected
   that these dimensions would confer biological classification potential
   which in this study was the separation of tumor types without prior
   knowledge of the sample origin. To achieve our goal, we tested two
   approaches, DQC and t-SNE, both of which grouped samples into clusters
   that made biological sense based on sample annotation enrichment.

   The biological relevance of the tumor clusters was evidenced by
   enrichment of annotations relative to all tumors. The consensus
   clustering technique applied to one thousand t-SNE embeddings revealed
   nine tumor subpopulations (Figs [110]3 and [111]4). For example, we
   observed in cluster 3 that BLCA and OV tumors were similar enough to
   co-cluster and be enriched for late onset (> = 40 years) whereas
   cluster 4 was late onset but restricted to OV tumors. The unusual
   co-clustering of disparate tumor types in cluster 3 warrants further
   investigation. Further, we identified four clusters (0, 5, 6, 8) that
   were detected in younger patients (< = 40 years) and two additional
   clusters of tumors found in older patients (1, 2). We also observed an
   enrichment in clusters 0 and 5 for THCA and female attributes. It
   should be noted that 59/572 (10.3%) of samples labeled THCA were
   actually “Solid Tissue Normal” samples. These clusters make sense as
   thyroid cancer is 2.9 times more likely to occur in females^[112]22.
   Clusters 1, 6, 8 showed a male bias and clusters 6 and 8 are associated
   with brain cancer; this finding corroborates the male bias seen in
   brain tumor incidence^[113]23. None of the brain tumor samples were
   annotated as normal. With a deeper attribute dataset, we believe this
   approach would reveal higher quality sample groups and even more
   insight for the exploration of tumor biology.

   While both t-SNE and DQC performed well in clustering the tumors, there
   were some interesting differences. DQC revealed interesting patterns
   that were not observed with t-SNE. For example, the DQC “brain arch”
   driven by the 48 core transcripts revealed an interesting substructure
   connecting LGG and GBM brain tumors (Fig. [114]2). When we clustered
   the samples across the DQC arch from LGG to GBM, we found the 48 core
   transcripts expression levels varied across the “brain arch” and
   exhibited a trend where epithelial, thyroid, and other genes were often
   up-regulated in samples at the bottom of the left-hand leg of the arch
   (LGG group 1) and often down-regulated in samples at the bottom of the
   right-hand leg (GBM group 7). There also appears to be immune response,
   extracellular matrix, and differentiation genes turned on at the
   right-hand leg of the arch (GBM group 7) and off at the bottom of the
   left-hand leg (LGG group 1). These data suggest that the core 48
   biomarkers have excellent classification potential for all tumors.
   Furthermore, a subset of these genes function differently in GBM and
   LGG tumors, which have very different survival times of 14.6 months
   (GBM^[115]24) versus 7 years (LGG^[116]25).

   This is not the first study to identify biomarkers and classify TCGA
   tumor types. Martinez et al. identified eight transcriptional
   superclusters using unsupervised hierarchical clustering of expression
   profiles between tumor sub-types across twelve TCGA tumor
   types^[117]26. In contrast to our approach, their study analyzed the
   top 1500 genes and used prior knowledge of tumor type in their analysis
   whereas our study input the full GEM and sorted tumors without using
   tumor type knowledge in the sorting process. Li et al. examined TCGA
   RNAseq data using a classification strategy where they classified 9096
   tumor samples from 31 tumor types using GA/KNN as classification
   engine^[118]27. That classification study used prior knowledge of tumor
   type in the classification process whereas our study was blind to the
   sample labels. In Hoadley et al., a cluster-of-cluster (COCA) technique
   was used on RNAseq profiles (and other tumor molecular measurements) to
   determine a molecular taxonomy of 12 cancer types, discovering 11
   molecular signature-based subtypes^[119]9. Their study showed that
   tissue of origin labels was not always indicative of the tumor’s
   molecular basis. In contrast to our study, they only used the 6000 most
   variable genes in their analysis. It is also typical to use many
   differentially expressed genes to characterize tumor type or subtype.
   Ceccarelli et al., for example, used 2275 differentially expressed
   genes to build “molecular profiles” of LGG and GBM tumors^[120]21 and
   Verhaak et al. identified 840 genes predictive of subtype in the case
   of GBM tumors alone^[121]28. We are aware that the 48 core transcripts
   used in this analysis are probably a subset of such a larger population
   of classifier genes.

   Our results confirm there are two means of classifying samples: (1)
   using a small number of transcripts of high significance or (2) many
   transcripts of low relative significance. By examining the functions of
   genes in the first iteration of DQC-based transcript selection
   (Fig. [122]7; Supplemental Table [123]2), it was clear that the first
   48 core transcripts show more relevance to the tumor phenotype and thus
   are more likely to be mechanistic in tumor progression. We also
   demonstrated the existence of a “background classification” effect
   where a random sample of – on the order of 200 transcripts –
   recapitulates the classification potential of the transcripts
   identified by DQC-based feature selection (Fig. [124]8) and supports
   the convention of using sets of hundreds or thousands of differentially
   expressed genes to characterize tumor types. It is possible, then, that
   relaxing the transcript significance threshold used in DQC-based
   feature selection may not identify additional transcripts of interest
   so much as reveal the high classification potential arising from the
   aggregate small effects of many transcripts.

   While the “background classification” effect suggests that a
   sufficiently large set of random gene expression vectors has
   classification potential, it seems unlikely that all these genes would
   be involved in tumor specific biology. We assumed that through
   successive levels of DQC-based transcript selection we would remove
   differentially expressed genes and detect essentially random genes
   without collective function that merely contain the background
   classification potential. In fact, we did find that as we continued
   this process, the number of pathways we detected decreased after
   correcting for the number of transcripts found in the pathway
   (Fig. [125]6B). However, functional enrichment analysis, as defined by
   Reactome biochemical pathway enrichment in a gene group relative to all
   genes in the genome^[126]29, revealed a repeating pathway enrichment
   pattern for iterations 2, 3, 4, 6, and 7. The pathways that appear in
   these iterations appear to control protein synthesis and may be a
   signal for general cell growth and proliferation processes. It is
   interesting to note that the threshold used in the DQC-based feature
   selection process for the first seven levels did not decrease
   significantly. Rather, the significant drop in threshold was only seen
   in higher iterations. The drop-off in the number of observed reaction
   pathways appears to coincide with the drop-off of the threshold that
   had to be used in the selection process. A future experiment could
   determine if these repeating genes and pathways are present in normal
   tissue GEMs (e.g. GTEX datasets^[127]2) which would imply tissue
   specificity as opposed to a tumor type property.

   In conclusion, we describe and contrast two very different sample
   clustering algorithms: DQC and t-SNE. We implemented dimensionality
   reduction to identify the important transcripts and all subsequent DQC
   analysis was done without further dimensionality reduction. t-SNE
   analysis involved further embedding into two spatial dimensions. Both
   techniques are effective at clustering samples to detect substructures
   in a GEM. The fact that DQC worked in the full 48-dimension space of
   core transcripts is likely the reason it reveals more subtle aspects of
   the data. Unexpectedly, we discovered the confounding effect many
   random transcripts can classify and sort samples. We also repeatedly
   showed that this ability is not merely identifying tissue of origin as
   opposed to tumor type. This is an early but intriguing concept that
   should be addressed if a researcher seeks cause-and-effect as opposed
   to tumor type-associated biomarkers. Finally, while we applied these
   techniques to tumor data, the same approach can be applied beyond
   genomics contexts, a fact that has been previously shown for
   DQC^[128]19.

Methods

Gene Expression Matrix (GEM) Preparation

   RNAseq profiles, rsem processed at the transcript level, for five
   public tumor types were downloaded on April 1, 2016 from the TCGA Data
   portal at [129]https://gdc-portal.nci.nih.gov. A total of 2,016
   datasets were obtained comprised of the types as labeled by TCGA: BLCA
   (n = 427), GBM (n = 174), LGG (n = 534), OV (n = 309), and THCA
   (n = 572). Each expression profile was merged into a single gene
   expression matrix (GEM) with 73,599 transcripts labeled with knowngene5
   UC-Santa Cruz genome database gene model identifiers. It is important
   to address potential batch effects – technical and biological variation
   between samples of the same group – in a high throughput genomics
   study^[130]6. The TCGA Batch Effects webserver
   ([131]http://bioinformatics.mdanderson.org/tcgambatch/) was queried to
   gain insight on batch effects present in each cancer subtype. It was
   found that the Dispersion Separability Criterion (DSC) scores for the
   RNAseqv2 isoform data for each cancer subtype indicate a mild presence
   of batch effects in the data used in our study (p < 0.0005). The DSC
   scores for BLCA, GBM, LGG, OV, and THCA are 0.310, 0.000, 0.298, 0.089,
   and 0.252, indicating a low ratio of dispersion between vs. within
   batches for these cancer types. Lauss et al. found that performing a
   quantile normalization on colon cancer RNAseqv2 data from TCGA helped
   to reduce batch effects present in this data^[132]6. We performed a
   similar quantile normalization on the GEM used in our study.
   Furthermore, we did not detect any outliers using a Kolmogorov–Smirnov
   test as performed in^[133]8. The TCGA GEM was quantile normalized and
   randomly sorted to create a single tumor GEM for input into the DQC and
   t-SNE pipelines. A small fraction of the subtype labeled samples (83
   out 2016–4.12%; 19 BLCA; 59 THCA; 5 GBM) included “Solid Tissue Normal”
   samples. These and other clinical annotations associated with each TCGA
   sample are included in Supplemental Table [134]3.

DQC-Based Important Transcript Selection

   As DQC evolution proceeds it becomes apparent that sample data points
   separate well in some dimensions and not in others (see sample DQC
   tumor evolution animation in Supplemental Video 2). This separation
   allows for a novel form of feature selection where we select a subset
   of the RNA transcripts that play the most important role in the
   evolution of the data.

   To describe the DQC selection process it is convenient to consider the
   transpose of the original GEM, so that the rows are the tumor samples
   and the columns are the RNA transcript labels. The SVD-decomposition of
   this matrix rewrites the m × n GEM (m samples, n RNA transcripts) as
   Equation ([135]3):
   [MATH:
   <mi>G</mi><mi>E</mi><msub><mrow><mi>M</mi></mrow><mrow><mi>i</mi><mi>j<
   /mi></mrow></msub><mo>=</mo><mrow><munder><mo>∑</mo><mi>l</mi></munder>
   <mrow><msub><mrow><mi>U</mi></mrow><mrow><mi>i</mi><mi>l</mi><mspace
   width=".25em"></mspace></mrow></msub><msub><mrow><mi>S</mi></mrow><mrow
   ><mi>l</mi><mi>l</mi></mrow></msub><msubsup><mrow><mi>V</mi></mrow><mro
   w><mi>l</mi><mi>j</mi></mrow><mrow><mi>t</mi></mrow></msubsup></mrow></
   mrow> :MATH]
   3

   where U is an n × n matrix, S is an n × n diagonal real matrix with
   non-vanishing entries only along the diagonal arranged in decreasing
   value, and V^t is an m × n matrix.

   The rows of V^t are unit vectors in the m-dimensional space of
   transcripts and define a new coordinate system best adapted to plotting
   the data. As the corresponding eigenvalue of S goes down, the variance
   of the data in that direction drops as well. In general, the rows of
   V^t are linear combinations of the original features that correspond to
   columns in GEM. It is common to dimensionally reduce the matrix GEM by
   choosing an integer N that is smaller than the number of rows in GEM
   and defining Equation ([136]4)
   [MATH:
   <mi>G</mi><mi>E</mi><msubsup><mrow><mi>M</mi></mrow><mrow><mi>i</mi><mi
   >j</mi></mrow><mrow><mi>N</mi></mrow></msubsup><mo>=</mo><munderover><m
   o>∑</mo><mrow><mi>l</mi><mo>=</mo><mn>1</mn></mrow><mi>N</mi></munderov
   er><mrow><msub><mrow><mi>U</mi></mrow><mrow><mi>i</mi><mi>l</mi><mspace
   width=".25em"></mspace></mrow></msub><msub><mrow><mi>S</mi></mrow><mrow
   ><mi>l</mi><mi>l</mi></mrow></msub><mspace
   width=".25em"></mspace><msubsup><mrow><mi>V</mi></mrow><mrow><mi>l</mi>
   <mi>j</mi></mrow><mrow><mi>t</mi></mrow></msubsup></mrow><mo>,</mo>
   :MATH]
   4

   where the sum over l goes from 1 to N. The square root of the sum of
   the squares of the left-out terms in S[ll] provide an upper bound on
   the absolute error one makes in the full data matrix by using the
   dimensionally reduced matrix, instead of the original matrix. It is
   very convenient, when doing DQC evolution to replace the dimensionally
   reduced matrix
   [MATH:
   <mi>G</mi><mi>E</mi><msubsup><mrow><mi>M</mi></mrow><mrow><mi>i</mi><mi
   >j</mi></mrow><mrow><mi>N</mi></mrow></msubsup> :MATH]
   by U^N, an n × N matrix made up of the first N columns of U.

   Note that while the dimensionally reduced matrix U^N has fewer columns
   than the original GEM it does not depend on fewer features, since each
   row of V^t depends upon many more than N features.

   DQC based feature selection simply examines those rows of V^t that
   correspond to directions where the DQC evolved data clearly separates.
   For each of these vectors, we plot the absolute value of the
   eigenvector components - sorted in decreasing order. These plots tend
   to show a close group of larger values, followed by a gap, followed by
   smaller values. We select those features corresponding to large values
   of the components of the row of V^t in question. We combined features
   identified in this way to arrive at our set of selected features.

   We should emphasize that this simple feature selection technique
   requires prior DQC evolution of the data to identify the useful
   directions of the SVD-decomposition. It should be noted that no theorem
   guarantees this approach will produce an exhaustive list of most
   important transcripts and that no information about tumor types was
   used at any stage of the selection process.

t-SNE Analysis and Consensus Cluster Detection

   The full or partial GEM was evaluated separately by a dimensionality
   reduction (embedding), clustering, consensus, and enrichment pipeline.
   Embedding was performed using t-Distributed Stochastic Neighbor
   Embedding (t-SNE^[137]13) using the Python implementation from
   [138]https://github.com/DmitryUlyanov/Multicore-TSNE. For each t-SNE
   run, one thousand two-dimensional randomly initialized embeddings were
   created. Each embedding was clustered individually using Hierarchical
   Density-Based Spatial Clustering of Applications with Noise
   (HDBSCAN^[139]30). A consensus of nine clusters was determined over the
   whole set of clustered embeddings using the Cluster Ensembles
   method^[140]31. Label enrichment of these consensus clusters for
   patient attributes associated with the tumor samples was evaluated
   using a Chi-squared test (p < 0.001).

Background Classification Potential

   Repeated samples, varying in size, of random subsets of transcripts
   were extracted from the TCGA matrix and embedded using t-SNE. Twenty
   random sample subsets of size 25, 50, 75, 100, 125, 150, 175, and 200,
   ten samples of subsets of size 225, 250, 300, 400, and 500, and five
   subsets of size 100 were evaluated. These were treated with the same
   pipeline as above where the samples were embedded, clustered with
   HDBSCAN, consensus clusters were assigned by Cluster Ensembles, and
   then each of these consensus clusters was evaluated for tumor type
   enrichment. The percentage of clusters enriched for at least one tumor
   type was recorded.

Functional Enrichment Analysis

   Functional enrichment of the core and subsequent iterations of
   transcripts was performed using an in-house Perl script modeled after
   the online DAVID tool at [141]https://david.ncifcrf.gov. Tested
   attributes include human transcripts mapped to terms from
   InterPro^[142]32, PFAM^[143]33, the Gene Ontology (GO)^[144]34, the
   Kyoto Encyclopedia of Genes and Genomes (KEGG)^[145]35,
   Reactome^[146]36 and MIM^[147]37. Terms that were present in a gene
   list more often than in the genomic background were considered enriched
   (FDR <0.01). The full enrichment list is shown in Supplemental
   Table [148]2.

Electronic supplementary material

   [149]Supplemental Video 1^ (631KB, mp4)
   [150]Supplemental Video 2^ (3.6MB, mp4)
   [151]Supplemental Table 1^ (19.5KB, xlsx)
   [152]Supplemental Table 2^ (117.8KB, xlsx)
   [153]Supplemental Table 3^ (480KB, xlsx)

Acknowledgements