Abstract

   Knowledge of similarities among diseases can contribute to uncovering
   common genetic mechanisms. Based on ranked gene lists, a couple of
   similarity measures were proposed in the literature. Notice that they
   may suffer from the determination of cutoff or heavy computational
   load, we propose a novel similarity score SimSIP among diseases based
   on gene ranks. Simulation studies under various scenarios demonstrate
   that SimSIP has better performance than existing rank-based similarity
   measures. Application of SimSIP in gene expression data of 18 cancer
   types from The Cancer Genome Atlas shows that SimSIP is superior in
   clarifying the genetic relationships among diseases and demonstrates
   the tendency to cluster the histologically or anatomically related
   cancers together, which is analogous to the pan-cancer studies.
   Moreover, SimSIP with simpler form and faster computation is more
   robust for higher levels of noise than existing methods and provides a
   basis for future studies on genetic relationships among diseases. In
   addition, a measure MAG is developed to gauge the magnitude of
   association of anindividual gene with diseases. By using MAG the genes
   and biological processes significantly associated with colorectal
   cancer are detected.

   Keywords: Gene expression, Similarity score, Genetic relationships,
   Disease network, Gene ranks

Introduction

   Exploring the common genetic basis of complex human diseases is often
   useful for understanding the disease relationships drawing on their
   genetic mechanisms. A similarity measure is a central component in
   detecting the common genetic basis among different diseases. Several
   common metrics have been proposed to measure similarities between
   diseases, such as Pearson, Spearman correlation coefficient, Euclidean
   distance, Manhattan distance, and Jaccard correlation coefficient
   ([32]Antosh et al., 2013; [33]Dennis et al., 2003; [34]Serra, Romualdi
   & Fogolari, 2016; [35]Shi, Yi & Ma, 2014). However, with the
   technological developments in molecular biology, large-scale gene
   expression profiling datasets produced from diverse technological
   platforms necessitate new and adaptive similarity measures to reveal
   meaningful genetic relationships across multiple platform types.
   Ranking genes according to their contribution to each disease can
   convert heterogeneous data into platform-independent rank lists, and a
   recent study ([36]Serra, Romualdi & Fogolari, 2016) suggested that the
   discrimination ability of the similarity/distance measures based on
   ranked gene lists perform well or better than traditional measures
   (such as Euclidean distance and correlation coefficient), particularly
   for gene expression datasets produced with different biotechnologies
   (microarray, RNA-seq, etc.). Therefore, the genetic overlaps among
   diseases can be detected by integrating multi-platform datasets using
   similarity measures based on ranked gene lists, which can help us gain
   further insight on understanding disease etiology.

   A review of related studies in similarity measures based on ranked gene
   lists showed that most of them depend on either a fixed cutoff position
   to consider overlaps between the top part of ranked gene lists or
   variational cutoffs to select the one that generates the most
   significant results ([37]Chen, Zhou & Sun, 2015; [38]Dennis et al.,
   2003; [39]Von Mering et al., 2006). To avoid the uncertainty of results
   owing to arbitrariness in cutoff settings, global rank-based similarity
   measures have been developed. For example, the algorithm of GOrilla is
   a tool to discover and visualize the enrichment of GO terms that ranks
   genes by fold-change ([40]Eden et al., 2009). The algorithm of gene set
   enrichment analysis (GSEA) allows all genes to contribute to
   overlapping signals in proportion to their degree of differential
   expression and can detect the weak signals that would be discarded by
   “threshold” approaches ([41]Efron & Tibshirani, 2007; [42]Subramanian
   et al., 2005). The algorithm of CORaL estimates the significant set
   size using the overlaps between sections of the ranked gene lists by
   maximizing statistical likelihood ([43]Antosh et al., 2013). The
   algorithm of R2KS particularly emphasizes finding the same items near
   the top of the ranked gene list ([44]Ni & Vingron, 2012). [45]Plaisier
   et al. (2010) presented a popular “threshold-free” approach in
   neuroscience and biomedicine named rank-rank hypergeometric overlap
   (RRHO), which identifies and visualizes regions of significant overlap
   between two ranked gene lists and determines the statistical
   significance of enrichment by hypergeometric distribution.

   Moreover, a similarity measure OrderedList that focuses on evaluating
   whether there is significant overlap between two ranked gene lists was
   proposed ([46]Yang et al., 2006). The OrderedList is the weighted sum
   of the number of overlapping genes with an exponentially decaying
   weight, where a parameter β is introduced to determine how deep to go
   in the ranked gene lists. There are many studies on statistics-based
   improvement from similarity measures OrderedList. For example,
   [47]Serra, Romualdi & Fogolari (2016) introduced similarity measure
   FES[β] (fraction enrichment sum) and set the exponentially decaying
   parameter β as 0, 0.001, and 0.01. [48]Chen, Zhou & Sun (2015) adopted
   the default series of β values in the R package OrderedList and
   proposed a robust statistic minimum p value:
   [MATH: <munder><mrow><mo
   form="prefix">min</mo></mrow><mi>β</mi></munder><mo>⁡</mo><mrow><msub><
   mi>p</mi><mi>β</mi></msub></mrow> :MATH]
   , where parameter β can be calculated by setting a minimum weight and a
   default series of positions. To avoid arbitrariness and manual
   intervention of parameter β selection, [49]Chen, Zhou & Sun (2015)
   finally found a parameter-free similarity measure WeiSumE* with good
   performance to detect overlaps among ranked gene lists in simulations,
   which is the weighted sum that normalizes the number of overlapping
   genes on the top genes of two ranked gene lists by its expectation.
   Generally, existing rank-based similarity measures are mostly based on
   the overlaps among ranked gene lists and may lead to information loss
   owing to fixed cutoff positions in the ranked gene lists, the
   uncertainty of results because of the arbitrariness of the cutoff
   position, or heavy computational burden.

   In this study, we propose MAG based on the transformation of gene ranks
   to measure the magnitude of association of the individual gene with two
   diseases. By exploring the summation of MAG over all genes, we develop
   a novel similarity score SimSIP with simpler form and light computation
   burden to gauge genetic overlap among diseases based on gene ranks
   instead of intersections between the top part of ranked gene lists. To
   show the superiority of SimSIP, we firstly conduct a series of
   simulation studies to demonstrate the performance of SimSIP compared to
   some existing similarity measures based on ranked gene lists (WeiSumE*
   ([50]Chen, Zhou & Sun, 2015), OrderedList ([51]Yang et al., 2006),
   FES[0.001], FES[0.01]([52]Serra, Romualdi & Fogolari, 2016)) and
   Euclidean distance EucD under various scenarios. Secondly, we apply
   SimSIP to analyze the gene expression data of cancers in The Cancer
   Genome Atlas database and find that it sheds light on the genetic
   relationships among cancers. Thirdly, we arrange the significantly
   similar cancer pairs among the 18 cancer types detected by SimSIP into
   a disease network in which the tendency to cluster the histologically
   or anatomically related cancers provides basic support for pan-cancer
   studies. Finally, for the most significantly similar cancer pair, colon
   adenocarcinoma (COAD) and rectum adenocarcinoma (READ), found by
   SimSIP, we use MAG to measure the magnitude of association of each gene
   both with COAD and READ and find the important oncogenes of colorectal
   cancer which are associated with COAD and READ and regulated in the
   same pattern. Moreover, biological processes highly associated with
   colorectal cancer are detected.

Materials and Methods

MAG and SimSIP

   Let us assume that there are n genes for disease 1 and disease 2. For
   gene i, 1 ≤ i ≤ n, let a[i] be its rank among the n genes for disease
   1, and b[i] is defined similarly for disease 2. Now we intend to gauge
   the genetic similarity or genetic relationship between these two
   diseases based on gene rank lists
   [MATH: <mo stretchy="false"
   fence="false">{</mo><mrow><msub><mi>a</mi><mi>i</mi></msub></mrow><msub
   sup><mo stretchy="false"
   fence="false">}</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</m
   i></msubsup> :MATH]
   and
   [MATH: <mo stretchy="false"
   fence="false">{</mo><mrow><msub><mi>b</mi><mi>i</mi></msub></mrow><msub
   sup><mo stretchy="false"
   fence="false">}</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</m
   i></msubsup> :MATH]
   . The gene rank usually represents the strength of association with the
   disease, in the sense of small rank meaning strong association and big
   rank meaning weak association. For example, the rank can be assigned
   for each gene by the magnitude of p value reported in the study of
   detecting differentially expressed genes, which is the routine work in
   the literature. To facilitate the construction of the similarity score,
   we transform the ranks a[i] and b[i] to their reciprocal 1/a[i] and
   1/b[i], whose values fall in the interval (0, 1] and can be treated
   directly as the magnitude of association of gene i with diseases 1 and
   2, respectively. Further, we employ the geometric mean
   [MATH:
   <msqrt><mn>1</mn><mrow><mo>/</mo></mrow><mrow><msub><mi>a</mi><mi>i</mi
   ></msub></mrow><mo>⋅</mo><mn>1</mn><mrow><mo>/</mo></mrow><mrow><msub><
   mi>b</mi><mi>i</mi></msub></mrow></msqrt> :MATH]
   of 1/a[i] and 1/b[i] and call it MAG, which is a compromise between
   1/a[i] and 1/b[i] as the magnitude of association of gene i with both
   diseases 1 and 2. Intuitively, a small value of MAG means a weak
   association of gene i with these two diseases, a big value means a
   strong association of gene i with these two diseases. Note the value of
   MAG is between n^−1 and 1.

   Now let us explore the summation
   [MATH: <msubsup><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>n</mi></msubsup><msqrt><mn>1</mn><mrow><mo>/</mo></mrow><mrow><msub
   ><mi>a</mi><mi>i</mi></msub></mrow><mo>⋅</mo><mn>1</mn><mrow><mo>/</mo>
   </mrow><mrow><msub><mi>b</mi><mi>i</mi></msub></mrow></msqrt> :MATH]
   of MAG over all genes, that is, the total magnitude of association for
   all n genes with diseases 1 and 2. Note that this summation is actually
   the inner product of vectors
   [MATH: <mo stretchy="false"
   fence="false">{</mo><mn>1</mn><mrow><mo>/</mo></mrow><msqrt><mrow><msub
   ><mi>a</mi><mi>i</mi></msub></mrow></msqrt><msubsup><mo
   stretchy="false"
   fence="false">}</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</m
   i></msubsup> :MATH]
   and
   [MATH: <mo stretchy="false"
   fence="false">{</mo><mn>1</mn><mrow><mo>/</mo></mrow><msqrt><mrow><msub
   ><mi>b</mi><mi>i</mi></msub></mrow></msqrt><msubsup><mo
   stretchy="false"
   fence="false">}</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</m
   i></msubsup> :MATH]
   with positive components. It is easy to check that the length of each
   of these two vectors is
   [MATH: <msqrt><msubsup><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>n</mi></msubsup><mn>1</mn><mrow><mo>/</mo></mrow><mi>i</mi></msqrt>
   :MATH]
   , which depends only on n and is independent of genes’ concrete ranks.
   So the summation is proportional to the Angle cosine between the two
   vectors mentioned above. Recall that the inner product in algebra or
   geometry theory is sometimes called the scalar product or dot product
   and is the projection of one vector on another in geometry space, which
   is a symmetric measure of closeness of two vectors. So the similarity
   score by inner product
   [MATH:
   <mi>S</mi><mi>i</mi><mi>m</mi><mi>S</mi><mi>I</mi><mi>P</mi><mrow></mro
   w><mo>=</mo><msubsup><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>n</mi></msubsup><msqrt><mn>1</mn><mrow><mo>/</mo></mrow><mrow><msub
   ><mi>a</mi><mi>i</mi></msub></mrow><mo>⋅</mo><mn>1</mn><mrow><mo>/</mo>
   </mrow><mrow><msub><mi>b</mi><mi>i</mi></msub></mrow></msqrt> :MATH]

   is an appropriate candidate for measuring similarity between
   contributions of all n genes to diseases 1 and 2, and it can be further
   taken as a similarity score between two diseases derived from the
   corresponding gene rank lists. Considering one extreme case in which
   a[i] = b[i] for every gene i, which means the rank of every gene for
   disease 1 is exactly equal to the rank of same gene for disease 2,
   these two diseases are the most similar in terms of these n genes and
   SimSIP attains its maximum
   [MATH: <msubsup><mo
   movablelimits="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>n</mi></msubsup><mn>1</mn><mrow><mo>/</mo></mrow><mi>i</mi> :MATH]
   . Considering the other extreme case in which the gene ranking top for
   disease 1 would always rank bottom for disease 2, and gene ranking
   bottom for disease 1 would always rank top for disease 2, these two
   diseases are the most dissimilar in term of these n genes and SimSIP
   attains its minimum. So a big value of SimSIP would imply that two
   diseases are similar.

Assessment of significance

   It is easily observed from the expression of SimSIP that the more
   identically ranked genes associated with diseases 1 and 2 there are,
   the more similar two diseases are, and the larger SimSIP is. Therefore,
   an observed SimSIP larger than expected under the null hypothesis that
   two diseases have non-overlapping genes means significant. When we say
   that two diseases have no overlapping genes in the genetic perspective,
   we mean that the two gene rank lists are two random shufflings of 1∼ n.
   Based on the expression of SimSIP, we describe a procedure to generate
   the empirical distribution of SimSIP under the null hypothesis, which
   is used to evaluate the significance of the similarity score. Without
   loss of generality, we fix the first gene rank list as
   [MATH: <mo stretchy="false" fence="false">{</mo><mi>i</mi><msubsup><mo
   stretchy="false"
   fence="false">}</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</m
   i></msubsup> :MATH]
   and random permutate 1∼ n as
   [MATH: <mo stretchy="false"
   fence="false">{</mo><mrow><msub><mi>b</mi><mi>i</mi></msub></mrow><msub
   sup><mo stretchy="false"
   fence="false">}</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</m
   i></msubsup> :MATH]
   for S times, then we obtain the corresponding SimSIP^s, s = 1,2,…,S.
   The p value of SimSIP is computed as
   [MATH: <mi>p</mi><mo>=</mo><mstyle displaystyle="true"
   scriptlevel="0"><mrow><mfrac><mrow><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>s</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>S</mi></munderover><mi>I</mi><mo
   stretchy="false">(</mo><mi>S</mi><mi>i</mi><mi>m</mi><mi>S</mi><mi>I</m
   i><mrow><msup><mi>P</mi><mi>s</mi></msup></mrow><mo>≥</mo><mi>S</mi><mi
   >i</mi><mi>m</mi><mi>S</mi><mi>I</mi><mi>P</mi><mo
   stretchy="false">)</mo></mrow><mi>S</mi></mfrac></mrow><mo>,</mo></msty
   le> :MATH]
   (1)

   where I(·) is the indicator function assigning the value 1 or 0 relying
   on whether the condition within brackets is met. Let p^r denote the p
   value and r = 1,2,…, R for R replications, and the power under the
   alternative hypothesis or the type I error under the null hypothesis
   for a given significance level α is
   [MATH:
   <mi>P</mi><mi>o</mi><mi>w</mi><mi>e</mi><mi>r</mi><mo>=</mo><mstyle
   displaystyle="true" scriptlevel="0"><mrow><mfrac><mrow><munderover><mo
   movablelimits="false">∑</mo><mrow><mi>r</mi><mo>=</mo><mn>1</mn></mrow>
   <mi>R</mi></munderover><mi>I</mi><mo
   stretchy="false">(</mo><mrow><msup><mi>p</mi><mi>r</mi></msup></mrow><m
   o>≤</mo><mi>α</mi><mo
   stretchy="false">)</mo></mrow><mi>R</mi></mfrac></mrow><mo>.</mo></msty
   le> :MATH]
   (2)

   Regarding MAG, we can employ a similar procedure to evaluate the
   significance of an observed MAG. Under the null hypothesis that two
   diseases having non-overlapping genes, the two gene ranks a[i] and b[i]
   involved in are randomly drawn from {1,2,…, n} with replacement, and it
   is not difficult to get the distribution of MAG under the null
   hypothesis. The significance of the observed MAG can then be
   calculated.

Results

Simulation study

Parameter setting

   We carry out simulations to evaluate the performance of SimSIP and the
   existing rank-based methods WeiSumE* ([53]Chen, Zhou & Sun, 2015),
   OrderedList ([54]Yang et al., 2006), FES[0.001], and FES[0.01]
   ([55]Serra, Romualdi & Fogolari, 2016), which are weighted sum of the
   number of overlapping genes between the top part of ranked gene lists,
   and Euclidean distance EucD between the original values, which is a
   method without ranking.

   As done in [56]Chen, Zhou & Sun (2015), we set the number of genes n =
   6,000, 11,000, and 25,000 and randomly choose two sets of d genes from
   the n genes as the associated genes with diseases 1 and 2,
   respectively. We fix the d = 1, 5, 10, and 20 in the simulation. The
   number o of overlapping associated genes in the two sets is taken as 0
   for evaluating the type I error rate and positive for evaluating the
   power. For convenience, genes 1 to d are assumed to be associated with
   disease 1 and genes d − o + 1 to 2d − o are associated with disease 2,
   where d ≥ o. As performed by [57]Chen, Zhou & Sun (2015), two normal
   distributions N(0,σ^2) and N(1,σ^2) are employed to generate the rank
   of each gene. Specifically, N(1,σ^2) is for the associated genes and
   N(0,σ^2) for the remaining n − d non-associated genes, in either
   disease 1 or 2. The gene rank is from the value of normal distribution.
   The variance σ^2 in the normal distribution is taken as 0.01, 0.05,
   0.1, 0.5 when o = 0; 0.07, 0.09, 0.11, 0.13 when o = 1; 0.1, 0.15, 0.2,
   0.25 when o = 5; 0.2, 0.25, 0.3, 0.35 when o = 10; and 0.3, 0.35, 0.4,
   0.45 when o = 20. The variance σ^2 plays an important role in
   discerning the associated genes. The number of replications R is set to
   1,000 in the computation of powers/type I error rates, and the nominal
   significance level is set to 0.05. We set S = 1,000 in the simulation
   study and S = 10 millions in real data analysis of The Cancer Genome
   Atlas (TCGA). As an illustration, the empirical distribution of SimSIP
   for total genes number n = 6,000, 11,000, 25,000 are given in [58]Figs.
   S1–[59]S3 and the empirical distribution of MAG for n = 6,000, 11,000,
   25,000 are given in [60]Figs. S4–[61]S6.

Type I error

   We firstly show the simulation results under the null model in
   evaluating the type I error rates of SimSIP and five existing methods
   under various scenarios. For the null setting, the empirical p values
   of existing rank-based methods (WeiSumE*, OrderedList, FES[0.001],
   FES[0.01]) are computed by permutating ranked gene lists from the value
   of the normal distribution, and the null distribution of EucD is
   obtained by permutating the value of the normal distribution generated.
   All empirical sizes shown in [62]Table S1 are around the significance
   0.05 and are well controlled.

Power comparison

   In the second set of simulations, the performance of SimSIP is compared
   with five other existing measures in terms of power. The results show
   that the powers of six measures are all monotonic decreasing for higher
   noise (variance σ^2) and monotonic increasing for a bigger ratio of o
   to d. The findings are shown in [63]Fig. 1. Firstly, for the same total
   gene number n, the smaller the number of associated genes d becomes,
   the more obvious the advantage of the SimSIP is. For example, in the
   scenarios of d = 1 (in [64]Figs. 1A, [65]1E and [66]1I), the power of
   SimSIP has almost a 10% increase compared with that of WeiSumE*
   ([67]Chen, Zhou & Sun, 2015) and OrderedList ([68]Yang et al., 2006),
   is more than two times the powers of FES[0.01] ([69]Serra, Romualdi &
   Fogolari, 2016), and almost more than 8 times the powers of FES[0.001]
   ([70]Serra, Romualdi & Fogolari, 2016) and EucD. The scenario of d = 5
   is analogy to the scenario of d = 1. However, in the scenarios of
   increasing d up to 10 and 20, the advantage of SimSIP is no longer
   obvious. Secondly, for the same d, the larger the number of genes (n)
   becomes, the more obvious the advantages of SimSIP is. For example, in
   the scenarios of n = 6,000 and n = 11,000, the similarity measure
   WeiSumE* is the most powerful when the number of associated genes d =
   20; however, SimSIP becomes superior when compared with the other five
   measures when n increases to 25,000. Thirdly, when the number of
   associated genes d is relatively large, SimSIP gradually manifests
   certain advantages with the increase in noise (variance σ^2), which can
   be observed in [71]Figs. 1C, [72]1D, [73]1G, [74]1H, [75]1K and [76]1L
   (in the scenarios of d = 10 and d = 20). When increasing the number of
   overlapping associated genes o up to 5, SimSIP always maintains the
   superiority compared with the other five methods, and its power has
   obvious advantages for the larger ratio of o to d with a fixed n and
   for the larger n with a fixed d (see [77]Fig. 2). When the number of
   overlapping associated genes o increases from 5 to 10 or 20, the
   advantages of SimSIP are much more significant (see [78]Figs. S7 and
   [79]S8).

Figure 1. Powers of EucD, FES[0.001], FES[0.01], OrderedList, SimSIP and
WeiSumE^* when o = 1 with 12 scenarios.

   [80]Figure 1
   [81]Open in a new tab

   (A) n = 6000, d = 1; (B) n = 6000, d = 5; (C) n = 6000, d = 10; (D) n =
   6000, d = 20; (E) n = 11000, d = 1; (F) n = 11000, d = 5; (G) n =
   11000, d = 10; (H) n = 11000, d = 20; (I) n = 25000, d = 1; (J) n =
   25000, d = 5; (K) n = 25000, d = 10; (L) n = 25000, d = 20. The
   arrangement of variance σ^2 on x axis is a series (0.07, 0.09, 0.11,
   0.13) on which the power of six measures can be ranged from 0.1 to
   0.95.

Figure 2. Powers of EucD, FES[0.001], FES[0.01], OrderedList, SimSIP and
WeiSumE^* when o = 5 with nine scenarios.

   [82]Figure 2
   [83]Open in a new tab

   (A) n = 6000, d = 5; (B) n = 6000, d = 10; (C) n = 6000, d = 20; (D) n
   = 11000, d = 5; (E) n = 11000, d = 10; (F) n = 11000, d = 20; (G) n =
   25000, d = 5; (H) n = 25000, d = 10; (I) n = 25000, d = 20. The
   arrangement of variance σ^2 on x axis is a series (0.1, 0.15, 0.2,
   0.25) on which the power of six measures can be ranged from 0.1 to
   0.95.

   As shown in [84]Figs. 1 and [85]2 and [86]Figs. S7 and [87]S8, the
   similarity measure WeiSumE* performs better than other measures when
   there are less but significant overlap between two diseases with lower
   noise (variance σ^2), and our method SimSIP performs better in all the
   other scenarios. Especially for n = 25,000, which approximates the
   number of genes in the full human genome, the performance of SimSIP is
   always the best. Thus, compared with other measures, SimSIP is more
   appropriate for detecting genetic similarity between longer gene lists
   and works well when more overlapping genes occur among diseases. Most
   importantly, our method SimSIP is more robust for higher noise than the
   other five methods. Therefore, SimSIP is more suitable for the study of
   human diseases than the existing methods, especially for the study of
   cancers in which there are more genetic overlaps.

   Furthermore, the similarity measure OrderedList has a higher power with
   fewer overlaps (for o = 1) and FES[0.01] has a higher power with more
   overlaps (for o > 1). Their performances are superior to those of
   similarity measure FES[0.001] and Euclidean distance EucD, which always
   have low power in all scenarios. We also find that the rank-based
   measures always perform better than Euclidean distance EucD, which uses
   the value of the normal distribution rather than gene ranks. These
   results demonstrate that the rank of a gene can provide additional
   information on its contribution to each disease upon converting real
   data to ranks. In general, compared with existing similarity measures,
   SimSIP performs better in almost every simulation, which indicates that
   it is feasible to construct similarity measure by replacing overlaps
   between top ranked gene sets with gene ranks.

TCGA data analysis

   The Cancer Genome Atlas is a combined effort by multiple research
   institutes, in which tumor and normal samples from more than 11,000
   patients are publicly available, comprising 37 types of (epi)genetic
   and clinical data for 33 types of cancer. We download gene expression
   datasets (whose gene expression profiles were determined experimentally
   using the Illumina HiSeq 2000 RNA Sequencing platform) of 33 types of
   cancer using the UCSC Xena functional genome browser
   ([88]https://xenabrowser.net/datapages/). Expression data of 20,530
   genes are available for each cancer, and 18 types of cancer are
   selected (see details in [89]Table 1) based on the criteria that the
   sample size of the control group is not smaller than 5. We rank all
   genes by their p values derived from the R package LIMMA for
   differential expression analysis and apply our proposed method to gauge
   the genetic similarity among the 18 types of cancer in TCGA. In this
   section, we choose WeiSumE* ([90]Chen, Zhou & Sun, 2015), which
   performs only second to SimSIP under all simulation scenarios, to
   compare with SimSIP.

Table 1. Sample sizes of control group and case group per cancer type in
TCGA.

   Abbreviation          Cancer type[91]^a            n[0]^[92]b n[1]
   BLCA         Bladder urothelial carcinoma          19         407
   BRCA         Breast invasive carcinoma             114        1,097
   CHOL         Cholangiocarcinoma                    9          36
   COAD         Colon adenocarcinoma                  41         286
   ESCA         Esophageal carcinoma                  11         184
   GBM          Glioblastoma multiforme               5          154
   HNSC         Head and neck squamous cell carcinoma 44         520
   KICH         Kidney chromophobe                    25         66
   KIRC         Kidney renal clear cell carcinoma     72         533
   KIRP         Kidney renal papillary cell carcinoma 32         290
   LIHC         Liver hepatocellular carcinoma        50         371
   LUAD         Lung adenocarcinoma                   59         515
   LUSC         Lung squamous cell carcinoma          51         502
   PRAD         Prostate adenocarcinoma               52         497
   READ         Rectum adenocarcinoma                 10         94
   STAD         Stomach adenocarcinoma                35         415
   THCA         Thyroid carcinoma                     59         505
   UCEC         Uterine corpus endometrial carcinoma  24         176
   [93]Open in a new tab

   Notes:
   ^a

   From 33 types of cancer in TCGA, 18 types of cancer with control group
   sample size being five or more are selected.
   ^b

   n[0] is sample size of control group and n[1] is sample size of case
   group for a given cancer.

   For the (18 × 17)/2 = 153 cancer pairs among 18 cancers, we compute
   their similarity scores by using SimSIP and WeiSumE* respectively and
   standardize them by normalization method
   [MATH: <mstyle displaystyle="true"
   scriptlevel="0"><mrow><mfrac><mrow><mi>T</mi><mo>−</mo><mrow><msub><mi>
   T</mi><mrow><mi>m</mi><mi>i</mi><mi>n</mi></mrow></msub></mrow></mrow><
   mrow><mrow><msub><mi>T</mi><mrow><mi>m</mi><mi>a</mi><mi>x</mi></mrow><
   /msub></mrow><mo>−</mo><mrow><msub><mi>T</mi><mrow><mi>m</mi><mi>i</mi>
   <mi>n</mi></mrow></msub></mrow></mrow></mfrac></mrow></mstyle> :MATH]
   . The normalized similarity scores about SimSIP are in [0.11, 0.77]
   with mean 0.24 and variance 0.015, and that about WeiSumE* are in
   [0.064, 0.53] with mean 0.13 and variance 0.005. Obviously, the
   normalized similarity scores about SimSIP are generally bigger and more
   spread than those about WeiSumE*. So SimSIP is more discriminative than
   WeiSumE* in quantifying the relationships among diseases.

Significant cancer pairs in TCGA

   To illustrate the application of SimSIP and WeiSumE* in exploring
   significant relationships among 18 cancers, we compute the empirical p
   values of SimSIP and WeiSumE* based on the null distribution (shown in
   [94]Figs. S9 and [95]S10) for the 153 cancer pairs among 18 cancers.
   Given the nominal significance level of 0.05, the Bonferroni adjustment
   of empirical p values 0.05/153 is employed to detect significant
   similar cancer pairs. A total of 91 significant similar cancer pairs
   are detected by SimSIP, 82 pairs by WeiSumE*, with 81 pairs in their
   intersection (see [96]Table S2; [97]Fig. S11). For the 81 cancer pairs
   detected both with SimSIP and WeiSumE* ([98]Chen, Zhou & Sun, 2015),
   [99]Fig. 3 displays the empirical p values of the two methods for the
   same cancer pair. Clearly, the empirical p values of SimSIP are
   generally smaller than that of WeiSumE* (except for cancer pair KIRP
   and LUAD with index 63 in [100]Fig. 3), which is consistent with the
   results given in the simulation.

Figure 3. The empirical p values of SimSIP (black dot) and WeiSumE* (red
square) for 81 cancer pairs found with both SimSIP and WeiSumE*.

   [101]Figure 3
   [102]Open in a new tab

   As shown in [103]Fig. S11, SimSIP finds more possibly significant
   cancer pairs which include almost all the significant cancer pairs
   found with WeiSumE*. Among the 10 significant cancer pairs (shown in
   red in [104]Table S2) found with SimSIP, not WeiSumE*, five cancer
   pairs can be explained in pan-cancer studies: For cancer pair COAD and
   UCEC, the diversity of high antigen-specific TCR repertoires correlates
   with the improved prognostic progression-free interval in COAD and UCEC
   ([105]Thorsson et al., 2018); the expression of TMEM173 in tumor
   tissues is significantly upregulated and hypomethylated in cancer pair
   COAD and THCA but significantly downregulated and hypermethylated in
   cancer pair LUSC and PRAD ([106]An et al., 2019). Furthermore, there
   are high mutations rates for TBK1 in cancer pair COAD and UCEC, and the
   expression of TMEM173 is positively associated with the infiltration of
   immune cells in cancer pair BRCA and THCA ([107]An et al., 2019). A
   high IRF3 expression yields a poor prediction of prognosis in patients
   with KIRC and PRAD ([108]An et al., 2019). In addition, cancer pair
   COAD and UCEC depends on components of the EGFR pathway at similar
   frequencies ([109]Wormald, Milla & O’Connor, 2013).

Genetic similarity among 18 types of cancer

   Depicting the 91 significant similar cancer pairs found with SimSIP in
   a disease network by using software Cytoscape
   ([110]https://cytoscape.org/) as in [111]Fig. 4, the following
   observations could be addressed:

Figure 4. The diseases network of 18 cancer types.

   [112]Figure 4
   [113]Open in a new tab

   Taking Bonferroni adjustment of empirical p values given nominal
   significance 0.05 as a threshold, we select 91 significantly similar
   cancer pairs found with SimSIP to demonstrate the genetic relationships
   among cancers by software Cytoscape ([114]https://cytoscape.org/). The
   width, length and color of the edge in the network are determined by
   the magnitude of SimSIP, the size and color of the vertex in the
   network is determined by the degree of vertex (the number of cancers
   which is significantly related to this cancer).
    1. The vertex HNSC has the largest degree, 15, which shows that the
       HNSC has significant similarity with the other 15 cancers, except
       for KICH and KIRC. The degrees of vertices such as BLCA and UCEC
       reach 14; READ, ESCA, LUSC, and HNSC reach 13; and LUAD and LIHC
       reach 12. There are close intrinsic genetic relationships among the
       18 cancers. The seven types of cancers ESCA, LUSC, HNSC, BLCA,
       BRCA, STAD, and UCEC ( their vertices with deeper color and bigger
       size) with a higher degree are closer to each other and tend to
       form a pivotal hub of the disease network of 18 cancer types.
    2. Cancers originating from the same organ or tissue tend to
       co-cluster, such as cancer pair READ and COAD or KIRP and KIPC with
       the most similar relationship. In addition, cancers with proximity
       also tend to group together, such as LIHC and CHOL with a highly
       significant relationship. These provide evidence that tumors with
       closer physical distance in human organs have similar sources of
       endoderm development or exposure to a common cancer-causing
       environmental factor ([115]Sell & Dunsford, 1989). Further, for the
       three types of kidney tumors, KICH, KIRP, and KIPC, the similarity
       between KIRP and KIPC is more significant than KICH and KIPC or
       KIRP and KIPC, which may be explained by the fact that KIRC and
       KIRP are cancers of the proximal tubule segments, whereas KICH is a
       cancer of the distal tubule segments ([116]Chen et al., 2016;
       [117]Davis et al., 2014; [118]Lee, Chou & Knepper, 2015). Compared
       to other cancer pairs from the same tissue, the degree of
       similarity between the two types of lung tumor LUSC and LUAD is not
       so significant, which may be due to the derivation of cell types of
       the two types of lung tumors: LUSC originates from squamous
       epithelial cells in the respiratory tract and alveoli, whereas LUAD
       originates from a large number of glandular or alveolar cells
       ([119]Li et al., 2015; [120]Mainardi et al., 2014; [121]Sutherland
       et al., 2014).
    3. There are significant similarities among gastrointestinal tumors
       (READ, COAD, STAD, and ESCA), which is consistent with the results
       of integrative clustering across data types in the miRNA, mRNA, and
       RPPA platforms ([122]Hoadley et al., 2018). In the disease network
       ([123]Fig. 4), squamous cell carcinomas (BLCA, ESCA, HNSC, and
       LUSC) are also co-clustered, and the similarity score of the cancer
       pair ESCA and LUSC is the top three and has more significant
       similarity. [124]Hoadley et al. (2018) reported a similar discovery
       based on a multi-platform dataset (including miRNA, mRNA, RPPA,
       aneuploidy, and DNA methylation data) in TCGA, and [125]Abrams et
       al. (2018) also suggested that regardless of the tissue types of
       squamous cell carcinoma, potential similarities were detected among
       the transcription factor expression profiles of BLCA, ESCA, HNSC,
       and LUSC. In addition, the gynecologic tumors UCEC and BRCA are
       also close to each other in the network, which is consistent with
       the results of previous studies ([126]Hoadley et al., 2018).
    4. Finally, similar to previous studies ([127]Abrams et al., 2018;
       [128]Hoadley et al., 2018), three types of cancer, PRAD, THCA, and
       GBM, are relatively independent in the network, the relationships
       among them and other cancers are relatively weak. In addition,
       instead of squamous cancers being clustered together,
       adenocarcinomas (PRAD, COAD, LUAD, STAD, and READ) appear to be
       scattered around the edge of the network.

   From the above observations of the disease network ([129]Fig. 4)
   obtained with SimSIP, we find them to be analogous to those of previous
   pan-cancer studies ([130]Abrams et al., 2018; [131]Hoadley et al.,
   2018): histologically or anatomically related cancers tend to cluster
   together, which provide basic support for analyses of
   pan-gastrointestinal, pan-squamous, pan-kidney, and pan-gynecological
   cancers.

Associated genes with colorectal cancer

   For the most significantly similar cancer pair COAD and READ found with
   SimSIP, we try to explore the common underlying genetic mechanisms by
   sorting MAG values of genes in descending order. The significance of
   the difference of expression level of the top 5 genes in COAD and READ
   are shown in [132]Table S3, which are obtained from the TCGA data
   mining website UALCAN ([133]http://ualcan.path.uab.edu/). [134]Table S3
   provides evidence that the top 5 genes are associated with COAD and
   READ and they are regulated in the same pattern (either up-regulated or
   down-regulated). It is worth mentioning that, for the top 500 genes
   associated with COAD and READ and the top 500 genes associated with
   KICH and STAD, which are the most similar and the least similar cancer
   pairs found with SimSIP respectively, the proportion of genes regulated
   in the same pattern is 1 and 0.648 respectively. Furthermore, we
   compute the correlation coefficient between the logarithmic
   transformation of SimSIP and the proportion of genes regulated in the
   same pattern in the top N (N = 500, 1,000, 1,500, 2,000) genes
   associated with each of 153 cancer pairs (shown in [135]Fig. S12).
   Obviously, there is a significant relationship between the degree of
   overlap between diseases and the proportion of genes regulated in the
   same pattern. These findings suggest that the associated genes of
   highly overlapping cancers may be regulated in the same pattern.
   Through gene annotation by using Metascape ([136]http://metascape.org/)
   and literature review, among the top 5 genes, one gene (CDH3) is
   associated with multiple cancers (shown in [137]Fig. S13), and two
   genes (ETV4, KRT24) are associated with colorectal cancer; the
   remaining genes can be considered as candidate susceptibility genes of
   colorectal cancer ([138]Table S4). CDH3, the top 1 gene detected by
   MAG, is located in a region on the long arm of chromosome 16.
   [139]Paredes et al. (2012) advised that genetic or epigenetic changes
   in this gene or changes in its protein expression often lead to tissue
   disorders, cellular dedifferentiation, and enhanced invasiveness of
   tumor cells. This gene is associated with intestinal infections
   ([140]Van Marck et al., 2011) and colon cancer ([141]Van Marck et al.,
   2011; [142]Sun et al., 2011). In addition, its over-expression is also
   associated with tumor progression and low survival in non-small-cell
   lung cancer ([143]Imai et al., 2018) and in the loss of heterozygous
   events for breast and prostate cancer ([144]Royo et al., 2016;
   [145]Sousa et al., 2014; [146]Vieira et al., 2017). Moreover, CDH3 is
   significantly associated with liver cancer, gastric cancer, bladder
   cancer, and cervical adenocarcinoma ([147]Bauer et al., 2014;
   [148]Paredes et al., 2012; [149]Sun et al., 2011; [150]Van Marck et
   al., 2011). Searching for CDH3 in the CancerMine database
   ([151]https://www.mycancergenome.org/) reveals that 12 cancers are
   associated with the over-expression of CDH3. This gene is therefore
   very important in the genetic mechanism of cancer. The top 3 gene ETV4
   is strongly linked to metastasis of colorectal cancer ([152]Dumortier
   et al., 2018) and enriched in pathway transcriptional misregulation in
   cancer ([153]Table S4). The top 5 gene KRT24 is over-expressed in
   patients with colorectal cancer and is a susceptibility gene for early
   onset of colorectal cancer ([154]Hong et al., 2007).

Associated signaling pathway with colorectal cancer

   Calculating empirical p values of MAG for each gene by its null
   distribution (described in [155]Fig. S14), 1,838 genes significantly
   associated with colorectal cancer are obtained. By DAVID
   ([156]https://david.ncifcrf.gov/tools.jsp), these 1,838 genes
   associated with colorectal cancer are clustered in diverse functional
   categories.

   As shown in [157]Table 2, of 21 functional categories, 3 are associated
   with cancer (cAMP signaling pathway ([158]Akgul, 2009; [159]Myklebust
   et al., 1999), fatty acid degradation ([160]Currie et al., 2013), and
   nitrogen metabolism ([161]Sanchez-Vega et al., 2018)) and 11 appear in
   colorectal cancer studies, including calcium signaling pathway
   ([162]Dallol et al., 2003), circadian entrainment ([163]Lévi et al.,
   2010), ribosome biogenesis in eukaryotes ([164]Lafontaine, 2015),
   cGMP-PKG signaling pathway ([165]Li et al., 2013), dopaminergic synapse
   ([166]Xu et al., 2010), retrograde endocannabinoid signaling
   ([167]Proto et al., 2012), cholinergic synapse ([168]Frucht et al.,
   1999), insulin secretion ([169]Giovannucci, 2001), gastric acid
   secretion ([170]Morton, Prendergast & Barrett, 2011), morphine
   addiction ([171]Jin et al., 2012), and nicotine addiction
   ([172]Shureiqi et al., 2003; [173]Yang & Frucht, 2001). The remaining
   signaling pathways can be considered as candidates for studies on
   biological processes associated with colorectal cancer. In detail, the
   circadian entrainment pathway closely interacts with the cell division
   cycle and pharmacological pathways in the treatment of metastatic
   colorectal cancer and accelerates or slows down cancer growth through
   modifications of host and tumor circadian clocks, which drives 24 h
   changes in drug metabolism, cellular proliferation and apoptosis, cell
   cycle events, DNA repair, and angiogenesis ([174]Lévi et al., 2007,
   [175]2010). For the insulin secretion pathway, because insulin and
   insulin-like growth factor axes are major determinants of cell
   proliferation and apoptosis, an increase in their circulating
   concentrations is associated with a high risk of colonic neoplasia.
   However, the dietary pattern with high saturated fatty acid intake can
   stimulate insulin resistance or secretion ([176]Giovannucci, 2001), and
   cellular proliferation requires fatty acids to synthesize cell
   membranes and signaling molecules ([177]Currie et al., 2013). In
   addition, the growth of colon tumor cells is selectively inhibited by
   nonsteroidal anti-inflammatory drugs that activate the cGMP/PKG pathway
   to suppress Wnt/β-catenin signaling ([178]Li et al., 2013). Up to 15%
   of colorectal cancers are distinguished by DNA microsatellite
   instability and manifested by the presence of DNA replication errors
   ([179]Jass et al., 1998).

Table 2. The result of pathway enrichment analysis of colorectal cancer.

      ID                 KEGG pathway                Benjamini
   hsa04713 Circadian entrainment                   2.30E−04
   hsa04024 cAMP signaling pathway                  2.97E−04
   hsa03008 Ribosome biogenesis in eukaryotes       3.68E−04
   hsa04080 Neuroactive ligand-receptor interaction 7.92E−04
   hsa04020 Calcium signaling pathway               0.003328211
   hsa04723 Retrograde endocannabinoid signaling    0.006342379
   hsa04724 Glutamatergic synapse                   0.006620968
   hsa04911 Insulin secretion                       0.010800326
   hsa04728 Dopaminergic synapse                    0.011092804
   hsa04725 Cholinergic synapse                     0.017964832
   hsa04022 cGMP-PKG signaling pathway              0.021342644
   hsa04970 Salivary secretion                      0.023476401
   hsa04261 Adrenergic signaling in cardiomyocytes  0.024338123
   hsa00071 Fatty acid degradation                  0.029045641
   hsa05032 Morphine addiction                      0.036794268
   hsa04978 Mineral absorption                      0.038043761
   hsa04971 Gastric acid secretion                  0.045576
   hsa04726 Serotonergic synapse                    0.045974199
   hsa00910 Nitrogen metabolism                     0.047420936
   hsa05033 Nicotine addiction                      0.047658725
   hsa05031 Amphetamine addiction                   0.048006097
   [180]Open in a new tab

Discussion

   SimSIP is a novel similarity score that measures the genetic
   relationships among diseases by (1) introducing a suitable
   transformation of gene ranks that converts gene ranks into the
   magnitude of association of gene with the disease; and (2) comparing
   the similarity between gene rank lists in geometric space instead of
   comparing the overlaps between the top part of ranked gene lists. In
   this study, three additional tasks are also fulfilled: constructing a
   disease network of 18 cancers and offering some support for pan-cancer
   studies; developing a measure MAG to gauge the magnitude of association
   of an individual gene with two diseases (note that MAG can be
   generalized to multiple diseases); and finding some important genes and
   signaling pathways associated with colorectal cancer.

   Extensive simulations show that SimSIP has better performance than
   existing methods in scenarios with larger numbers of overlapping
   associated genes (o) and larger number of genes (n), whereas in
   scenarios with smaller o and smaller n, such as o = 1, n = 6,000 or
   11,000, WeiSumE* performs better. It is desirable that SimSIP can
   clearly identify the differences among diseases with more overlapping
   associated genes. Thus, in real data application, we use SimSIP to
   measure genetic similarities among cancers based on the differential
   expression analyses of multiple datasets in TCGA with the R package
   LIMMA. The results show that SimSIP can find more significant cancer
   pairs than WeiSumE*, such as cancer pair COAD and UCEC, COAD and THCA,
   LUSC and PRAD, BRCA and THCA, or KIRC and PRAD demonstrating the
   usefulness of SimSIP. Furthermore, for the 81 cancer pairs found with
   both SimSIP and WeiSumE*, the p value of SimSIP is smaller than
   WeiSumE*, therefore, the SimSIP may be more powerful than WeiSumE* in
   detecting the significant cancer pairs.

   Overall, SimSIP has a simpler form and faster computation (time
   consumption for six measures is shown in [181]Table S5), is more robust
   for higher levels of noise, and is more suitable for the study of human
   diseases than existing methods, especially for the study of cancers in
   which there are more genetic overlaps. In order to make this conclusion
   considerably stronger, we extend the range of the simulations to cover
   bigger number of overlapping associated genes, such as o = 50, 100 and
   200. As shown in [182]Figs. S15–[183]S17, SimSIP still performs well or
   better in the scenarios with bigger ratio of o to d when o = 50, 100
   and 200, which is analogous to previous simulations.

Conclusions

   This article proposes a similarity score based on the list of gene
   ranks to measure the genetic relationships among diseases from gene
   expression data. Our method SimSIP gives a new perspective to detect
   the genetic relationships among diseases that does not depend on a
   threshold as fraction enrichment ([184]Serra, Romualdi & Fogolari,
   2016) nor on the weighted sum of overlaps between ranked gene lists as
   OrderedList ([185]Yang et al., 2006).

   Similar to other rank-based measures, our method relies on the
   correctness and scientific quality of the gene ranking. If gene ranking
   does not reflect the contribution of the gene to each disease, the
   rank-based measure is not necessarily superior to the commonly used
   measures. There are many common methods for ranking genes in practice,
   such as by the magnitude of p values of t-test, of marginal regression
   analysis and of some methods for detecting differentially expressed
   genes (such as LIMMA, edgeR and DESeq2 et al.). Although p values
   derived from the R package LIMMA for differential expression analysis
   were used in this paper to rank the gene, different researchers can
   choose different ranking methods based on their specific needs. In
   summary, in contrast to existing measures that are all based on the
   number of overlapping genes in top ranked gene lists among diseases, we
   creatively describe the genetic relationships among diseases from the
   spatial similarity between the transformation of gene ranks, which
   provides a new research direction for studies of similarity measures to
   reveal genetic relationships among diseases.

Supplemental Information

   Supplemental Information 1. Figures S1-S17.
   [186]Click here for additional data file.^ (2.6MB, docx)
   DOI: 10.7717/peerj.10576/supp-1
   Supplemental Information 2. Tables S1-S5.
   [187]Click here for additional data file.^ (39.5KB, docx)
   DOI: 10.7717/peerj.10576/supp-2
   Supplemental Information 3. The function of MAG.
   [188]Click here for additional data file.^ (215B, r)
   DOI: 10.7717/peerj.10576/supp-3
   Supplemental Information 4. The function of SimSIP.
   [189]Click here for additional data file.^ (143B, r)
   DOI: 10.7717/peerj.10576/supp-4

Acknowledgments