Abstract

Background

   Alternative polyadenylation (APA) has emerged as a pervasive mechanism
   that contributes to the transcriptome complexity and dynamics of gene
   regulation. The current tsunami of whole genome poly(A) site data from
   various conditions generated by 3′ end sequencing provides a valuable
   data source for the study of APA-related gene expression. Cluster
   analysis is a powerful technique for investigating the association
   structure among genes, however, conventional gene clustering methods
   are not suitable for APA-related data as they fail to consider the
   information of poly(A) sites (e.g., location, abundance, number, etc.)
   within each gene or measure the association among poly(A) sites between
   two genes.

Results

   Here we proposed a computational framework, named PASCCA, for
   clustering genes from replicated or unreplicated poly(A) site data
   using canonical correlation analysis (CCA). PASCCA incorporates
   multiple layers of gene expression data from both the poly(A) site
   level and gene level and takes into account the number of replicates
   and the variability within each experimental group. Moreover, PASCCA
   characterizes poly(A) sites in various ways including the abundance and
   relative usage, which can exploit the advantages of 3′ end deep
   sequencing in quantifying APA sites. Using both real and synthetic
   poly(A) site data sets, the cluster analysis demonstrates that PASCCA
   outperforms other widely-used distance measures under five performance
   metrics including connectivity, the Dunn index, average distance,
   average distance between means, and the biological homogeneity index.
   We also used PASCCA to infer APA-specific gene modules from recently
   published poly(A) site data of rice and discovered some distinct
   functional gene modules. We have made PASCCA an easy-to-use R package
   for APA-related gene expression analyses, including the
   characterization of poly(A) sites, quantification of association
   between genes, and clustering of genes.

Conclusions

   By providing a better treatment of the noise inherent in repeated
   measurements and taking into account multiple layers of poly(A) site
   data, PASCCA could be a general tool for clustering and analyzing
   APA-specific gene expression data. PASCCA could be used to elucidate
   the dynamic interplay of genes and their APA sites among various
   biological conditions from emerging 3′ end sequencing data to address
   the complex biological phenomenon.

Electronic supplementary material

   The online version of this article (10.1186/s12864-019-5433-7) contains
   supplementary material, which is available to authorized users.

   Keywords: Alternative polyadenylation, Cluster analysis, Gene
   expression, Canonical correlation analysis, Network inference

Background

   Messenger RNA (mRNA) polyadenylation is an essential cellular process
   in eukaryotes, which consists of cleavage at the 3′ end of pre-mRNA and
   an addition of a tract of adenosines [poly(A) tail]. As one of the key
   post-transcriptional events, polyadenylation plays important roles in
   many aspects of mRNA biogenesis and functions, such as mRNA stability,
   localization, and translation [[37]1, [38]2]. Accumulating genomic
   studies have indicated that most eukaryotic genes (more than 70% of
   genes in plants or mammals) can undergo alternative polyadenylation
   (APA) [[39]3–[40]7], leading to mRNAs with variable 3′ ends and/or
   different coding potentials [[41]8, [42]9]. APA is now emerging as a
   pervasive mechanism that contributes to dynamics of gene regulation and
   links to important cellular fates. For example, APA can be regulated in
   a tissue- and/or developmental stage- specific manner. Global 3’ UTR
   shortening was observed in testis, proliferating cells, and cancer
   cells [[43]3, [44]10, [45]11]. APA is also associated with flowering
   time in plants [[46]12] and oncogene activation in human cancer cells
   [[47]11]. Recent whole genome poly(A) site data from various conditions
   generated by 3′ end sequencing [[48]7, [49]13–[50]16] have stimulated
   interests in elucidating the dynamics of APA and its implications for
   regulation of gene expression, which can be a potential data source for
   the study of APA-related gene expression. Surprisingly, however, as
   data continue to accumulate, there is no general method or tool to
   analyze gene expression regarding APA regulation in different tissue
   types, developmental stages, or disease states.

   Clustering is one of the most frequently used analyses on genomic data,
   which has been demonstrated to be a powerful technique for
   investigating the association structure among genes as well as
   underlying molecular mechanisms of gene clusters [[51]17, [52]18]. The
   conventional cluster analysis is to apply widely used clustering
   algorithms on gene expression data, such as correlation or Euclidean
   distance based hierarchical clustering, K-means clustering, and Self
   Organizing Map [[53]17, [54]19, [55]20]. However, traditional methods
   for clustering gene expression data are not suitable for APA-related
   gene expression analysis. First, in conventional gene cluster analyses,
   a single value, such as the raw count or FPKM (fragments per kilobase
   per million mapped fragments) [[56]21], is used to represent gene
   expression level, while this is not applicable for the case of poly(A)
   site data as one gene can have multiple poly(A) sites. A common
   approach for analyzing gene expression from poly(A) site data is
   summing up the abundance of poly(A) sites within each gene and then
   applying popular clustering algorithms [[57]22–[58]24]. Although this
   is a simple and direct way, it would overlook the information of
   poly(A) sites (e.g., location, abundance, number, etc.) within each
   gene. Consequently, for example, the difference between two genes with
   different number of poly(A) sites but the same overall abundance was
   not considered in previous studies. As such, it is necessary to take
   into account the number, abundance, even the location of all poly(A)
   sites within each gene. Second, the result of a cluster analysis
   heavily depends on the cluster algorithm, especially the similarity
   measure between genes [[59]17]. Distance measures such as correlation
   coefficients, Minkowski distance, and mutual information [[60]17] have
   been widely employed in traditional cluster analyses, while such
   metrics are not able to measure the association among poly(A) sites
   between two genes. It is important but still challenging to design a
   measure to involve multiple layers of gene expression data from both
   the poly(A) site level and gene level. Third, although the regulation
   of APA across different physiological or pathological conditions has
   been well studied in recent years [[61]7–[62]9, [63]25, [64]26],
   cluster analysis using poly(A) site data has not been extensively
   studied in the field of APA. Most previous studies on APA focused on
   the analyses of 3’ UTR lengthening or shortening across various tissues
   or development stages [[65]7, [66]23, [67]26–[68]28], while the
   analysis of gene expression is scarce. Recent advances in deep 3′ end
   sequencing have provided multiple layers of transcriptome complexity
   detailing individual poly(A) sites within each gene rather than just
   overall gene expression [[69]6, [70]7, [71]15, [72]24, [73]25, [74]29],
   placing new demands on the methods applied to identify potential gene
   modules associated with specific APA regulation.

   The reliability of the biological conclusion drawn from genomic studies
   heavily depends on the quality of the biological data used, while in
   most cases, biological experiments are often subject to various
   potential sources of variance. To reduce the inherent noise as well as
   produce reproducible and statistically significant results, a common
   approach is to conduct repeated measurements (replicates). Replication
   is important for statistics analysis as it can not only enhance the
   precision of estimated quantities but also provide information about
   the random fluctuation or the uncertainty of the derived estimate
   [[75]30]. As the cost of deep sequencing is declining, growing genomic
   data are being generated with repeated measurements. Conventional
   clustering algorithms such as k-means or hierarchical clustering are
   not ideal to deal with repeated data as they ignore the specific
   experimental design under which the biological data were collected. In
   most gene expression analyses, gene expression levels of different
   replicates are first averaged and then analyzed with conventional
   clustering algorithms, which fails to employ the information concerning
   the variability among replicates. Considering variability in gene
   expression analysis would help to increase the detection power [[76]31]
   and yield clusters with higher accuracy and stability [[77]32]. With
   this in mind, several clustering methods or distance measures have been
   proposed for summarizing repeated measurements, such as confidence
   interval inferential methodology [[78]30], the multivariate correlation
   coefficient method [[79]33, [80]34], and infinite mixture model-based
   approach [[81]32]. However, these methods are not applicable for the
   APA-related gene expression data because each individual gene contains
   multi-layer information about poly(A) site usage and it cannot be
   treated as an independent feature. Recently, several methods or tools,
   such as RseqNet [[82]35] and SpliceNet [[83]36], were proposed to infer
   co-expression network from multi-layer genomic data taking into account
   the expression difference among exons and isoforms. However, these
   methods fail to take into consideration the variance among multiple
   replicates and are not specialized for APA analyses. Whole genome
   poly(A) site data with replicates across various tissues and/or
   developmental states are being generated [[84]7, [85]13, [86]14],
   demanding computationally efficient methods to take advantage of these
   new data sets. Incorporating both repeated measurements and APA
   knowledge into the analysis of gene expression regulation would lead to
   more statistically significant and biologically relevant insights in
   the field of APA.

   Here we proposed a computational framework, named PASCCA, for
   clustering genes from poly(A) site data using canonical correlation
   analysis (CCA). PASCCA is intended to leverage the merit of existing
   poly(A) site data for APA-related gene expression analyses, which has
   the following advantages. First, PASCCA incorporates detailed
   information about APA sites within each gene, which can quantify the
   overall association of APA sites across various conditions between each
   pair of genes. Second, PASCCA takes into account both the number of
   replicates and the variability within each experimental group, which is
   capable of fully exploring the similarity between repeated measures.
   Third, PASCCA characterizes poly(A) sites in various ways including the
   abundance and relative usage, which can exploit the advantages of 3′
   end deep sequencing in quantifying APA sites. Moreover, PASCCA provides
   a correlation measure rather than a clustering method, which could be
   easily used as a similarity metric for various clustering methods, gene
   network inference methods, or other potential circumstances. We have
   made PASCCA an easy-to-use R package for analyses of APA-related gene
   expression. Using both real and synthetic poly(A) site data sets, the
   cluster analysis demonstrates that PASCCA performs better than other
   widely-used distance measures under several performance metrics
   including connectivity, the Dunn index, average distance, average
   distance between means, and the biological homogeneity index. We also
   used PASCCA to infer APA-specific gene modules from a recently
   published poly(A) site data set of rice [[87]7] and discovered some
   distinct functional gene modules. By providing a better treatment of
   the noise inherent in repeated measurements and taking into account
   multiple layers of poly(A) site data, PASCCA could be a general tool
   for clustering and analyzing APA-specific gene expression data.

Results

Overview of PASCCA

   PASCCA consists of a general pipeline for analyzing poly(A) site data
   (Fig. [88]1). First, poly(A) site data are pre-processed for further
   APA-specific gene expression analyses. Poly(A) sites with low
   abundance, sites located in intergenic regions, or genes that possess
   single poly(A) site are removed. The retained poly(A) sites are
   subjected to DEXseq [[89]37] to identify poly(A) sites with
   differential usage among experiments and sites that are not
   differentially used in at least one pair of experiments are discarded.
   Next, different quantification methods can be used to characterize each
   poly(A) site. In addition to using the abundance to represent each
   poly(A) site, we included the relative usage as another metric to
   quantify poly(A) sites, which has been reported critical in the
   determination of poly(A) site choice among different conditions
   [[90]5]. After quantifying poly(A) sites, the data are then subjected
   to a weighting scheme based on canonical correlation analysis to obtain
   the correlation between each gene pair. As the core step of PASCCA,
   this weighting scheme incorporates detailed information about poly(A)
   sites within each gene and takes into account both the number of
   replicates and the variability within each experiment. The output of
   this step is a similarity matrix which can be used for downstream
   analyses, such as clustering and network inference. Both real and
   synthetic poly(A) site data sets were tested and various performance
   indexes were employed for comprehensive performance evaluation of
   PASCCA.

Fig. 1.

   [91]Fig. 1
   [92]Open in a new tab

   General pipeline of PASCCA

Evaluation of PASCCA on real poly(A) site data set in rice

   We adopted a replicated poly(A) site data set from rice to evaluate
   PASCCA, which consists of 14 tissues each with two or three repeated
   measurements [[93]7]. First we identified 4564 genes with at least one
   differentially used poly(A) site using DEXseq [[94]37], and 14,107
   poly(A) sites in these genes were obtained for further analysis. The
   weight matrix obtained from PASCCA was used as the distance matrix and
   compared with other correlation-based distance metrics, including
   Pearson’s correlation coefficient (PCC) and CCA. Since no priori
   knowledge of the exact number of clusters was available for the real
   rice poly(A) site data, variable number of clusters ranging from 5 to
   20 was set for performance evaluation. Under each specific number of
   clusters, the performance of each distance measure was assessed by
   calculating various performance metrics based on the hierarchical
   clustering method. PASCCA shows the best performance among all distance
   measures regardless of performance metrics employed (Fig. [95]2). The
   performance of PASCCA is consistently higher than PCC and CCA in terms
   of the internal validation measures, CON (connectivity) and DUNN (the
   Dunn index) (Fig. [96]2a and b), indicating that the variance within
   clusters derived from PASCCA is much smaller than that from PCC and
   CCA. Considering the stability validation, PASCCA is apparently
   superior to PCC and has slight advantages over CCA (Fig. [97]2c and d).
   PASCCA also provides the most biologically relevant clustering
   partitions as measured by the biological homogeneity index (BHI) (Fig.
   [98]2e), reflecting the increased biological homogeneity of clusters
   obtained from PASCCA. Generally, PCC provides the worst results, which
   may be due to that PCC fails to incorporate detailed information of
   poly(A) sites within each gene. Next, instead of choosing variable
   number of clusters, the best number of clusters for each distance
   measure was estimated by the Silhouette criterion [[99]17, [100]38].
   Still, PASCCA shows overall better performance than PCC and CCA (Fig.
   [101]2f), demonstrating that clusters identified from PASCCA are more
   physically stable and compact.

Fig. 2.

   [102]Fig. 2
   [103]Open in a new tab

   Evaluation of PASCCA on real poly(A) site data in rice using
   hierarchical clustering. Standardized cluster validation scores for
   various performance indexes with increasing number of clusters were
   calculated, including CON (a), DUNN (b), AD (c), ADM (d), and BHI (e).
   Without knowing the true number of clusters in a given data set,
   variable number of clusters ranging from 5 to 20 was set. Comparison of
   performances with the estimated number of clusters for each method was
   shown in (f). Larger score indicates better performance. CON,
   connectivity; DUNN, Dunn index; AD, average distance; ADM, average
   distance between means; BHI, biological homogeneity index

Evaluation of PASCCA on synthetic poly(A) site data sets

   To further demonstrate the superiority of PASCCA on repeated data, we
   analyzed synthetic data sets with replicates (see Methods). We applied
   PASCCA to three different kinds of data sets with variable number of
   experiments, genes, and repeated measurements. We need to point out
   that, there is no real gene in the synthetic data sets, therefore the
   index of BHI was not considered in the simulation study. In the first
   simulation study, we tested synthetic data sets with different number
   of experiments. Given a specific number of experiments ranging from
   four to twelve, ten synthetic data sets each with 500 genes that
   possess multiple poly(A) sites and three replicates for each experiment
   were generated. For each run of clustering, we set the number of
   clusters varying from 5 to 20. After clustering ten synthetic data sets
   of a given number of experiments, we obtained a total of 160 validation
   scores for each performance metric under one distance. Then the mean
   and standard deviation of the 160 validation scores were calculated. In
   almost all cases, PASCCA presents the best results, followed by CCA
   (Fig. [104]3). Considering the internal metrics (CON and DUNN), PASCCA
   outperforms CCA and PCC (Fig. [105]3a and b), reflecting higher
   compactness, connectedness, and separation of cluster partitions
   obtained from PASCCA. Particularly, PCC provides better performance
   than CCA regarding the CON metric (Fig. [106]3a) whereas CCA
   outperforms PCC regarding the DUNN metric (Fig. [107]3b), which
   reflects that PCC generates cluster partitions with higher
   connectedness while CCA generates cluster partitions with higher
   separation. When considering the AD (average distance) metric, PASCCA
   has a slight advantage over CCA but provides far better performance
   than PCC (Fig. [108]3c), reflecting the smaller average distance
   between observations in the same cluster obtained from PASCCA or CCA
   than that from PCC. Regarding the ADM (average distance between means)
   metric, again, PASCCA has the best performance, followed by CCA, and
   PCC provides the worst results (Fig. [109]3d).

Fig. 3.

   [110]Fig. 3
   [111]Open in a new tab

   Validation scores on synthetic data sets with different number of
   experiments using hierarchical clustering. Standardized cluster
   validation scores for various cluster validation measures across a
   range of different number of clusters were calculated, including CON
   (a), DUNN (b), AD (c), and ADM (d). For each trial with a fixed number
   of experiments, ten data sets were randomly selected from the whole
   synthetic data set. The best number of clusters was estimated for each
   trial. The mean validation scores for trials performed on the 10 random
   data sets were plotted. The standard deviation is depicted as an error
   bar

   In the second simulation study, we tested synthetic data sets with
   variable number of genes to assess the effect of data size on
   clustering. Given a restricted number of genes ranging from 500 to 4500
   with an increment of 500, ten data sets each with 14 experiments and
   three replicates for each experiment were randomly generated. Similar
   to the scenario on different number of experiments, we obtained the
   mean and standard deviation for each performance metric under each
   distance measure. Again, PASCCA provides the best results regardless of
   performance metrics or number of genes (Additional file [112]1: Figure
   S1). The variance within clusters obtained from PASCCA is much smaller
   than that from PCC and CCA, which is reflected by metrics of CON and
   DUNN (Additional file [113]1: Figures S1a and b). According to metrics
   of AD and ADM, PASCCA also provides more stable results than PCC and
   CCA (Additional file [114]1: Figure S1c and d).

   In the third evaluation scenario, we generated synthetic data sets that
   contain 500 genes and 14 experiments with two to 15 replicates for each
   experiment. Regarding CON and AD metrics, PASCCA presents consistently
   higher performance than CCA and PCC, whereas CCA and PCC provides the
   worst results according to CON and AD, respectively (Additional file
   [115]1: Figure S2a and c). Interestingly, regarding the AD metric, the
   performance of CCA is decreased with the increase of the number of
   replicates while the performance of PASCCA is high and stable
   (Additional file [116]1: Figure S2c), demonstrating the importance of
   considering replicates in clustering. Considering the DUNN and ADM
   metrics, PASCCA performs slightly worse or equally to CCA when the
   number of replicates is low, while PASCCA outperforms CCA with the
   increase of the number of replicates (Additional file [117]1: Figure
   S2b and d). Overall, PASCCA stands out as the best distance, while PCC
   provides the worst performance.

Characterization of poly(A) sites by relative abundance

   A previous study [[118]5] used the relative proportion of reads rather
   than the number of reads of poly(A) sites to determine the poly(A) site
   choice between two conditions and found a large number of Arabidopsis
   genes were altered in the oxt6 mutant. Here we used the relative
   abundance of the poly(A) site as another metric to characterize poly(A)
   sites. Given a gene with n poly(A) sites in one experiment, the
   relative abundance for poly(A) site p is
   [MATH: <mfrac><mrow><mi>a</mi><mfenced close=")"
   open="("><mi>p</mi></mfenced></mrow><mrow><msub><mo>∑</mo><mi>n</mi></m
   sub><mi>a</mi><mfenced close=")"
   open="("><mi>i</mi></mfenced></mrow></mfrac><mo>,</mo><mi>i</mi><mo>=</
   mo><mn>1</mn><mo>.</mo><mo>.</mo><mi>n</mi><mspace
   width="0.25em"></mspace> :MATH]
   , where a(p) is the abundance of poly(A) site p. Using the real poly(A)
   site data set represented by the relative abundance, we obtained
   weights for all gene pairs using PASCCA. First, we conducted the
   cluster analysis to evaluate the performance of PASCCA. Again, PASCCA
   is superior to CCA and PCC regardless of performance metrics
   (Fig. [119]4a-e). Considering the internal validation metrics, PASCCA
   apparently outperforms CCA and PCC (Fig. [120]4a and b), which is
   similar to the result using the abundance of poly(A) sites (Fig.
   [121]2a and b). Regarding the stability validation metrics, PASCCA has
   slight advantages over CCA using the AD metric whereas they have
   comparable performance according to the ADM metric (Fig. [122]4c and
   d). Still, both PASCCA and CCA clearly outperform PCC. In terms of the
   BHI metric, PASCCA presents the best results, followed by PCC, while
   CCA provides the worst results (Fig. [123]4e). Obviously, regardless of
   ways to characterize poly(A) sites, PASCCA generally outperforms PCC
   and CCA (Figs.[124]2 and [125]4). According to the BHI metric, both
   ways present the best performance when the number of clusters is 12
   (Figs. [126]2e and [127]4e). In the case with 12 clusters,
   distributions of numbers of genes in each cluster obtained from both
   ways are similar (Fig. [128]4f). Surprisingly, however, less than 30%
   of genes in clusters from both ways are overlapped (Additional file
   [129]1: Figure S3). For example, for the largest cluster that has ~ 700
   genes from both ways, only 195 genes are overlapped. These results
   suggest that different ways used to characterize poly(A) sites may
   contribute considerably to the clustering results, therefore, it is
   critical to choose the way for representing poly(A) sites and to
   carefully inspect the clustering results according to the respective
   biological questions.

Fig. 4.

   [130]Fig. 4
   [131]Open in a new tab

   Cluster analyses of real poly(A) site data set based on relative
   abundance. Standardized cluster validation scores for various
   performance indexes with the increasing of the number of clusters were
   calculated, including CON (a), DUNN(b), AD (c), ADM (d), and BHI (e).
   Larger scores indicate better performance. Without knowing the true
   number of clusters in a given data set, variable number of clusters
   ranging from 5 to 20 was set. (f) Number of genes in clusters obtained
   from PASCCA using poly(A) site data set characterized by abundance or
   relative abundance

Distinct gene modules identified by network inference integrating PASCCA

   Network inference has become a critical step towards understanding
   complex biological phenomena. Next, we demonstrated the use of PASCCA
   in constructing APA-specific gene networks. First weights for all gene
   pairs were obtained from PASCCA and CCA, respectively. Only gene pairs
   with statistically significant weights were retained. The weight
   matrices from both methods were further used as adjacency matrices for
   WGCNA [[132]39], a popular R package for weighted correlation network
   analysis, to infer network modules. For comparison, we also obtained
   network modules based on gene expression levels that were obtained by
   summing up reads of all poly(A) sites in each gene (hereinafter
   referred to as genePCC). Each module obtained from WGCNA can be
   considered as a co-expression network. Using WGCNA, nine, eight, and 15
   modules were obtained using PASCCA, CCA, and genePCC, respectively
   (Additional file [133]1: Figure S4a). Although PASCCA and CCA obtained
   similar number of modules, the number of genes in these modules varied
   widely. Particularly, among the eight modules obtained from CCA, the
   vast majority of genes (61%, 2768) were found in one module. In
   contrast, genes are more evenly distributed in modules obtained from
   PASCCA (Additional file [134]1: Figure S4a). It is possible that CCA
   failed to distinguish small modules from large ones and consequently
   produces an overbalanced module with large number of genes. We also
   found that ~ 60% of genes from each module obtained from PASCCA are
   overlapped with the largest module obtained from CCA (Additional file
   [135]1: Figure S4b), indicating that PASCCA is capable of segmenting a
   large group of genes by incorporating information such as the variance
   among replicates. Among the three methods, the highest number of
   modules (15) were obtained by genePCC. Similar to CCA, the numbers of
   genes in modules from genePCC are also very unevenly distributed,
   ranging from 65 to 1261.

   In order to evaluate the performance of PASCCA in the network
   construction, we also calculated various metrics for assessing the
   modularity and community structure in a network. Generally, PASCCA has
   better performance than CCA and genePCC regardless of network metrics
   employed (Fig. [136]5a). The increased density of modules obtained from
   PASCCA is reflected in a higher ACC (average clustering coefficient)
   score of 0.77, compared to 0.67 in CCA and 0.68 in genePCC. According
   to the BHI metric, modules generated from PASCCA are more biologically
   meaningful than those from CCA or genePCC. Particularly, genePCC has
   much lower score of MD (module degree) metric (0.15) than PASCCA (0.57)
   or CCA (0.55), reflecting that there are much denser connections
   between nodes within modules but much sparser connections between nodes
   in different modules obtained from PASCCA or CCA than from genePCC.

Fig. 5.

   [137]Fig. 5
   [138]Open in a new tab

   Analyses of gene modules. a Validation scores of network metrics for
   gene modules identified by genePCC, CCA, and PASCCA. Larger scores
   indicate better performance. Bars correspond to the mean validation
   scores for all modules identified by the corresponding method. The
   standard deviation is depicted as an error bar. b Gene ontology result
   for modules obtained from PASCCA. c KEGG pathway enrichment analysis
   for modules obtained from PASCCA

   Next, we examined the relationship between tissues and modules
   identified by PASCCA according to the correlation between each pair of
   module and tissue. These modules can be largely divided into two groups
   (Additional file [139]1: Figure S5): one group is of high correlation
   with tissues of dry seed, endosperm, imbibed seed, and embryo; the
   other group is highly correlated with tissues of shoot, leaf, stem, and
   mature pollen. This result reflects that different developmental stages
   of the same tissues have similar distribution of module correlations,
   which is consistent with the previous result from rice that different
   developmental stages of the same tissues have similar expression
   patterns of transcript isoforms [[140]7]. We also performed GO (gene
   ontology) analysis and KEGG (Kyoto encyclopedia of genes and genomes)
   pathway enrichment analysis for genes or hub genes from these modules.
   GO analysis revealed distinct functions associated with different
   modules (Fig. [141]5b). For example, the pink module is uniquely
   enriched in the biological process of embryo development; the red
   module is exclusively enriched in the biological process of anatomical
   structure morphogenesis. Green, brown and magenta modules are
   over-represented in processes such as carbohydrate metabolic process,
   biosynthetic process, photosynthesis, and lipid metabolic process,
   which play critical roles in controlling plant growth, development, and
   crop yield [[142]40, [143]41]. We also performed KEGG pathway
   enrichment analysis for hub genes identified from each individual
   modules to investigate their functional importance. Hub genes for a
   module were defined as genes with correlation values with the
   respective module larger than 0.7. Hub genes for all modules obtained
   from PASCCA were provided in Additional file [144]2. Turquoise and pink
   modules are over-represented in pathways of cysteine and methionine
   metabolism and sulfur relay system, which were associated with pathways
   of glutathione (GSH) metabolism and biosynthesis of its precursor (Fig.
   [145]5c). GSH has been found to be critical for plant cold acclimation
   and chilling tolerance through reducing the accumulation of reactive
   oxygen species [[146]42, [147]43]. It is possible that the APA-mediated
   GSH metabolism may be crucial in the adaptation of rice to extreme
   temperate climates. The blue module is exclusively enriched in the
   pathway of phosphatidylinositol signaling system, which is the main
   signaling pathway of plant disease resistance. The green, pink and
   magenta modules are over-represented in plant growth-related pathways,
   including pentose phosphate pathway, porphyrin and chlorophyll
   metabolism, carbon metabolism, carbon fixation in photosynthetic
   organisms and circadian rhythm - plant. These findings indicated that
   these APA-mediated gene modules may play an important role in the
   growth process of rice.

Discussion

   Cluster analysis has been enormously successful in the past decades in
   detecting patterns or relationships between genes to reveal the
   underlying molecular mechanism [[148]17, [149]19, [150]20, [151]44].
   Inference of isoform or protein networks has also attracted
   considerable attention recently, and several tools now exist for
   network construction involving single or multiple layers of biological
   information [[152]35, [153]36, [154]45–[155]48]. Despite of the
   availability of various clustering and network inference methods,
   integrating appropriate distance measures for different biological data
   sets or research purposes is one of the primary issues [[156]44].
   Provided that most clustering methods use a distance matrix as the
   input, choosing a distance measure employed by the clustering method is
   a non-trivial task which can significantly affect the final clustering
   performance [[157]17]. Recent genomic studies have uncovered widespread
   occurrences of APA and found a large number of genes with APA sites
   [[158]8, [159]24, [160]49, [161]50], however, methods of cluster
   analysis or network inference for APA-related gene expression data are
   still scarce, placing demands on developing new methods complementary
   to traditional APA analyses to gain deeper insights into the underlying
   biological system. Here, we resorted to the method of canonical
   correlation analysis and assigned the weight for each gene pair that
   represents the strength of direct interaction between genes.
   Especially, our model enhances weights of direct interactions by
   incorporating various poly(A) site information and repeated
   measurements of biological experiments.

   In this study we have proposed PASCCA, a new pipeline for clustering
   and network inference from APA-related gene expression data
   with/without repeated measurements. The weight matrix from the
   weighting scheme can be an alternative to the similarity or distance
   metric for downstream cluster analysis, network inference, and other
   possible purposes. CCA, a traditional statistical method for
   investigating the relationships between two sets of variables, have
   recently been employed in genomic studies to estimate the correlations
   from gene expression data, however, the use of CCA is still fairly
   limited [[162]35, [163]51]. As one kind of correlation coefficient, CCA
   is useful in cluster analysis to determine the correlation between
   genes. However, when CCA was applied for clustering gene expression,
   replicates of each treatment group were simply averaged regardless the
   underlying variance. This is, unfortunately, not fully applicable on
   the current situation that genomic data with multiple biological
   replicates are increasingly generated to produce reproducible and
   statistically significant results. To meet these specific needs, PASCCA
   seamlessly integrated the concept of shrinkage and CCA to improve the
   estimation of correlation for data with replicates. Shrinkage concept
   has been widely employed in previous studies to overcome limitations of
   Pearson’s correlation coefficient [[164]33, [165]52–[166]54]. By
   incorporating shrinkage concept, PASCCA is capable of taking into
   account both the number of repeated measurements and the variance
   within each experiment, which can better cluster replicated biological
   data and highlight new information pertaining to gene expression
   patterns. Using one of the most popular cluster methods, hierarchical
   clustering, we comprehensively compared PASCCA with two correlation
   based distance measures including CCA and PCC, based on diverse cluster
   and network validation metrics. Results demonstrated that PASCCA can
   generate clusters and networks with higher quality using both synthetic
   and real poly(A) site data sets. Moreover, we test different synthetic
   data sets with variable number of experiments, genes, and repeated
   measurements, results showed that PASCCA has higher performance and
   better robustness than other distances investigated. One of the major
   reasons for the superiority of PASCCA is that the shrinkage factor
   introduced in the CCA benefits an optimal estimate of the error in
   replicates and thus can better quantify the relationship between genes,
   even for data with small number of replicates which is the normal case
   in many genomic studies.

   Another compelling feature of PASCCA is, perhaps, incorporating
   detailed information about APA sites to quantify the association
   between genes. Previously, to measure the gene expression level from 3′
   end sequencing, overall gene expressions were determined by summing up
   all poly(A) sites within each gene [[167]22–[168]24]. Although the
   recent high-throughput 3′ end sequencing has made detailing APA sites
   cost-effective, the true merit of 3′ sequencing in quantifying APA
   sites has not been fully explored in most poly(A) studies to date.
   Unlike traditional distances such as the correlation coefficient and
   Euclidean distance that are based on overall gene expressions without
   considering biological details within each gene, PASCCA is capable of
   inferring the multivariable (APA sites) dependency between two genes.
   By comparing the clustering results from PASCCA with other two
   correlations including PCC and CCA, we demonstrated that PASCCA can
   significantly improve the clustering performance by quantifying
   abundance difference in APA sites. More importantly, PASCCA provides an
   advantage for full exploitation of poly(A) sites by incorporating
   different metrics in quantifying APA sites before conducting the
   cluster analysis or network inference. Consequently, the performance of
   PASCCA may be affected by the quantification metric used. To assess the
   influence of different quantification metrics, we conducted two cluster
   analyses on the same real poly(A) site data set but using two different
   metrics, abundance and relative abundance. Experimental results using
   different quantification schemes vary greatly. Numerous studies have
   emphasized the importance of relative usage instead of the abundance of
   poly(A) site in determining poly(A) site choice among different
   conditions [[169]5, [170]24]. For practical application purpose, we
   suggest using both quantification metrics for weighting gene pairs
   which could be complimentary to each other.

   Given the importance of APA in regulating gene expression, the lack of
   methodology for quantifying the correlation in gene pairs is one of the
   big hurdles in the construction of APA-specific biological networks.
   PASCCA also contributes considerably to providing a correlation measure
   rather than a clustering method, which could be easily used as a
   similarity metric for downstream cluster analyses or network inference.
   Using the latest real poly(A) site data set, we adopted WGCNA [[171]39]
   to infer APA-specific gene expression networks based on the weight
   matrix calculated from PASCCA in order to demonstrate the biological
   importance of PASCCA and its implications on APA studies. We discovered
   nine distinct gene modules across 14 different tissues and
   developmental stages of rice. GO analysis suggests that some gene
   modules are strongly involved in biological processes relevant to plant
   growth processes including lipid metabolic process, photosynthesis,
   biosynthetic process, and carbohydrate metabolic process (Fig.
   [172]5b). Similarly, KEGG enrichment analysis showed that these modules
   were significantly enriched in plant growth-related pathways, such as
   the pentose phosphate pathway, porphyrin and chlorophyll metabolism,
   carbon fixation in photosynthetic organisms (Fig. [173]5c). These
   findings indicated that gene modules inferred from PASCCA may play an
   important role in the growth process of rice. In addition, we found
   some gene modules were over-represented in pathways of GSH metabolism
   and biosynthesis of its precursor (Fig. [174]5c) which may be crucial
   for plant chilling tolerance and cold acclimation [[175]42, [176]43],
   suggesting that genes involved in these pathways may be functionally
   important in the adaptation of rice to extreme temperate climates.
   These results showed the potential of incorporating PASCCA to yield
   important gene modules and to lead to testable hypotheses in biology.

Conclusions

   We proposed a computational framework, called PASCCA, for analyses of
   APA-related gene expression, including the characterization of poly(A)
   sites, quantification of association between genes with/without
   repeated measurements, clustering of APA-related genes to infer
   significant APA specific gene modules, and the evaluation of clustering
   performance with a variety of indexes. PASCCA incorporates multiple
   layers of gene expression data from both the poly(A) site level and
   gene level and takes into account both the number of replicates and the
   variability within each experimental group. PASCCA could be a general
   tool for clustering and analyzing APA-specific gene expression data,
   which is useful in elucidating the dynamic interplay of genes and their
   APA sites among various biological conditions from emerging 3′ end
   sequencing data to address the complex biological phenomenon.

Methods

Real and synthetic poly(A) site data sets

   We used both real and synthetic data sets to evaluate PASCCA. The real
   poly(A) site data set which consists of 14 tissues each with two or
   three repeated measurements in rice was collected from the previous
   study [[177]7]. Fu et al. focused on the identification of tissue
   specific poly(A) sites among different tissues but did not conduct any
   cluster analysis to infer APA-specific gene modules. This data set
   contains a total of 68,220 poly(A) sites dispersed in 28,032 genes,
   which is the largest poly(A) site data set in plants to date. To obtain
   poly(A) sites with high confidence, we discarded poly(A) sites that are
   supported by less than five reads.

   It is noteworthy that data sets with a certain number of conditions
   (tissues, developmental states, etc.) and repeated measurements are
   required for fully evaluating PASCCA, unfortunately, very few publicly
   available poly(A) site data sets meet both criteria. Although there are
   plenty of gene expression data with repeated measurements from
   microarray or RNA-seq, repeated poly(A) site data sets are still rare.
   To overcome these limitations, we used a two-step process to generate
   synthetic data that have the same distribution of abundance of poly(A)
   sites derived from the rice data. Given a gene g with k poly(A) sites,
   let the abundance of poly(A) site i (i = 1. . k) be a(i), then the true
   frequency of poly(A) site i is p(i) = a(i)/ ∑ a(k). For each
   experiment, we generated the simulated abundance of each poly(A) site
   in each gene according to the binomial distribution with probability
   being p(i) and size being ∑a(k). To simulate replicates for each
   experiment, we added random noises based on the normal distribution
   using the R function rnorm with both the mean and standard deviation
   derived from the true data set. To evaluate the performance of PASCCA,
   data sets with different number of experiments (tissues), repeated
   measurements, and genes were randomly selected from the synthetic data
   set for evaluation.

Performance evaluation

   For the performance evaluation, our primary interest lies in the
   comparison of the distance measure provided in PASCCA with other
   distance measures rather than on the assessment of clustering methods.
   Because that distance measures are normally employed with a clustering
   method but not as a single entity, we applied one of the most popular
   clustering methods, hierarchical clustering (HC) [[178]55] and compared
   PASCCA with two distance measures, including PCC and CCA [[179]35]. The
   reason for choosing PCC and CCA for comparison is that they are both
   the correlation based distances used in biological data which are the
   same kind of distance as PASCCA. PCC is one of the most popular
   distances in cluster analysis of gene expression data. Hong et al.
   [[180]35] proposed a CCA-based method and developed a package called
   RSeqNet for clustering genes by taking into account the expression
   difference among exons. Although RSeqNet was not initially developed
   for poly(A) site data, it can be used for calculating the correlation
   between genes using the processed and formatted poly(A) site data. It
   should be noted that PCC is not capable of incorporating the poly(A)
   site information of genes, therefore, we summed up the abundance of all
   poly(A) sites within a gene as the expression level for that gene
   before applying PCC for clustering.

   There is no priori knowledge of what genes should be clustered together
   according to the poly(A) site data. We then used several performance
   metrics that do not require the class label to quantitatively assess
   the overall performance of PASCCA, which cover three main types of
   cluster validation measures including internal, stability, and
   biological [[181]56]. The internal validation evaluates the quality of
   the clustering based on intrinsic information in the data, using only
   the data set and the clustering partition as input. For internal
   validation, we used measures that reflect the compactness,
   connectedness, and separation of the cluster partitions, including the
   connectivity (CON) and the Dunn index (DUNN) [[182]44, [183]56]. The
   CON metric measures the extent of observations that are placed in the
   same cluster as their nearest neighbours in the data space; the DUNN
   metric reflects non-linear combinations of the compactness and
   separation [[184]56]. The stability validation measures the stability
   of the clustering by comparing the clustering result between the full
   data and the perturbed data. For stability validation, we used two
   indexes including the average distance (AD) and the average distance
   between means (ADM) [[185]44, [186]56]. The AD metric measures the
   average distance between observations clustered in the same cluster
   using the full data and the data with a single column removed; the ADM
   metric computes the average distance between cluster centers [[187]56].
   Biological validation measures the quality of the clustering by
   investigating the biological significance of clusters. We used the BHI
   (biological homogeneity index) to measure how biologically homogeneous
   a gene clustering is [[188]56]. We adopted the R package clValid
   [[189]56] to calculate validation scores for these metrics. As
   different metrics have different ranges of value, validation scores
   were normalized between 0 and 1 for a more intuitive comparison. The
   larger score indicates better performance.

   To assess network modules identified by WGCNA [[190]39] using different
   distance metrics, we employed several additional network metrics,
   including the eigenvector centrality (EC), module degree (MD), and
   average clustering coefficient (ACC) [[191]57–[192]60]. EC measures the
   influence of a node in a network. MD is a measure of the quality of the
   network module partition. ACC measures the density of triangles in a
   network.

Weighting scheme based on canonical correlation analysis

   We designed a weighting scheme based on CCA to quantify the correlation
   between each gene pair. We embedded the shrinkage correlation
   coefficient [[193]33] into the CCA framework to infer the correlation
   between two genes by incorporating detailed layers of all poly(A)
   sites. Assuming that we have each gene measured across T experiments
   with R(t) replicates for the t^th experiment, then
   [MATH:
   <mi>R</mi><mo>=</mo><munderover><mo>∑</mo><mrow><mi>t</mi><mo>=</mo><mn
   >1</mn></mrow><mi>T</mi></munderover><mi>R</mi><mfenced close=")"
   open="("><mi>t</mi></mfenced> :MATH]
   , where R is the total number of replicates of all experiments. Given a
   gene G with K poly(A) sites, let
   [MATH: <msub><mi mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">G</mi></msub><mo>=</mo><mfenced close="}"
   open="{" separators=",,,;,,;,,"><msubsup><mi>D</mi><mi
   mathvariant="italic">iG</mi><mfenced close=")" open="("><mrow><mi
   mathvariant="italic">jr</mi><mfenced close=")"
   open="("><mi>j</mi></mfenced></mrow></mfenced></msubsup><mrow><mi>i</mi
   ><mo>=</mo><mn>1</mn></mrow><mo>…</mo><mi>K</mi><mrow><mi>j</mi><mo>=</
   mo><mn>1</mn></mrow><mo>…</mo><mi>T</mi><mrow><mi>r</mi><mfenced
   close=")"
   open="("><mi>j</mi></mfenced><mo>=</mo><mn>1</mn></mrow><mo>…</mo><mrow
   ><mi>R</mi><mfenced close=")"
   open="("><mi>j</mi></mfenced></mrow></mfenced> :MATH]
   denote the set of measurements of all poly(A) sites in this gene, where
   [MATH: <msubsup><mi>D</mi><mi mathvariant="italic">iG</mi><mfenced
   close=")" open="("><mrow><mi mathvariant="italic">jr</mi><mfenced
   close=")" open="("><mi>j</mi></mfenced></mrow></mfenced></msubsup>
   :MATH]
   is the measurement for the i^th poly(A) site of gene G at the r(j)^th
   replicate of the j^th experiment. Given two genes P and Q each with m
   and n poly(A) sites (assuming m ≤ n), the objective is to quantify
   their relationship based on D[P] and D[Q]. We adopted CCA to obtain the
   maximum correlation coefficients for D[P] and D[Q] by seeking weights α
   and β for D[P] and D[Q] which results in the maximun correlation
   coefficient for the linear combination of the m poly(A) sites in gene
   P, A = α^RD[P] and the linear combination of the n poly(A) sites in
   gene Q, B = β^RD[Q]. This is equivalent to solving the following
   problem:
   [MATH: <mtable
   displaystyle="true"><mtr><mtd><mo>max</mo><mo>:</mo><mtext>corr</mtext>
   <mfenced close=")" open="(" separators=","><mi
   mathvariant="normal">A</mi><mi
   mathvariant="normal">B</mi></mfenced><mo>=</mo><mfrac><mrow><msup><mi>α
   </mi><mi>R</mi></msup><msub><mo>∑</mo><mi
   mathvariant="italic">PQ</mi></msub><mi>β</mi></mrow><mrow><msqrt><mrow>
   <msup><mi>α</mi><mi>R</mi></msup><msub><mo>∑</mo><mi
   mathvariant="italic">PP</mi></msub><mi>α</mi></mrow></msqrt><msqrt><mro
   w><msup><mi>β</mi><mi>R</mi></msup><msub><mo>∑</mo><mi
   mathvariant="italic">QQ</mi></msub><mi>β</mi></mrow></msqrt></mrow></mf
   rac></mtd></mtr><mtr><mtd><msub><mo>∑</mo><mi
   mathvariant="italic">PP</mi></msub><mo>=</mo><mtext
   mathvariant="italic">corr</mtext><mfenced close=")" open="("
   separators=","><msub><mi mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">p</mi></msub><msub><mi
   mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">p</mi></msub></mfenced><mo>,</mo><msub><mo>∑<
   /mo><mi mathvariant="italic">QQ</mi></msub><mo>=</mo><mtext
   mathvariant="italic">corr</mtext><mfenced close=")" open="("
   separators=","><msub><mi mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">Q</mi></msub><msub><mi
   mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">Q</mi></msub></mfenced></mtd></mtr><mtr><mtd>
   <msub><mo>∑</mo><mi mathvariant="italic">PQ</mi></msub><mo>=</mo><mtext
   mathvariant="italic">corr</mtext><mfenced close=")" open="("
   separators=","><msub><mi mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">p</mi></msub><msub><mi
   mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">Q</mi></msub></mfenced><mo>,</mo><msub><mo>∑<
   /mo><mi mathvariant="italic">QP</mi></msub><mo>=</mo><mtext
   mathvariant="italic">corr</mtext><mfenced close=")" open="("
   separators=","><msub><mi mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">Q</mi></msub><msub><mi
   mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">P</mi></msub></mfenced><mo>.</mo></mtd></mtr>
   </mtable> :MATH]
   1

   Here ∑are the correlation matrices of samples.To obtain ∑[PP], ∑[PQ],
   ∑[QQ],  and∑[QP],we solved the correlation coefficient matrix for the m
   poly(A) sites in gene P and n poly(A) sites in gene Q. PCC is one of
   the most popular ways to calculate the correlation between two vectors,
   however, using the average value of each experiment or considering each
   replicate as an independent experiment would neglect the information
   concerning the variance among replicates. Considering the
   between-replicate variance, Yeung et al. proposed the standard
   deviation weighted correlation coefficient (SDCC) to model the
   variability of replicates, which showed higher accuracy and stability
   than using PCC [[194]32]. However, when the number of replicates is
   much smaller relative to the number of genes, SDCC could be inaccurate
   in estimating errors [[195]33]. Due to high labour and time costs,
   microarray or RNA-seq experiments are usually measured with limited
   number of replicates (e.g., < 5), SDCC is unfortunately not suitable
   for general gene expression data. To avoid the inaccuracy introduced by
   the small number of replicates, here we employed the shrinkage
   correlation coefficient (SCC) which has been applied in the analysis of
   replicated microarray data [[196]33]. SCC can fully exploit the
   similarity between replicates for the robust statistical estimation of
   errors of replicated expression data.

   Given T experiments, the real squared measurement errors of these
   experiments are denoted as δ(1), δ(2), … , δ(T). Initially, a
   T-dimensional model is required to estimate these T parameters,
   however, estimating parameters based on higher dimension would produce
   higher variance on the same data set. To reduce the estimation error,
   we projected the original T-dimensional model to the restricted
   one-dimensional sub model by using the mean of these T parameters,
   [MATH: <mspace width="0.25em"></mspace><mi
   mathvariant="normal">Θ</mi><mfenced close=")"
   open="("><mi>t</mi></mfenced><mo>=</mo><mfrac><mn>1</mn><mi>T</mi></mfr
   ac><munderover><mo>∑</mo><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi
   >T</mi></munderover><mi>δ</mi><mfenced close=")"
   open="("><mi>t</mi></mfenced> :MATH]
   . However, another type of estimation error would be introduced if we
   simply replace δ(1), δ(2), … , δ(T) with Θ(t). To balance both types of
   errors, we adopted the shrinkage error estimate. Given
   [MATH: <msubsup><mi>D</mi><mi mathvariant="italic">mG</mi><mrow><mi
   mathvariant="italic">tr</mi><mfenced close=")"
   open="("><mi>t</mi></mfenced></mrow></msubsup> :MATH]
   as the measurement for the m^th poly(A) site of gene G at the r(t)^th
   replicate of the t^th experiment,
   [MATH: <msub><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w></msub><mo>=</mo><munderover><mo>∑</mo><mrow><mi>r</mi><mfenced
   close=")"
   open="("><mi>t</mi></mfenced><mo>=</mo><mn>1</mn></mrow><mrow><mi>R</mi
   ><mfenced close=")"
   open="("><mi>t</mi></mfenced></mrow></munderover><msubsup><mi
   mathvariant="normal">D</mi><mi>mG</mi><mrow><mi>tr</mi><mfenced
   close=")" open="("><mi
   mathvariant="normal">t</mi></mfenced></mrow></msubsup><mo>/</mo><mi>R</
   mi><mfenced close=")" open="("><mi>t</mi></mfenced> :MATH]
   is the average value of all replicates of this poly(A) site in the t^th
   experiment and
   [MATH:
   <msubsup><mi>K</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow><mn>2</mn
   ></msubsup><mo>=</mo><mfrac><mn>1</mn><mrow><mi>R</mi><mfenced
   close=")"
   open="("><mi>t</mi></mfenced><mo>−</mo><mn>1</mn></mrow></mfrac><munder
   over><mo>∑</mo><mrow><mi>r</mi><mfenced close=")"
   open="("><mi>t</mi></mfenced><mo>=</mo><mn>1</mn></mrow><mrow><mi>R</mi
   ><mfenced close=")"
   open="("><mi>t</mi></mfenced></mrow></munderover><msup><mfenced
   close=")" open="("><mrow><msubsup><mi>D</mi><mi
   mathvariant="italic">mG</mi><mrow><mi
   mathvariant="italic">tr</mi><mfenced close=")"
   open="("><mi>t</mi></mfenced></mrow></msubsup><mo>−</mo><msub><mover
   accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w></msub></mrow></mfenced><mn>2</mn></msup> :MATH]
   is the variance of the m^th poly(A) site in the t^th experiment.

   If the standard deviation (SD) is used as an estimate of the
   measurement error, then the SD-weighted average expression of poly(A)
   site m over the experiment t is:
   [MATH: <msubsup><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w><mi
   mathvariant="italic">SD</mi></msubsup><mo>=</mo><msubsup><mo>∑</mo><mro
   w><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></msubsup><mfrac><msub
   ><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w></msub><msubsup><mi>K</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow>
   <mn>2</mn></msubsup></mfrac><mo>/</mo><msubsup><mo>∑</mo><mrow><mi>t</m
   i><mo>=</mo><mn>1</mn></mrow><mi>T</mi></msubsup><mfrac><mn>1</mn><msub
   sup><mi>K</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow><mn>2</mn></ms
   ubsup></mfrac><mo>.</mo> :MATH]
   2

   To overcome the limitation of the SDCC method [[197]32], we introduced
   the shrinkage error to substitute the mean square error in eq. ([198]2)
   for a more accurate estimation of errors. We defined the unbiased
   estimate of the squared measurement error as
   [MATH: <msubsup><mover accent="true"><mi>K</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w><mn>2</mn></msubsup><mo>=</mo><msubsup><mo>∑</mo><mrow><mi>t</mi><mo>
   =</mo><mn>1</mn></mrow><mi>T</mi></msubsup><mfrac><mrow><mi>R</mi><mfen
   ced close=")"
   open="("><mi>t</mi></mfenced><mo>−</mo><mn>1</mn></mrow><mrow><mi>R</mi
   ><mo>−</mo><mi>T</mi></mrow></mfrac><msubsup><mi>K</mi><mrow><mi>m</mi>
   <mo>,</mo><mi>t</mi></mrow><mn>2</mn></msubsup><mo>.</mo> :MATH]
   3

   We then defined a balanced estimate based on the linear regularization
   model [[199]33]:
   [MATH:
   <msub><mi>W</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>=
   </mo><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><msub><mi>ρ</mi><mi>m</mi></msub></m
   row></mfenced><msubsup><mi>K</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></
   mrow><mn>2</mn></msubsup><mo>+</mo><msub><mi>ρ</mi><mi>m</mi></msub><ms
   ubsup><mover accent="true"><mi>K</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w><mn>2</mn></msubsup><mo>,</mo> :MATH]
   4

   where ρ[m]ϵ[0, 1] is the shrinkage factor.

   Next, the shrinkage factor can be estimated by the quadratic loss
   function [[200]61, [201]62]:
   [MATH: <msub><mover accent="true"><mi>ρ</mi><mi
   mathvariant="normal">ˇ</mi></mover><mi>m</mi></msub><mo>=</mo><mfrac><m
   row><msubsup><mo>∑</mo><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T
   </mi></msubsup><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><mfrac><mrow><mi>R</mi><mfenced
   close=")"
   open="("><mi>t</mi></mfenced><mo>−</mo><mn>1</mn></mrow><mrow><mi>R</mi
   ><mo>−</mo><mi>T</mi></mrow></mfrac></mrow></mfenced><mo
   mathvariant="italic">var</mo><mfenced close=")"
   open="("><msubsup><mi>K</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow>
   <mn>2</mn></msubsup></mfenced></mrow><mrow><msubsup><mo>∑</mo><mrow><mi
   >t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></msubsup><msup><mfenced
   close=")"
   open="("><mrow><msubsup><mi>K</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi><
   /mrow><mn>2</mn></msubsup><mo>−</mo><msubsup><mover
   accent="true"><mi>K</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi
   mathvariant="normal">t</mi></mrow><mn>2</mn></msubsup></mrow></mfenced>
   <mn>2</mn></msup></mrow></mfrac><mo>,</mo> :MATH]
   5

   where
   [MATH: <mo>var</mo><mfenced close=")"
   open="("><msubsup><mi>K</mi><mrow><mi>m</mi><mo>.</mo><mi>t</mi></mrow>
   <mn>2</mn></msubsup></mfenced><mo>=</mo><mfrac><mrow><mi>R</mi><mfenced
   close=")" open="("><mi>t</mi></mfenced></mrow><msup><mfenced close=")"
   open="("><mrow><mi>R</mi><mfenced close=")"
   open="("><mi>t</mi></mfenced><mo>−</mo><mn>1</mn></mrow></mfenced><mn>3
   </mn></msup></mfrac><munderover><mo>∑</mo><mrow><mi>r</mi><mfenced
   close=")"
   open="("><mi>t</mi></mfenced><mo>=</mo><mn>1</mn></mrow><mrow><mi>R</mi
   ><mfenced close=")"
   open="("><mi>t</mi></mfenced></mrow></munderover><msup><mfenced
   close="]" open="["><mrow><msup><mfenced close=")"
   open="("><mrow><msubsup><mi>D</mi><mi
   mathvariant="italic">mG</mi><mrow><mi
   mathvariant="italic">tr</mi><mfenced close=")"
   open="("><mi>t</mi></mfenced></mrow></msubsup><mo>−</mo><msub><mover
   accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w></msub></mrow></mfenced><mn>2</mn></msup><mo>−</mo><msubsup><mi>K</mi
   ><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow><mn>2</mn></msubsup></mrow>
   </mfenced><mn>2</mn></msup> :MATH]
   .

   To restrict ρ[m] between 0 and 1, the final shrinkage factor is
   [MATH:
   <msubsup><mi>ρ</mi><mi>m</mi><mo>∗</mo></msubsup><mo>=</mo><mfenced
   close="" open="{"><mtable><mtr><mtd><mn>0</mn><mo>,</mo><mspace
   width="0.5em"></mspace><msub><mi>ρ</mi><mi>m</mi></msub><mo>≤</mo><mn>0
   </mn></mtd></mtr><mtr><mtd><mtable><mtr><mtd><mspace
   width="0.25em"></mspace><msub><mi>ρ</mi><mi>m</mi></msub><mo>,</mo></mt
   d><mtd><mn>0</mn><mo><</mo><msub><mi>ρ</mi><mi>m</mi></msub><mo><</mo><
   mn>1</mn></mtd></mtr></mtable></mtd></mtr><mtr><mtd><mtable><mtr><mtd><
   mn>1</mn><mo>,</mo></mtd><mtd><msub><mi>ρ</mi><mi>m</mi></msub><mo>≥</m
   o><mn>1</mn></mtd></mtr></mtable></mtd></mtr></mtable></mfenced><mo>.</
   mo> :MATH]
   6

   Substituting
   [MATH: <msubsup><mi>ρ</mi><mi>m</mi><mo>∗</mo></msubsup> :MATH]
   into eq. ([202]4), the balanced estimate is
   [MATH:
   <msubsup><mi>W</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow><mo>∗</mo
   ></msubsup><mo>=</mo><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><msubsup><mi>ρ</mi><mi>m</mi><mo>∗</
   mo></msubsup></mrow></mfenced><msubsup><mi>K</mi><mrow><mi>m</mi><mo>,<
   /mo><mi>t</mi></mrow><mn>2</mn></msubsup><mo>+</mo><msubsup><mi>ρ</mi><
   mi>m</mi><mo>∗</mo></msubsup><msubsup><mover
   accent="true"><mi>K</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w><mn>2</mn></msubsup> :MATH]
   7

   Then the error between the mean of experiment
   [MATH: <msub><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w></msub> :MATH]
   and the corresponding true expression value can be measured by means of
   the shrinkage error
   [MATH:
   <msub><mi>ψ</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>=
   </mo><msqrt><mfrac><msubsup><mi>W</mi><mrow><mi>m</mi><mo>,</mo><mi>t</
   mi></mrow><mo>∗</mo></msubsup><mrow><mi>R</mi><mfenced close=")"
   open="("><mi>t</mi></mfenced></mrow></mfrac></msqrt><mo>.</mo> :MATH]
   8

   Apparently, by introducing the parameter R(t) denoting the number of
   replicates for tissue t, different numbers of replicates are allowed
   for different experiments. If the number of replicates is the same for
   all experiments, then
   [MATH:
   <msub><mi>ψ</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub><mo>=
   </mo><mi>ε</mi><msubsup><mi>W</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi><
   /mrow><mo>∗</mo></msubsup> :MATH]
   , where ε=1/R(t).

   Substituting the mean square error
   [MATH:
   <msubsup><mi>K</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow><mn>2</mn
   ></msubsup> :MATH]
   with the mean of the shrinkage error ψ[m, t] in eq. ([203]2), the
   shrinkage error- weighted average expression of poly(A) site m in
   experiment t is:
   [MATH: <msubsup><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mi>m</mi><mtext
   mathvariant="italic">SCCA</mtext></msubsup><mo>=</mo><msubsup><mo>∑</mo
   ><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></msubsup><mfrac>
   <msub><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w></msub><msubsup><mi>ψ</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow>
   <mn>2</mn></msubsup></mfrac><mo>/</mo><msubsup><mo>∑</mo><mrow><mi>t</m
   i><mo>=</mo><mn>1</mn></mrow><mi>T</mi></msubsup><mfrac><mn>1</mn><msub
   sup><mi>ψ</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow><mn>2</mn></ms
   ubsup></mfrac><mo>.</mo> :MATH]
   9

   Therefore, the correlation coefficient of the m^th and n^th poly(A)
   site is [[204]63]:
   [MATH: <msub><mi>λ</mi><mi
   mathvariant="italic">mn</mi></msub><mo>=</mo><mfrac><mrow><msubsup><mo>
   ∑</mo><mrow><mi>t</mi><mo>=</mo><mn>1</mn></mrow><mi>T</mi></msubsup><m
   frac><mfenced close=")" open="("><mrow><msub><mover
   accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w></msub><mo>−</mo><msubsup><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mi>m</mi><mtext
   mathvariant="italic">SCCA</mtext></msubsup></mrow></mfenced><msub><mi
   mathvariant="normal">ψ</mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow><
   /msub></mfrac><mspace width="0.25em"></mspace><mfrac><mfenced close=")"
   open="("><mrow><msub><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>n</mi><mo>,</mo><mi>t</mi></mro
   w></msub><mo>−</mo><msubsup><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mi>n</mi><mtext
   mathvariant="italic">SCCA</mtext></msubsup></mrow></mfenced><msub><mi
   mathvariant="normal">ψ</mi><mrow><mi>n</mi><mo>,</mo><mi>t</mi></mrow><
   /msub></mfrac></mrow><msqrt><mrow><msubsup><mo>∑</mo><mrow><mi>t</mi><m
   o>=</mo><mn>1</mn></mrow><mi>T</mi></msubsup><msup><mfenced close=")"
   open="("><mfrac><mfenced close=")" open="("><mrow><msub><mover
   accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mro
   w></msub><mo>−</mo><msubsup><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mi>m</mi><mtext
   mathvariant="italic">SCCA</mtext></msubsup></mrow></mfenced><msub><mi>ψ
   </mi><mrow><mi>m</mi><mo>,</mo><mi>t</mi></mrow></msub></mfrac></mfence
   d><mn>2</mn></msup><msubsup><mo>∑</mo><mrow><mi>t</mi><mo>=</mo><mn>1</
   mn></mrow><mi>T</mi></msubsup><msup><mfenced close=")"
   open="("><mfrac><mfenced close=")" open="("><mrow><msub><mover
   accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mrow><mi>n</mi><mo>,</mo><mi>t</mi></mro
   w></msub><mo>−</mo><msubsup><mover accent="true"><mi>E</mi><mo
   stretchy="true">¯</mo></mover><mi>n</mi><mtext
   mathvariant="italic">SCCA</mtext></msubsup></mrow></mfenced><msub><mi>ψ
   </mi><mrow><mi>n</mi><mo>,</mo><mi>t</mi></mrow></msub></mfrac></mfence
   d><mn>2</mn></msup></mrow></msqrt></mfrac> :MATH]
   10

   Then correlation coefficient matrices of poly(A) sites within specific
   gene(s) are:
   [MATH: <mspace width="0.25em"></mspace><mtext>corr</mtext><mfenced
   close=")" open="("
   separators=","><mi>P</mi><mi>Q</mi></mfenced><mo>=</mo><mfenced
   close="]" open="["><mtable><mtr><mtd><msub><mo>∑</mo><mi
   mathvariant="italic">PP</mi></msub><mspace
   width="0.25em"></mspace><msub><mo>∑</mo><mi
   mathvariant="italic">PQ</mi></msub><mspace
   width="0.25em"></mspace></mtd></mtr><mtr><mtd><mspace
   width="0.25em"></mspace><msub><mo>∑</mo><mi
   mathvariant="italic">QP</mi></msub><mspace
   width="0.25em"></mspace><msub><mo>∑</mo><mi
   mathvariant="italic">QQ</mi></msub><mspace
   width="0.25em"></mspace></mtd></mtr></mtable></mfenced><mo>.</mo>
   :MATH]
   11
   [MATH: <msub><mo>∑</mo><mi>PP</mi></msub><mo>=</mo><mfenced close="]"
   open="["><mtable><mtr><mtd><msub><mi>λ</mi><mn>11</mn></msub></mtd><mtd
   ><mo>⋯</mo></mtd><mtd><msub><mi>λ</mi><mrow><mn>1</mn><mi>m</mi></mrow>
   </msub></mtd></mtr><mtr><mtd><mo>⋮</mo></mtd><mtd><mo>⋱</mo></mtd><mtd>
   <mo>⋮</mo></mtd></mtr><mtr><mtd><msub><mi>λ</mi><mrow><mi>m</mi><mn>1</
   mn></mrow></msub></mtd><mtd><mo>⋯</mo></mtd><mtd><msub><mi>λ</mi><mi
   mathvariant="italic">mm</mi></msub></mtd></mtr></mtable></mfenced><mo>,
   </mo><msub><mo>∑</mo><mi
   mathvariant="italic">QQ</mi></msub><mo>=</mo><mfenced close="]"
   open="["><mtable><mtr><mtd><msub><mi>λ</mi><mn>11</mn></msub></mtd><mtd
   ><mo>⋯</mo></mtd><mtd><msub><mi>λ</mi><mrow><mn>1</mn><mi>n</mi></mrow>
   </msub></mtd></mtr><mtr><mtd><mo>⋮</mo></mtd><mtd><mo>⋱</mo></mtd><mtd>
   <mo>⋮</mo></mtd></mtr><mtr><mtd><msub><mi>λ</mi><mrow><mi>n</mi><mn>1</
   mn></mrow></msub></mtd><mtd><mo>⋯</mo></mtd><mtd><msub><mi>λ</mi><mi
   mathvariant="italic">nn</mi></msub></mtd></mtr></mtable></mfenced><mo>,
   </mo> :MATH]
   [MATH: <msub><mo>∑</mo><mi
   mathvariant="italic">PQ</mi></msub><mo>=</mo><mfenced close="]"
   open="["><mtable><mtr><mtd><msub><mi>λ</mi><mn>11</mn></msub></mtd><mtd
   ><mo>⋯</mo></mtd><mtd><msub><mi>λ</mi><mrow><mn>1</mn><mi>n</mi></mrow>
   </msub></mtd></mtr><mtr><mtd><mo>⋮</mo></mtd><mtd><mo>⋱</mo></mtd><mtd>
   <mo>⋮</mo></mtd></mtr><mtr><mtd><msub><mi>λ</mi><mrow><mi>m</mi><mn>1</
   mn></mrow></msub></mtd><mtd><mo>⋯</mo></mtd><mtd><msub><mi>λ</mi><mi
   mathvariant="italic">mn</mi></msub></mtd></mtr></mtable></mfenced><mo>,
   </mo><msub><mo>∑</mo><mi
   mathvariant="italic">QP</mi></msub><mo>=</mo><mfenced close="]"
   open="["><mtable><mtr><mtd><msub><mi>λ</mi><mn>11</mn></msub></mtd><mtd
   ><mo>⋯</mo></mtd><mtd><msub><mi>λ</mi><mrow><mn>1</mn><mi>m</mi></mrow>
   </msub></mtd></mtr><mtr><mtd><mo>⋮</mo></mtd><mtd><mo>⋱</mo></mtd><mtd>
   <mo>⋮</mo></mtd></mtr><mtr><mtd><msub><mi>λ</mi><mrow><mi>n</mi><mn>1</
   mn></mrow></msub></mtd><mtd><mo>⋯</mo></mtd><mtd><msub><mi>λ</mi><mi
   mathvariant="italic">nm</mi></msub></mtd></mtr></mtable></mfenced><mo>,
   </mo> :MATH]

   The Lagrange multiplier method can be used to solve problem ([205]1)
   [[206]35, [207]64], then
   [MATH: <mtable displaystyle="true"><mtr><mtd><mfenced close=")"
   open="("><mrow><msub><mo>∑</mo><mi
   mathvariant="italic">QP</mi></msub><msubsup><mo>∑</mo><mi
   mathvariant="italic">PP</mi><mrow><mo>−</mo><mn>1</mn></mrow></msubsup>
   <msub><mo>∑</mo><mi
   mathvariant="italic">PQ</mi></msub><mo>−</mo><msup><mi>ξ</mi><mn>2</mn>
   </msup><msub><mo>∑</mo><mi
   mathvariant="italic">QQ</mi></msub></mrow></mfenced><msub><mi
   mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">Q</mi></msub><mo>=</mo><mn>0</mn></mtd></mtr>
   <mtr><mtd><mfenced close=")" open="("><mrow><msub><mo>∑</mo><mi
   mathvariant="italic">PQ</mi></msub><msubsup><mo>∑</mo><mi
   mathvariant="italic">QQ</mi><mrow><mo>−</mo><mn>1</mn></mrow></msubsup>
   <msub><mo>∑</mo><mi
   mathvariant="italic">QP</mi></msub><mo>−</mo><msup><mi>ξ</mi><mn>2</mn>
   </msup><msub><mo>∑</mo><mi
   mathvariant="italic">PP</mi></msub></mrow></mfenced><msub><mi
   mathvariant="bold-italic">D</mi><mi
   mathvariant="bold-italic">P</mi></msub><mo>=</mo><mn>0</mn><mo>,</mo></
   mtd></mtr></mtable> :MATH]
   12

   where ξ^2 is the eigenvalue of matrix
   [MATH: <mi>P</mi><mo>=</mo><msubsup><mo>∑</mo><mi
   mathvariant="italic">PP</mi><mrow><mo>−</mo><mn>1</mn></mrow></msubsup>
   <msub><mo>∑</mo><mi
   mathvariant="italic">PQ</mi></msub><msubsup><mo>∑</mo><mi
   mathvariant="italic">QQ</mi><mrow><mo>−</mo><mn>1</mn></mrow></msubsup>
   <msub><mo>∑</mo><mi mathvariant="italic">QP</mi></msub> :MATH]
   . An alternative way to obtain the maximum value of ξ is to solve the
   matrix to get k positive eigenvalues
   [MATH:
   <msubsup><mi>ξ</mi><mn>1</mn><mn>2</mn></msubsup><mo>≥</mo><msubsup><mi
   >ξ</mi><mn>2</mn><mn>2</mn></msubsup><mo>≥</mo><mo>⋯</mo><mo>≥</mo><msu
   bsup><mi>ξ</mi><mi>k</mi><mn>2</mn></msubsup><mo>,</mo> :MATH]
   13

   where k = min {m, n}.

   ξ[1] is the first canonical correlation coefficient, which reflects the
   greatest degree of correlation. ξ[k] is the k^th canonical correlation
   coefficient. Next we test the significance of each canonical
   correlation coefficient by using a hypothetical test based on the
   maximum likelihood criterion [[208]65] to obtain statistically
   significant canonical correlation coefficients.
   [MATH: <mtable
   displaystyle="true"><mtr><mtd><msub><mi>H</mi><mn>0</mn></msub><mo>:</m
   o><msub><mi>ξ</mi><mn>1</mn></msub><mo>=</mo><msub><mi>ξ</mi><mn>2</mn>
   </msub><mo>=</mo><mo>⋯</mo><mo>=</mo><msub><mi>ξ</mi><mi>k</mi></msub><
   mo>=</mo><mn>0</mn></mtd></mtr><mtr><mtd><msub><mi>H</mi><mn>1</mn></ms
   ub><mo>:</mo><msub><mi>ξ</mi><mn>1</mn></msub><mo>≠</mo><mn>0</mn><mspa
   ce width="0.25em"></mspace><mtext>or</mtext><mspace
   width="0.25em"></mspace><msub><mi>ξ</mi><mn>2</mn></msub><mo>≠</mo><mn>
   0</mn><mspace
   width="0.25em"></mspace><mtext>or</mtext><mo>,</mo><mo>⋯</mo><mo>,</mo>
   <msub><mi>ξ</mi><mi>k</mi></msub><mo>≠</mo><mn>0</mn><mspace
   width="0.25em"></mspace><mtext>and</mtext><mspace
   width="0.25em"></mspace><msub><mi>ξ</mi><mn>1</mn></msub><mo>≥</mo><msu
   b><mi>ξ</mi><mn>2</mn></msub><mo>≥</mo><mo>⋯</mo><mo>≥</mo><msub><mi>ξ<
   /mi><mi>k</mi></msub><mo>></mo><mn>0</mn><mo>.</mo></mtd></mtr></mtable
   > :MATH]
   14

   Given a sufficiently large g, the statistic of likelihood ratio for the
   j^th canonical correlation coefficient (jϵ[0, k]) is
   [MATH:
   <msubsup><mi>χ</mi><mi>j</mi><mn>2</mn></msubsup><mo>=</mo><mo>−</mo><m
   fenced close="]"
   open="["><mrow><mi>g</mi><mo>−</mo><mfrac><mn>1</mn><mn>2</mn></mfrac><
   mfenced close=")"
   open="("><mrow><mi>m</mi><mo>+</mo><mi>n</mi></mrow></mfenced></mrow></
   mfenced><msubsup><mo>∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><
   mi>k</mi></msubsup><mo>log</mo><mfenced close=")"
   open="("><mrow><mn>1</mn><mo>−</mo><msubsup><mi>ξ</mi><mi>i</mi><mn>2</
   mn></msubsup></mrow></mfenced><mo>.</mo> :MATH]
   15

   The canonical correlation coefficients follow the χ^2-distribution with
   (m − j)(n − j) degrees of freedom.

   Given a confidence interval (e.g., α = 0.05), if j = 0 and the null
   hypothesis is accepted, then ξ[1] = 0 indicates that the two sets of
   variables are uncorrelated. If the null hypothesis is rejected, then it
   means that at least one of the canonical correlation coefficients is
   greater than 0, therefore the first pair of canonical variables is
   considered as significantly correlated. The hypothesis test is
   continued in the same way to verify whether the second canonical
   correlation coefficient is significant or not. This process is repeated
   until all non-zero canonical correlation coefficients are found.

   For each pair of gene, at most k non-zero canonical correlation
   coefficients can be obtained. Although the first canonical correlation
   coefficient reflects the greatest degree of correlation between the two
   genes, solely using the first coefficient will neglect contributions of
   other canonical correlation coefficients. Here we used p-values of all
   non-zero coefficients from the hypothetical test to obtain the weight
   for each pair of gene that quantifies the degree of correlation
   [[209]35].
   [MATH:
   <mi>w</mi><mo>=</mo><mfrac><mrow><msubsup><mo>∑</mo><mn>1</mn><mi>k</mi
   ></msubsup><msub><mi>ξ</mi><mi>k</mi></msub><mi>L</mi><mfenced
   close=")"
   open="("><mrow><mo>log</mo><msub><mi>P</mi><mi>k</mi></msub></mrow></mf
   enced></mrow><mrow><msubsup><mo>∑</mo><mn>1</mn><mi>k</mi></msubsup><mi
   >L</mi><mfenced close=")"
   open="("><mrow><mo>log</mo><msub><mi>P</mi><mi>k</mi></msub></mrow></mf
   enced></mrow></mfrac><mspace width="0.25em"></mspace> :MATH]
   16

   where
   [MATH: <mi>L</mi><mfenced close=")"
   open="("><mrow><mo>log</mo><mi>P</mi></mrow></mfenced><mo>=</mo><mfence
   d close=""
   open="{"><mtable><mtr><mtd><mn>0</mn><mo>,</mo><mi>P</mi><mo>></mo><mn>
   0.05</mn></mtd></mtr><mtr><mtd><mo>−</mo><mo>log</mo><mi>P</mi><mo>,</m
   o><mi>P</mi><mo>≤</mo><mn>0.05</mn></mtd></mtr></mtable></mfenced>
   :MATH]
   . P[k] is the p-value from the hypothetical test for the k^th
   correlation coefficient.

Cluster analysis and network inference

   Weights of all gene pairs obtained from PASCCA are first transformed to
   a similarity matrix, then the matrix is further used for clustering and
   network inference. In this study, we adopted the widely-used clustering
   method, hierarchical clustering, to cluster genes, which was
   implemented by the R function hclust with default parameters. WGCNA
   (v1.51) [[210]39] was used to infer network modules (parameters:
   softPower = 6; mergeCutHeight = 0.05, minModuleSize = 30). Various
   metrics were employed to evaluate the clusters and modules obtained
   from different methods.

Implementation of PASCCA

   PASCCA is available as an R package, which is available for download
   via [211]https://github.com/BMILAB/PASCCA. Computations in this study
   were carried out on a desktop computer with configuration “Intel(R)
   Core(TM) i5-4460T CPU @ 1.90GHz, and 8G RAM”. For practical application
   purpose for large scale data, we have leveraged the MPI (Message
   Passing Interface) framework to run PASCCA in parallel across many
   cores and nodes, which could drastically reduce the computing time.
   This package allows users to quantify associations between genes
   with/without repeated measurements using their own poly(A) site data
   and conduct downstream cluster analysis and network inference to
   explore important APA specific biological mechanism.

Additional files

   [212]Additional file 1:^ (206.4KB, pptx)

   Supplemental Figures. This file contains all the Supplemental Figures.
   (PPTX 206 kb)
   [213]Additional file 2:^ (107.5KB, xlsx)

   Hub genes for all gene modules obtained from PASCCA (XLSX 107 kb)

Acknowledgements