Abstract

Background

   Estrogen is a chemical messenger that has an influence on many breast
   cancers as it helps cells to grow and divide. These cancers are often
   known as estrogen responsive cancers in which estrogen receptor
   occupies the surface of the cells. The successful treatment of breast
   cancers requires understanding gene expression, identifying of tumor
   markers, acquiring knowledge of cellular pathways, etc. In this paper
   we introduce our proposed triclustering algorithm δ-TRIMAX that aims to
   find genes that are coexpressed over subset of samples across a subset
   of time points. Here we introduce a novel mean-squared residue for such
   3D dataset. Our proposed algorithm yields triclusters that have a
   mean-squared residue score below a threshold δ.

Results

   We have applied our algorithm on one simulated dataset and one
   real-life dataset. The real-life dataset is a time-series dataset in
   estrogen induced breast cancer cell line. To establish the biological
   significance of genes belonging to resultant triclusters we have
   performed gene ontology, KEGG pathway and transcription factor binding
   site enrichment analysis. Additionally, we represent each resultant
   tricluster by computing its eigengene and verify whether its eigengene
   is also differentially expressed at early, middle and late estrogen
   responsive stages. We also identified hub-genes for each resultant
   triclusters and verified whether the hub-genes are found to be
   associated with breast cancer. Through our analysis CCL2, CD47, NFIB,
   BRD4, HPGD, CSNK1E, NPC1L1, PTEN, PTPN2 and ADAM9 are identified as
   hub-genes which are already known to be associated with breast cancer.
   The other genes that have also been identified as hub-genes might be
   associated with breast cancer or estrogen responsive elements. The TFBS
   enrichment analysis also reveals that transcription factor POU2F1 binds
   to the promoter region of ESR1 that encodes estrogen receptor α.
   Transcription factor E2F1 binds to the promoter regions of coexpressed
   genes MCM7, ANAPC1 and WEE1.

Conclusions

   Thus our integrative approach provides insights into breast cancer
   prognosis.

   Keywords: Time series gene expression data, Tricluster, Mean-squared
   residue, Eigengene, Affirmation score, Gene ontology, KEGG pathway,
   TRANSFAC

Background

   In the context of genomics research, the functional approach is based
   on the ability to analyze genome-wide patterns of gene expression and
   the mechanisms by which gene expression is coordinated. Microarray
   technology and other high-throughput methods are used to measure
   expression values of thousands of genes over different
   samples/experimental conditions. In recent years the microarray
   technology has been used to measure in a single experiment expression
   values of thousands of genes under a huge variety of experimental
   conditions across different time points. This kind of datasets can be
   referred to as time series microarray datasets. Because of the large
   data volume, computational methods are used to analyze such datasets.
   Clustering is one of the most common methods for identifying
   coexpressed genes [[34]1]. This kind of analysis is facilitative for
   constructing gene regulatory networks in which single or groups of
   genes interact with other genes. Besides this, coexpression analysis
   also reveals information about some unknown genes that form a cluster
   with some known genes.

   A clustering algorithm is used to group genes that are coexpressed over
   all conditions/samples or to group experimental conditions over all
   genes based on some similarity/dissimilarity metric. However clustering
   may fail to find the group of genes that are similarly expressed over a
   subset of samples/experimental conditions i.e. clustering algorithms
   are unable to find such local patterns in the gene expression dataset.
   To deal with that problem, biclustering algorithms are used. A
   bicluster can be defined as a subset of genes that are coexpressed over
   a subset of samples/experimental conditions. The first biclustering
   algorithm that was used to analyse gene expression datasets was
   proposed by Cheng and Church and they used a greedy search heuristic
   approach to retrieve largest possible bicluster having mean squared
   residue (MSR) under a predefined threshold value δ (δ-bicluster)
   [[35]2]. But nowadays, biologists are eager to analyze 3D microarray
   dataset to answer the question: “Which genes are coexpressed under
   which subset of experimental conditions/samples across which subset of
   time points?” Biclustering is not able to deal with such 3D datasets.
   So, in this case we need some other clustering technique that can mine
   3D datasets. Hence the term Triclustering has been defined and a
   tricluster can be delineated as a subset of genes that are similarly
   expressed across a subset of experimental conditions/samples over a
   subset of time points. Zhao and Zaki proposed a triclustering algorithm
   TRICLUSTER that is based on graph-based approach. They defined
   coherence of a tricluster as
   [MATH: <mfrac><mrow><mtext
   mathvariant="italic">max</mtext><mo>(</mo><msub><mrow><mi>e</mi></mrow>
   <mrow><mtext
   mathvariant="italic">ib</mtext></mrow></msub><mo>/</mo><msub><mrow><mi>
   e</mi></mrow><mrow><mtext
   mathvariant="italic">ia</mtext></mrow></msub><mo>,</mo><msub><mrow><mi>
   e</mi></mrow><mrow><mtext
   mathvariant="italic">jb</mtext></mrow></msub><mo>/</mo><msub><mrow><mi>
   e</mi></mrow><mrow><mtext
   mathvariant="italic">ja</mtext></mrow></msub><mo>)</mo></mrow><mrow><mt
   ext
   mathvariant="italic">min</mtext><mo>(</mo><msub><mrow><mi>e</mi></mrow>
   <mrow><mtext
   mathvariant="italic">ib</mtext></mrow></msub><mo>/</mo><msub><mrow><mi>
   e</mi></mrow><mrow><mtext
   mathvariant="italic">ia</mtext></mrow></msub><mo>,</mo><msub><mrow><mi>
   e</mi></mrow><mrow><mtext
   mathvariant="italic">jb</mtext></mrow></msub><mo>/</mo><msub><mrow><mi>
   e</mi></mrow><mrow><mtext
   mathvariant="italic">ja</mtext></mrow></msub><mo>)</mo></mrow></mfrac><
   mo>-</mo><mn>1</mn> :MATH]
   , where e[ia],e[ib] denote the expression values of two columns a and b
   respectively for a row i. A tricluster is valid if it has a ratio below
   a maximum ratio threshold ϵ[[36]3].

   Here we introduce an efficient triclustering algorithm δ-TRIMAX[[37]4]
   that aims to cope with noisy 3D gene expression dataset and is less
   sensitive to input parameters. The normalization method does not
   influence the performance of our algorithm, as it produces the same
   results for both normalized and raw datasets. Here we propose a novel
   extension of MSR [[38]2] for 3D gene expression data and use a greedy
   search heuristic approach to retrieve triclusters, having MSR values
   below a threshold δ. Hence the triclusters can be defined as
   δ-tricluster.

   In this work we have applied our proposed δ-TRIMAX algorithm on a
   time-series gene expression data in estrogen induced breast cancer
   cell. Estrogen, a chemical messenger plays an instrumental role in
   normal sexual development, regulating woman’s menstrual cycles and
   normal development of the breast. Estrogen is also needed for heart and
   healthy bones. As estrogen plays vital role in stimulating breast cell
   division, has an effect on other hormones implicated in breast cell
   division and provides support to the growth of estrogen-responsive
   tumors, it may be involved in risk for breast cancer [[39]5]. Though
   since last decade, some research has been done to decipher some unknown
   questions on breast cancer risk, still some questions such as
   involvement of genes in breast cancer risk etc. remain unanswered. Here
   our coexpression analysis reveals some genes that have already been
   found to be associated with estrogen induced breast cancer and some
   other genes that might play an important role in this context.
   Additionally, our coregulation analysis brings out some important
   information such as which transcription factor binds the promoter
   regions of genes and play an important role in this context.

   In section 2, we have described our proposed triclustering algorithm in
   detail. Section 3 shows results of our algorithm using one artificial
   dataset and one real-life dataset. In section 4, we conclude our work.

Methods

Definitions

   Definition 1 (Time Series Microarray Gene Expression Dataset). We can
   model a time series microarray gene expression dataset (D) as a G × C ×
   T matrix and each element of D (d [ijk]) corresponds to the expression
   value of gene i over jth sample/experimental condition across time
   point k, where i ∈ (g[1],g[2],...,g [G]), j ∈ (c[1],c[2],...,c [C]) and
   k ∈ (t[1],t[2],...,t [T]).

   Definition 2 (Tricluster). A tricluster is defined as a submatrix
   M(I,J,K) = [m [ijk]], where i ∈ I, j ∈ J and k ∈ K. The submatrix M
   represents a subset of genes (I) that are coexpressed over a subset of
   conditions (J) across a subset of time points (K).

   Definition 3 (Perfect Shifting Tricluster). A Tricluster M(I,J,K) = m
   [ijk], where i ∈ I, j ∈ J and k ∈ K, is called a perfect shifting
   tricluster if each element of the submatrix M is represented as: m
   [ijk]= Γ + α[i] + β[j] + η[k], where Γ is a constant value for the
   tricluster, α[i], β[j] and η[k] are shifting factors of ith gene, jth
   samples/experimental condition and kth time point, respectively. As the
   noise is present in microarray datasets, the deviation from actual
   value and expected value of each element in the dataset also exists.
   For this deviation, every tricluster is not a perfect one.

   Cheng and Church proposed an algorithm for retrieving large and maximal
   biclusters that have mean squared residue score (MSR) below a threshold
   δ in 2D microarray gene expression dataset. They also showed that MSR
   of a perfect δ-bicluster and perfect shifting bicluster is zero (S = δ
   = 0) [[40]2,[41]6]. Now extending this idea, here we present a novel
   definition of Mean Squared Residue score for 3D microarray gene
   expression datasets. The MSR (S) of a perfect shifting tricluster
   becomes also zero, where each element m[ijk] = Γ + α[i] + β[j] + η[k].
   For delineating new MSR score (S), at first we need to define the
   residue score:

   Let the mean of ith gene (m [ijk]):
   [MATH: <msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>=</mo><mfrac><mrow><m
   n>1</mn></mrow><mrow><mo>|</mo><mi>J</mi><mo>|</mo><mo>|</mo><mi>K</mi>
   <mo>|</mo></mrow></mfrac><msub><mrow><mi
   mathsize="big">∑</mi></mrow><mrow><mi>j</mi><mo>∈</mo><mi>J</mi><mo>,</
   mo><mi>k</mi><mo>∈</mo><mi>K</mi></mrow></msub><mspace
   width="0.3em"></mspace><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub> :MATH]
   , the mean of jth sample/experimental condition (m [ijk]):
   [MATH: <msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>=</mo><mfrac><mrow><m
   n>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|</mo><mi>K</mi>
   <mo>|</mo></mrow></mfrac><msub><mrow><mi
   mathsize="big">∑</mi></mrow><mrow><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</
   mo><mi>k</mi><mo>∈</mo><mi>K</mi></mrow></msub><msub><mrow><mi>m</mi></
   mrow><mrow><mtext mathvariant="italic">ijk</mtext></mrow></msub> :MATH]
   , the mean of kth time point (m [ijk]):
   [MATH: <msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>=</mo><mfrac><mrow><m
   n>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|</mo><mi>J</mi>
   <mo>|</mo></mrow></mfrac><msub><mrow><mi
   mathsize="big">∑</mi></mrow><mrow><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</
   mo><mi>j</mi><mo>∈</mo><mi>J</mi></mrow></msub><msub><mrow><mi>m</mi></
   mrow><mrow><mtext mathvariant="italic">ijk</mtext></mrow></msub> :MATH]
   , and the mean of tricluster (m [ijk]):
   [MATH: <msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>=</mo><mfrac><mrow><m
   n>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|</mo><mi>J</mi>
   <mo>|</mo><mo>|</mo><mi>K</mi><mo>|</mo></mrow></mfrac><msub><mrow><mi
   mathsize="big">∑</mi></mrow><mrow><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</
   mo><mi>j</mi><mo>∈</mo><mi>J</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</m
   i></mrow></msub><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub> :MATH]
   . Now the mean of the tricluster can be considered as the value of
   constant i.e. Γ=m[IJK]. We can define the shifting factor for the ith
   gene (α[i]) as the difference between m [ijk] and m [ijk] i.e.
   α[i]=m[iJK]-m[IJK]. Similarly, we can define shifting factor for the
   jth condition (β[j]) as β[j]=m[IjK]-m[IJK] and shifting factor for the
   kth time point (η[k]) can be defined as η[k]=m[IJk]-m[IJK]. Hence we
   can define each element of a perfect shifting tricluster as
   m[ijk]=Γ+α[i]+β[j]+η[k]=m[IJK]+(m[iJK]-m[IJK])+(m[IjK]-m[IJK])+(m[IJk]-
   m[IJK])=(m[iJK]+m[IjK]+m[IJk]-2m[IJK]). But usually noise is evident in
   microarray gene expression dataset. Therefore to evaluate the
   difference between the actual value of an element (m [ijk]) and its
   expected value, obtained from above equation, the term “residue” can be
   used [[42]6]. Thus the residue of a tricluster (r [ijk]) can be defined
   as follows:
   r[ijk]=m[ijk]-(m[iJK]+m[IjK]+m[IJk]-2m[IJK])=(m[ijk]-m[iJK]-m[IjK]-m[IJ
   k]+2m[IJK]).

   Definition 4 (Mean Squared Residue). We define the term Mean Squared
   Residue MSR(I,J,K) or S of a tricluster M(I,J,K) to estimate the
   quality of a tricluster i.e. the level of coherence among the elements
   of a tricluster as follows:
   [MATH: <mtable class="align" columnalign="left"><mtr><mtd
   class="align-1"><mspace width="-23.0pt"></mspace><mtext
   mathvariant="bold">S</mtext></mtd><mtd
   class="align-2"><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo
   ><mi>I</mi><mo>|</mo><mo>|</mo><mi>J</mi><mo>|</mo><mo>|</mo><mi>K</mi>
   <mo>|</mo></mrow></mfrac><munder><mrow><mi
   mathsize="big">∑</mi></mrow><mrow><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</
   mo><mi>j</mi><mo>∈</mo><mi>J</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</m
   i></mrow></munder><msubsup><mrow><mi>r</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow><mrow><mn>2</mn></mrow></msubsup
   ><mspace width="2em"></mspace></mtd></mtr><mtr><mtd
   class="align-1"></mtd><mtd
   class="align-2"><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo
   ><mi>I</mi><mo>|</mo><mo>|</mo><mi>J</mi><mo>|</mo><mo>|</mo><mi>K</mi>
   <mo>|</mo></mrow></mfrac><munder><mrow><mi
   mathsize="big">∑</mi></mrow><mrow><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</
   mo><mi>j</mi><mo>∈</mo><mi>J</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</m
   i></mrow></munder><mspace width="0.3em"></mspace><mspace
   width="0.3em"></mspace><mspace
   width="0.3em"></mspace><msup><mrow><mo>(</mo><msub><mrow><mi>m</mi></mr
   ow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup><mi>.</mi><mspace
   width="2em"></mspace></mtd></mtr></mtable> :MATH]
   (1)

   Lower residue score represents larger coherence and better quality of a
   tricluster.

Proposed method

   δ-TRIMAX aims to find largest and maximal triclusters in a 3D
   microarray gene expression dataset. It is an extension of Cheng and
   Church biclustering algorithm [[43]2] that deals with 2-D microarray
   datasets. In contrast, our algorithm is capable to mine 3D gene
   expression dataset. There is always a submatrix in an expression
   dataset that has a perfect MSR(I,J,K) or S score i.e. S = 0 and this
   submatrix is each element of the dataset. But as mentioned above, our
   algorithm finds maximal triclusters having S score under a threshold δ,
   hence we have used a greedy heuristic approach to find triclusters. Our
   algorithm therefore starts with the entire dataset containing all
   genes, all samples/experimental conditions and all time points.

Algorithm 1 (δ-TRIMAX)

   Input. D, a matrix that represents 3D microarray gene expression
   dataset, λ> 1, an input parameter for multiple node deletion algorithm,
   δ≥ 0, maximum allowable MSR score.

   Output. All possible δ-triclusters.

   Initialization. Missing elements in D ← random numbers, D’ ← D

   Repeat

   a. D’[1] ← Results of Algorithm 2 on D’ using delta and λ. If the no.
   of genes (conditions/samples and/or no. of time points) is 50 (This
   value can be choosen experimentally. Large value increases the
   execution time of the algorithm as it then executes more number of
   iterations.), then do not apply Algorithm 2 on genes
   (conditions/samples and/or time points).

   b. D’[2] ← Results of Algorithm 3 on D’[1] usingδ.

   c. D’[3] ← Results of Algorithm 4 on D’[2].

   d. Return D’[3] and replace the elements that exist in D’ and D’[3]
   with random numbers.

   Until(No gene is found for δ-tricluster)

   Initially, our algorithm removes genes or conditions or time points
   from the dataset to accomplish largest diminishing of score S; this
   step is described in the following section in which a node corresponds
   to a gene or experimental condition or time point in the 3D microarray
   gene expression dataset.

Algorithm 2 (Multiple node deletion)

   Input. D, a matrix of real numbers that represents 3D microarray gene
   expression dataset; δ≥ 0, maximum allowable MSR threshold, λ> 1,
   threshold for multiple node deletion. The value of λ has been set
   experimentally to optimize the speed and performance (to avoid falling
   into local optimum) of the algorithm.

   Output. M [ijk], a δ-tricluster, consisting of a subset(I) of genes, a
   subset(J) of samples/experimental conditions and a subset of time
   points, having MSR score (S) less than or equal to δ.

   Initialization. I ←{set of all genes }, J ←{set of all experimental
   conditions/ samples } and K ←{set of all time points } and to M(I,J,K)
   ← D

   Repeat

   Calculate m [ijk], ∀ i ∈ I; m [ijk], ∀ j ∈ J; m [ijk], ∀ k ∈ K; m [ijk]
   and S.

   If S ≤δ return M(I,J,K)

   Else

   Delete genes i ∈ I that satisfy the following inequality
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo><mi>J</mi><mo>|</mo><mo>|
   </mo><mi>K</mi><mo>|</mo></mrow></mfrac><msub><mrow><mi>Σ</mi></mrow><m
   row><mi>j</mi><mo>∈</mo><mi>J</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</
   mi></mrow></msub><msup><mrow><mo>(</mo><msub><mrow><mi>m</mi></mrow><mr
   ow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup><mo>></mo><mi>λ</mi><mtext
   mathvariant="bold">S</mtext> :MATH]

   Recalculate m [ijk], ∀ i ∈ I; m [ijk], ∀ j ∈ J; m [ijk], ∀ k ∈ K; m
   [ijk] and S

   Delete samples/experimental conditions j ∈ J that satisfy the following
   inequality
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|
   </mo><mi>K</mi><mo>|</mo></mrow></mfrac><msub><mrow><mi>Σ</mi></mrow><m
   row><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</
   mi></mrow></msub><msup><mrow><mo>(</mo><msub><mrow><mi>m</mi></mrow><mr
   ow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup><mo>></mo><mi>λ</mi><mtext
   mathvariant="bold">S</mtext> :MATH]

   Recalculate m [ijk], ∀ i ∈ I; m [ijk], ∀ j ∈ J; m [ijk], ∀ k ∈ K; m
   [ijk] and S

   Delete time points k ∈ K that satisfy the following inequality
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|
   </mo><mi>J</mi><mo>|</mo></mrow></mfrac><msub><mrow><mi>Σ</mi></mrow><m
   row><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</mo><mi>j</mi><mo>∈</mo><mi>J</
   mi></mrow></msub><msup><mrow><mo>(</mo><msub><mrow><mi>m</mi></mrow><mr
   ow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup><mo>></mo><mi>λ</mi><mtext
   mathvariant="bold">S</mtext> :MATH]

   End if

   Until(There is no change in I, J and/or K)

   The complexity of this algorithm is O(max(m,n,p)) where m, n and p are
   the number of genes, samples and time points in the 3D microarray
   dataset.

   In the second step, we delete one node at each iteration from the
   resultant submatrix, produced by Algorithm 2, until the score S of the
   resultant submatrix is less than or equal to δ. This step results in a
   δ-tricluster.

Algorithm 3 (Single node deletion)

   Input. D, a matrix of real numbers that represents 3D microarray gene
   expression dataset; δ≥ 0, maximum allowable MSR threshold.

   Output. M [ijk], a δ-tricluster, consisting of a subset(I) of genes, a
   subset(J) of samples/experimental conditions and a subset of time
   points, having MSR score (S) less than or equal to δ.

   Initialization.I ←{set of all genes in D }, J ←{set of experimental
   conditions/samples in D } and K ←{set of time points in D } and to
   M(I,J,K) ← D

   Calculate m [ijk], ∀ i ∈ I; m [ijk], ∀ j ∈ J; m [ijk], ∀ k ∈ K; m [ijk]
   and S.

   While S >δ

   Detect gene i ∈ I that has the highest score
   [MATH:
   <mi>μ</mi><mo>(</mo><mi>i</mi><mo>)</mo><mo>=</mo><mfrac><mrow><mn>1</m
   n></mrow><mrow><mo>|</mo><mi>J</mi><mo>|</mo><mo>|</mo><mi>K</mi><mo>|<
   /mo></mrow></mfrac><msub><mrow><mi>Σ</mi></mrow><mrow><mi>j</mi><mo>∈</
   mo><mi>J</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</mi></mrow></msub><msu
   p><mrow><mo>(</mo><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup> :MATH]

   Detect sample/experimental condition j ∈ J that has the highest score
   [MATH:
   <mi>μ</mi><mo>(</mo><mi>j</mi><mo>)</mo><mo>=</mo><mfrac><mrow><mn>1</m
   n></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|</mo><mi>K</mi><mo>|<
   /mo></mrow></mfrac><msub><mrow><mi>Σ</mi></mrow><mrow><mi>i</mi><mo>∈</
   mo><mi>I</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</mi></mrow></msub><msu
   p><mrow><mo>(</mo><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup> :MATH]

   Detect time point k ∈ K that has the highest score
   [MATH:
   <mi>μ</mi><mo>(</mo><mi>k</mi><mo>)</mo><mo>=</mo><mfrac><mrow><mn>1</m
   n></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|</mo><mi>J</mi><mo>|<
   /mo></mrow></mfrac><msub><mrow><mi>Σ</mi></mrow><mrow><mi>i</mi><mo>∈</
   mo><mi>I</mi><mo>,</mo><mi>j</mi><mo>∈</mo><mi>J</mi></mrow></msub><msu
   p><mrow><mo>(</mo><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup> :MATH]

   Delete gene or sample/experimental condition or time point that has
   highest μ score and modify I or J or K. Recalculate m [ijk], ∀ i ∈ I; m
   [ijk], ∀ j ∈ J; m [ijk], ∀ k ∈ K; m [ijk] and S.

   End while

   Return M(I,J,K)

   The complexity of first and second steps is O(mnp) as those will
   iterate (m+n+p) times. The complexity of selection of best genes,
   samples and time points is O(log m + log n + log p). So it is suggested
   to use algorithm II before algorithm 3.

   As the goal of our algorithm is to find maximal triclusters, having MSR
   score (S) below the threshold δ, the resultant tricluster M(I,J,K) may
   not be the largest one. That means some genes and/or experimental
   conditions/samples and/or time points may be added to the resultant
   tricluster T produced by node deletion algorithm, so that the MSR score
   of new tricluster T’ produced after node addition does not exceed the
   MSR score of T. Now the third step of our algorithm is described below.

Algorithm 4 (Node addition)

   Input. D, a matrix of real numbers that represents δ-tricluster, having
   a subset of genes (I), a subset of experimental conditions/samples (J)
   and a subset of time points (K).

   Output. M
   [MATH:
   <msub><mrow></mrow><mrow><msup><mrow><mi>I</mi></mrow><mrow><mo>′</mo><
   /mrow></msup><msup><mrow><mi>J</mi></mrow><mrow><mo>′</mo></mrow></msup
   ><msup><mrow><mi>K</mi></mrow><mrow><mo>′</mo></mrow></msup></mrow></ms
   ub> :MATH]
   , a δ-tricluster, consisting of a subset of genes (I’), a subset of
   samples/experimental conditions (J’) and a subset of time points (K’),
   such that I ⊂ I ^′, J ⊂ J ^′, K ⊂ K’ and MSR(I ^′,J ^′,K’) ≤ MSR of D.

   Initialization. M(I,J,K) ← D

   Repeat

   Calculate m [ijk], ∀ i; m [ijk], ∀ j; m [ijk], ∀ k; m [ijk] and S. Add
   genes i ∉ I that satisfy the following inequality
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo><mi>J</mi><mo>|</mo><mo>|
   </mo><mi>K</mi><mo>|</mo></mrow></mfrac><msub><mrow><mi>Σ</mi></mrow><m
   row><mi>j</mi><mo>∈</mo><mi>J</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</
   mi></mrow></msub><msup><mrow><mo>(</mo><msub><mrow><mi>m</mi></mrow><mr
   ow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup><mo>≤</mo><mtext mathvariant="bold">S</mtext>
   :MATH]

   Recalculate m [ijk], ∀ j; m [ijk], ∀; m [ijk] and S

   Add samples/experimental conditions j ∉ J that satisfy the following
   inequality
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|
   </mo><mi>K</mi><mo>|</mo></mrow></mfrac><msub><mrow><mi>Σ</mi></mrow><m
   row><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</
   mi></mrow></msub><msup><mrow><mo>(</mo><msub><mrow><mi>m</mi></mrow><mr
   ow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup><mo>≤</mo><mtext mathvariant="bold">S</mtext>
   :MATH]

   Recalculate m [ijk], ∀ i; m [ijk], ∀ k; m [ijk] and S

   Add time points k ∉ K that satisfy the following inequality
   [MATH:
   <mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|
   </mo><mi>J</mi><mo>|</mo></mrow></mfrac><msub><mrow><mi>Σ</mi></mrow><m
   row><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</mo><mi>j</mi><mo>∈</mo><mi>J</
   mi></mrow></msub><msup><mrow><mo>(</mo><msub><mrow><mi>m</mi></mrow><mr
   ow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>-</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup><mo>≤</mo><mtext mathvariant="bold">S</mtext>
   :MATH]

   Until(There is no change in I, J and/or K)

   I’ ← I, J’ ← J, K’ ← K Return I’, J’, K’

   The complexity of this algorithm is O(mnp) as each step iterates
   (m+n+p) times.

Tricluster eigengene

   To find tricluster eigengene we applied singular value decomposition
   method (SVD) on the expression data of each tricluster [[44]7]. For
   instance,
   [MATH:
   <msubsup><mrow><mi>X</mi></mrow><mrow><mi>g</mi><mo>×</mo><mo>(</mo><mi
   >c</mi><mo>∗</mo><mi>t</mi><mo>)</mo></mrow><mrow><mi>i</mi></mrow></ms
   ubsup> :MATH]
   represents the expression matrix of ith tricluster, where g, c and t
   represent the number of genes, samples and time points of ith
   tricluster. Now we apply SVD on the data matrix (normalized to mean=0
   and variance=1). Now, the SVD of ith tricluster can be represented as,
   [MATH:
   <msup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow></msup><mo>=</mo><m
   text
   mathvariant="italic">UD</mtext><msup><mrow><mi>V</mi></mrow><mrow><mi>T
   </mi></mrow></msup><mo>,</mo> :MATH]
   (2)

   where U and V are the orthogonal matrices. U^i is a g ∗ (c ∗ t) matrix
   with orthonormal columns, V^i is a (c ∗ t) × (c ∗ t) orthogonal matrix
   and D^i is (c ∗ t) × (c ∗ t) diagonal matrix of singular values.

   Assuming that singular values in matrix D^i are arranged in
   non-decreasing order, we can represent eigengene of ith tricluster by
   the first column of matrix V^i, i.e.
   [MATH:
   <msup><mrow><mi>E</mi></mrow><mrow><mi>i</mi></mrow></msup><mo>=</mo><m
   subsup><mrow><mi>V</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi>i</mi></
   mrow></msubsup><mo>,</mo> :MATH]
   (3)

Results and discussion

Results on simulated dataset

   We have produced one simulated dataset SMD of size 2000 × 30 × 30. At
   first we have implanted three perfect shifting triclusters of size 100
   × 6 × 6, 80 × 6 × 6 and 60 × 5 × 5 into the dataset SMD and then
   implanted three noisy shifting triclusters of the same size mentioned
   before into it. To estimate the degree of similarity between the
   implanted and obtained triclusters, we define affirmation score in the
   same way as Prelic et. al. defined for two sets of biclusters
   [[45]6,[46]8]. So, overall average affirmation score of T[1] with
   respect to T[2] is as follows, where (SM
   [MATH:
   <msubsup><mrow></mrow><mrow><mi>G</mi></mrow><mrow><mo>∗</mo></mrow></m
   subsup> :MATH]
   (T[1], T[2])) is the average gene affirmation score, (SM
   [MATH:
   <msubsup><mrow></mrow><mrow><mi>C</mi></mrow><mrow><mo>∗</mo></mrow></m
   subsup> :MATH]
   (T[1], T[2])) is the average sample affirmation score and (SM
   [MATH:
   <msubsup><mrow></mrow><mrow><mi>K</mi></mrow><mrow><mo>∗</mo></mrow></m
   subsup> :MATH]
   (T[1], T[2])) is the average time point affirmation score of T[1] with
   respect to T[2]:
   [MATH: <mspace width="-14.0pt"></mspace><mtable><mtr><mtd
   columnalign="left"><mi>S</mi><msup><mrow><mi>M</mi></mrow><mrow><mo>∗</
   mo></mrow></msup><mo>(</mo><msub><mrow><mi>T</mi></mrow><mrow><mn>1</mn
   ></mrow></msub><mo>,</mo><msub><mrow><mi>T</mi></mrow><mrow><mn>2</mn><
   /mrow></msub><mo>)</mo></mtd></mtr><mtr><mtd columnalign="left"><mspace
   width="3em"></mspace><mo>=</mo><msqrt><mrow><mo>(</mo><mi>S</mi><msubsu
   p><mrow><mi>M</mi></mrow><mrow><mi>G</mi></mrow><mrow><mo>∗</mo></mrow>
   </msubsup><mo>(</mo><msub><mrow><mi>T</mi></mrow><mrow><mn>1</mn></mrow
   ></msub><mo>,</mo><msub><mrow><mi>T</mi></mrow><mrow><mn>2</mn></mrow><
   /msub><mo>)</mo><mspace width="0.3em"></mspace><mo>×</mo><mspace
   width="0.3em"></mspace><mi>S</mi><msubsup><mrow><mi>M</mi></mrow><mrow>
   <mi>C</mi></mrow><mrow><mo>∗</mo></mrow></msubsup><mo>(</mo><msub><mrow
   ><mi>T</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>,</mo><msub><mrow><
   mi>T</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>)</mo><mo>×</mo><mi>S
   </mi><msubsup><mrow><mi>M</mi></mrow><mrow><mi>T</mi></mrow><mrow><mo>∗
   </mo></mrow></msubsup><mo>(</mo><msub><mrow><mi>T</mi></mrow><mrow><mn>
   1</mn></mrow></msub><mo>,</mo><msub><mrow><mi>T</mi></mrow><mrow><mn>2<
   /mn></mrow></msub><mo>)</mo><mo>)</mo></mrow></msqrt></mtd></mtr></mtab
   le> :MATH]
   (4)

   Suppose, we have two sets of triclusters T [im] and T [res] where T
   [im]represents the set of implanted triclusters and T [res] corresponds
   to the set of triclusters retrieved by any triclustering algorithm.
   Hence SM ^∗(T[im],T[res]) denotes how well the triclustering algorithm
   finds the true triclusters that have been implanted into the dataset.
   This score varies from 0 to 1 (if T [im] = T [res]).

   For the dataset containing perfect shifing triclusters, we have
   assigned 0.35 and 1.0005 to the parameters δ and λ, respectively. The
   value of δ varies from one dataset to another dataset. Then we have
   added noisy triclusters having different levels i.e. standard
   deviations (σ = 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5, 1.7). To have
   an idea about the δ value, we have first clustered the genes over all
   time points and then the time points over the subset of genes for each
   gene cluster in each sample plane using the K-means algorithm. Then we
   have computed the MSR value(S) of the submatrix, considering a randomly
   selected sample plane, gene and time-pont cluster for 100 times. Then
   we have taken the lowest value as the value of δ. For these noisy
   datasets, we have assigned 3.75 and 1.004 to the parameters δ and λ,
   respectively. In Figure [47]1 we have compared the performance of our
   algorithm with that of the TRICLUSTER algorithm [[48]3] in terms of
   affirmation score using the artificial dataset. Our δ-TRIMAX algorithm
   performs better than TRICLUSTER algorithm for the noisy dataset. For
   perfect additive triclusters, performances of both these algorithms are
   comparable with each other.

Figure 1.

   [49]Figure 1
   [50]Open in a new tab

   Comparison in terms of Affirmation Scores.a. Comparison of Affirmation
   scores produced by -TRIMAX and TRICLUSTER algorithm. b. Comparison of
   running time of -TRIMAX and TRICLUSTER algorithm on the synthetic
   dataset.

Results on real-life dataset

Datasets for genome-wide analysis of estrogen receptor binding sites

   This dataset contains 54675 affymetrix probe-set ids, 3 biological
   replicates and 4 time points. In this experiment MCF7 cells are
   stimulated with 100 nm estrogen for 0, 3, 6 and 12 hours and the
   experiments are performed in triplicate. This dataset is publicly
   available at Gene Expression Omnibus (GEO) (dataset id- GSE 11324). It
   was used for discovering of cis-regulatory sites in previously
   uninvestigated regions and cooperating transcription factors underlying
   estrogen signaling in breast cancer [[51]9]. We assign 0.012382 and 1.2
   to δ and λ respectively. In this case our algorithm results in 115
   triclusters. From Figure [52]2, we observe that the genes in tricluster
   4 have similar expression profiles over all three samples across 0, 6
   and 12 hours but not at 3 hour.

Figure 2.

   [53]Figure 2
   [54]Open in a new tab

   Expression Profiles. Figures in first row show the expression profiles
   of genes ESR1, HOXA11, FAM71A, SPEF2, IFIH1, FPR2, SPAG9, NCF4, ADAM3A,
   CCNYL1 respectively of tricluster 4 over all samples; The red-colored
   time point (3 hrs.) is not a member of this tricluster. The figures in
   second row show the expression profiles of the same genes across 0, 6
   and 12 hours.

   To compare the performance of our proposed algorithm with TRICLUSER
   algorithm on real-life dataset, we have used three validation indexes.

Coverage

   Coverage for any triclustering algorithm can be delineated as
   [MATH: <mtext mathvariant="italic">Coverage</mtext><mo>=</mo><mfenced
   open="("
   close=")"><mrow><mfrac><mrow><msub><mrow><mi>g</mi></mrow><mrow><mtext
   mathvariant="italic">alg</mtext></mrow></msub><mo>×</mo><msub><mrow><mi
   >c</mi></mrow><mrow><mtext
   mathvariant="italic">alg</mtext></mrow></msub><mo>×</mo><msub><mrow><mi
   >t</mi></mrow><mrow><mtext
   mathvariant="italic">alg</mtext></mrow></msub></mrow><mrow><mi>G</mi><m
   o>×</mo><mi>C</mi><mo>×</mo><mi>T</mi></mrow></mfrac></mrow></mfenced><
   mo>×</mo><mn>100</mn><mo>,</mo> :MATH]
   (5)

   where g[alg], c[alg] and t[alg] denote total number of genes,
   experimental samples and time points retrieved by the triclustering
   algorithm. G, C and T represent number of all genes, experimental
   samples and time points in the dataset.

Triclustering Quality Index (TQI)

   We can elucidate Triclusering Quality Index of a tricluster by equation
   4.
   [MATH: <mtext
   mathvariant="italic">TQI</mtext><mo>=</mo><mfrac><mrow><mtext
   mathvariant="italic">MS</mtext><msub><mrow><mi>R</mi></mrow><mrow><mi>i
   </mi></mrow></msub></mrow><mrow><mtext
   mathvariant="italic">Volum</mtext><msub><mrow><mi>e</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></mfrac><mo>,</mo> :MATH]
   (6)

   where MSR[i] and Volume[i] represent mean-squared residue and volume of
   ith tricluster. Lower TQI score represents better quality of
   tricluster.

Statistical Difference from Background (SDB)

   Here we have introduced another quality measurement, termed as
   Statistical Differences from Background (SDB) [[55]10] as
   [MATH: <mtext
   mathvariant="italic">SDB</mtext><mo>=</mo><mfrac><mrow><mn>1</mn></mrow
   ><mrow><mi>n</mi></mrow></mfrac><munderover><mrow><mi
   mathsize="big">∑</mi></mrow><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>n</mi></mrow></munderover><mfrac><mrow><mtext
   mathvariant="italic">MS</mtext><msub><mrow><mi>R</mi></mrow><mrow><mi>i
   </mi></mrow></msub></mrow><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mi
   >r</mi></mrow></mfrac><msubsup><mrow><mi
   mathsize="big">∑</mi></mrow><mrow><mi>j</mi><mo>=</mo><mn>1</mn></mrow>
   <mrow><mi>r</mi></mrow></msubsup><mtext
   mathvariant="italic">RMS</mtext><msub><mrow><mi>R</mi></mrow><mrow><mi>
   j</mi></mrow></msub><mo>-</mo><mtext
   mathvariant="italic">MSRi</mtext></mrow></mfrac><mo>,</mo> :MATH]
   (7)

   where n is the total number of triclusters extracted by the algorithm.
   MSR[i] represents mean squared residue of ith tricluster retrieved by
   the algorithm and RMSR[j] represents mean squared residue of jth random
   tricluster having the same number of genes, experimental samples and
   time points as that of ith resultant tricluster. Here higher value of
   the denominator denotes better quality of the resultant tricluster.
   Hence, lower SDB score signifies better performance of the algorithm.
   Table [56]1 shows the comparison between proposed δ-TRIMAX algorithm
   and TRICLUSTER algorithm in terms of coverage, SDB and TQI score.

Table 1.

   Comparison between δ -TRIMAX and TRICLUSTER algorithm using coverage,
   Statistical Difference of from Background (SDB) and Triclustering
   Quality Index (TQI)
   Algorithm   Coverage SDB       Average TQI
   δ-TRIMAX
     __________________________________________________________________

   93.7412
     __________________________________________________________________

   0.4670856
     __________________________________________________________________

   3.082684e-05
     __________________________________________________________________

   TRICLUSTER  72.34019 0.4775341 3.348486e-05
   [57]Open in a new tab

Biological significance

   We have established the biological significance of genes belonging to
   each resultant tricluster by performing (a) Gene Ontology (GO) and KEGG
   pathway enrichment analysis, (b) cogitating each tricluster with
   different estrogen-responsive stages (early (3 hour), middle (6 hour)
   and late (12 hour)), (c) identifying hub genes of each tricluster and
   (d) Transcription Factor Binding Site (TFBS) enrichment analysis.

GO and KEGG pathway enrichment analysis

   We have used GOStats package [[58]11] in R to perform GO and KEGG
   pathway enrichment analysis for establishing biological significance of
   genes belonging to each tricluster. We have adjusted the p-values using
   Benjamini-Hochberg FDR method [[59]12] and considered those terms as
   significant ones that have a p-value below a threshold of 0.05. The
   smaller p-value represents higher significance level. We have found
   statistically enriched GO terms for genes belonging to each tricluster.
   We have compared the performance of our proposed δ-TRIMAX algorithm
   with that of TRICLUSTER algorithm on real-life dataset. For comparison
   of the performances we have considered GO Biological Processes (GOBP)
   and KEGG pathway terms that have already been reported to play an
   important role in estrogen induced breast cancer cell. Table [60]2
   shows the comparison between δ-TRIMAX and TRICLUSTER algorithm in terms
   corrected p-values of GOBP and KEGG pathway terms cell adhesion and Wnt
   signaling pathway that are observed to be associated with estrogen
   induced breast cancer [[61]13,[62]14], respectively.

Table 2.

   Comparison between δ -TRIMAX and TRICLUSTER algorithm in terms of
   p-values of GO and KEGG pathway term enrichment analysis
   Algorithm                GOBP term   KEGG pathway terms
   δ-TRIMAX
     __________________________________________________________________

   GO:0007155: cell adhesion
     __________________________________________________________________

   KEGG:04310: Wnt signaling
     __________________________________________________________________


     __________________________________________________________________

     (4.31e-08)
     __________________________________________________________________

     pathway (0.011)
     __________________________________________________________________

   TRICLUSTER
     __________________________________________________________________

   GO:0007155: cell adhesion
     __________________________________________________________________

   KEGG:04310: Wnt signaling
     __________________________________________________________________

                              (0.00022)   pathway (0.03)
   [63]Open in a new tab

Association of triclusters with different stages of response to estrogen
stimulus

   To cogitate each tricluster with different estrogen responsive stages
   of the experiment, we represent each tricluster by eigengene. Then we
   have examined whether the eigengene of each tricluster is
   differentially expressed at early, middle and late estrogen responsive
   stages using Limma package in R [[64]15] (FDR-BH corrected p-value
   cutoff 0.05). If eigengene of one tricluster is found to be
   differentially expressed at any possible responsive stages, then the
   genes having highly correlated expression profiles with that of
   eigengene can also be considered to be significantly expressed at the
   same stages. In total our algorithm results in 115 triclusters.
   Eigengene of tricluster 7 has been found to be differentially expressed
   between 0 hour - 6 hours, 0 hour - 12 hours, 3 hours - 12 hours and 6
   hours - 12 hours. 429 genes among 505 genes are found to be
   differentially expressed in this tricluster. KEGG pathway term mTOR
   signaling pathway is observed to be meliorated in this tricluster and
   has been reported to be associated with estrogen induced breast cancer
   cell [[65]16]. Genes PIK3CA, PRKAA1, RPS6, ULK2 participate in that
   pathway. The genes belonging to tricluster 50 are coexpressed over all
   samples across 0, 6 and 12 hours. The eigengene of tricluster 50 has
   been observed to be differentially expressed between 0 hour - 12 hours
   and 6 hours - 12 hours. 96% of the genes belonging to this tricluster
   are found to be differentially expressed. The genes in this tricluster
   are meliorated with the KEGG pathway term ubiquitin mediated
   proteolysis (UBE2K, CUL4B, PIAS1, CDC23). It has been reported in a
   previous study that there is crosstalk between ER α and targets of ER α
   for ubiquitin mediated proteolysis [[66]17]. In tricluster 71 time
   points 3, 6 and 12 hours are present in that tricluster and the
   eigengene is significantly expressed between 3 hours and 12 hours. 44
   genes out of 52 genes in this tricluster are significantly expressed
   between 3 and 12 hours. Genes belonging to this tricluster are also
   enriched with the KEGG pathway term ubiquitin mediated proteolysis
   (UBA6, BIRC6, ANAPC1, CUL5). Genes belonging to tricluster 48 are
   coexpressed across 0, 3 and 12 hours. Eigengene of tricluster 48 is
   significantly expressed between 0 and 12 hours, 3 and 12 hours. The
   KEGG pathway term TGF-beta signaling pathway is meliorated in this
   tricluster and the crosstalk between TGF-beta signaling pathway and ER
   α has been reported in a previous study [[67]18]. Genes SKP1, BMPR2 are
   found to play a role in the enriched pathway. Eigengene of tricluster
   95 is significantly upregulated between 0 and 12 hours. 60% of all
   genes belonging to this tricluster are differentially up regulated at
   late responsive stage. The genes in this tricluster has been found to
   be coexpressed across 0 hour, 12 hours over all samples. The KEGG
   pathway term apoptosis (XIAP, IRAK4, CASP6) is observed to be
   meliorated in this tricluster and it has been found in a recent study
   that apoptosis can be induced by estrogen in estrogen
   deprivation-resistant breast cancer cell [[68]19]. The genes of that
   tricluster 75 have been observed to be coexpressed over all samples
   across 3 and 12 hours. The eigengene of tricluster 75 is differentially
   expressed between 3 hours and 12 hours. 39 genes among 64 genes are
   significantly expressed between 3 and 12 hours. In this case we have
   observed enrichment for KEGG pathway terms ErbB signaling pathway that
   is found to be associated with estrogen induced breast cancer cell
   [[69]20]. The coexpressed genes NCK1, SOS2 in this tricluster
   participate in that pathway.

Identification and roles of hub genes

   To identify hub genes of each tricluster, we have computed tricluster
   membership of each gene by calculating Pearson correlation coefficient
   between each gene and the eigengene of that tricluster. We have
   considered the top fifteen genes as hub genes having highest
   correlation coefficient with the eigengene of that tricluster. For
   tricluster 1, we have identified NPC1L1, TMEM161B-AS1, POU5F1P3,
   POU5F1P4, POU5F1B, CCL2 as hub-genes that are coexpressed over all-time
   points. It has been observed in a previous study that high doses of
   estrogen augment intestinal cholesterol absorption attributable in part
   to an up-regulated expression of NPC1L1 which is known as intestinal
   sterol influx transporter [[70]21]. CCL2 is found to play an important
   role in mediating cross-talk between cancer cells and stromal
   fibroblasts in breast cancer cells [[71]22]. DNAJC3-AS1, ITSN2, TRPC1,
   CD47, ZNF286A, TSC22D2, PHF17, ZNF286B, TMEM67, NFIB, JKAMP, DENND4A,
   HPGD are identified as hub-genes that are coexpressed over all samples
   and 0, 3, 6 and 12 hours in tricluster 7. NFIB has been reported as a
   potential target of ER negative breast cancers [[72]23]. Transient
   receptor potential cation channel (TRPC1) is known to play an important
   role in breast cancer [[73]24]. CD47 has been found to intervene
   killing of breast cancer cells [[74]25]. HPGD plays important role in
   epithelial-mesenchymal transition and migration in breast cancer cells
   [[75]26]. In tricluster 4, the genes are coexpressed over 0, 6 and 12
   hours and IGKV1-13, FAM69C, SGCD, CSNK1E, TRMU, CRYBA2, IGKV1D-13,
   IGSF11, PACS1, IQCK are identified as hub-genes. CSNK1E has been
   observed to play an important role proliferation of breast cancer cells
   and act as a regulator of activated -catenin driven transcription
   [[76]27]. For tricluster 13 we have identified ESYT3, SERINC2, LRRC14,
   ALDH4A1, RPL10, BRD4, DEC1, ZFP30, TCP11L2, ALDOA as hub-genes. The
   gene for Aldolase A (ALDOA) plays an instrumental role in hypoxia which
   is a feature of solid tumors in breast cancer [[77]28]. Besides this
   BRD4 known as Bromodomain 4 is found to be associated with breast
   cancer progression [[78]29]. Genes PFKFB1, TAF1, PIKFYVE, MEMO1P1,
   KIF1B, PHF20L1, ARHGAP24, TSC22D1, AK7, DPY30, MEMO1, PTEN, ADAM9,
   PTPN2, MTSS1L are found as hubgenes of tricluster 95. PTEN is known to
   be a tumor suppressor gene in breast cancer [[79]30,[80]31]. PTPN2,
   ADAM9 have been reported to be associated with breast cancer in
   previous studies [[81]32,[82]33]. PIKFYVE has been found to intervene
   epidermal growth factor receptor that is associated with human breast
   cancers [[83]34]. In case of tricluster 42, ANTXR2, RHBDL2, GSTCD,
   DENND1B, KLC3, PREP, NOS1, STOML3, CDK5R1, CLEC7A, HGD, FOXC1, MSRB3,
   TEX34, SLC36A1 are appeared as hub-genes that are coexpressed over all
   samples and across 0, 12 hours. In a recent work, the activity of
   RHBDL2 has been identified in many tumour cells including breast cancer
   [[84]35]. The role of FOXC1 as a regulator of human breast cancer cells
   by activating NF κB signaling has been discovered in a recent work
   [[85]36].

TFBS enrichment analysis

   To analyse the potential coregulation of coexpressed genes, we have
   done transcription factor binding site (TFBS) enrichment analysis using
   the TRANSFAC library (version 2009.4) [[86]37] that contains eukaryotic
   transcription factors, their experimentally proven binding sites, and
   regulated genes. Here we used 42,544,964 TFBS predictions that have
   high affinity scores and are conserved between human, mouse, dog and
   cow [[87]38]. Out of these 42 million conserved TFBSs we have selected
   the best 1% for each TRANSFAC matrix individually to identify the most
   specific regulator (transcription factor) - target interactions. We
   have used hypergeometric test [[88]39] and Benjamini Yekutieli-FDR
   method [[89]40] for p-value correction to find over-represented binding
   sites (p-value ≤ 0.05) in the upstream regions of genes belonging to
   each tricluster. Table [90]3 shows the list of triclusters where we
   have found statistically meliorated TFBSs. From Table [91]3, we can
   observe that the genes in tricluster 26 are enriched with
   helix-turn-helix, zinc-coordinating DNA-binding and basic domain
   transcription factors. The helix-turn-helix domain transcription factor
   E2F1, to which TRANSFAC matrix V$E2F_Q2 is associated acts as a
   regulator of cell proliferation in estrogen-induced breast cancer cell
   [[92]41]. The zinc finger transcription factors Sp1 and Sp4, asociated
   with matrix V$SP1_Q6_01 have already been reported to play an important
   role in estrogen-induced MCF-7 breast cancer cell line [[93]42,[94]43].
   In tricluster 17, the basic domain transcrption factor CREB (matrix
   V$CREB_01) is important for malignancy in breast cancer cell. ATF1,
   ATF2, ATF3, ATF4, ATF5 (matrix V$CREBATF_Q6) likewise play an important
   role in breast cancer cell [[95]44]. We have observed enrichment for
   matrix V$NFAT1_Q6. The corresponding transcription factor (NFATC1) has
   been found to be associated with clinical characteristics in breast
   cancer cell [[96]45]. In tricluster 4 POU2F1, the TF associated with
   matrix V$OCT1_03 is a helix-turn-helix domain transcription factor
   (Oct-1) and has been reported to be estrogen-responsive in a previous
   study [[97]46]. Table [98]4 shows some statistically enriched KEGG
   pathway terms for coexpressed and differentially expressed (using
   adjusted p-value ≤ 0.05) genes the promoters of which are bound by
   aforementioned transcription factors.

Table 3.

   TRANSFAC Matrices for Triclusters, having statistically enriched TFBS
   for real-life dataset
   Tricluster (no. of genes) 20 most significant TRANSFAC matrices (in
   ascending order of p-values)   FDR-BY corrected p-value of top-most
   matrix
   Tricluster 3 (875)
     __________________________________________________________________

   V$NCX_02, V$MSX1_02, V$PAX4_02, V$POU3F2_01, V$TBP_01, V$BRN3C_01,
   V$BARX2_01, V$HB24_02, V$HOXD10_01, V$BARX1_01, V$DBX1_01, V$HMBOX1_01,
   V$HDX_01, V$BSX_01, V$NKX52_01, V$HMX3_02, V$LBX2_01, V$HOXD13_01,
   V$NFAT1_Q6, V$HOXD8_01
     __________________________________________________________________


     __________________________________________________________________

   4.29e-08
     __________________________________________________________________

   Tricluster 1 (4477)
     __________________________________________________________________

   V$NCX_02, V$HDX_01, V$BCL6_01, V$ZNF333_01, V$DLX2_01, V$DLX7_01,
   V$DLX5_01, V$SRY_02, V$BARX1_01, V$SOX4_01, V$NKX24_01, V$HOXD3_01,
   V$LBX2_01, V$LHX61_02, V$SRY_01, V$TST1_01, V$DLX3_01, V$XVENT1_01,
   V$EVX1_01, V$BARX2_01
     __________________________________________________________________


     __________________________________________________________________

   1.27e-05
     __________________________________________________________________

   Tricluster 26 (3177)
     __________________________________________________________________

   V$E2F_Q2, V$ZF5_01,V$USF2_Q6, V$SP1_Q6_01, V$KID3_01, V$CHCH_01
     __________________________________________________________________


     __________________________________________________________________

   2.99e-05
     __________________________________________________________________

   Tricluster 4 (3482)
     __________________________________________________________________

   V$BCL6_01, V$HOXA10_01, V$SRY_01, V$NKX23_01, V$WT1_Q6, V$HOXB9_01,
   V$ISL2_01, V$HOXD10_01, V$HOXD8_01, V$NCX_02, V$X1_02, V$PAX4_04,
   V$BARHL2_01, V$DLX1_01, V$SRY_02, V$OCT1_03, V$DLX5_01, V$LHX9_01,
   V$DBX2_01, V$HMGIY_Q6
     __________________________________________________________________


     __________________________________________________________________

   9.51e-05
     __________________________________________________________________

   Tricluster 2 (2186)
     __________________________________________________________________

   V$CHCH_01, V$MOVOB_01, V$MAZ_Q6, V$PAX4_03, V$CACD_01, V$GEN_INI3B_B,
   V$GEN_INI_B, V$CKROX_Q2
     __________________________________________________________________


     __________________________________________________________________

   0.0001
     __________________________________________________________________

   Tricluster 12 (476)
     __________________________________________________________________

   V$SRY_02, V$NCX_02, V$BCL6_01, V$HB24_01, V$HOXA10_01, V$NKX25_02,
   V$SRY_01, V$PBX1_02, V$HOXD10_01
     __________________________________________________________________


     __________________________________________________________________

   0.002
     __________________________________________________________________

   Tricluster 17 (999)
     __________________________________________________________________

   V$CREB_01, V$CREBATF_Q6, V$SP1_Q6_01, V$ATF3_Q6, V$CREBP1CJUN_01
     __________________________________________________________________


     __________________________________________________________________

   0.004
     __________________________________________________________________

   Tricluster 50 (182)
     __________________________________________________________________

   V$ETF_Q6
     __________________________________________________________________


     __________________________________________________________________

   0.006
     __________________________________________________________________

   Tricluster 18 (260)
     __________________________________________________________________

   V$STAT1STAT1_Q3
     __________________________________________________________________


     __________________________________________________________________

   0.042
     __________________________________________________________________

   Tricluster 31 (2465) V$SP1_Q6_01   0.046
   [99]Open in a new tab

Table 4.

   Statistically enriched KEGG pathway terms for differentially expressed
   and coexpressed targets of TRANSFAC matrices V$NFAT1_Q6, V$OCT1_03,
   V$CREB_01, V$CREBATF_Q6, V$E2F_Q2 and V$SP1_Q6_01
   Tricluster TRANSFAC matrix KEGG pathway terms (corrected p-value ≤
   0.05)
   4
     __________________________________________________________________

   V$NFAT1_Q6
     __________________________________________________________________

   KEGG: 00471: D-Glutamine and D-glutamate metabolism (GLS), KEGG: 04310:
   Wnt signaling pathway (PPP2R1B, ROCK1, TBL1X), KEGG: 04350: TGF-beta
   signaling pathway (ROCK1, TBL1X),
     __________________________________________________________________

   4
     __________________________________________________________________

   V$OCT1_03
     __________________________________________________________________

   KEGG: 04961: Endocrine and other factor-regulated calcium reabsorption
   (SLC8A1, ESR1)
     __________________________________________________________________

   17
     __________________________________________________________________

   V$CREB_01
     __________________________________________________________________

   KEGG: 00030: Pentose phosphate pathway (RBKS), KEGG: 04012: ErbB
   signaling pathway (PAK1), KEGG: 05211: Renal cell carcinoma (PAK1)
     __________________________________________________________________

   17
     __________________________________________________________________

   V$CREBATF_Q6
     __________________________________________________________________

   KEGG: 04660: T cell receptor signaling pathway (PAK1), KEGG: 04650:
   Natural killer cell mediated cytotoxicity (PAK1, KEGG: 05120:
   Epithelial cell signaling in Helicobacter pylori infection (PAK1),
   KEGG: 04360: Axon guidance (PAK1)
     __________________________________________________________________

   26
     __________________________________________________________________

   V$E2F_Q2
     __________________________________________________________________

   KEGG: 04110: Cell cycle (MCM7, ANAPC1, WEE1), KEGG: 03030: DNA
   replication (MCM7, POLA1)
     __________________________________________________________________

   26 V$SP1_Q6_01 KEGG: 00100: Steroid biosynthesis (SQLE), KEGG: 00270:
   Cysteine and methionine metabolism (MAT2A), KEGG: 04962:
   Vasopressin-regulated water reabsorption (CREB1), KEGG: 04623:
   Cytosolic DNA-sensing pathway (MAVS)
   [100]Open in a new tab

Conclusion

   In this work we have proposed δ-TRIMAX triclustering algorithm that
   aims to retrieve large and coherent groups of genes, having an MSR
   score below a threshold δ. Genes belonging to each tricluster are
   coexpressed over a subset of samples/ experimental conditions and
   across subset of time points. The results of GO and KEGG pathway
   enrichment analysis show that our proposed algorithm is able to extract
   group of coexpressed genes that are biologically significant. We have
   performed TFBS enrichment analysis to establish the fact that the
   promoter regions of the genes having similar expression profile are
   bound by the same transcription factors. We have compared the
   performance of our algorithm with that of existing algorithm using one
   artificial dataset in terms of affirmation score and one real-life
   dataset in terms of coverage, statistical difference from background
   and triclustering quality index score. In case of these two datasets
   our proposed algorithm outperformed the existing one. Additionally,
   here we have represented the expression profiles of genes belonging to
   each tricluster by eigengene and then identified hub genes using the
   profile of eigengene. We have observed that most of the identified
   hub-genes are previously reported to be associated with breast cancer
   and estrogen responsive elements. The other identified hub genes might
   be associated with breast cancer and need to be verified
   experimentally. Hence our integrative approach and findings might
   provide new insights into breast cancer prognosis.

Competing interests

   The authors declare that they have no competing interests.

Authors’ contributions

   AB, MH, AM, UM and SB carried out the literature study and preplanning
   of this work. AB, MH collected the datasets. MH participated in the
   prediction of transcription factor binding sites. AB developed the
   code, did the experiments, analysed the results and wrote the draft of
   the manuscript. MH, AM, UM, SB and EW corrected the draft. EW
   supervised the entire work. All authors read and approved the final
   manuscript.

Contributor Information

   Anirban Bhar, Email: anirban.bhar@bioinf.med.uni-goettingen.de.

   Martin Haubrock, Email: martin.haubrock@bioinf.med.uni-goettingen.de.

   Anirban Mukhopadhyay, Email: anirban@klyuniv.ac.in.

   Ujjwal Maulik, Email: umaulik@cse.jdvu.ac.in.

   Sanghamitra Bandyopadhyay, Email: sanghami@isical.ac.in.

   Edgar Wingender, Email: edgar.wingender@bioinf.med.uni-goettingen.de.

Acknowledgements