Abstract

Background

   Exploratory analysis of multi-dimensional high-throughput datasets,
   such as microarray gene expression time series, may be instrumental in
   understanding the genetic programs underlying numerous biological
   processes. In such datasets, variations in the gene expression profiles
   are usually observed across replicates and time points. Thus mining the
   temporal expression patterns in such multi-dimensional datasets may not
   only provide insights into the key biological processes governing
   organs to grow and develop but also facilitate the understanding of the
   underlying complex gene regulatory circuits.

Results

   In this work we have developed an evolutionary multi-objective
   optimization for our previously introduced triclustering algorithm
   δ-TRIMAX. Its aim is to make optimal use of δ-TRIMAX in extracting
   groups of co-expressed genes from time series gene expression data, or
   from any 3D gene expression dataset, by adding the powerful
   capabilities of an evolutionary algorithm to retrieve overlapping
   triclusters. We have compared the performance of our newly developed
   algorithm, EMOA- δ-TRIMAX, with that of other existing triclustering
   approaches using four artificial dataset and three real-life datasets.
   Moreover, we have analyzed the results of our algorithm on one of these
   real-life datasets monitoring the differentiation of human induced
   pluripotent stem cells (hiPSC) into mature cardiomyocytes. For each
   group of co-expressed genes belonging to one tricluster, we identified
   key genes by computing their membership values within the tricluster.
   It turned out that to a very high percentage, these key genes were
   significantly enriched in Gene Ontology categories or KEGG pathways
   that fitted very well to the biological context of cardiomyocytes
   differentiation.

Conclusions

   EMOA- δ-TRIMAX has proven instrumental in identifying groups of genes
   in transcriptomic data sets that represent the functional categories
   constituting the biological process under study. The executable file
   can be found at
   [31]http://www.bioinf.med.uni-goettingen.de/fileadmin/download/EMOA-del
   ta-TRIMAX.tar.gz.

Electronic supplementary material

   The online version of this article (doi:10.1186/s12859-015-0635-8)
   contains supplementary material, which is available to authorized
   users.

   Keywords: Microarray gene expression data, Developmental biology,
   Tricluster, Multi-objective optimization, Eigen gene, Affirmation
   score, TRANSFAC

Background

   One of the main aims of functional genomics is to understand the
   dynamic features encoded in the genome such as the regulation of gene
   activities. It often refers to high-throughput approaches devised to
   gain a complete picture about all genes of an organism in one
   experiment. Several steps, such as transcription, RNA splicing and
   translation are involved in the process of gene expression, which is
   subject to a great many of regulatory mechanisms. Analysis of such gene
   expression data provides enormous leverages to understand the
   principles of cellular systems, diseases mechanisms, molecular networks
   etc. Genes having similar expression profiles are frequently found to
   be regulated by similar mechanisms. Previous studies elucidated the
   impact of highly connected intra-modular hub genes on such regulations
   [[32]1–[33]3]. Detecting hub genes and analyzing their roles may
   facilitate understanding the basal control mechanisms of a certain
   normal or disease cellular phenotype to develop.

   Microarray technology is used to measure the expression of thousands of
   genes over a set of biological replicates simultaneously. In recent
   years, such expression signatures have increasingly been monitored for
   sets of time points in order to follow the course of biological
   processes. In case of such three-dimensional datasets, at each time
   point the activity of all genes is measured for a number of biological
   replicates. Although the experimental setups are kept identical for
   these replicates, variations between them can still occur. For
   instance, stochastic effects can result in delays or accelerations of a
   certain cell state transition. Thus, grouping similar biological
   replicates may facilitate the analysis of time series gene expression
   data. Moreover, expression profiles of genes may also vary over
   different time points. Appropriate computational methods are therefore
   required to analyze such high-throughput datasets specifically to
   identify temporal expression patterns over biological replicates and
   time points. Clustering, one of the unsupervised learning approaches,
   has been used to explore such two-dimensional gene expression datasets.
   Clustering algorithms aim to maximize similarity within or to minimize
   similarity between clusters, based on a distance measure [[34]4].
   Clustering is able to group genes or samples over a set of samples or
   genes, respectively, but it has been reported in previous studies that
   genes are not necessarily to be co-expressed over all samples. Hence to
   find such local patterns, i.e. genes having similar expression profiles
   over a subset of samples in 2D gene expression datasets biclustering
   algorithms are used [[35]5]. In previous studies, biclusters have been
   found to be biologically more significant as these algorithms aim to
   extract groups of correlated genes from a subset of samples. Such
   subspace clustering techniques find clusters in multiple overlapping
   subspaces. To deal with time series gene expression datasets,
   biclustering algorithms fail to extract genes that have similar
   expression profiles over a subset of samples during a subset of time
   points. To perform co-expression analysis in such three-dimensional
   gene expression datasets triclustering algorithms have to be employed.
   Zhao et al. proposed the TRICLUSTER algorithm that aims to retrieve
   groups of genes that have similar expression profiles over a subset of
   samples during a subset of time points [[36]6]. In a recent work,
   Tchagang et al. proposed a triclustering algorithm (OPTricluster) for
   mining short time series gene expression datasets. OPTricluster
   effectively mines time series gene expression data having approximately
   3-8 time points and 2-5 samples. According to their definition of a
   tricluster, genes belonging to a tricluster must have constant,
   coherent or order preserving expression patterns over a subset of
   samples during a subset of time points. In case of an order-preserving
   tricluster, there must be a permutation of the time points such that
   expression levels of genes form a monotonic function [[37]7]. In our
   previous work we have proposed a triclustering algorithm δ-TRIMAX by
   introducing a novel mean squared residue score (MSR) to mine a 3D gene
   expression dataset and each tricluster must have an MSR score below a
   threshold δ [[38]8, [39]9]. In spite of its proven merits [[40]8,
   [41]9], δ-TRIMAX has some limitations: a) it can not retrieve
   overlapping triclusters, b) due to its greedy approach it often gets
   stuck at local optima. Finding overlapping triclusters is important in
   biological context, since each gene may participate in several
   biological processes, thus being subject to multiple regulatory
   influences [[42]10]. A subset of genes may therefore be involved in a
   set of biological processes and consequently belong to several
   triclusters. However, the goals of δ-TRIMAX algorithm were to maximize
   the volume and minimize the MSR score of the resultant triclusters.
   Hence the problem of optimizing such multiple conflicting objectives
   can be classified as multi-objective optimization problem where a set
   of alternative solutions of equivalent quality exists instead of one
   single optimal solution. To optimize the conflicting objectives of
   δ-TRIMAX we have used a non-dominated sorting genetic algorithm-II
   (NSGA-II) [[43]11] as a multi-objective optimization method to develop
   EMOA- δ-TRIMAX (Evolutionary Multi-objective Optimization Algorithm for
   δ-TRIMAX). It could demonstrate that EMOA- δ-TRIMAX effectively copes
   with the problems of δ-TRIMAX.

   The main purpose of studying developmental biology is to gain insight
   into the biological processes by which an organism, or one particular
   organ, grows and develops. Cell differentiation refers to the
   biological processes by which a less specialized cell develops into a
   specialized cell type. For instance, stem cells can differentiate into
   different specialized cell types such as cardiomyocytes, neural
   progenitors etc. [[44]12, [45]13]. In this work we aim at analyzing
   gene expression profiles during the differentiation of human induced
   pluripotent stem cells (hiPSCs) into cardiomyocytes in order to reveal
   key genes, potential biological processes and/ or pathways by which
   stem cells gain new phenotypic features of adult heart cells. To study
   the temporal expression patterns over developmental time points and
   biological replicates, we have applied our proposed triclustering
   algorithm EMOA- δ-TRIMAX on a real-life dataset that contains mRNA
   expression profiles of hiPSCs differentiating into cardiomyocytes
   [[46]12]. Figure [47]1 shows the general work flow of this work. After
   retrieval of triclusters by applying EMOA- δ-TRIMAX we first performed
   enrichment analyses of KEGG pathways and transcription factor binding
   sites (TFBSs) among the clustered genes to demonstrate biological
   significance of each resultant tricluster. In the next step, we
   identified key genes for each resultant tricluster and performed
   biological process and KEGG pathway enrichment analysis to uncover
   potential biological processes that may govern stem cell
   differentiation towards adult heart.

Fig. 1.

   Fig. 1
   [48]Open in a new tab

   Workflow. General workflow applied in this work

Methods

Definitions

   Time series gene expression dataset (D): Such a dataset can be modeled
   as a G × C × T matrix, of which each element d [ijk] corresponds to the
   expression value of the ith gene over the jth sample and across the kth
   time point where i ∈ (g [1],g [2],....,g [G]), j ∈ (c [1],c [2],....,c
   [C]), k ∈ (t [1],t [2],....,t [T]).

   Tricluster (M): A tricluster can be defined as a sub-matrix M(I,J,K) =
   [ m [ijk]], where i ∈ I, j ∈ J, k ∈ K. Sub-matrix M represents a subset
   of genes (I) that have similar expression profiles over a subset of
   samples (J) during a subset of time points (K).

   Perfect shifting tricluster: A tricluster M(I,J,K) is called perfect
   shifting tricluster if each element of the tricluster is represented
   as: m [ijk]=Γ+α [i]+β [j]+η [k], where Γ is a constant value of the
   tricluster and α [i], β [j] and η [k] are the shifting factors of ith
   gene, jth sample, kth time point respectively.

   Mean squared residue: Mean squared residue score (MSR) of shifting
   tricluster M(I,J,K) can be modeled as [[49]8, [50]9]
   [MATH: <mtable><mtr><mtd><mtext
   mathvariant="italic">MSR</mtext><mo>=</mo></mtd><mtd><mspace
   width="0.3em"></mspace><mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo><m
   i>I</mi><mo>|</mo><mo>|</mo><mi>J</mi><mo>|</mo><mo>|</mo><mi>K</mi><mo
   >|</mo></mrow></mfrac><munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><m
   o>∈</mo><mi>I</mi><mo>,</mo><mi>j</mi><mo>∈</mo><mi>J</mi><mo>,</mo><mi
   >k</mi><mo>∈</mo><mi>K</mi></mrow></munder><msubsup><mrow><mi>r</mi></m
   row><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow><mrow><mn>2</mn></mrow></msubsup
   ><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>
   |</mo><mo>|</mo><mi>J</mi><mo>|</mo><mo>|</mo><mi>K</mi><mo>|</mo></mro
   w></mfrac><munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>
   I</mi><mo>,</mo><mi>j</mi><mo>∈</mo><mi>J</mi><mo>,</mo><mi>k</mi><mo>∈
   </mo><mi>K</mi></mrow></munder></mtd></mtr><mtr><mtd><msup><mrow><mo>(<
   /mo><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub><mo>−</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>−</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>−</mo><msub><mrow><mi
   >m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>+</mo><mn>2</mn><msub
   ><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>)</mo></mrow><mrow><m
   n>2</mn></mrow></msup><mo>,</mo></mtd></mtr></mtable> :MATH]
   (1)

   where the mean of the ith gene is
   [MATH: <msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">iJK</mtext></mrow></msub><mo>=</mo><mfrac><mrow><m
   n>1</mn></mrow><mrow><mo>|</mo><mi>J</mi><mo>|</mo><mo>|</mo><mi>K</mi>
   <mo>|</mo></mrow></mfrac><munder><mrow><mo>∑</mo></mrow><mrow><mi>j</mi
   ><mo>∈</mo><mi>J</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</mi></mrow></m
   under><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub> :MATH]
   , the mean of the jth sample is
   [MATH: <msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IjK</mtext></mrow></msub><mo>=</mo><mfrac><mrow><m
   n>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|</mo><mi>K</mi>
   <mo>|</mo></mrow></mfrac><munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi
   ><mo>∈</mo><mi>I</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</mi></mrow></m
   under><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub> :MATH]
   , the mean of the kth time point is
   [MATH: <msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJk</mtext></mrow></msub><mo>=</mo><mfrac><mrow><m
   n>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|</mo><mi>J</mi>
   <mo>|</mo></mrow></mfrac><munder><mrow><mo>∑</mo></mrow><mrow><mi>i</mi
   ><mo>∈</mo><mi>I</mi><mo>,</mo><mi>j</mi><mo>∈</mo><mi>J</mi></mrow></m
   under><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub> :MATH]
   , and the mean of tricluster is
   [MATH: <msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">IJK</mtext></mrow></msub><mo>=</mo><mfrac><mrow><m
   n>1</mn></mrow><mrow><mo>|</mo><mi>I</mi><mo>|</mo><mo>|</mo><mi>J</mi>
   <mo>|</mo><mo>|</mo><mi>K</mi><mo>|</mo></mrow></mfrac><munder><mrow><m
   o>∑</mo></mrow><mrow><mi>i</mi><mo>∈</mo><mi>I</mi><mo>,</mo><mi>j</mi>
   <mo>∈</mo><mi>J</mi><mo>,</mo><mi>k</mi><mo>∈</mo><mi>K</mi></mrow></mu
   nder><msub><mrow><mi>m</mi></mrow><mrow><mtext
   mathvariant="italic">ijk</mtext></mrow></msub> :MATH]
   .

   The MSR score of a tricluster represents the level of coherence among
   the elements of the tricluster. Hence a lower MSR score means better
   quality of a tricluster. For a perfect shifting tricluster the MSR
   score is zero. If we use some global normalization, like min–max
   normalization globally on the whole dataset, it does not affect the
   algorithm. Moreover, it can be shown that gene–wise Z-normalization
   only on a tricluster does not affect the MSR score. However, when we
   apply similar normalization on the whole dataset, it affects the
   triclusters, and in turn affects our algorithm. Still we prefer to
   normalize the dataset in order to eliminate the variability in gene
   expression profiles due to experimental errors and noises and as
   normalization reduces the effects of scaling patterns, scaling patterns
   could also be identified partially.

Steps of EMOA- δ-TRIMAX

   The steps the of δ-TRIMAX algorithm [[51]8, [52]9] have been described
   in the Additional file [53]1. Figure [54]2 shows the steps of our
   proposed EMOA- δ-TRIMAX algorithm.

Fig. 2.

   Fig. 2
   [55]Open in a new tab

   EMOA- δ-TRIMAX algorithm. Steps of the EMOA- δ-TRIMAX algorithm

Multi-objective optimization problem

   The multi-objective optimization problem is equivalent to finding the
   vector
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>=</mo><msup><mrow><mo>[</mo><msubsup><mrow><mi
   >x</mi></mrow><mrow><mn>1</mn></mrow><mrow><mo>∗</mo></mrow></msubsup><
   mo>,</mo><msubsup><mrow><mi>x</mi></mrow><mrow><mn>2</mn></mrow><mrow><
   mo>∗</mo></mrow></msubsup><mo>,</mo><mo>…</mo><mo>,</mo><msubsup><mrow>
   <mi>x</mi></mrow><mrow><mi>n</mi></mrow><mrow><mo>∗</mo></mrow></msubsu
   p><mo>]</mo></mrow><mrow><mi>T</mi></mrow></msup> :MATH]
   of decision variables that satisfies a number of equality and
   inequality constraints by optimizing the vector function
   [MATH: <mover
   accent="true"><mrow><mi>f</mi></mrow><mo>¯</mo></mover><mo>(</mo><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover><mo>)</mo><mo>=<
   /mo><msup><mrow><mo>[</mo><msub><mrow><mi>f</mi></mrow><mrow><mn>1</mn>
   </mrow></msub><mo>(</mo><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover><mo>)</mo><mo>,<
   /mo><msub><mrow><mi>f</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>(</m
   o><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover><mo>)</mo><mo>,<
   /mo><mo>…</mo><mo>,</mo><msub><mrow><mi>f</mi></mrow><mrow><mi>r</mi></
   mrow></msub><mo>(</mo><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover><mo>)</mo><mo>]<
   /mo></mrow><mrow><mi>T</mi></mrow></msup> :MATH]
   subject to some constraints. Here the constraints correspond to the
   feasible region F that holds all the acceptable solutions;
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup> :MATH]
   stands for an optimal solution. For a minimization problem, Pareto
   optimality can be formally delineated as: A decision vector
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup> :MATH]
   is referred to as Pareto optimal if and only if there is no
   [MATH: <mover accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover>
   :MATH]
   such that ∀i∈{1,2,..,r},
   [MATH:
   <msub><mrow><mi>f</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>(</mo><m
   over
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover><mo>)</mo><mo>≤<
   /mo><msub><mrow><mi>f</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>(</m
   o><msup><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>)</mo> :MATH]
   and ∃i∈{1,2,…,r},
   [MATH:
   <msub><mrow><mi>f</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>(</mo><m
   over
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover><mo>)</mo><mo><<
   /mo><msub><mrow><mi>f</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>(</m
   o><msup><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup><mo>)</mo> :MATH]
   . In words,
   [MATH: <msup><mrow><mover
   accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover></mrow><mrow><mo
   >∗</mo></mrow></msup> :MATH]
   is called Pareto optimal if there exists no possible vector
   [MATH: <mover accent="true"><mrow><mi>x</mi></mrow><mo>¯</mo></mover>
   :MATH]
   that induces a diminution of some criterion without a contemporaneous
   increase of at least one other criterion [[56]11, [57]14].

Genetic algorithm

   A genetic algorithm is a search heuristic that imitates the process of
   Darwinian evolution [[58]11, [59]14]. Here the population is generated
   randomly and consists of a set of chromosomes that encode the
   parameters of the search space. A fitness function corresponds to the
   objective function to be optimized and is used to estimate the goodness
   of each chromosome in the population. Genetic operators such as
   selection, crossover and mutation are used to evolve subsequent
   generations. If some particular criterion is met or the maximum
   generation limit is reached, then the algorithm finishes its execution.

Encoding chromosome

   Each chromosome is represented by a binary string that has three parts.
   A chromosome encodes a possible tricluster. For a time series gene
   expression dataset having G number of genes, C number of samples and T
   number of time points, the first G bits correspond to genes, the next C
   bits represent the samples and the last T positions stand for the time
   points. Hence each string is represented by (G+C+T) bits, having a
   value either 1 or 0. A value 1 means the corresponding gene or sample
   or time point is a member of the tricluster. Suppose for a 3D gene
   expression dataset having 10 genes, 5 samples and 8 time points, a
   string {10010011100011101010101} represents that genes {g [1],g [4],g
   [7],g [8],g [9]}, samples {s [3],s [4],s [5]} and time points {t [2],t
   [4],t [6],t [8]} are the members of the tricluster. The initial
   population consists of a set of randomly generated chromosomes.
   Retrieval of overlapping genes belonging to several triclusters are
   guaranteed by the step of chromosome encoding. As each bit of a
   chromosome in the population represents the presence or absence of
   genes, replicates and time points in one resultant tricluster, often we
   could find an overlap between the positions of any two chromosomes
   containing a value 1. Thus different chromosomes can encode overlapping
   triclusters. Some genes and/ or samples and/ or time points could be
   added to the initial population inspite of lying far away from the
   feature space. To remove such nodes from the population, δ-TRIMAX has
   been used as a local search heuristic.

Objective functions

   After applying δ-TRIMAX as a local search heuristic, each string in the
   population represents one δ-tricluster having an MSR below a threshold
   δ. Now we compute values of the following three objective functions for
   each individual in the population. The first objective function is
   [MATH:
   <msub><mrow><mi>f</mi></mrow><mrow><mn>1</mn></mrow></msub><mo>=</mo><m
   frac><mrow><mtext
   mathvariant="italic">MSR</mtext></mrow><mrow><mi>δ</mi></mrow></mfrac><
   mo>,</mo> :MATH]
   (2)

   where MSR is the mean squared residue score of one tricluster. Hence, f
   [1] needs to be minimized. The second objective function is
   [MATH:
   <msub><mrow><mi>f</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>=</mo><m
   frac><mrow><mfenced close="|" open="|"
   separators=""><mrow><mi>I</mi></mrow></mfenced><mo>∗</mo><mfenced
   close="|" open="|"
   separators=""><mrow><mi>J</mi></mrow></mfenced><mo>∗</mo><mfenced
   close="|" open="|"
   separators=""><mrow><mi>K</mi></mrow></mfenced></mrow><mrow><mfenced
   close="|" open="|"
   separators=""><mrow><mi>G</mi></mrow></mfenced><mo>∗</mo><mfenced
   close="|" open="|"
   separators=""><mrow><mi>C</mi></mrow></mfenced><mo>∗</mo><mfenced
   close="|" open="|"
   separators=""><mrow><mi>T</mi></mrow></mfenced></mrow></mfrac><mo>,</mo
   > :MATH]
   (3)

   where (|I|∗|J|∗|K|) is the volume of the tricluster and (|G|∗|C|∗|T|)
   is the volume of the dataset. Our goal is to maximize the value of f
   [2]. Finally the third objective function is
   [MATH:
   <msub><mrow><mi>f</mi></mrow><mrow><mn>3</mn></mrow></msub><mo>=</mo><m
   fenced close="|" open="|"
   separators=""><mrow><mn>1</mn><mo>−</mo><mfrac><mrow><mn>6</mn><mo>∑</m
   o><munderover><mrow><mi>d</mi></mrow><mrow><mi>i</mi></mrow><mrow><mn>2
   </mn></mrow></munderover></mrow><mrow><mi>n</mi><mo>(</mo><msup><mrow><
   mi>n</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>−</mo><mn>1</mn><mo>)
   </mo></mrow></mfrac></mrow></mfenced><mo>,</mo> :MATH]
   (4)

   where d [i] is the difference between the ranks of average expression
   values (sorted either in ascending or descending order) over a subset
   of samples at ith time point of each pair of genes in one tricluster
   and n is the number of time points in that tricluster. Here the goal is
   to maximize the non-parametric Spearman correlation coefficient (f [3])
   [[60]15] of the resultant triclusters.

Motivations of objective functions

   As the aim of our proposed algorithm is to find triclusters having a
   lower MSR score and a higher volume, the first two objective functions
   (f [1] and f [2]) ensure to accomplish those goals. Moreover the
   objective function f [3] is used to maximize the correlation
   coefficients among genes belonging to the resultant triclusters. We
   have taken the absolute values of the correlation coefficients just
   considering the fact that coregulated genes can be both up- and
   down-regulated by the transcription factors across a subset of time
   points.

Genetic operators

   Here, non-dominated sorting and crowding distance are used for fitness
   assignment and comparison [[61]11]. A crossover is a generalization of
   several mutations performed at once, which we have not applied in this
   work [[62]16]. Instead, we have used bit string mutations with a high
   mutation probability to generate offspring population from a parent
   population. In this case, the mutation occurs at random positions
   through bit flips. For instance, for a binary string {1011010010} we
   generate a random number ranges from 0 to 1 for each bit of the string.
   If this random number for a particular bit is less than or equal to the
   mutation probability, mutation occurs and the value 1 or 0 is changed
   to a value 0 or 1, respectively. The mutation probability remains same
   for each of the bits of chromosome. After applying the mutation
   operator on each individual of the population, some genes/samples/time
   points can be added to the population that are lying far away from the
   feature space. To cope with this problem we have applied δ-TRIMAX as a
   local search heuristic.

Elitism

   We have included elitism to keep track of non-dominated Pareto optimal
   solutions after each generation [[63]11]. Stopping criteria is measured
   by the convergence metric delineated in equation ([64]8).

Tricluster eigengene

   We applied the singular value decomposition method (SVD) on the
   expression data of each resultant tricluster to detect its eigengene
   [[65]17]. For instance,
   [MATH:
   <msubsup><mrow><mi>X</mi></mrow><mrow><mi>g</mi><mo>×</mo><mo>(</mo><mi
   >c</mi><mo>∗</mo><mi>t</mi><mo>)</mo></mrow><mrow><mi>i</mi></mrow></ms
   ubsup> :MATH]
   stands for the expression matrix of ith tricluster, where g, c and t
   represent the number of genes, samples and time points of ith
   tricluster. Now we apply SVD on the data matrix (normalized to mean=0
   and variance=1). Now, the SVD of ith tricluster can be represented as,
   [MATH:
   <msup><mrow><mi>X</mi></mrow><mrow><mi>i</mi></mrow></msup><mo>=</mo><m
   text
   mathvariant="italic">UD</mtext><msup><mrow><mi>V</mi></mrow><mrow><mi>T
   </mi></mrow></msup><mo>,</mo> :MATH]
   (5)

   where U and V are the orthogonal matrices. U ^i is a g ∗ (c ∗ t) matrix
   with orthonormal columns, V ^i is a (c ∗ t) × (c ∗ t) orthogonal matrix
   and D ^i is (c ∗ t) × (c ∗ t) diagonal matrix of singular values.
   Assuming that singular values in matrix D ^i are arranged in
   non-decreasing order, we can represent the eigengene of the ith
   tricluster by the first column of matrix V ^i, i.e.
   [MATH:
   <msup><mrow><mi>E</mi></mrow><mrow><mi>i</mi></mrow></msup><mo>=</mo><m
   subsup><mrow><mi>V</mi></mrow><mrow><mn>1</mn></mrow><mrow><mi>i</mi></
   mrow></msubsup><mi>.</mi> :MATH]
   (6)

KEGG pathway enrichment

   To establish the biological significance of the genes belonging to each
   resultant tricluster for both datasets we have performed a KEGG pathway
   enrichment analysis using the GOStats package in R with a p-value
   cutoff (BH-corrected p-value) of 0.05 [[66]18, [67]19].

TFBS enrichment analysis

   Genes that exhibit similar expression profiles are supposed to be
   regulated by the same mechanism. To analyze the potential co-regulation
   of co-expressed genes, we have done a transcription factor binding site
   (TFBS) enrichment analysis using the TRANSFAC library (version 2012.2)
   [[68]20]. Here we used 52 million TFBS predictions that are conserved
   between human, mouse, dog and cow [[69]21]. Out of these 52 million
   conserved TFBSs we have selected the highest-scoring 1 % for each
   TRANSFAC matrix to identify the most specific regulator (transcription
   factor) - target interactions. We have applied a hypergeometric test
   and Benjamini Yekutieli-FDR for p-value correction to find
   over-represented binding sites (p-value ≤ 0.05) in the upstream regions
   of genes belonging to each tricluster [[70]22, [71]23].

Datasets

Description of the artificial datasets

Artificial dataset 1 (AD1):

   First, we have applied the proposed algorithm to an artificial dataset
   containing 1000 genes, 5 samples and 4 time points. We have then
   embedded 3 perfect shifting triclusters (standard deviation (σ)=0) of
   size 100 × 4 × 4, 80 × 4 × 4 and 60 × 4 × 4 into the dataset. In the
   next step, we have implanted 3 noisy triclusters with different levels
   of noise (σ=0.1,0.3,0.5,0.7,0.9) into the synthetic dataset.

Artificial dataset 2 (AD2):

   Moreover, we have generated another artificial dataset which contains
   200 genes, 10 replicates and 10 time points. Afterwards, we have
   implanted 3 perfect shifting triclusters (standard deviation (σ)=0) of
   size 50 × 3 × 3, 50 × 3 × 3 and 50 × 3 × 3 into the dataset. In the
   next step, we have added different levels of noise
   (σ=0.1,0.3,0.5,0.7,0.9) into the synthetic dataset.

Artificial dataset 3 (AD3):

   To evaluate the performance of the proposed algorithm in case of the
   datasets containing different number of time points, we have generated
   three additional artificial datasets of size 200 (genes) × 10
   (replicates) × 20 (time points), 200 (genes) × 10 (replicates) × 25
   (time points) and 200 (genes) × 10 (replicates) × 30(time points) in
   which we have embedded 3 perfect shifting triclusters of size 30 × 3 ×
   8, 30 × 3 × 6 and 30 × 3 × 4.

Artificial dataset 4 (AD4):

   In order to show the performance of the algorithm for the dataset
   containing missing values, we have randomly deleted the values of 0.5
   %, 1 %, 1.5 % and 2 % of all elements of one artificial dataset of size
   200 × 10 × 20 containing three triclusters of size 30 × 3 × 8, 30 × 3 ×
   6 and 30 × 3 × 4.

Description of real-life datasets

Dataset 1:

   In this work, this previously published dataset has only been used for
   comparing the performance of the proposed algorithm with that of the
   other existing triclustering algorithms since one of the algorithms we
   wanted to compare our approach with, OPTricluster, can only be
   efficiently applied to a short time series gene expression dataset and
   thus, was not suitable to be used for dataset 2 (see below) [[72]7].
   Dataset 1 holds 54675 Affymetrix human genome U133 plus 2.0 probe ids,
   3 samples and 4 time points (0, 3, 6 and 12 hours) ([73]GSE11324)
   [[74]24]. The goal of this experiment was to determine cis-regulatory
   sites in previously uncharted genome regions, responsible for conveying
   estrogen responses, and to identify the cooperating transcription
   factors that also contribute to estrogen signaling in MCF7 breast
   cancer cells.

Dataset 2:

   This dataset contains 48803 Illumina HumanWG-6 v3.0 probe ids, 3
   replicates and 12 time points (days 0, 3, 7, 10, 14, 20, 28, 35, 45,
   60, 90 and 120) ([75]GSE35671) [[76]12]. All these replicates are
   independent of each other. The aim of this study was to provide
   insights into the molecular regulation of hiPSC differentiation to
   cardiomyocytes.

Dataset 3:

   This experiment was carried out to study the dynamics of expression
   profiles of 54675 Affymetrix human genome U133 plus 2.0 probe ids in
   response to IFN-beta-1b treatment across four time points over 6
   patients ([77]GSE46280) [[78]25].

Results and discussion

Results on an artificial dataset

   To evaluate the performance of the proposed algorithm on the artificial
   datasets described above (2.6.1), we have used the affirmation score
   [[79]8, [80]9] defined as
   [MATH: <mspace
   width="-15.0pt"></mspace><mtable><mtr><mtd><mi>S</mi><msup><mrow><mi>M<
   /mi></mrow><mrow><mo>∗</mo></mrow></msup><mo>(</mo><msub><mrow><mi>T</m
   i></mrow><mrow><mtext
   mathvariant="italic">im</mtext></mrow></msub><mo>,</mo><msub><mrow><mi>
   T</mi></mrow><mrow><mtext
   mathvariant="italic">res</mtext></mrow></msub><mo>)</mo><mo>=</mo><msqr
   t><mrow><mi>S</mi><msubsup><mrow><mi>M</mi></mrow><mrow><mi>G</mi></mro
   w><mrow><mo>∗</mo></mrow></msubsup><mo>(</mo><msub><mrow><mi>T</mi></mr
   ow><mrow><mtext
   mathvariant="italic">im</mtext></mrow></msub><mo>,</mo><msub><mrow><mi>
   T</mi></mrow><mrow><mtext
   mathvariant="italic">res</mtext></mrow></msub><mo>)</mo><mo>×</mo><mi>S
   </mi><msubsup><mrow><mi>M</mi></mrow><mrow><mi>C</mi></mrow><mrow><mo>∗
   </mo></mrow></msubsup><mo>(</mo><msub><mrow><mi>T</mi></mrow><mrow><mte
   xt
   mathvariant="italic">im</mtext></mrow></msub><mo>,</mo><msub><mrow><mi>
   T</mi></mrow><mrow><mtext
   mathvariant="italic">res</mtext></mrow></msub><mo>)</mo><mo>×</mo><mi>S
   </mi><msubsup><mrow><mi>M</mi></mrow><mrow><mi>T</mi></mrow><mrow><mo>∗
   </mo></mrow></msubsup><mo>(</mo><msub><mrow><mi>T</mi></mrow><mrow><mte
   xt
   mathvariant="italic">im</mtext></mrow></msub><mo>,</mo><msub><mrow><mi>
   T</mi></mrow><mrow><mtext
   mathvariant="italic">res</mtext></mrow></msub><mo>)</mo></mrow></msqrt>
   <mo>,</mo></mtd></mtr></mtable> :MATH]
   (7)

   where, T [im] is the set of implanted triclusters, T [res] represents
   the set of triclusters extracted by any triclustering algorithm,
   [MATH:
   <mi>S</mi><msubsup><mrow><mi>M</mi></mrow><mrow><mi>G</mi></mrow><mrow>
   <mo>∗</mo></mrow></msubsup><mo>(</mo><msub><mrow><mi>T</mi></mrow><mrow
   ><mtext
   mathvariant="italic">im</mtext></mrow></msub><mo>,</mo><msub><mrow><mi>
   T</mi></mrow><mrow><mtext
   mathvariant="italic">res</mtext></mrow></msub><mo>)</mo> :MATH]
   is the average gene affirmation score,
   [MATH:
   <mi>S</mi><msubsup><mrow><mi>M</mi></mrow><mrow><mi>C</mi></mrow><mrow>
   <mo>∗</mo></mrow></msubsup><mo>(</mo><msub><mrow><mi>T</mi></mrow><mrow
   ><mtext
   mathvariant="italic">im</mtext></mrow></msub><mo>,</mo><msub><mrow><mi>
   T</mi></mrow><mrow><mtext
   mathvariant="italic">res</mtext></mrow></msub><mo>)</mo> :MATH]
   is the average sample affirmation score and
   [MATH:
   <mi>S</mi><msubsup><mrow><mi>M</mi></mrow><mrow><mi>T</mi></mrow><mrow>
   <mo>∗</mo></mrow></msubsup><mo>(</mo><msub><mrow><mi>T</mi></mrow><mrow
   ><mtext
   mathvariant="italic">im</mtext></mrow></msub><mo>,</mo><msub><mrow><mi>
   T</mi></mrow><mrow><mtext
   mathvariant="italic">res</mtext></mrow></msub><mo>)</mo> :MATH]
   is the average time point affirmation score of T [res] with respect to
   T [im]. The value of S M ^∗(T [im],T [res]) ranges from 0 to 1. If T
   [res]=T [im], then the affirmation score is 1.

   The affirmation score was also used to compare the performance of the
   proposed algorithm with that of the other triclustering algorithms and
   one biclustering algorithm [[81]26]. Before applying the biclustering
   algorithm to the artificial dataset we have converted G × C × T dataset
   into a G × CT dataset. To compute the value of δ, we have first
   clustered the genes over all time points and then the time points over
   the subset of genes for each gene cluster in each sample plane using
   the k-means algorithm. Taking a randomly selected sample plane, we have
   computed the MSR score of the submatrix of each gene and time-point
   cluster and repeated this procedure 100 times. Then we have taken the
   lowest value as the value of δ. Although it is possible to minimize the
   MSR score without introducing the threshold parameter δ, minimizing MSR
   without using any threshold may either yield some small sized
   triclusters which may not provide any biologically meaningful
   information or produce large sized triclusters which may contain genes
   and/or samples and/or time points lying far apart from the feature
   space. Thus using a threshold parameter δ may balance the size and
   quality of the resultant triclusters. The value of λ has been
   experimentally set to maximize the speed of the proposed algorithm and
   minimize the risk falling into a local optimum. The values of δ and λ
   used to run the proposed algorithm and our previously proposed δ-TRIMAX
   algorithm on the artificial datasets are enlisted in Tables [82]1,
   [83]2, [84]3 and [85]4. We have tuned the input parameters of other
   triclustering algorithms rather than using the default parameter values
   in order to achieve better results on the artificial datasets. The
   OPTricluster algorithm, which was proposed to mine only short time
   series gene expression datasets and yields triclusters containing all
   the time points, has only been applied to the dataset AD1 as this
   dataset contains triclusters having the same number of time points as
   the entire dataset. For the rest of the artificial datasets used in
   this work, the time point affirmation scores will be deteriorated for
   OPTricluster algorithm which in turn affects the overall affirmation
   score. From Figs. [86]3 and [87]4 we can observe that the proposed
   algorithm outperforms the other algorithms for each of the artificial
   datasets in terms of affirmation score. Figure [88]5 shows that
   although the affirmation scores of the proposed algorithm become worse
   in case of the dataset containing missing data points, it still
   outperforms the other algorithms. Moreover, Table [89]5 indicates the
   fact that EMOA- δ-TRIMAX can effectively deal with the datasets having
   different number of time points.

Table 1.

   Values of input parameters of EMOA- δ-TRIMAX namely, λ and δ for
   different levels of noise in case of the artificial dataset 1 (AD1)
   Noise levels (σ) Values of λ Values of δ
   0                1.2         0.0002
   0.1              1.2         0.025
   0.3              1.2         0.115
   0.5              1.2         0.26
   0.7              1.2         0.49
   0.9              1.2         0.85
   [90]Open in a new tab

Table 2.

   Values of input parameters of EMOA- δ-TRIMAX namely, λ and δ for
   different levels of noise in case of the artificial dataset 2 (AD2)
   Noise Levels (σ) Values of λ Values of δ
   0                1.2         0.00002
   0.1              1.2         0.045
   0.3              1.2         0.06
   0.5              1.2         0.29
   0.7              1.2         0.59
   0.9              1.2         0.8
   [91]Open in a new tab

Table 3.

   Values of input parameters of EMOA- δ-TRIMAX namely, λ and δ for
   different levels of noise in case of the artificial dataset 3 (AD3_{a,
   b, c})
   λ (AD3 _a) δ (AD3 _a) λ (AD3 _b) δ (AD3 _b) λ (AD3 _c) δ (AD3 _c)
   1.2        0.0002     1.2        0.02       1.2        0.02
   [92]Open in a new tab

Table 4.

   Values of input parameters of EMOA- δ-TRIMAX namely, λ and δ for
   different levels of noise in case of the artificial dataset 4 (AD4)
   % of missing values λ   δ
   0.5                 1.2 0.02
   1                   1.2 0.02
   1.5                 1.2 0.02
   2                   1.2 0.02
   [93]Open in a new tab

Fig. 3.

   Fig. 3
   [94]Open in a new tab

   Comparison in terms of affirmation score. Comparison between EMOA-
   δ-TRIMAX, δ-TRIMAX, TRICLUSTER, OPTricluster and the biclustering
   algorithm proposed by Cheng and Church in terms of affirmation score
   for the artificial dataset 1

Fig. 4.

   Fig. 4
   [95]Open in a new tab

   Comparison in terms of affirmation score. Comparison between EMOA-
   δ-TRIMAX, δ-TRIMAX and TRICLUSTER algorithm in terms of affirmation
   score for the artificial dataset 2

Fig. 5.

   Fig. 5
   [96]Open in a new tab

   Comparison in terms of affirmation score. Comparison between EMOA-
   δ-TRIMAX, δ-TRIMAX and TRICLUSTER algorithm in terms of affirmation
   score for the artificial dataset 4

Table 5.

   Comparison between EMOA- δ-TRIMAX, δ-TRIMAX and TRICLUSTER algorithm in
   terms of affirmation score for the artificial dataset 3 (AD3 _a, AD3
   _b, AD3 _c)
   Dataset EMOA- δ-TRIMAX δ-TRIMAX TRICLUSTER
   AD3 _a  1              1        1
   AD3 _b  1              1        1
   AD3 _c  1              1        1
   [97]Open in a new tab

   Moreover we compared the performance of the proposed algorithm with
   that of the existing ones in terms of CPU time. From Fig. [98]6, we can
   see that the proposed algorithm takes relatively more time to retrieve
   one tricluster as Non-dominated Sorting Genetic Algorithm-II (NSGA-II)
   has been used to optimize multiple objectives.

Fig. 6.

   Fig. 6
   [99]Open in a new tab

   Comparison in terms of CPU time. Comparison between EMOA- δ-TRIMAX,
   δ-TRIMAX and TRICLUSTER algorithm in terms of CPU time for the
   artificial datasets 1 (a), 2 and 3 (b)

Robustness of the evolutionary algorithm

   In order to show the robustness of the proposed algorithm, we have used
   the artificial datasets 1 and 2 with different levels of noise
   described above (2.6.1). For each of these datasets, we have run the
   proposed algorithm for 20 times and reported the standard deviations of
   the affirmation scores obtained after each run in Table [100]6 which
   establishes the robustness of the proposed algorithm as in case of each
   of the two datasets, the affirmation scores are very close to the mean.

Table 6.

   Standard deviations of the affirmation scores yielded by the EMOA-
   δ-TRIMAX algorithm for artificial dataset 1 (AD1) and 2 (AD2)
   Noise levels (σ) Standard deviation (AD1) Standard deviation (AD2)
   0                0.05                     0.003
   0.1              0.05                     0.004
   0.3              0.02                     0.02
   0.5              0.005                    0.02
   0.7              0.003                    0.03
   0.9              0.004                    0.02
   [101]Open in a new tab

Results on real-life datasets

   As a data preprocessing step, we have used robust multi-array average
   (RMA) method to normalize the datasets. The values of the input
   parameters of EMOA- δ-TRIMAX are provided in Table [102]7. We have set
   the values of λ and δ of EMOA- δ-TRIMAX and our previously proposed
   δ-TRIMAX algorithms for each of the real-life datasets according to our
   criteria explained in section ‘Results on an artificial dataset’. As
   using default parameter values may often produce poor results, the
   input parameters of other algorithms were tuned in order to obtain
   better results on each of the real-life datasets. Table [103]8 shows
   the percentage of probe ids, replicates and time points that are
   covered by the triclusters obtained with the proposed algorithm.

Table 7.

   Values of input parameters of EMOA- δ-TRIMAX for each of the real-life
   datasets
   Datasets              Dataset 1 Dataset 2 Dataset 3
   λ                     1.2       1.2       1.2
   δ                     0.012382  0.008     0.008754
   Number of generations 100       100       100
   Population Size       100       100       100
   Mutation probability  0.9       0.9       0.9
   [104]Open in a new tab

Table 8.

   Percentage of probe ids, replicates and time points covered by the
   resultant triclusters for each of the real-life datasets
   Datasets                Dataset 1 Dataset 2 Dataset 3
   Coverage of probe ids   99.02 %   88.14 %   93 %
   Coverage of replicates  100 %     100 %     100 %
   Coverage of time points 100 %     100 %     100 %
   [105]Open in a new tab

Convergence of solutions

   In order to show the convergence of solutions towards the Pareto
   optimal front around its center region, we have computed minSum values
   in each generation as follows
   [MATH: <mtext
   mathvariant="italic">minSum</mtext><mo>(</mo><mi>Ψ</mi><mo>)</mo><mo>=<
   /mo><munder><mrow><mo>min</mo></mrow><mrow><mi>x</mi><mo>∈</mo><mi>Ψ</m
   i></mrow></munder><mo>(</mo><msub><mrow><mi>f</mi></mrow><mrow><mn>1</m
   n></mrow></msub><mo>(</mo><mi>x</mi><mo>)</mo><mo>+</mo><mo>(</mo><mn>1
   </mn><mo>−</mo><msub><mrow><mi>f</mi></mrow><mrow><mn>2</mn></mrow></ms
   ub><mo>(</mo><mi>x</mi><mo>)</mo><mo>)</mo><mo>+</mo><mo>(</mo><mn>1</m
   n><mo>−</mo><msub><mrow><mi>f</mi></mrow><mrow><mn>3</mn></mrow></msub>
   <mo>(</mo><mi>x</mi><mo>)</mo><mo>)</mo><mo>)</mo><mo>,</mo> :MATH]
   (8)

   where Ψ denotes the current population and f [1], f [2] and f [3]
   correspond to the objective functions defined in section
   ‘[106]Methods’. We have found that the solutions converge towards a
   Pareto optimal front in case of each of the real-life datasets (Fig.
   [107]7).

Fig. 7.

   Fig. 7
   [108]Open in a new tab

   Convergence of solutions. Convergence of solutions towards the Pareto
   optimal front. minSum values are plotted for Dataset 1 (a), Dataset 2
   (b) and Dataset 3 (c)

Performance comparison

   We have applied our proposed algorithm on the three aforementioned
   real-life datasets and compared its performance with that of other
   triclustering algorithms. For this comparison, we have computed a
   Tricluster Diffusion (TD) score and a Statistical Difference from
   Background (SDB) score [[109]27]. The TD score has been defined by
   equation [110]9.
   [MATH:
   <mi>T</mi><msub><mrow><mi>D</mi></mrow><mrow><mi>i</mi></mrow></msub><m
   o>=</mo><mfrac><mrow><mtext
   mathvariant="italic">MS</mtext><msub><mrow><mi>R</mi></mrow><mrow><mi>i
   </mi></mrow></msub></mrow><mrow><mtext
   mathvariant="italic">Volum</mtext><msub><mrow><mi>e</mi></mrow><mrow><m
   i>i</mi></mrow></msub></mrow></mfrac><mo>,</mo> :MATH]
   (9)

   where M S R [i] and V o l u m e [i] stand for the mean squared residue
   score (see eq. ([111]1)) and for the volume of each resultant
   tricluster i. The volume of the ith tricluster can be defined as (|I
   [i]|∗|J [i]|∗|K [i]|), where |I [i]|, |J [i]| and |K [i]|, represent
   the number of genes, samples and time points of the ith tricluster,
   respectively. A lower TD score represents better quality of
   triclusters. Figures [112]8, [113]9 and [114]10 plot the TD scores (in
   log scale) of the resultant triclusters produced by all algorithms,
   showing that EMOA- δ-TRIMAX yields triclusters having lower TD scores
   than those produced by other algorithms for each of the three datasets.

Fig. 8.

   Fig. 8
   [115]Open in a new tab

   Comparison in terms of Tricluster Diffusion score for Dataset 1.
   Performance comparison between EMOA- δ-TRIMAX, δ-TRIMAX, TRICLUSTER and
   OPTricluster in terms of TD (in log scale) score for Dataset 1

Fig. 9.

   Fig. 9
   [116]Open in a new tab

   Comparison in terms of Tricluster Diffusion score for Dataset 2.
   Performance comparison between EMOA- δ-TRIMAX, δ-TRIMAX and TRICLUSTER
   in terms of TD (in log scale) score for Dataset 2

Fig. 10.

   Fig. 10
   [117]Open in a new tab

   Comparison in terms of Tricluster Diffusion score for Dataset 3.
   Performance comparison between EMOA- δ-TRIMAX, δ-TRIMAX, TRICLUSTER and
   OPTricluster in terms of TD (in log scale) score for Dataset 3

   The statistical difference from background (SDB) score signifies
   whether a set of n triclusters is statistically different from the
   background data matrix. The SDB score is defined by equation [118]10. A
   higher SDB score signifies better performance of the algorithm.
   [MATH: <mtext
   mathvariant="italic">SDB</mtext><mo>=</mo><mfrac><mrow><mn>1</mn></mrow
   ><mrow><mi>n</mi></mrow></mfrac><munderover><mrow><mo>∑</mo></mrow><mro
   w><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow><mi>n</mi></mrow></mundero
   ver><mfrac><mrow><mfrac><mrow><mn>1</mn></mrow><mrow><mi>r</mi></mrow><
   /mfrac><munderover><mrow><mo>∑</mo></mrow><mrow><mi>j</mi><mo>=</mo><mn
   >1</mn></mrow><mrow><mi>r</mi></mrow></munderover><mtext
   mathvariant="italic">RMS</mtext><msub><mrow><mi>R</mi></mrow><mrow><mi>
   j</mi></mrow></msub><mo>−</mo><mtext
   mathvariant="italic">MS</mtext><msub><mrow><mi>R</mi></mrow><mrow><mi>i
   </mi></mrow></msub></mrow><mrow><mtext
   mathvariant="italic">MS</mtext><msub><mrow><mi>R</mi></mrow><mrow><mi>i
   </mi></mrow></msub></mrow></mfrac><mo>,</mo> :MATH]
   (10)

   where n is the total number of triclusters extracted by the algorithm.
   M S R [i] represents the mean squared residue of the ith tricluster
   retrieved by the algorithm and R M S R [j] stands for the mean squared
   residue of the jth random tricluster having the same number of genes,
   experimental samples and time points as the ith resultant tricluster.
   Here a higher value of the numerator indicates a better quality of the
   resultant tricluster. In our study we have set r to 100. OPTricluster
   can not be applied to Dataset 2 as it effectively mines only short time
   series gene expression data having approximately 3-8 time points. From
   Tables [119]9, [120]10 and [121]11 we can observe the highest SDB
   scores for EMOA- δ-TRIMAX algorithm in case of the dataset 1, dataset 2
   and dataset 3.

Table 9.

   Performance comparison between EMOA- δ-TRIMAX, δ-TRIMAX, TRICLUSTER and
   OPTricluster in terms of SDB score for Dataset 1
   Algorithms     SDB score
   EMOA- δ-TRIMAX 2.49851
   δ-TRIMAX       2.140935
   TRICLUSTER     2.094091
   OPTricluster   0.4956035
   [122]Open in a new tab

Table 10.

   Performance comparison between EMOA- δ-TRIMAX, δ-TRIMAX and TRICLUSTER
   in terms of SDB score for Dataset 2
   Algorithms     SDB score
   EMOA- δ-TRIMAX 13.88559
   δ-TRIMAX       12.10529
   TRICLUSTER     7.520363
   [123]Open in a new tab

Table 11.

   Performance comparison between EMOA- δ-TRIMAX, δ-TRIMAX, TRICLUSTER and
   OPTricluster in terms of SDB score for Dataset 3
   Algorithms     SDB score
   EMOA- δ-TRIMAX 9.454915
   δ-TRIMAX       8.945816
   TRICLUSTER     7.076184
   OPTricluster   0.4383489
   [124]Open in a new tab

Biological significance

   KEGG pathway enrichments have been found for each resultant tricluster
   for datasets 1, 2 and 3. To compare the performance of our proposed
   algorithm with that of the other algorithms using KEGG pathway
   enrichment we used a hit score [[125]28]. The hit score for KEGG
   pathway enrichment can be delineated by equation [126]11.
   [MATH: <mtext
   mathvariant="italic">Hit</mtext><mo>(</mo><mi>K</mi><mo>)</mo><mo>=</mo
   ><mfrac><mrow><mo>max</mo><mo>{</mo><mfenced close="|" open="|"
   separators=""><mrow><munderover accent="false"
   accentunder="false"><mrow><mi>N</mi></mrow><mrow><mi>T</mi></mrow><mrow
   ><mn>1</mn></mrow></munderover></mrow></mfenced><mo>,</mo><mfenced
   close="|" open="|" separators=""><mrow><munderover accent="false"
   accentunder="false"><mrow><mi>N</mi></mrow><mrow><mi>T</mi></mrow><mrow
   ><mn>2</mn></mrow></munderover></mrow></mfenced><mo>,</mo><mo>…</mo><mo
   >,</mo><mfenced close="|" open="|" separators=""><mrow><munderover
   accent="false"
   accentunder="false"><mrow><mi>N</mi></mrow><mrow><mi>T</mi></mrow><mrow
   ><mi>n</mi></mrow></munderover></mrow></mfenced><mo>}</mo></mrow><mrow>
   <mfenced close="|" open="|"
   separators=""><mrow><mi>T</mi></mrow></mfenced></mrow></mfrac><mo>,</mo
   > :MATH]
   (11)

   where
   [MATH:
   <msubsup><mrow><mi>N</mi></mrow><mrow><mi>T</mi></mrow><mrow><mi>i</mi>
   </mrow></msubsup> :MATH]
   is the intersection gene set of tricluster T and its enriched KEGG
   pathway term i; |T| is the total number of genes in tricluster T. A
   higher hit score signifies that more genes in T participate in a
   canonical pathway.

   We have observed TFBS enrichment for 98 %, 96 % and 94 % of all
   resultant triclusters for datasets 1, 2 and 3, respectively. We used a
   Hit score (equation ([127]12)) to compare the performance of EMOA-
   δ-TRIMAX with that of other triclustering algorithms using the results
   of TFBS enrichment.
   [MATH: <mtext mathvariant="italic">Hit</mtext><mo>(</mo><mtext
   mathvariant="italic">TF</mtext><mo>)</mo><mo>=</mo><mfrac><mrow><mo>max
   </mo><mo>{</mo><mfenced close="|" open="|"
   separators=""><mrow><munderover accent="false"
   accentunder="false"><mrow><mi>P</mi></mrow><mrow><mi>T</mi></mrow><mrow
   ><mn>1</mn></mrow></munderover></mrow></mfenced><mo>,</mo><mfenced
   close="|" open="|" separators=""><mrow><munderover accent="false"
   accentunder="false"><mrow><mi>P</mi></mrow><mrow><mi>T</mi></mrow><mrow
   ><mn>2</mn></mrow></munderover></mrow></mfenced><mo>,</mo><mo>…</mo><mo
   >,</mo><mfenced close="|" open="|" separators=""><mrow><munderover
   accent="false"
   accentunder="false"><mrow><mi>P</mi></mrow><mrow><mi>T</mi></mrow><mrow
   ><mi>n</mi></mrow></munderover></mrow></mfenced><mo>}</mo></mrow><mrow>
   <mfenced close="|" open="|"
   separators=""><mrow><mi>T</mi></mrow></mfenced></mrow></mfrac><mo>,</mo
   > :MATH]
   (12)

   where
   [MATH:
   <msubsup><mrow><mi>P</mi></mrow><mrow><mi>T</mi></mrow><mrow><mi>i</mi>
   </mrow></msubsup> :MATH]
   is the intersection gene set of tricluster T and its enriched TRANSFAC
   matrix i; |T| is the total number of genes in tricluster T. A higher
   hit score signifies that more genes in T are regulated by a common
   transcription factor.

   At first we have calculated the hit scores Hit(K) and Hit(TF) for each
   resultant tricluster using KEGG pathway and TFBS enrichment results,
   respectively. For each tricluster (T) we generated 100 random gene
   lists having the same size as the tricluster (T). The Hit scores for
   each randomly generated gene list were computed using KEGG pathway and
   TFBS enrichment results. As final step we have applied the
   non-parametric Mann-Whitney-Wilcoxon test to compute the significance
   between these two sets of hit scores in terms of p-values [[128]29].
   From Figs. [129]11 and [130]12 it can be seen that EMOA- δ-TRIMAX
   yields more significant triclusters than the other algorithms in terms
   of hit scores computed from the enriched KEGG pathways and TRANSFAC
   matrices for each of the real-life datasets because a higher percentage
   of triclusters obtained by the proposed algorithm have a smaller
   p-value than those produced by the other algorithms for each of the
   real-life datasets. Particularly striking is the inverse trend of Hit
   Scores in the TFBS enrichment observed with EMOA- δ-TRIMAX, which has
   by far the largest population at the lowest p-values, and the other
   algorithms, where an increasing number of clusters is found with
   increasing p-values (Fig. [131]12 [132]a–[133]c).

Fig. 11.

   Fig. 11
   [134]Open in a new tab

   Comparison in terms of Hit score using KEGG pathway enrichment.
   Performance comparison between EMOA- δ-TRIMAX, δ-TRIMAX, TRICLUSTER and
   OPTricluster in terms of Hit scores for Datasets 1 (a), 2 (b) and 3 (c)

Fig. 12.

   Fig. 12
   [135]Open in a new tab

   Comparison in terms of Hit score using TFBS enrichment. Performance
   comparison between EMOA- δ-TRIMAX, δ-TRIMAX, TRICLUSTER and
   OPTricluster in terms of Hit scores for Datasets 1 (a), 2 (b) and 3 (c)

Importance of clustering biological replicates in 3D gene expression datasets

   Time series microarray experiments are performed to measure the
   expression profiles of genes at a set of time points. At each time
   point, the experiments are often repeated for a certain number of
   times, which in turn yield the expression profiles of the genes over a
   set of biological replicates at each time point. Though the expression
   profiles of these biological replicates are measured at the same time
   point keeping the experimental setup unchanged, peculiarities in
   experimental protocol or physiological variation of the population may
   cause disparity in the expression profiles of technical or biological
   replicates, respectively. Thus grouping those replicates which exhibit
   similar expression profiles might play an important role to identify
   those replicates that behave similarly. This enables us to retrieve
   biologically meaningful information from these samples rather than
   leveling effects by forcing together samples exhibiting dissimilar
   expression profiles. Here, we have tried to unravel the reason of not
   always getting all the replicates as the members of each resultant
   tricluster. In Fig. [136]13, we have plotted the mean of the Euclidean
   distances between the expression profiles of each pair of clustered
   samples over the clustered genes and time points along with that of
   each pair of all replicates. From this figure, we can notice the
   enhancement of the average intra-cluster distances between replicates
   while incorporating the missing replicates into our resultant
   triclusters for each of the real-life datasets. Thus, grouping the
   closest biological replicates improves the quality of the resultant
   triclusters and thus may play instrumental roles in extracting more
   biologically meaningful information. From Fig. [137]13 we can see that
   Dataset 3 has the most divergent replicates. It is not astonishing to
   us as in Dataset 1, the expression profiles were measured during the
   response of a well-controlled cell culture to estrogen; in Dataset 2,
   the expression profiles were monitored during the development
   differentiation of a cell culture over a long time period, with many
   chance to diverge whereas, in case of Dataset 3, replicates correspond
   to six human individuals where the applied interferon-beta elicited
   highly divergent responses.

Fig. 13.

   Fig. 13
   [138]Open in a new tab

   Importance of clustering replicates. Average Euclidean distances
   between the expression profiles of each pair of clustered (red) and all
   (blue, dashed line) replicates over the clustered genes and time points
   for Dataset 1 (a), Dataset 2 (b) and Dataset 3 (c)

Identifying key genes of triclusters and analyzing their roles during hiPSC
differentiation into cardiomyocytes

   To detect key genes, we have first represented each tricluster by its
   eigen-gene and then computed the Pearson correlation coefficient
   between each gene of the tricluster and its eigen-gene. We then ranked
   the probe-ids in descending order of Pearson correlation coefficient.
   We consistently observed that the genes corresponding to, for instance
   the 10 top-most probe-ids exhibited clear functional characteristics
   with relevance for cardiac development (or concomitant processes, see
   below) when being mapped to GOBPs (Gene Ontology Biological Processes)
   or metabolic pathways (from KEGG). Therefore, we considered them as
   “key genes” of that tricluster. Usually, no similarly clear
   categorizations were found for all the genes of one tricluster. For
   instance, if we perform biological process enrichment test using all
   genes of tricluster 64, we would not find the biological processes like
   S-adenosylhomocysteine, lipoprotein metabolic processes as the enriched
   ones. From Fig. [139]14 we can see that the identified key genes of the
   triclusters are highly correlated with the corresponding eigen-gene
   vectors. It has been stated in the original work that cardiomyocyte
   differentiation (dataset 2) occurs during days 0, 3, 7, 10, 14, 20, 28,
   35 whereas days 35, 45, 60, 90, 120 are the post-differentiation time
   points [[140]12]. On day 14, the heart beating was first observed.
   Figure [141]15 summarizes the corresponding GOBPs (Gene Ontology
   Biological Process) and metabolic pathways of the corresponding
   tricluster key genes during different stages of cardiomyocytes
   differentiation. It is of interest to observe enrichment of several
   biosynthetic and metabolic processes such as lipoprotein, naphthalene,
   S-adenosylhomocysteine, serotonin, fucose, putrescine, ketone,
   prostanoid, fatty acid, carbohydrate, spermidine etc. and amine,
   putrescine, folate biosynthetic processes during stem cell
   differentiation into cardiomyocyte. Each of the aforementioned
   metabolic and biosynthetic processes is known to play an instrumental
   role in either heart development or in preventing cardiovascular
   diseases [[142]30–[143]41]. Moreover, the enriched biological processes
   show the parallel occurrence of neural and cardiac development
   [[144]42]. This is not too surprising since a previous study reported
   that the crosstalk between the neuronal and the cardiovascular system
   may play a pivotal role in the development of both systems [[145]43].
   The lists of enriched processes also reveal the occurrences of smooth,
   cardiac and skeletal muscle cell differentiations during cardiomyocyte
   development; this finding is also supported by previous reports
   [[146]44]. Moreover, the instrumental role of the canonical Wnt
   receptor signaling pathway involved in heart development can also be
   deduced from the list of enriched biological processes involved in all
   stages of differentiation. A previous study inferred Wnt signaling
   pathway as an important regulator during cardiomyocyte differentiation
   [[147]45]. Furthermore, through our analysis we have identified the
   enrichment of biological process such as histone H3 acetylation or the
   hippo signaling which are also inferred to be functionally associated
   with the characteristics of hiPSC-derived cardiomyocytes [[148]46,
   [149]47]. The Additional file [150]1: Tables S1-S5 contain the lists of
   enriched GOBPs/ KEGG pathways of the triclusters shown in Fig. [151]15.
   Additionally, in Additional file [152]1: Tables S6-S8, we have enlisted
   genes that are already known to play important roles in cardiovascular
   diseases and development, in addition to genes that are hypothesized to
   be functional in this context by interpreting and associating their
   general biological functions.

Fig. 14.

   Fig. 14
   [153]Open in a new tab

   Average Pearson correlation coefficient (PCC) between key genes.
   Average Pearson correlation coefficient (PCC) between 10 top-most probe
   ids of triclusters and the corresponding eigen-gene vectors during
   different phases of cardiomyocyte differentiation

Fig. 15.

   Fig. 15
   [154]Open in a new tab

   Summarizations of GOBPs and metabolic pathways of the key genes of
   resultant triclusters. Summarization of enriched GOBPs and metabolic
   pathways of the key genes of the mentioned triclusters during hiPSC
   differentiation to cardiomyocytes. Green, red, blue and black colored
   boxes represent the time points Days 0 to 35, Days 35 to 120, Days 14
   to 120 and Days 0 to 120, respectively

Conclusion

   In this work, we have shown that the improved version of our previously
   proposed triclustering algorithm EMOA- δ-TRIMAX outperforms the other
   algorithms when applied to four synthetic datasets as well as on three
   real-life datasets used in this work. Moreover, after retrieving groups
   of co-expressed and co-regulated genes over a subset of samples and
   across a subset of time points from a microarray gene expression
   dataset of hiPSC-derived cardiomyocyte differentiation, using the
   singular value decomposition method we have detected tricluster key
   genes most of which have already been shown or inferred to play
   instrumental roles in cardiac development. Thus, the other identified
   key genes can be hypothesized to be meaningful in this context as well,
   which needs to be experimentally validated. Furthermore, the enriched
   biological processes for the identified key genes of each tricluster
   not only resulted in a set of biological processes, associated with
   stem cell differentiation into cardiomyocytes but also a set of
   metabolic processes, the majority of which are known to play crucial
   roles in preventing cardiac diseases. Thus, the identified metabolic
   processes can be used to provide insights into potential therapeutic
   strategies to the treatment of cardiovascular diseases. Moreover, the
   triclusters for which the identified key genes are found to be involved
   in heart development might be facilitative to unravel regulatory
   mechanisms during different stages of cardiomyocyte development.

Acknowledgements