Abstract

   Metabolic transformations have been reported as involved in neoplasms
   survival. This suggests a role of metabolic pathways as potential
   cancer pharmacological targets. Modulating tumor's energy production
   pathways may become a substantial research area for cancer treatment.
   The significant role of metabolic deregulation as inducing
   transcriptional instabilities and consequently whole-system failure, is
   thus of foremost importance. By using a data integration approach that
   combines experimental evidence for high-throughput genome wide gene
   expression, a non-equilibrium thermodynamics analysis, nonlinear
   correlation networks as well as database mining, we were able to
   outline the role that transcription factors MEF2C and MNDA may have as
   main master regulators in primary breast cancer phenomenology, as well
   as the possible interrelationship between malignancy and metabolic
   dysfunction. The present findings are supported by the analysis of 1191
   whole genome gene expression experiments, as well as probabilistic
   inference of gene regulatory networks, and non-equilibrium
   thermodynamics of such data. Other evidence sources include pathway
   enrichment and gene set enrichment analyses, as well as motif
   comparison with a comprehensive gene regulatory network (of homologue
   genes) in Arabidopsis thaliana. Our key finding is that the
   non-equilibrium free energies provide a realistic description of
   transcription factor activation that when supplemented with gene
   regulatory networks made us able to find deregulated pathways. These
   analyses also suggest a novel potential role of transcription factor
   energetics at the onset of primary tumor development. [31]Results are
   important in the molecular systems biology of cancer field, since
   deregulation and coupling mechanisms between metabolic activity and
   transcriptional regulation can be better understood by taking into
   account the way that master regulators respond to physicochemical
   constraints imposed by different phenotypic conditions.

Introduction

   It is known that tumors could depend on energy production pathways that
   are different from those of normal cells. These unique pathways require
   in some cases the expression and function of so-called tumor-specific
   enzymes. Some of these glycolytic enzymes, as well as other modulators
   of tumor behavior, have recently been analyzed in search for a clue
   that inhibition of such enzymes or appropriate tuning of such
   modulators should deprive tumors of energy, while leaving
   non-transformed cells unaffected. Recent findings seem to point out to
   several so-called metabolic transformations that permit neoplasms
   survival, thus suggesting a role of metabolic pathways as potential
   pharmacological targets [32][1]. In fact, preliminary experiments on
   animals with hepatocellular carcinoma have indeed shown very
   encouraging results. It appears that modulating the energy production
   pathways of tumors is poised to become a substantial research area for
   cancer treatment [33][2]. The role of perturbed local cell energetics
   in association with cancer is not new. In the past, under several
   instances, relationships seem to appear between metabolic variation and
   tumorigenesis, spread and dissemination of malignancy. In recent times
   a growing interest (or best a revival of it) has taken place and
   evidence seem to suggest closer connections than those suspected. For
   instance, the importance of glycolysis in cancer development [34][3].
   It has been discussed how a combination of agents that inhibit both
   energy production and cell signaling may provide a novel and effective
   approach to target pancreatic cancer effectively.

   Thermodynamic studies at the transcriptional [35][4], epigenetic
   [36][5], [37][6], and metabolic [38][7] levels have pointed out to
   energetics as playing a non-trivial role in the onset and development
   of malignancy. In the particular case of this paper, we will focus on
   the relationship between transcriptional de-regulation of a set of
   genes that present transcription factor (TF) and metabolic activity
   (some of them) while at the same time have been associated with the
   presence of breast cancer. We will then study its regulatory and
   thermodynamical behavior by means of gene expression data obtained from
   genome-wide analysis experiments in RNA from biopsy-captured tissue of
   both primary breast cancer and normal breast.

   The role of gene interaction networks have also been extensively
   mentioned in relation to cancer phenomenology, it has been claimed that
   these network effects are, in fact much more important than individual
   gene contributions [39][8]. Some of these networks are indeed related
   to energetic and metabolic processes [40][9], tyrosine-related
   deregulation [41][10], and immunity weakening [42][11]. One usually
   think of tumor cells as having successful mechanisms to evade normal
   control and cell regulation of proliferation and apoptosis. Alterations
   in gene expression have become a better (but far from completely)
   understood component of normal development and disease progression. In
   particular, TFs have become a promising target for therapy. In brief,
   gross alterations in TF regulation would result in cascade triggering
   affecting both the whole cell cycle and the metabolic activity thus
   resulting in possible development of cancer. Many people have come to
   conclude that cancer is a transcriptional disorder disease
   [43][12]–[44][15], while, as we have mentioned other authors have
   recently turned their attention to the metabolic and energetic
   component [45][7], [46][9], hence a possible connection between these
   two approaches could be found in the energetic deregulation Inline
   graphic transcriptional disorder leading both to cascade triggering and
   metabolic disorders related to neoplasm formation and development. For
   these reasons this paper will attempt to model the role of TFs at both
   the energetics (thermodynamic) level and the network approach.

Analysis

   One of the cornerstones of contemporary genomic studies, in particular
   of the systems biology approach, is data integration (DI). DI is useful
   to make sense out of the extremely large corpus of experimental
   evidence given, for instance by genome-wide expression analysis. With
   the continuous advent of novel techniques in high throughput molecular
   biology and the ‘omics maybe just one thing has been established:
   Complex biological systems need to be studied from several standpoints
   to unveil the actual mechanisms behind them. In the present case, our
   aim is to sketch some hints for a proposal of functional mechanisms
   behind gene expression in cancer and cell energetics. The analysis
   work-flow for the present study was as follows (see also [47]Figure 1):

Figure 1. Flow chart design for this work.

   [48]Figure 1
   [49]Open in a new tab

   It is noticeable that this is an hybrid model which incorporates both
   data-driven discovery (gene expression analysis, gene-set enrichment
   and probabilistic network inference) and hypothesis driven inquiries
   (data mining and non-equilibrium thermodynamics modeling).
    1. Statistical pre-processing of the microarray gene expression data.
    2. Determination of differentially expressed genes and statistical
       significance assessment.
    3. Data mining for functional features within the statistically
       significant differential expression gene set.
    4. Non-equilibrium thermodynamics calculations ([50]Figures 2, [51]3,
       [52]4, [53]5).
    5. Probabilistic inference of gene regulatory networks.
    6. Pathway statistical enrichment analysis.
    7. Search for common non-linear correlations found for human MEF2C in
       this work ([54]Figure 6) that are present also in a highly curated
       A. Thaliana transcription factor database, indicating modular
       conservation among species.
    8. Gene Set Enrichment [55]Analysis applied to the 1191 samples
       expression matrix to look up for dysregulated functions and
       pathways as a complement for the gene analysis in cancer and
       metabolic pathways ([56]Figure 7 and [57]Figure 8).

Figure 2. Gene Expression Intensity profile.

   [58]Figure 2
   [59]Open in a new tab

   We can notice strong stochasticity in the signals. However, a definite
   background tendency may be identified, as it may be clearer when
   examining the concentration profile.

Figure 3. mRNA concentration profile.

   [60]Figure 3
   [61]Open in a new tab

   In spite of strong stochasticity, we can notice some trends. For
   instance, there are some genes with a low concentration and low
   variability while others present larger concentrations and
   variabilities.

Figure 4. Transcriptional affinities profile.

   [62]Figure 4
   [63]Open in a new tab

   , a strong stochastic behavior can be noted, however when considering
   the associated chemical potentials of transcription (see [64]Fig. 5)
   definite trends arise.

Figure 5. Chemical potentials of transcription profile.

   [65]Figure 5
   [66]Open in a new tab

   As in the case of the concentration profiles we can notice some trends
   in spite of stochasticity. There are some genes with a low
   concentration and low variability while others present larger
   concentrations and variabilities.

Figure 6. Inferred gene regulatory network [67][10], [68][16].

   Figure 6
   [69]Open in a new tab

   Links colored in red represent interactions associated (in the
   literature, from data mining) with breast cancer, yellow links are
   interactions associated with other types of cancer, turquoise links are
   interactions associated with metabolic disorders and navy blue links
   are otherwise. We could notice that some genes are regulated by more
   than one of these transcription factors as is the case with RAD52,
   ADH1C, OIP5, ELK4, PEX10, GAA, FTHP1 and ADAP1 which are co-regulated
   by MEF2C and SMAD3; of STMN1, STAU1 and the c10ORF10 transcript which
   are co-regulated by SMAD3 and MNDA; of CSH1, FANCI, FHIT, CDKN2A, PRC1,
   CENPF, MAP3K1, BNIP3, HCLS1, TAF15, PCBP2, SPAI7, SLC38A1, PASK,
   HIST1H2BD, POLR2I, UPK3B, EHD1, PRICKLE3, LOC91316, ANAPC2, LST1,
   RRP12, c9ORF6, BRP44, CLEC4A, TMEM194A, TPO3B, HIST1H2AM and ZFP37
   which are co-regulated by MEF2C and MNDA; and by c14ORF1, DOK5, AFF1,
   and CADM4 which are co-regulated by MEF2C and POU2AF1. In the case of
   these double-regulatory interactions there are some cases in which both
   regulatory interactions are associated with a certain phenotype and
   other cases in which different (or no) phenotype is associated for each
   link.

Figure 7. Cancer related deregulated pathways.

   [70]Figure 7
   [71]Open in a new tab

   Statistical enrichment of deregulated pathways within our differential
   gene-sets include canonical cancer pathways such as DNA damage repair,
   AMPK and RAS. Molecules marked with a red star correspond to
   differentially expressed genes in the cancer/control contrast.

Figure 8. Metabolism related deregulated pathways.

   [72]Figure 8
   [73]Open in a new tab

   Statistical enrichment of deregulated pathways within our differential
   gene-sets include metabolic pathways. For instance, both branches of
   the cholesterol biosynthesis pathway are affected. Molecules marked
   with a red star correspond to differentially expressed genes in the
   cancer/control contrast.

Differentially expressed genes

   After pre-processing (background correction, normalization and
   summarization) of the samples [74][16] according to the RMA algorithm
   [75][17], we proceeded to implement a statistical analysis by using
   linear modeling (limma) to look up for significant differentially
   expressed genes (full data matrix available upon request). Empirical
   Bayes and other shrinkage methods are used to borrow information across
   genes making the analyses stable even for experiments with small number
   of arrays. This method allows very general experiments to be analyzed
   as easily as single replica experiments. The approach requires two
   matrices, the first one is called the design matrix which gives a
   representation of the different RNA targets which have been hybridized
   to the arrays. The second one, or contrast matrix which allows the
   coefficients defined by the design matrix to be combined into contrasts
   of interest. Each contrast corresponds to a comparison of interest
   between the RNA targets [76][18].

Data mining for metabolic and transcription factor activity

   Once we had a set of differentially expressed genes, we proceeded to
   implement a data mining search over it. Search parameters include the
   following constraints:
     * Genes that are well known transcription factors, reported not only
       by sequence homology but also by actual experimental evidence.
     * Genes that have been associated in the literature with the presence
       of breast cancer (higher scores) or any other tumors -liquid
       neoplasms were excluded- (lower scores).
     * Genes whose protein products are related to cellular level
       metabolic pathways.
     * Genes whose transcripts possess a complete physicochemical
       characterization, e.g. Affymetrix® calibration probes have reported
       free energies of formation.

   From the set of genes included in the GeneChip® under study (namely
   Affymetrix HGU133-A) which were statistically significant in their
   differential expression between tumors and controls, we built sets that
   satisfy the aforementioned constraints. Then we made the intersection
   set of all these. This set, that we will call hereon a Core set
   consisted in four genes, namely MNDA, POU2AF1, MEF2C and SMAD3. In what
   follows we will analyze in detail the non-equilibrium thermodynamics of
   transcription as well as the regulatory network structure of such genes
   within a sample set of 1191 microarrays.

Non-equilibrium thermodynamic model

   On general grounds, the finding of relevant genes associated with a
   cancer phenotype is based on determining features such as differential
   expression patterns. However, two important issues that should be also
   taken into account besides the gene expression levels, are the
   energetics within the cell and the physicochemical properties of the
   biomolecules involved in transcriptional regulation (transcription
   factors in particular). Both features could be responsible for TF
   activity since they affect the mechanism behind the activation of
   target genes according to previous specific cellular energetic
   conditions.

   An interesting trend in the transcriptional energetics in some
   well-studied genes [77][19] is that, the values of the activation
   energies are in general lower for genes that act as transcription
   factors and higher for genes with no-known TF-activity. The
   physicochemical meaning of this finding seem to point-out to
   transcription factors as genes whose expression is regulated by lower
   activation-energy barriers. Since TF's are involved in the
   transcriptional activation of other genes, it is expected that they are
   synthesized first when energy is started to being released by metabolic
   processes in the cell. Transcriptional targets should, in general be
   synthesized later and with higher activation energies. These higher
   saturation limits for the chemical potentials of TFs suggest both
   stability and spontaneity in the expression of these as compared to
   target genes.

   In brief, since master TFs are needed in early stages of whole-genome
   transcription (i.e. upstream in the regulatory cascade) in order to
   kick-start such processes and, by taking into account that
   transcriptional regulation has been characterized as an activated
   process [78][20]; a hierarchy in the synthesis of mRNA templates (and,
   further on, in their product proteins) is established in terms of the
   corresponding activation energies -as given by their chemical
   potentials- and of the availability of free energy in the cellular
   environment. Then, by calculating thermodynamic properties for TFs, it
   is possible to unveil the order or priority in their activation which
   is dependent on energy accessibility.

   A non-equilibrium thermodynamical theory of gene regulation has already
   been proposed [79][20]. As it was shown the thermodynamic analysis of
   transcriptional regulation presents several challenges, in particular
   associated with the fact that the cell is a small system, in the sense
   that the role of fluctuations and noise plays a rather fundamental role
   for its characterization. Systems outside the domain of the
   thermodynamic limit are characterized by large fluctuations and hence
   stochastic effects need to be taken into account. An extremely
   important conundrum in contemporary thermal physics lies in the
   connections between probability and thermodynamics. A developed theory
   exists however, called mesoscopic nonequilibrium thermodynamics (MNET)
   [80][21] which specifically addresses the issue by considering the
   stochastic nature of the time evolution of small non-equilibrium
   systems, in a context which is extremely close to our work. To account
   for stochasticity one needs to recognize that scaling down the
   description of a physical system brings up energy contributions that
   are commonly neglected in thermodynamical descriptions. The
   time-evolution of these systems could be described as a generalized
   diffusion process over a potential landscape in the space of mesoscopic
   variables. This process is driven by a generalized
   mesoscopic-thermodynamic force whose stochastic origin could be tracked
   back by means of, for example, a Fokker-Planck-like analysis [81][21].
   These classes of formalisms are appropriated in the case of activated
   processes, for instance, a system crossing a potential barrier. The
   complex biochemical reactions involved in transcriptional regulation
   belong also to this category.

   The present theoretical framework [82][20] shares similar ideas with
   MNET (although treated in a less formal way due to present
   unavailability of information regarding the non-local probability
   distributions) and deals with intensity levels of gene activity as
   measured in whole genome gene expression profiling on GeneChips. It is
   based on the thermodynamics of hybridization [83][22] that considers a
   basic two-state model that quantifies mRNA concentrations by
   competitive hybridization [84][23], [85][24]. This non-equilibrium
   thermodynamical theory has been used to study the role of transcription
   factors in the phenomena of anomalous transcriptional bursts [86][19],
   [87][25].

   As is usual in non-equilibrium thermodynamics we will assume that a
   generalized entropy-like function Inline graphic exists, which may be
   written in the form [88][26], [89][27]:
   graphic file with name pone.0042678.e003.jpg (1)

   [90]Eq. 1 is a formal extension of the Gibbs relation of equilibrium
   thermodynamics. The quantities appearing are as usual: Inline graphic
   is the internal energy, Inline graphic is the local temperature, Inline
   graphic and Inline graphic the pressure and volume, Inline graphic is
   the chemical potential for the Inline graphic -th mRNA species. Inline
   graphic and Inline graphic are extended thermodynamical fluxes and
   forces [91][26]. For a multicomponent mRNA mixture (under fixed volume
   and pressure), the set of relevant variables consists in the
   temperature Inline graphic and concentration of each gene species
   Inline graphic as the slow varying (classical) parameters set and the
   mass flux of these species Inline graphic and their corresponding
   forces Inline graphic as fast variables. These latter variables will
   take into account the presence of inhomogeneous regions (concentration
   domains formed because of the gene regulatory interactions) to correct
   the predictions based on the local equilibrium hypothesis.

   The non-equilibrium Gibbs free energy for a mixture of Inline graphic ,
   mRNA transcripts reads:
   graphic file with name pone.0042678.e017.jpg (2)

   Quantities are local fields defined as usual, (e.g. Inline graphic ),
   within the mentioned formalism one can consider that a generalized
   entropy-like function Inline graphic exists [92][26], [93][27], also T
   is the temperature, p the pressure, Inline graphic the chemical
   potential, etc.; Inline graphic is the concentration for species Inline
   graphic , Inline graphic , with Inline graphic the absolute temperature
   and Inline graphic the gas constant, Inline graphic is the free energy
   of hybridization of Inline graphic , Inline graphic is a parameter that
   sets the scale of intensity [94][22] corresponding to the saturation
   limit Inline graphic ; Inline graphic and Inline graphic are extended
   thermodynamical fluxes and forces that take into account non-local
   effects.

   If we recall from reference [95][20] given the relation between gene
   expression intensity Inline graphic and concentration Inline graphic ,
   we have that:
   graphic file with name pone.0042678.e034.jpg (3)

   After this, a proposal on the form for the extended fluxes and forces
   should be given. Hence, we are proposing a system of linear (in the
   forces) coupled fluxes with memory [96][20].

   The constitutive equations are,
   graphic file with name pone.0042678.e035.jpg (4)
   graphic file with name pone.0042678.e036.jpg (5)

   The Inline graphic 's are time-independent amplitudes, Inline graphic
   is a unit vector in the direction of mass flow (the nature of Inline
   graphic will not affect the rest of our description, since we will be
   dealing with the magnitude of the mass flux Inline graphic ) and Inline
   graphic 's are relaxation times considered path-independent scalars.
   Since we have a linear relation between thermodynamic fluxes and forces
   some features of the Onsager-Casimir formalism will still hold.

   Irreversible coupling is given by [97]Eq. 4 and [98]5, nevertheless due
   to the fact that actual transcription measurement experiments are made
   either on homeostasis (steady state) settings or within time series
   designs with intervals several orders of magnitude larger than the
   associated relaxation times (which are of the order of a few molecular
   collision times) it is possible to take the limits Inline graphic and
   Inline graphic , then the integrals become evaluated delta functions to
   give:
   graphic file with name pone.0042678.e044.jpg (6)
   graphic file with name pone.0042678.e045.jpg (7)

   Also due to the spatial nature of the experimental measurements (either
   RNA blots or DNA/RNA chips measure space-averaged mRNA concentrations)
   it is possible to work with the related scalar quantities instead, to
   give:
   graphic file with name pone.0042678.e046.jpg (8)
   graphic file with name pone.0042678.e047.jpg (9)

   Substituting [99]Eq. 8 and [100]9 into [101]Eq. 2 one gets:
   graphic file with name pone.0042678.e048.jpg (10)

   If we assume that the generalized transport coefficient Inline graphic
   is independent of the flux Inline graphic we can write:
   graphic file with name pone.0042678.e051.jpg (11)

   Or in terms of the transcription regulation chemical potentials Inline
   graphic
   graphic file with name pone.0042678.e053.jpg (12)

   In the constant transport coefficient approximation, [102]Eq. 12 reads:
   graphic file with name pone.0042678.e054.jpg (13)

   Defining generalized transport coefficients Inline graphic .

   If we change variables in [103]equation 2 from concentration to gene
   expression intensity (by using [104]equation 3) and introduce the
   fluxes, forces and generalized transport coefficients, we could
   rephrase it as follows:
   graphic file with name pone.0042678.e056.jpg (14)

   The resulting affinity of transcription i.e. the thermodynamic
   conjugate variable to the probe intensity Inline graphic is given by:
   graphic file with name pone.0042678.e058.jpg (15)

   The quantity Inline graphic plays the role of a chemical affinity for
   the gene expression process. As gene expression is a process that
   follows thermal activation kinetics, Inline graphic is a temperature
   dependent variable. This dependency shows up both explicitly (by the
   value Inline graphic ) and also due to indirect temperature dependency
   given by the saturation constants Inline graphic . Inline graphic are
   coefficients related to gene cross-regulation and Inline graphic are
   the chemical potentials associated with transcriptional regulation
   [105][19], [106][20].

   The related chemical potential of transcription [107][20] is given by:
   graphic file with name pone.0042678.e065.jpg (16)

   Since we have reliable experimental data for the expression levels
   Inline graphic from 1191 whole-genome gene expression experiments, it
   is possible to calculate the gene transcriptional affinities and
   chemical potentials for the set of genes of interest from
   [108]equations 15 and [109]16 (we also have good values for the
   constants Inline graphic taken from spike-in experimental data provided
   by the gene-chip manufacturer). Since these experiments have been made
   without the use of any knock-out or knock-down techniques (i.e. all
   genes are subject to their corresponding regulatory interactions), the
   gene expression levels Inline graphic used to calculate the chemical
   potentials and transcriptional affinities have already incorporated
   (although in an implicit way) the effect of transcriptional regulation
   as given by the gene regulatory mechanisms depicted in the third term
   at the r.h.s. of [110]equation 14.

Non-linear correlation inference of regulatory networks

   To deconvolute a Gene Regulatory Network (GRN) related to primary
   breast cancer we applied a methodology based on a local pattern-sharing
   measure as surrogate to actual gene-gene interactions to our
   dataset(see [111]Materials and Methods - Inline graphic Experimental
   datasets). The goal of deconvolution methods is inferring GRNs based on
   statistical dependencies within the joint probability distribution of
   gene expression for all genes within a given gene set. Typical means to
   reach this goal consist in the quantification of the new information
   content that arise when we look at the full joint probability
   distribution when compared to a series of successive independence
   approximations. According with the method given in reference [112][10],
   we calculated measures of non-linear correlation between the normalized
   expression values (22238 probesets) and the core set of 4 genes.
   Threshold-analysis was made on such measures to look up for statistical
   significance in the inference and based on their IBS index value we
   selected a set of 712 genes. The optimal network (within the given
   approximations) was found by a Maximum Entropy Method as is shown
   elsewhere [113][10] and was validated by several (mostly in silico and
   database-mining) methods [114][16].

Biochemical pathway statistical enrichment analysis

   Reactome [115][28] biological/biochemical pathway over-representation
   analysis was performed to determine the Reactome pathways in which gene
   IDs in our list were strongly enriched. Reactome is a an open-source,
   open access, manually curated and peer-reviewed pathway database that
   may help to understand the biological context of genomic data.
   Significance assessment was also made by means of ‘urn model’
   hypergeometric distribution tests.

Materials and Methods

Experimental datasets

   A curated set of 1191 whole genome gene expression profiles was
   generated [116][16] from datasets for several publicly available
   experiments deposited in the GEO database [117][29]. These experiments
   were performed in total mRNA extracted under the [118]GPL96 protocol
   [119][30] which is based on the Affymetrix HGU133A microarray GeneChip
   platform. In the case of experiments including some kind of treatment
   or cell modification we only took the unaltered samples to include them
   in our analyses. Details are given in [120]Table 1. Further information
   is available in the corresponding GEO entries and/or may be available
   upon request. In the case of human mRNA samples taken directly from
   organ tissue (by a biopsy) and not from cultured cell-lines, it is
   extremely difficult to design time-course experiments. Thus, in order
   to study a surrogate model of transcriptional de-regulation, we
   proposed the following alternative to look for correlations: After
   quality control pre-processing, background correction and normalization
   of the microarrays, the samples were prioritized (ordered) according to
   their BNIP3 (Affymetrix-probe ID 201848_s_at) expression level. BNIP3
   is a well known marker of progression and malignancy in primary breast
   cancer that correlates both with lab tests and clinical trials
   [121][31]. By ordering the independent, steady-state samples in this
   manner it is now possible to look up for correlation patterns of gene
   expression.

Table 1. GEO [122][29] identifier and references for the Microarray