Abstract Metabolic transformations have been reported as involved in neoplasms survival. This suggests a role of metabolic pathways as potential cancer pharmacological targets. Modulating tumor's energy production pathways may become a substantial research area for cancer treatment. The significant role of metabolic deregulation as inducing transcriptional instabilities and consequently whole-system failure, is thus of foremost importance. By using a data integration approach that combines experimental evidence for high-throughput genome wide gene expression, a non-equilibrium thermodynamics analysis, nonlinear correlation networks as well as database mining, we were able to outline the role that transcription factors MEF2C and MNDA may have as main master regulators in primary breast cancer phenomenology, as well as the possible interrelationship between malignancy and metabolic dysfunction. The present findings are supported by the analysis of 1191 whole genome gene expression experiments, as well as probabilistic inference of gene regulatory networks, and non-equilibrium thermodynamics of such data. Other evidence sources include pathway enrichment and gene set enrichment analyses, as well as motif comparison with a comprehensive gene regulatory network (of homologue genes) in Arabidopsis thaliana. Our key finding is that the non-equilibrium free energies provide a realistic description of transcription factor activation that when supplemented with gene regulatory networks made us able to find deregulated pathways. These analyses also suggest a novel potential role of transcription factor energetics at the onset of primary tumor development. [31]Results are important in the molecular systems biology of cancer field, since deregulation and coupling mechanisms between metabolic activity and transcriptional regulation can be better understood by taking into account the way that master regulators respond to physicochemical constraints imposed by different phenotypic conditions. Introduction It is known that tumors could depend on energy production pathways that are different from those of normal cells. These unique pathways require in some cases the expression and function of so-called tumor-specific enzymes. Some of these glycolytic enzymes, as well as other modulators of tumor behavior, have recently been analyzed in search for a clue that inhibition of such enzymes or appropriate tuning of such modulators should deprive tumors of energy, while leaving non-transformed cells unaffected. Recent findings seem to point out to several so-called metabolic transformations that permit neoplasms survival, thus suggesting a role of metabolic pathways as potential pharmacological targets [32][1]. In fact, preliminary experiments on animals with hepatocellular carcinoma have indeed shown very encouraging results. It appears that modulating the energy production pathways of tumors is poised to become a substantial research area for cancer treatment [33][2]. The role of perturbed local cell energetics in association with cancer is not new. In the past, under several instances, relationships seem to appear between metabolic variation and tumorigenesis, spread and dissemination of malignancy. In recent times a growing interest (or best a revival of it) has taken place and evidence seem to suggest closer connections than those suspected. For instance, the importance of glycolysis in cancer development [34][3]. It has been discussed how a combination of agents that inhibit both energy production and cell signaling may provide a novel and effective approach to target pancreatic cancer effectively. Thermodynamic studies at the transcriptional [35][4], epigenetic [36][5], [37][6], and metabolic [38][7] levels have pointed out to energetics as playing a non-trivial role in the onset and development of malignancy. In the particular case of this paper, we will focus on the relationship between transcriptional de-regulation of a set of genes that present transcription factor (TF) and metabolic activity (some of them) while at the same time have been associated with the presence of breast cancer. We will then study its regulatory and thermodynamical behavior by means of gene expression data obtained from genome-wide analysis experiments in RNA from biopsy-captured tissue of both primary breast cancer and normal breast. The role of gene interaction networks have also been extensively mentioned in relation to cancer phenomenology, it has been claimed that these network effects are, in fact much more important than individual gene contributions [39][8]. Some of these networks are indeed related to energetic and metabolic processes [40][9], tyrosine-related deregulation [41][10], and immunity weakening [42][11]. One usually think of tumor cells as having successful mechanisms to evade normal control and cell regulation of proliferation and apoptosis. Alterations in gene expression have become a better (but far from completely) understood component of normal development and disease progression. In particular, TFs have become a promising target for therapy. In brief, gross alterations in TF regulation would result in cascade triggering affecting both the whole cell cycle and the metabolic activity thus resulting in possible development of cancer. Many people have come to conclude that cancer is a transcriptional disorder disease [43][12]–[44][15], while, as we have mentioned other authors have recently turned their attention to the metabolic and energetic component [45][7], [46][9], hence a possible connection between these two approaches could be found in the energetic deregulation Inline graphic transcriptional disorder leading both to cascade triggering and metabolic disorders related to neoplasm formation and development. For these reasons this paper will attempt to model the role of TFs at both the energetics (thermodynamic) level and the network approach. Analysis One of the cornerstones of contemporary genomic studies, in particular of the systems biology approach, is data integration (DI). DI is useful to make sense out of the extremely large corpus of experimental evidence given, for instance by genome-wide expression analysis. With the continuous advent of novel techniques in high throughput molecular biology and the ‘omics maybe just one thing has been established: Complex biological systems need to be studied from several standpoints to unveil the actual mechanisms behind them. In the present case, our aim is to sketch some hints for a proposal of functional mechanisms behind gene expression in cancer and cell energetics. The analysis work-flow for the present study was as follows (see also [47]Figure 1): Figure 1. Flow chart design for this work. [48]Figure 1 [49]Open in a new tab It is noticeable that this is an hybrid model which incorporates both data-driven discovery (gene expression analysis, gene-set enrichment and probabilistic network inference) and hypothesis driven inquiries (data mining and non-equilibrium thermodynamics modeling). 1. Statistical pre-processing of the microarray gene expression data. 2. Determination of differentially expressed genes and statistical significance assessment. 3. Data mining for functional features within the statistically significant differential expression gene set. 4. Non-equilibrium thermodynamics calculations ([50]Figures 2, [51]3, [52]4, [53]5). 5. Probabilistic inference of gene regulatory networks. 6. Pathway statistical enrichment analysis. 7. Search for common non-linear correlations found for human MEF2C in this work ([54]Figure 6) that are present also in a highly curated A. Thaliana transcription factor database, indicating modular conservation among species. 8. Gene Set Enrichment [55]Analysis applied to the 1191 samples expression matrix to look up for dysregulated functions and pathways as a complement for the gene analysis in cancer and metabolic pathways ([56]Figure 7 and [57]Figure 8). Figure 2. Gene Expression Intensity profile. [58]Figure 2 [59]Open in a new tab We can notice strong stochasticity in the signals. However, a definite background tendency may be identified, as it may be clearer when examining the concentration profile. Figure 3. mRNA concentration profile. [60]Figure 3 [61]Open in a new tab In spite of strong stochasticity, we can notice some trends. For instance, there are some genes with a low concentration and low variability while others present larger concentrations and variabilities. Figure 4. Transcriptional affinities profile. [62]Figure 4 [63]Open in a new tab , a strong stochastic behavior can be noted, however when considering the associated chemical potentials of transcription (see [64]Fig. 5) definite trends arise. Figure 5. Chemical potentials of transcription profile. [65]Figure 5 [66]Open in a new tab As in the case of the concentration profiles we can notice some trends in spite of stochasticity. There are some genes with a low concentration and low variability while others present larger concentrations and variabilities. Figure 6. Inferred gene regulatory network [67][10], [68][16]. Figure 6 [69]Open in a new tab Links colored in red represent interactions associated (in the literature, from data mining) with breast cancer, yellow links are interactions associated with other types of cancer, turquoise links are interactions associated with metabolic disorders and navy blue links are otherwise. We could notice that some genes are regulated by more than one of these transcription factors as is the case with RAD52, ADH1C, OIP5, ELK4, PEX10, GAA, FTHP1 and ADAP1 which are co-regulated by MEF2C and SMAD3; of STMN1, STAU1 and the c10ORF10 transcript which are co-regulated by SMAD3 and MNDA; of CSH1, FANCI, FHIT, CDKN2A, PRC1, CENPF, MAP3K1, BNIP3, HCLS1, TAF15, PCBP2, SPAI7, SLC38A1, PASK, HIST1H2BD, POLR2I, UPK3B, EHD1, PRICKLE3, LOC91316, ANAPC2, LST1, RRP12, c9ORF6, BRP44, CLEC4A, TMEM194A, TPO3B, HIST1H2AM and ZFP37 which are co-regulated by MEF2C and MNDA; and by c14ORF1, DOK5, AFF1, and CADM4 which are co-regulated by MEF2C and POU2AF1. In the case of these double-regulatory interactions there are some cases in which both regulatory interactions are associated with a certain phenotype and other cases in which different (or no) phenotype is associated for each link. Figure 7. Cancer related deregulated pathways. [70]Figure 7 [71]Open in a new tab Statistical enrichment of deregulated pathways within our differential gene-sets include canonical cancer pathways such as DNA damage repair, AMPK and RAS. Molecules marked with a red star correspond to differentially expressed genes in the cancer/control contrast. Figure 8. Metabolism related deregulated pathways. [72]Figure 8 [73]Open in a new tab Statistical enrichment of deregulated pathways within our differential gene-sets include metabolic pathways. For instance, both branches of the cholesterol biosynthesis pathway are affected. Molecules marked with a red star correspond to differentially expressed genes in the cancer/control contrast. Differentially expressed genes After pre-processing (background correction, normalization and summarization) of the samples [74][16] according to the RMA algorithm [75][17], we proceeded to implement a statistical analysis by using linear modeling (limma) to look up for significant differentially expressed genes (full data matrix available upon request). Empirical Bayes and other shrinkage methods are used to borrow information across genes making the analyses stable even for experiments with small number of arrays. This method allows very general experiments to be analyzed as easily as single replica experiments. The approach requires two matrices, the first one is called the design matrix which gives a representation of the different RNA targets which have been hybridized to the arrays. The second one, or contrast matrix which allows the coefficients defined by the design matrix to be combined into contrasts of interest. Each contrast corresponds to a comparison of interest between the RNA targets [76][18]. Data mining for metabolic and transcription factor activity Once we had a set of differentially expressed genes, we proceeded to implement a data mining search over it. Search parameters include the following constraints: * Genes that are well known transcription factors, reported not only by sequence homology but also by actual experimental evidence. * Genes that have been associated in the literature with the presence of breast cancer (higher scores) or any other tumors -liquid neoplasms were excluded- (lower scores). * Genes whose protein products are related to cellular level metabolic pathways. * Genes whose transcripts possess a complete physicochemical characterization, e.g. Affymetrix® calibration probes have reported free energies of formation. From the set of genes included in the GeneChip® under study (namely Affymetrix HGU133-A) which were statistically significant in their differential expression between tumors and controls, we built sets that satisfy the aforementioned constraints. Then we made the intersection set of all these. This set, that we will call hereon a Core set consisted in four genes, namely MNDA, POU2AF1, MEF2C and SMAD3. In what follows we will analyze in detail the non-equilibrium thermodynamics of transcription as well as the regulatory network structure of such genes within a sample set of 1191 microarrays. Non-equilibrium thermodynamic model On general grounds, the finding of relevant genes associated with a cancer phenotype is based on determining features such as differential expression patterns. However, two important issues that should be also taken into account besides the gene expression levels, are the energetics within the cell and the physicochemical properties of the biomolecules involved in transcriptional regulation (transcription factors in particular). Both features could be responsible for TF activity since they affect the mechanism behind the activation of target genes according to previous specific cellular energetic conditions. An interesting trend in the transcriptional energetics in some well-studied genes [77][19] is that, the values of the activation energies are in general lower for genes that act as transcription factors and higher for genes with no-known TF-activity. The physicochemical meaning of this finding seem to point-out to transcription factors as genes whose expression is regulated by lower activation-energy barriers. Since TF's are involved in the transcriptional activation of other genes, it is expected that they are synthesized first when energy is started to being released by metabolic processes in the cell. Transcriptional targets should, in general be synthesized later and with higher activation energies. These higher saturation limits for the chemical potentials of TFs suggest both stability and spontaneity in the expression of these as compared to target genes. In brief, since master TFs are needed in early stages of whole-genome transcription (i.e. upstream in the regulatory cascade) in order to kick-start such processes and, by taking into account that transcriptional regulation has been characterized as an activated process [78][20]; a hierarchy in the synthesis of mRNA templates (and, further on, in their product proteins) is established in terms of the corresponding activation energies -as given by their chemical potentials- and of the availability of free energy in the cellular environment. Then, by calculating thermodynamic properties for TFs, it is possible to unveil the order or priority in their activation which is dependent on energy accessibility. A non-equilibrium thermodynamical theory of gene regulation has already been proposed [79][20]. As it was shown the thermodynamic analysis of transcriptional regulation presents several challenges, in particular associated with the fact that the cell is a small system, in the sense that the role of fluctuations and noise plays a rather fundamental role for its characterization. Systems outside the domain of the thermodynamic limit are characterized by large fluctuations and hence stochastic effects need to be taken into account. An extremely important conundrum in contemporary thermal physics lies in the connections between probability and thermodynamics. A developed theory exists however, called mesoscopic nonequilibrium thermodynamics (MNET) [80][21] which specifically addresses the issue by considering the stochastic nature of the time evolution of small non-equilibrium systems, in a context which is extremely close to our work. To account for stochasticity one needs to recognize that scaling down the description of a physical system brings up energy contributions that are commonly neglected in thermodynamical descriptions. The time-evolution of these systems could be described as a generalized diffusion process over a potential landscape in the space of mesoscopic variables. This process is driven by a generalized mesoscopic-thermodynamic force whose stochastic origin could be tracked back by means of, for example, a Fokker-Planck-like analysis [81][21]. These classes of formalisms are appropriated in the case of activated processes, for instance, a system crossing a potential barrier. The complex biochemical reactions involved in transcriptional regulation belong also to this category. The present theoretical framework [82][20] shares similar ideas with MNET (although treated in a less formal way due to present unavailability of information regarding the non-local probability distributions) and deals with intensity levels of gene activity as measured in whole genome gene expression profiling on GeneChips. It is based on the thermodynamics of hybridization [83][22] that considers a basic two-state model that quantifies mRNA concentrations by competitive hybridization [84][23], [85][24]. This non-equilibrium thermodynamical theory has been used to study the role of transcription factors in the phenomena of anomalous transcriptional bursts [86][19], [87][25]. As is usual in non-equilibrium thermodynamics we will assume that a generalized entropy-like function Inline graphic exists, which may be written in the form [88][26], [89][27]: graphic file with name pone.0042678.e003.jpg (1) [90]Eq. 1 is a formal extension of the Gibbs relation of equilibrium thermodynamics. The quantities appearing are as usual: Inline graphic is the internal energy, Inline graphic is the local temperature, Inline graphic and Inline graphic the pressure and volume, Inline graphic is the chemical potential for the Inline graphic -th mRNA species. Inline graphic and Inline graphic are extended thermodynamical fluxes and forces [91][26]. For a multicomponent mRNA mixture (under fixed volume and pressure), the set of relevant variables consists in the temperature Inline graphic and concentration of each gene species Inline graphic as the slow varying (classical) parameters set and the mass flux of these species Inline graphic and their corresponding forces Inline graphic as fast variables. These latter variables will take into account the presence of inhomogeneous regions (concentration domains formed because of the gene regulatory interactions) to correct the predictions based on the local equilibrium hypothesis. The non-equilibrium Gibbs free energy for a mixture of Inline graphic , mRNA transcripts reads: graphic file with name pone.0042678.e017.jpg (2) Quantities are local fields defined as usual, (e.g. Inline graphic ), within the mentioned formalism one can consider that a generalized entropy-like function Inline graphic exists [92][26], [93][27], also T is the temperature, p the pressure, Inline graphic the chemical potential, etc.; Inline graphic is the concentration for species Inline graphic , Inline graphic , with Inline graphic the absolute temperature and Inline graphic the gas constant, Inline graphic is the free energy of hybridization of Inline graphic , Inline graphic is a parameter that sets the scale of intensity [94][22] corresponding to the saturation limit Inline graphic ; Inline graphic and Inline graphic are extended thermodynamical fluxes and forces that take into account non-local effects. If we recall from reference [95][20] given the relation between gene expression intensity Inline graphic and concentration Inline graphic , we have that: graphic file with name pone.0042678.e034.jpg (3) After this, a proposal on the form for the extended fluxes and forces should be given. Hence, we are proposing a system of linear (in the forces) coupled fluxes with memory [96][20]. The constitutive equations are, graphic file with name pone.0042678.e035.jpg (4) graphic file with name pone.0042678.e036.jpg (5) The Inline graphic 's are time-independent amplitudes, Inline graphic is a unit vector in the direction of mass flow (the nature of Inline graphic will not affect the rest of our description, since we will be dealing with the magnitude of the mass flux Inline graphic ) and Inline graphic 's are relaxation times considered path-independent scalars. Since we have a linear relation between thermodynamic fluxes and forces some features of the Onsager-Casimir formalism will still hold. Irreversible coupling is given by [97]Eq. 4 and [98]5, nevertheless due to the fact that actual transcription measurement experiments are made either on homeostasis (steady state) settings or within time series designs with intervals several orders of magnitude larger than the associated relaxation times (which are of the order of a few molecular collision times) it is possible to take the limits Inline graphic and Inline graphic , then the integrals become evaluated delta functions to give: graphic file with name pone.0042678.e044.jpg (6) graphic file with name pone.0042678.e045.jpg (7) Also due to the spatial nature of the experimental measurements (either RNA blots or DNA/RNA chips measure space-averaged mRNA concentrations) it is possible to work with the related scalar quantities instead, to give: graphic file with name pone.0042678.e046.jpg (8) graphic file with name pone.0042678.e047.jpg (9) Substituting [99]Eq. 8 and [100]9 into [101]Eq. 2 one gets: graphic file with name pone.0042678.e048.jpg (10) If we assume that the generalized transport coefficient Inline graphic is independent of the flux Inline graphic we can write: graphic file with name pone.0042678.e051.jpg (11) Or in terms of the transcription regulation chemical potentials Inline graphic graphic file with name pone.0042678.e053.jpg (12) In the constant transport coefficient approximation, [102]Eq. 12 reads: graphic file with name pone.0042678.e054.jpg (13) Defining generalized transport coefficients Inline graphic . If we change variables in [103]equation 2 from concentration to gene expression intensity (by using [104]equation 3) and introduce the fluxes, forces and generalized transport coefficients, we could rephrase it as follows: graphic file with name pone.0042678.e056.jpg (14) The resulting affinity of transcription i.e. the thermodynamic conjugate variable to the probe intensity Inline graphic is given by: graphic file with name pone.0042678.e058.jpg (15) The quantity Inline graphic plays the role of a chemical affinity for the gene expression process. As gene expression is a process that follows thermal activation kinetics, Inline graphic is a temperature dependent variable. This dependency shows up both explicitly (by the value Inline graphic ) and also due to indirect temperature dependency given by the saturation constants Inline graphic . Inline graphic are coefficients related to gene cross-regulation and Inline graphic are the chemical potentials associated with transcriptional regulation [105][19], [106][20]. The related chemical potential of transcription [107][20] is given by: graphic file with name pone.0042678.e065.jpg (16) Since we have reliable experimental data for the expression levels Inline graphic from 1191 whole-genome gene expression experiments, it is possible to calculate the gene transcriptional affinities and chemical potentials for the set of genes of interest from [108]equations 15 and [109]16 (we also have good values for the constants Inline graphic taken from spike-in experimental data provided by the gene-chip manufacturer). Since these experiments have been made without the use of any knock-out or knock-down techniques (i.e. all genes are subject to their corresponding regulatory interactions), the gene expression levels Inline graphic used to calculate the chemical potentials and transcriptional affinities have already incorporated (although in an implicit way) the effect of transcriptional regulation as given by the gene regulatory mechanisms depicted in the third term at the r.h.s. of [110]equation 14. Non-linear correlation inference of regulatory networks To deconvolute a Gene Regulatory Network (GRN) related to primary breast cancer we applied a methodology based on a local pattern-sharing measure as surrogate to actual gene-gene interactions to our dataset(see [111]Materials and Methods - Inline graphic Experimental datasets). The goal of deconvolution methods is inferring GRNs based on statistical dependencies within the joint probability distribution of gene expression for all genes within a given gene set. Typical means to reach this goal consist in the quantification of the new information content that arise when we look at the full joint probability distribution when compared to a series of successive independence approximations. According with the method given in reference [112][10], we calculated measures of non-linear correlation between the normalized expression values (22238 probesets) and the core set of 4 genes. Threshold-analysis was made on such measures to look up for statistical significance in the inference and based on their IBS index value we selected a set of 712 genes. The optimal network (within the given approximations) was found by a Maximum Entropy Method as is shown elsewhere [113][10] and was validated by several (mostly in silico and database-mining) methods [114][16]. Biochemical pathway statistical enrichment analysis Reactome [115][28] biological/biochemical pathway over-representation analysis was performed to determine the Reactome pathways in which gene IDs in our list were strongly enriched. Reactome is a an open-source, open access, manually curated and peer-reviewed pathway database that may help to understand the biological context of genomic data. Significance assessment was also made by means of ‘urn model’ hypergeometric distribution tests. Materials and Methods Experimental datasets A curated set of 1191 whole genome gene expression profiles was generated [116][16] from datasets for several publicly available experiments deposited in the GEO database [117][29]. These experiments were performed in total mRNA extracted under the [118]GPL96 protocol [119][30] which is based on the Affymetrix HGU133A microarray GeneChip platform. In the case of experiments including some kind of treatment or cell modification we only took the unaltered samples to include them in our analyses. Details are given in [120]Table 1. Further information is available in the corresponding GEO entries and/or may be available upon request. In the case of human mRNA samples taken directly from organ tissue (by a biopsy) and not from cultured cell-lines, it is extremely difficult to design time-course experiments. Thus, in order to study a surrogate model of transcriptional de-regulation, we proposed the following alternative to look for correlations: After quality control pre-processing, background correction and normalization of the microarrays, the samples were prioritized (ordered) according to their BNIP3 (Affymetrix-probe ID 201848_s_at) expression level. BNIP3 is a well known marker of progression and malignancy in primary breast cancer that correlates both with lab tests and clinical trials [121][31]. By ordering the independent, steady-state samples in this manner it is now possible to look up for correlation patterns of gene expression. Table 1. GEO [122][29] identifier and references for the Microarray