Abstract
Background
Integration of transcriptomic and metabolomic data improves functional
interpretation of disease-related metabolomic phenotypes, and
facilitates discovery of putative metabolite biomarkers and gene
targets. For this reason, these data are increasingly collected in
large (> 100 participants) cohorts, thereby driving a need for the
development of user-friendly and open-source methods/tools for their
integration. Of note, clinical/translational studies typically provide
snapshot (e.g. one time point) gene and metabolite profiles and,
oftentimes, most metabolites measured are not identified. Thus, in
these types of studies, pathway/network approaches that take into
account the complexity of transcript-metabolite relationships may
neither be applicable nor readily uncover novel relationships. With
this in mind, we propose a simple linear modeling approach to capture
disease-(or other phenotype) specific gene-metabolite associations,
with the assumption that co-regulation patterns reflect functionally
related genes and metabolites.
Results
The proposed linear model, metabolite ~ gene + phenotype +
gene:phenotype, specifically evaluates whether gene-metabolite
relationships differ by phenotype, by testing whether the relationship
in one phenotype is significantly different from the relationship in
another phenotype (via a statistical interaction gene:phenotype
p-value). Statistical interaction p-values for all possible
gene-metabolite pairs are computed and significant pairs are then
clustered by the directionality of associations (e.g. strong positive
association in one phenotype, strong negative association in another
phenotype). We implemented our approach as an R package, IntLIM, which
includes a user-friendly R Shiny web interface, thereby making the
integrative analyses accessible to non-computational experts. We
applied IntLIM to two previously published datasets, collected in the
NCI-60 cancer cell lines and in human breast tumor and non-tumor
tissue, for which transcriptomic and metabolomic data are available. We
demonstrate that IntLIM captures relevant tumor-specific
gene-metabolite associations involved in known cancer-related pathways,
including glutamine metabolism. Using IntLIM, we also uncover
biologically relevant novel relationships that could be further tested
experimentally.
Conclusions
IntLIM provides a user-friendly, reproducible framework to integrate
transcriptomic and metabolomic data and help interpret metabolomic data
and uncover novel gene-metabolite relationships. The IntLIM R package
is publicly available in GitHub
([41]https://github.com/mathelab/IntLIM) and includes a user-friendly
web application, vignettes, sample data and data/code to reproduce
results.
Electronic supplementary material
The online version of this article (10.1186/s12859-018-2085-6) contains
supplementary material, which is available to authorized users.
Keywords: Metabolomics, Transcriptomics, Linear Modeling, Integration
Background
Metabolomics data is increasingly collected in human biospecimens to
identify putative biomarkers in diseases such as cancer [[42]1–[43]6].
Metabolites (small molecules < 1500 Da) are ideal candidates for
biomarker discovery because they directly reflect disease phenotype and
downstream effects of post-translational modifications [[44]6].
However, interpretation of metabolomics data, including understanding
how metabolite levels are modulated, is challenging. Reasons for this
challenge include the presence of many (hundreds) of unidentified
metabolites when untargeted approaches are applied [[45]7, [46]8], and
the fact that metabolomics profiles generated in human biospecimens are
typically ‘snapshots’ or time-averaged representations of a disease
state. Despite these difficulties, analyzing metabolomics data in light
of other omics information, such as the transcriptome, can help to
functionally interpret metabolomics phenotypes [[47]9–[48]15]. Data
integration, or the use of multiple sources of information or data to
provide a better model and understand a biological system [[49]16],
offers the opportunity to combine metabolomics data with other omics
data-sets (e.g. transcriptome). Measurement and integration of the
transcriptome and metabolome in the same cells, samples, or
individuals, are thus increasingly applied to elucidate mechanisms that
drive diseases, and to uncover putative biomarkers (metabolites) and
targets (genes).
Current approaches that integrate transcriptomic and metabolomic data
can be broadly categorized as numerical or pathway/network based.
Numerical approaches include multivariate analyses (e.g. logistic
regression, principal component analysis, partial least squares) and
correlation-based approaches (e.g. canonical correlations)
[[50]17–[51]19]. Differential correlation or coexpression methods have
also been developed to capture changes in relationships between
conditions [[52]20]. Open-source tools, including MixOmics [[53]21,
[54]22] and DiffCorr [[55]23], are available for integrating data but
generally require in-depth statistical knowledge for their use and may
not be as accessible to non-computational experts. Of note, such
numerical approaches typically do not capture the complex and indirect
relationships between transcripts and metabolites. For example,
non-linear reaction kinetics mechanisms, metabolite-metabolite
connections that regulate metabolite levels, and post-translational
modifications all contribute to the complexity of gene-metabolite
relationships [[56]24, [57]25]. To better capture these complex
relationships, pathway or network based approaches can be applied.
Open-source tools such as Metaboanalyst [[58]26], INMEX [[59]27], XCMS
Online [[60]28], Metabox [[61]29], and IMPALA [[62]30] integrate
transcriptomic and metabolomics data at a pathway level. One caveat of
these approaches is that they rely on curated pathways or
reaction-level information (knowledge of which enzymes produce a given
metabolite) [[63]18]. Pathway approaches are thus limited to
metabolites that are identified and that can be mapped to pathways,
which represents a fraction of what can be measured. In fact, of the
114,100 metabolites in the Human Metabolome Database [[64]31–[65]33],
only 18,558 are detected and quantified, and of those, only 3115 (17%)
map to KEGG pathways. Further, network approaches that attempt to study
the complex many to many associations between genes and metabolites may
not scale well when tens of thousands of gene-metabolite pairs are
evaluated.
Importantly, previous studies have shown that functionally related
genes and metabolites show coherent co-regulation patterns [[66]20,
[67]34, [68]35]. We make this functionality assumption here and propose
a linear modeling approach for integrating metabolomics and
transcriptomics data to identify phenotype-specific gene-metabolite
relationships. Of note, typical numerical integration approaches
uncover patterns of molecular features that are globally correlated or
aim to predict phenotype [[69]20]. However, these methods do not
directly and statistically test whether associations between
metabolites and gene expression differ by phenotype. This distinction
is important because global associations between genes and metabolites
may not only reflect one phenotype of interest, but could reflect other
features (e.g., environment, histology). As for methods that uncover
differentially correlated pairs between conditions [[70]35], they
either do not capture pairs of features that are correlated in one
group and not correlated in another group, or they bin relationships
into different types (e.g. positive correlation in one group, negative
correlation in another group), thereby making it difficult to compare
more than 2 phenotypes [[71]20, [72]34, [73]35]. Further, these
approaches are not implemented into user-friendly frameworks. Our
approach is thus advantageous because it directly evaluates the
relationship between genes and metabolites in the context of phenotype,
it can easily incorporate potential covariates, and is applicable to
categorical (> = 2 groups) or continuous phenotypes. Further, our
approach is implemented as a publicly available R package IntLIM
(Integration through Linear Modeling), available at our GitHub
repository [[74]36], which includes an R Shiny web interface making it
user-friendly to non-computational experts. In the wake of increasing
amounts of metabolomics and transcriptomic data generated, availability
of open-source, user-friendly, and streamlined approaches is key for
reproducibility. Using IntLIM, we evaluated phenotype-specific
relationships between gene and metabolite levels measured in the NCI-60
cancer cell lines [[75]10], and in tumor and adjacent non-tumor tissue
of breast cancer patients [[76]9]. We demonstrate that IntLIM is useful
for uncovering known and novel gene-metabolite relationships (which
would require further experimental validation).
Methods
NCI-60 cell line data pre-processing
The NCI-60 cancer cell line metabolomics (Metabolon platform) and gene
expression data (Affymetrix U133 microarray) were downloaded from the
Developmental Therapeutics Program (National Cancer Institute) website
[[77]10, [78]37]. Metabolomics and gene expression data, available in
57 cell lines, were pre-processed and normalized according to the
Metabolon and Affymetrix MAS5 algorithms [[79]38, [80]39],
respectively. The metabolomics data contains 353 metabolites, of which
198 are unidentified. Each cell line is measured in triplicates
(technical replicates), except for A498 and A549/ATCC, which had 4 and
2 technical replicates, respectively. The median of coefficients of
variation (CVs) within technical replicate samples was calculated for
each metabolite to assess consistency of abundance measurements.
Metabolites with CVs < 0.3 were removed (280 metabolites remaining),
abundances were log2 transformed, and the average technical replicate
value was calculated for each metabolite. Next, the number of imputed
values was estimated for each metabolite. The standard imputation
method used by Metabolon is to impute missing values for a given
metabolite by the minimum value of that metabolite across all samples.
Thus, for each metabolite, the number of samples with a value equal to
the minimum value (for that metabolite across all samples) minus “1”
(one of those values is the true minimum value and should be
subtracted) was used as an estimate of the number of missing values per
metabolite. Metabolites with more than 80% imputed values were filtered
out resulting in 220 metabolites, 111 of which are unidentified. Probes
from the Chiron Affymetrix U133 microarrays were mapped to genes using
the Bioconductor Ensembl database hgu133.plus.db [[81]40]. In cases
where more than one probe was matched to a given gene, the probe with
the highest mean expression across all samples was retained for
analysis, resulting in 17,987 genes with available expression. Lastly,
we removed the 10% (arbitrary cutoff) of the lowest expressing genes,
resulting in a total of 16,188 genes. For the linear modeling analyses,
220 metabolites and 16,188 genes were input.
For the NCI-60 cell line data, the phenotypes compared were leukemia
cell lines vs. breast/prostate/ovarian (BPO) cell lines. Because this
dataset was used to develop our approach, we purposefully chose cells
from cancers that are known to be highly different in terms of their
molecular profiles (e.g. blood cancer vs. solid tumor). The breast,
prostate, and ovarian cancer cell lines were grouped together because
they share susceptibility loci [[82]41] and our aim was to increase
sample size.
Breast cancer data pre-processing
Normalized gene expression (Affymetrix Gene Chip Human Gene 1.0 ST
Arrays) and metabolomics (Metabolon) data in tumor and adjacent
non-tumor tissue of breast cancer patients are publicly available
through the Gene Expression Omnibus ([83]GSE37751) and the
supplementary data of the original publication, respectively [[84]9,
[85]42]. The data was normalized using the Metabolon algorithm
(metabolites) and RMA algorithm [[86]43] (genes), as previously
described [[87]9]. Both gene and metabolite levels are available for 61
tumor and 47 adjacent non-tumor breast tissue. The metabolomics data
consists of 536 metabolites (203 of which are unidentified) in tumor
and non-tumor tissue. Metabolites with more than 80% imputed values
were removed, resulting in 379 metabolites, 119 of which are
unidentified. Probes from the gene expression data not mapping to a
gene symbol (Human Gene 1.0 ST Arrays) were removed. Similar to the
NCI-60 data pre-processing, the probe with the highest mean expression
was used for analysis when multiple probes mapped to a single gene.
This resulted in 20,254 genes measured in tumor and non-tumor tissue.
After removing the 10% lowest expression genes, we analyzed 18,228
genes. With this breast cancer data, our aim was to compare
gene-metabolite associations between tumor and non-tumor tissue. A
total of 379 metabolites and 18,228 genes were used for this analysis.
IntLIM: Integration through linear modeling approach
The linear model applied to integrate transcriptomic and metabolomic
data is:
[MATH:
m=β1+β2g+β3
p+β4g:p+ε :MATH]
1
where “m” and “g” are normalized (see data pre-processing above) and
log2-transformed metabolite abundances and gene levels respectively,
“p” is phenotype (e.g. cancer type, tumor vs. normal), “(g:p)” is the
statistical interaction [[88]44] between gene expression and phenotype,
and “ε” is the error term that is assumed to be independent and
normally distributed (ε = N(0, σ)) . A statistically significant
two-tailed p-value of the “(g:p)” interaction term indicates that the
slope relating gene expression and metabolite abundance is different
from one phenotype compared to the other. Through this model, we can
identify gene-metabolite associations that are specific to a particular
phenotype (Fig. [89]1). This model has been applied to all possible
gene-metabolite pairs including those involving unidentified
metabolites in the publicly available NCI-60 cancer cell line data
[[90]10] as well as previously published data from a breast cancer
study [[91]9]. Two-tailed p-values are subsequently corrected for
multiple comparisons using the method by Benjamini and Hochberg to
control the false discovery rate (FDR) [[92]45]. Gene-metabolite pairs
with an FDR-adjusted interaction p-value less than 0.10 or 0.05 in the
NCI-60 cell line and breast cancer data, respectively, were used to
determine statistical significance. (Due to the larger sample size in
the breast cancer data set and the much larger amount of significant
gene-metabolite pairs, our threshold for significance was more
stringent).
Fig. 1.
Fig. 1
[93]Open in a new tab
IntLIM defines phenotype-specific gene-metabolite pairs by uncovering
gene-metabolite pairs that show an association in one phenotype (e.g.
tumors) and another or no association in another phenotype (e.g.
non-tumors)
To filter and cluster the list of statistically significant
gene-metabolite pairs, the difference in Spearman correlations between
the two phenotypic groups being compared (leukemia vs. BPO for NCI-60
cells and tumor vs. non-tumor for breast cancer tissue) was used as an
effect size. Volcano plots of the difference in Spearman correlations
vs. the –log10 (FDR-adjusted p-values) are depicted to visualize the
distributions and help determine appropriate p-value and effect size
cutoffs (Additional file [94]1: Figure S3). For both datasets, a
minimum absolute difference in correlations of 0.5 was used as an
effect size cutoff.
The results can be visualized via a hierarchically clustered heatmap of
gene-metabolite Spearman correlations calculated for each phenotypic
group. Hierarchical clustering is performed with the hclust function.
The Euclidean distance is used as the distance metric and the complete
linkage method is used for agglomeration. The resulting dendrogram is
used to create a heatmap that helps visualize how relevant
gene-metabolite pairs cluster by their effect size (e.g. differences in
Spearman correlation between the two phenotypic groups).
IntLIM R package
A pipeline has been developed in the form of an R package to streamline
integration of metabolomics and gene expression data using IntLIM. The
package has been optimized and can solve a high number of linear models
(3–7 million gene-metabolite pairs) in 2 to 6 min on a laptop with
2.7GHz quad-core Intel Core i7 processor and 16 GB, 2133 MHz memory. Of
note, IntLIM requires less than 3% of the time to solve all possible
linear models compared to iterating through each model using the lm
function in R for performing linear regression analysis as it contains
a matrix algebra implementation of that function [[95]46]. Extensive
documentation is available in the package, including a vignette, and
formatted NCI-60 and breast cancer datasets are linked and available in
the IntLIM GitHub repository [[96]36]. The steps for analysis are:
1. Load data: input CSV files containing normalized and
log2-transformed gene expression data, normalized and
log2-transformed metabolite abundance data, metadata for the
samples (e.g. cancer status), and optionally metadata information
on the genes and metabolites
2. Filter data: gene expression and metabolomics data are optionally
filtered by gene and metabolite abundances and missing values
3. Run IntLIM: run linear models for all possible gene-metabolite
pairs and extract FDR-adjusted interaction p-values and effect
sizes (e.g. differences in slope/correlations between the groups)
4. Filter gene-metabolite pairs: filter results by user-input cutoffs
of FDR-adjusted p-values and effect size. A volcano plot (absolute
difference in correlation vs. –log10(FDR-adjusted p-values) is
shown to help users determine appropriate adjusted p-value and
effect size cutoffs. Resulting pairs are then clustered with
hierarchical clustering, based on correlations within each, and
visualized through heatmaps.
5. Visualize relevant gene-metabolite pairs: user-selected
gene-metabolite pairs can be visualized through scatterplots,
color-coded by phenotypic groups of interest (e.g. leukemia vs.
BPO, tumor vs. non-tumor).
The IntLIM package also includes an RShiny web interface, a powerful
tool that transforms complex analysis pipelines into interactive,
user-friendly web applications [[97]47]. The App guides users through
all steps available in the package, as mentioned above. Of note, most
plots are coded in highcharter [[98]48] or plotly [[99]49, [100]50] so
users can promptly assess the effect of changing parameters on analysis
results (e.g. immediate updates of tables and plots resulting from user
changes of effect size and p-value cutoffs). We believe this
interactivity accelerates data analysis and hence discovery of
phenotype-specific gene-metabolite pairs. Further the app makes the
analysis accessible to non-computational researchers. More information
can be found in the Additional file [101]2: IntLIM documentation.
Pathway analysis
Pathway and upstream regulator analyses were performed using the
Ingenuity Pathway Analysis (IPA) software. The list of genes or
identified metabolites from each cluster (e.g. highly correlated in one
group but no correlation in the other) of statistically significant
gene-metabolite pairs were input to conduct pathway analysis to analyze
input genes or metabolites in the context of biological pathways or
functions [[102]51]. IPA also includes an upstream regulator analysis
to determine whether those molecules were associated with a particular
upstream regulator. P-values, calculated from the Right-tailed Fisher’s
Exact Test, reflect whether the number of overlapping molecules
associated with a particular pathway or upstream regulator is greater
than expected by chance [[103]52]. For upstream regulator analysis,
both direct and indirect relationships between molecules and their
targets were considered (confidence = Experimentally observed).
[[104]53].
Results
IntLIM (integration through LInear modeling)
Our goal is to find gene-metabolite pairs that have a strong
association in one phenotype (e.g. leukemia vs. breast/prostate/ovarian
cancers (BPO), tumor vs. non-tumor) and an inverse or no association in
another phenotype. We anticipate that gene-metabolite relationships
that are phenotype dependent will help interpret metabolomics
phenotypes and will highlight molecular functions and pathways worth
evaluating further. With accumulating transcriptomic and metabolomics
data generated in the same samples, uncovering phenotype-specific
relationships could elucidate novel co-regulation patterns. Because
commonly leveraged untargeted metabolomics approaches produce large
amounts of unidentified metabolites, approaches that rely on
reaction-level or pathway annotations may not be sufficient to capture
all or novel relationships. To accomplish our goal, we thus rely on
numerical data integration and developed a linear modeling approach
that predicts metabolite levels from gene expression in a
phenotype-dependent manner (Fig. [105]1) (see [106]Methods). Unlike
correlation-based and logistic regression approaches, our approach
specifically evaluates whether the association between gene and
metabolite levels is related to a phenotype. Furthermore, it is
important to keep in mind that metabolite abundances can be modulated
by a group of enzymes, which in turn are regulated by a myriad of
regulatory processes (e.g. transcription, post-translational
modifications). Thus, gene expression, protein abundances, and
metabolite levels do not always have linear relationships. While these
more complex relationships will not be readily detected using our
approach [[107]14], co-regulated gene-metabolite relationships tend to
share biological functions [[108]34] and we make this assumption here.
Our approach is implemented as an R package, which is publicly
available through GitHub (See [109]Methods and IntLIM Documentation in
Additional file [110]2) [[111]36].
Application to NCI-60 data
The NCI-60 cell lines [[112]10] were developed as a drug-screening tool
focusing on a range of cancer types, including renal, colon, prostate,
breast, ovarian, leukemia, and non-small cell lung cancer [[113]54].
Transcriptomic (Affymetrix) and metabolomic (Metabolon) data are
available for 57 of those cell lines [[114]10] We applied IntLIM to
identify cancer-type specific gene-metabolite associations. The two
major subgroups compared were leukemia (6 cell lines: CCRF-CEM, HL-60
(TB), K-562, MOLT-4, RPMI-8226, SR) vs. the breast/prostate/ovarian
(BPO) cancer cell lines (14 total cell lines: BT-549, DU-145, HS 578 T,
IGROV1, MCF7, MDA-MB-231/ATCC, NCI/ADR-RES, OVCAR-3, OVCAR-4, OVCAR-5,
OVCAR-8, PC-3, SK-OV-3, T-47D) consisting of 16,188 genes and 220
metabolites (see [115]Methods). The latter cancers were grouped
together as they share common susceptibility loci [[116]41].
Unsupervised clustering using principal components analysis (PCA) on
the log2-transformed and filtered metabolomics and gene expression data
(Additional file [117]3: Figure S1A and B) clearly delineates the two
major subgroups (Additional file [118]3: Figure S1C and D).
All possible combinations of gene-metabolite pairs (3,561,360 models
run) were evaluated, using “BPO” and “leukemia” as cancer type. We
identified 1009 cancer-type dependent gene-metabolite associations
(FDR-adjusted p-value < 0.1 and correlation difference effect size
> 0.5, Additional file [119]4: Data S1, Additional file [120]1: Figure
S3A) involving 785 genes and 68 metabolites, of which 37 are
unidentified. Clustering of these pairs by the direction of association
(e.g. positive or negative correlation) within each cancer type
subgroup revealed two major clusters (Fig. [121]3). First, the
“leukemia correlated cluster” consists of 545 gene-metabolite pairs
(429 unique genes and 54 unique metabolites of which 31 are
unidentified) with relatively high positive correlations in leukemia
cell lines and low or negative correlations in BPO cell lines
(Fig. [122]2a). Second, the “leukemia anti-correlated cluster” consists
of 464 gene-metabolite pairs (356 unique genes and 45 unique
metabolites of which 24 are unidentified) with relatively high negative
correlations in leukemia cell lines and positive or low negative
correlations in BPO cell lines. Two of the top ranked gene-metabolite
pairs (ranked in descending order of absolute value of Spearman
correlation differences between BPO and leukemia) in the leukemia
correlated and leukemia anti-correlated clusters are FSCN1-malic acid
(Fig. [123]2b) and DLG4-leucine (Fig. [124]2c), respectively. FSCN1 and
malic acid (Fig. [125]2b) are positively correlated in leukemia
(r = 0.94) but negatively correlated in BPO cancers (r = − 0.75) (Fig.
[126]2b). FSCN1 is associated with the progression of prostate cancer
[[127]55], while malic acid (or ionized malate) is an intermediate
involved in glutamine metabolism pathways that play major roles in
cancer metastasis [[128]56, [129]57]. DLG4 and leucine (Fig. [130]2c)
are negatively correlated in leukemia (r = − 0.92) but positively
correlated (r = 0.78) in BPO cancers (Fig. [131]2c). DLG4 is
downregulated in human cervical cancer cell lines infected with human
papillomavirus and may act as a tumor suppressor [[132]58], while
leucine deprivation inhibits cell proliferation and induces apoptosis
in breast cancer cells [[133]59]. Interestingly, leucine
supplementation has been shown to enhance pancreatic cancer growth in
mouse models [[134]60]. These opposing correlations of DLG4-leucine and
FSCN1-malic acid between leukemia and BPO suggest possible
tissue-specific relationships that can be differentially targeted.
Fig. 3.
Fig. 3
[135]Open in a new tab
Results of IntLIM applied to a breast cancer datase. a Clustering of
Spearman correlations of 2842 identified gene-metabolite pairs(18,228
genes and 379 metabolites, with 61 tumor and 47 non-tumor samples)
(FDR-adjusted p-value of interaction coefficient < 0.05 with Spearman
correlation difference of > 0.5) in tumor and non-tumor tissue from
breast cancer tissue. b GPT2 association with 2-hydroxyglutarate
(FDR-adjusted p-value = 0.046, Normal Spearman Correlation = − 0.11,
Tumor Spearman Correlation = 0.40). c Lack of association between
2-hydroxygutarate with MYC (FDR adj. p-value = 0.90, Normal Spearman
Correlation = − 0.20, Tumor Spearman Correlation = 0.04)
Fig. 2.
Fig. 2
[136]Open in a new tab
Results of IntLIM applied to NCI-60 data. a Clustering of Spearman
correlations of 1009 identified gene-metabolite pairs (16,188 genes and
220 metabolites, 57 cell lines) (FDR adjusted p-value of interaction
coefficient < 0.10 with Spearman correlation difference of > 0.5) in
“BPO” and leukemia NCI-60 cell lines. Examples of two gene-metabolite
associations with significant differences: (b) FSCN1 and malic acid
(FDR adj. p-value = 0.082, BPO Spearman Correlation = − 0.75, Leukemia
Spearman Correlation = 0.94), (c) DLG4 and leucine (FDR adj.
p-value = 0.0399, BPO Spearman Correlation = 0.78, Leukemia Spearman
Correlation = − 0.93)
Pathway analysis on 419 unique and mappable genes in the “leukemia
correlated cluster” showed enrichment of the following pathways: acute
phase response signaling, 1D–myo-inositol hexakisphosphate
biosynthesis, hepatic fibrosis/hepatic stellate cell activation, CDK5
signaling, and PAK signaling (Additional file [137]5: Table S1). The
“leukemia anti-correlated cluster” genes (N = 351) were enriched for
endothelial NOS signaling, CREB signaling in neurons, dTMP de novo
biosynthesis, Huntington’s Disease signaling, and the P2Y purigenic
receptor signaling pathway (Additional file [138]5: Table S1). Most of
these pathways are relevant to cancer biology. For example, nitric
oxide has been found to have both tumor suppressive (e.g. promoting
apoptosis, inhibition of cancer growth) and tumor promoting properties
(promotion of angiogenesis, DNA repair mechanisms) [[139]61].
cAMP-regulator element binding protein (CREB) has been shown to be
over-expressed and phosphorylated in several cancers (including acute
myeloid leukemia) and might play a role in cancer pathogenesis
[[140]62]. These preliminary results demonstrate how different pathways
may be differentially regulated in a cancer-type dependent manner.
Since only 9 of 54 and 10 of 45 metabolites in the leukemia correlated
and leukemia anti-correlated clusters, respectively, could be mapped to
Human Metabolome Database (HMDB) IDs [[141]31–[142]33], pathway
analyses were not possible for the metabolites.
Application to breast cancer data
We further applied IntLIM to a previously published breast cancer study
[[143]9]. Gene expression and metabolomics profiling of tumor (n = 61)
and adjacent non-tumor tissue samples (n = 47) was measured in tissue
from breast cancer patients [[144]9]. Importantly, gene expression and
metabolomics were measured in the same tissue biospecimens. The
original study identified a relationship between MYC activation and
2-hydroxyglutarate (2-HG) accumulation as associated with poor
prognosis in breast cancer [[145]9]. Studies involving MYC
overexpression and knockdown in human mammary epithelial and breast
cancer cells further corroborated this relationship [[146]9]. When
assessing the relationship between MYC gene expression and 2-HG though,
we did not observe this association at the transcription level (Fig.
[147]3c). Our goal was thus to identify other potential regulators of
2-hydroxyglutarate accumulation in breast cancer tissue, and to assess
whether other gene-metabolite associations were specific to either
tumor or non-tumor tissue. The data consists of 18,228 genes and 379
metabolites (119 unidentified) measured in 61 tumor samples and 47
adjacent non-tumor samples (Additional file [148]6: Figure S2A and B).
Unsupervised clustering of gene and metabolite abundances separated
tumor from non-tumor tissue (Additional file [149]6: Figure S2C and D).
IntLIM was applied to all possible combinations of gene-metabolite
pairs (6,908,412 models), with tumor and non-tumor as the phenotype.
Our approach identified 2842 tumor-dependent gene-metabolite
correlations (FDR-adjusted interaction p-value < 0.05, and a Spearman
correlation difference > 0.5) involving 761 genes and 212 metabolites
of which 48 are unidentified. (Additional file [150]7: Data S2,
Additional file [151]1: Figure S3B). The resulting heatmap of
gene-metabolite Spearman correlations for tumor and non-tumor groups is
divided into two major clusters (Fig. [152]3a). The first is a
“tumor-correlated cluster” of 1038 gene-metabolite pairs (288 unique
genes and 155 metabolites of which 35 are unidentified) with relatively
high correlations in tumor samples and mostly negative correlations in
non-tumor samples. The second major cluster, “tumor anti-correlated
cluster”, comprises 1804 gene-metabolite pairs (479 unique genes and
188 metabolites of which 39 are unidentified) with high negative
correlations for tumor samples and mostly negative correlations for
non-tumor samples.
Upstream analysis of the genes involved in the tumor-correlated cluster
(N = 283) did identify MYC as an upstream transcriptional regulator
(Right-tailed Fisher’s Exact Test p-value = 6 × 10^− 3), even though
MYC and 2-HG are not differentially associated (Fig. [153]3c). 2-HG
was, however, found to be associated with GPT2 (FDR adj
p-value = 0.046, r = 0.40 in tumors, and r = − 0.11 in non-tumors)
(Fig. [154]3b, Additional file [155]7: Data S2). GPT2 plays a role in
glutamine metabolism and encodes a glutamic-pyruvic transaminase that
catalyzes reverse transamination between alanine and 2-oxoglutarate to
generate pyruvate and glutamate [[156]63]. Cancer cells exhibit a
metabolic reprogramming that results in increased lactate acid
production in the Warburg effect and the use of glutamine to replenish
the tricarboxylic acid cycle (TCA) [[157]64, [158]65]. The role of GPT2
serves to drive the utilization of glutamine as a carbon source for TCA
analplerosis [[159]63, [160]65]. While the exact mechanisms underlying
increased levels of 2-hydroxyglutarate in breast cancer cells are not
all known, our results suggest that metabolic reprogramming changes the
relationship between GPT2 and 2-hydroxyglutarate. Furthermore, GPT2 is
found to be in 18 (FDR adjusted p-value < 0.05 and correlation
difference > 0.5) other tumor-specific gene-metabolite associations
(Additional file [161]7: Data S2).
In addition to GPT2 and 2-HG, we identified 15 other gene-metabolite
pairs involving metabolites linked to glutamine metabolism. Of those
genes paired with glutamine, ASNS, which encodes asparagine synthetase,
is directly involved in metabolizing glutamine [[162]66] and SLC7A1 is
involved in glutamine transport [[163]64] (Additional file [164]7: Data
S2). Furthermore, there are 65 gene-metabolite pairs with glutamate and
25 pairs involving alanine (Additional file [165]7: Data S2), and 5
gene-metabolite pairs involving the WIF gene, which is part of the Wnt
signaling pathway [[166]9] (Additional file [167]7: Data S2).
Pathway analysis revealed that genes in the “tumor-correlated cluster”
(283 mapped into IPA out of 288 genes) were enriched for oxidative
phosphorylation, mitochondrial dysfunction, protein ubiquitination
pathway, GDP-mannose biosynthesis, and the pyridoxal 5′-phosphate
salvage pathway (Additional file [168]8: Table S2). Genes in the “tumor
anti-correlated cluster” (468 mapped onto IPA out of 479 genes) were
enriched for hepatic fibrosis/hepatic stellate cell activation, FAK
signaling, actin cytoskeleton signaling, signaling by Rho family
GTPases, and circadian rhythm signaling (Additional file [169]8: Table
S2). Expectedly, we find that pathways such as FAK signaling, actin
cytoskeleton, the protein ubiquitination pathway, and circadian rhythm
signaling have strong links to breast cancer pathogenesis
[[170]67–[171]71]. Of note, the top two pathways in the
tumor-correlated cluster (oxidative phosphorylation and mitochondrial
dysfunction) play roles in cellular energetics [[172]72].
Pathway analysis of the metabolites in the “tumor-correlated cluster”
(100 mapped onto IPA out of 155 metabolites) resulted in enrichment of
pathways related to tRNA charging and nucleotide degradation
(Additional file [173]9: Table S3). The “tumor anti-correlated cluster”
(115 mapped onto IPA out of 188 metabolites) was also enriched for tRNA
charging, citrulline metabolism, urea cycle, purine nucleotide
degradation, and purine ribonucleosides degradation to
ribose-1-phosphate (Additional file [174]9: Table S3). Pathways related
to tRNA and the urea cycle have been implicated in cancer
[[175]73–[176]75]. Citrulline metabolism and the urea cycle have also
been linked to glutamine metabolism [[177]57, [178]76, [179]77]. These
findings are consistent with previous studies [[180]9, [181]57,
[182]63, [183]64] that highlight the role of glutamine metabolism in
cancer cell proliferation and maintenance, especially with regards to
breast cancer [[184]9]. Further, the urea cycle has been shown to be
implicated in breast cancer and is linked to glutamine metabolism
[[185]77]. Notably, our IntLIM results identify 2 gene-metabolite pairs
with urea and 5 gene-metabolite pairs with arginine (FDR-adjusted
p-value of 0.05 or less, absolute Spearman Correlation
difference > 0.5), a major metabolite in the urea cycle (Additional
file [186]7: Data S2) [[187]77].
Discussion
As more and more transcriptomic and metabolomic data are collected in
the same samples or individuals, there is a need for streamlined
methods and associated user-friendly tools that integrate these data.
We implemented a novel linear modeling approach into an IntLIM R
package that includes a user-friendly web interface, to statistically
test whether gene and metabolite associations differ by phenotype.
Formally testing this dependency on phenotype differentiates our
approach from other numerical integration approaches such as logistic
regression and canonical correlations. Compared to other existing
methods that take into account phenotype dependency [[188]20, [189]34],
IntLIM is user-friendly, it uses a well-developed methodology (linear
model interactions), can easily account for other covariables (e.g.
gender, BMI, etc.), and can be applied to phenotypes that have more
than two categories or are continuous. Ultimately, uncovering
phenotype-specific relationships can provide insight into how
metabolites are being regulated by genes and on which pathways may be
involved in these phenotype-specific changes.
While knowledge of relevant pathways is powerful in developing
potential disease interventions and treatments, pathway enrichment
analyses are hampered by the large fraction of metabolites that are
identified or cannot be mapped to pathways. Importantly, IntLIM
uncovers phenotype-dependent gene-metabolite associations without a
priori curated information on pathways and networks, allowing discovery
of potentially novel associations (that would require further
experimental validation). Because untargeted metabolomics data produces
many unidentified features, phenotype-specific associations with IntLIM
could help further characterize these unidentified molecules. These
data-driven discoveries would require further experimental validation
and could generate new hypothesis to be tested. When pathway
annotations are available though, pathway enrichment analysis of genes
and metabolites that show similar patterns (e.g. positive correlation
in tumors but no correlation in non-tumors) can offer greater insight
onto pathways that are altered between phenotypes. With this in mind,
IntLIM produces a list of relevant genes and metabolites that could be
input into pathway integration approaches and software [[190]26,
[191]28–[192]30].
To demonstrate the utility of IntLIM to uncover cancer-relevant
gene-metabolite relationships, we evaluated transcriptomic and
metabolomics data measured in the NCI-60 cell lines [[193]10] and
breast tumor/adjacent non-tumor tissue [[194]9] (Figs. [195]2 and
[196]3). In both these data sets, we uncovered biologically relevant
gene-metabolite relationships and pathways. For example, glutamine
metabolism clearly stood out as an altered pathway in the breast cancer
data, in line with previous published results [[197]9]. Interestingly,
we also uncovered novel putative associations, such as the possible
modulation by GPT2 of 2-hydroxyglutarate accumulation in breast cancer
tissue (validation of this relationships would require further
experimentation).
While this first iteration of IntLIM uncovers phenotype-specific
gene-metabolite pairs, the approach can easily be extended to other
omics data (e.g., metabolomics/microbiome data, metabolomics/proteome,
proteome/transcriptome). Of note, because IntLIM makes use of a linear
model, we assume that the independent variables (e.g. metabolite
levels) are normally distributed to meet the normality assumption. We
have verified the normality assumption in the NCI-60 and breast cancer
datasets and leave it up to the user to appropriately transform and
check the normality of their data prior to using IntLIM. Furthermore,
our current linear model does not make use of the fact that some of the
samples may be paired. In our breast cancer data [[198]9], only a
subset of the patients (N = 41) have both tumor and adjacent non-tumor
available. It would be feasible to take into consideration the paired
nature of the samples using a mixed model methodology, and thereby
increase our power to detect significant relationships. Finally, future
developments of IntLIM will accommodate greater flexibility in defining
models. For example, we will include the capability of testing whether
phenotype-specific gene-metabolite associations are independent of
other putative confounders (e.g. age, gender, race, etc). Further,
while IntLIM currently only supports a binary phenotype, it is readily
generalizable to multicategorical phenotypes.
Like most approaches, IntLIM and the studies conducted are not without
limitations. The biochemical pathways that drive gene expression to
protein production to post-translational modifications to metabolite
production/consumption are complex [[199]24]. The abundance of a given
metabolite typically depend on a group of enzymes that produce/consume
that metabolite. Additionally, those enzymes have distinct kinetic
parameters, and their activity depends on a range of posttranslational
modifications and regulatory processes. As a result, transcript levels
are not the only factors that modulate metabolite abundance, and the
gene-metabolite relationship may not be linear. In this regard, IntLIM
may not adequately capture these complex relationships. Nonetheless,
linear-based approaches are well-developed, have successfully been
applied when integrating omics data, and co-regulated genes and
metabolites tend to be associated with functional roles [[200]10,
[201]20, [202]34]. Further, we demonstrate that this simple approach
can identify biologically meaningful, putative phenotype-dependent
gene-metabolite relationships that can be investigated with further
experiments. Another limitation is that IntLIM does not take into
consideration time-dependency of biochemical reaction steps, especially
given the time delay between gene expression and protein production and
further on metabolite production/consumption. However, in clinical and
translational applications, metabolomic and transcriptomic data is
typically collected at a “snapshot” in time, where time-dependent
analyses are not possible [[203]78]. Lastly, our approach, along with
other numerical and pathway based integration approaches, does not take
into account cellular heterogeneity in specimens analyzed, even though
this heterogeneity could impact gene-metabolite correlations in
different regions of cells or tissues [[204]79]. Because IntLIM remains
agnostic to the input, especially with regards to cell/tissue
heterogeneity, it is the user’s responsibility to interpret the data as
well as design future experiments to test findings from results.
Despite these limitations, IntLIM provides a user-friendly,
reproducible framework to integrate metabolomics and transcriptomics
data, or other omics data and provides a readily implementable first
step in integration.
Conclusions
Metabolomics and transcriptomic data are increasingly collected in the
same samples to uncover putative metabolite biomarkers and gene
therapeutic targets. User-friendly approaches that integrate these data
types will thus facilitate data interpretation in these studies, and
could generate data-driven hypothesis. With this in mind, we developed
a novel linear modeling approach that statistically tests whether
gene-metabolite associations are specific to particular phenotypes
(tumor vs. non-tumor, cancer-type, etc.). Our approach is available as
a publicly available R package, IntLIM, with an associated
user-friendly web application. We applied IntLIM to two cancer datasets
and uncovered known and novel gene-metabolite pairs and pathways that
were associated with cancer phenotypes. It is our hope that IntLIM will
assist researchers, with or without computational expertise, in
formulating novel hypothesis and proposing new studies especially with
regards to the gene-metabolite pairs identified. Integrating the
results with pathway analysis tools will provide further insight. The
IntLIM R package and App are available for download via GitHub and a
sample data-set and vignette are provided for users.
Additional files
[205]Additional file 1:^ (604.6KB, pdf)
Figure S3. “Volcano plots” of Spearman correlation differences vs. FDR-
adjusted p-values (of interaction term in linear model, see
[206]Methods) for A) NCI-60 cell line analysis and B) breast cancer
data analysis. (PDF 604 kb)
[207]Additional file 2:^ (1.4MB, docx)
Documentation on IntLIM. (DOCX 1444 kb)
[208]Additional file 3:^ (555.3KB, pdf)
Figure S1. Preliminary analysis of filtered NCI-60 data involving 14
breast/prostrate/ovarian cancer (BPO) lines and 6 leukemia cell lines
with 220 filtered metabolites and 16,188 genes. A) Distribution of
normalized (Metabolon method) metabolite abundances among NCI-60 cell
lines. B) Distribution of normalized (MAS5 algorithm) gene expression
data. C, D) Principal component analysis of metabolomics and gene
expression data, respectively. In the IntLIM package Rshiny app, these
plots are interactive and hovering over points will provide information
on those points (e.g. sample names). (PDF 555 kb)
[209]Additional file 4:^ (104.8KB, xlsx)
Data S1. NCI-60 Results with FDR Adjusted p-value < 0.10 and
Correlation Difference > 0.50. (XLSX 104 kb)
[210]Additional file 5:^ (56.9KB, pdf)
Table S1. NCI-60 Data pathway analysis results of genes. Ingenuity
Pathway Analysis Canonical Pathways from Genes involved in
Gene-Metabolite Pairs of the Leukemia Correlated Cluster and Leukemia
Anti-Correlated Cluster. P-values are all calculated from right-tailed
Fisher’s Exact Test. (PDF 56 kb)
[211]Additional file 6:^ (543.5KB, pdf)
Figure S2. Preliminary analysis of filtered breast cancer data
involving 108 samples (61 tumor and 47 non-tumor) with 379 metabolites
and 18,228 genes. A, B) Distribution of normalized metabolite levels
(Metabolon method) and RMA-normalized gene expression levels for all
samples, respectively. C,D) Principal component analysis of
metabolomics and gene expression data, respectively. In the IntLIM
package Rshiny app, these plots are interactive and hovering over
points will provide information on those points (e.g. sample names).
(PDF 543 kb)
[212]Additional file 7:^ (220.6KB, xlsx)
Data S2. Breast Cancer Results. FDR Adjusted p-value < 0.05 and
Spearman Correlation > 0.5. (XLSX 220 kb)
[213]Additional file 8:^ (55.1KB, pdf)
Table S2. Breast Cancer Data pathway analysis results of genes.
Ingenuity Pathway Analysis Canonical Pathways from Genes involved in
Gene-Metabolite Pairs of the Tumor Correlated Cluster and Tumor
Anti-Correlated Cluster. P-values are all calculated from right-tailed
Fisher’s Exact Test. (PDF 55 kb)
[214]Additional file 9:^ (53.1KB, pdf)
Table S3. Breast Cancer Data pathway analysis results of metabolites.
Ingenuity Pathway Analysis Canonical Pathways from Metabolites involved
in Gene-Metabolite Pairs of the Tumor Correlated Cluster. P-values are
all calculated from right-tailed Fisher’s Exact Test. (PDF 53 kb)
Acknowledgements