Abstract
Metabolism is controlled to ensure organismal development and
homeostasis. Several mechanisms regulate metabolism, including
allosteric control and transcriptional regulation of metabolic enzymes
and transporters. So far, metabolism regulation has mostly been
described for individual genes and pathways, and the extent of
transcriptional regulation of the entire metabolic network remains
largely unknown. Here, we find that three‐quarters of all metabolic
genes are transcriptionally regulated in the nematode Caenorhabditis
elegans. We find that many annotated metabolic pathways are
coexpressed, and we use gene expression data and the iCEL1314 metabolic
network model to define coregulated subpathways in an unbiased manner.
Using a large gene expression compendium, we determine the conditions
where subpathways exhibit strong coexpression. Finally, we develop
“WormClust,” a web application that enables a gene‐by‐gene query of
genes to view their association with metabolic (sub)‐pathways. Overall,
this study sheds light on the ubiquity of transcriptional regulation of
metabolism and provides a blueprint for similar studies in other
organisms, including humans.
Keywords: gene regulation, metabolic network model, metabolism, systems
biology, transcription
Subject Categories: Chromatin, Transcription & Genomics; Computational
Biology; Metabolism
__________________________________________________________________
Systems‐level analyses in C. elegans show widespread transcriptional
regulation of metabolic genes, coexpression of metabolic
(sub)‐pathways, and activation/repression conditions of (sub)‐pathways.
graphic file with name MSB-19-e11443-g009.jpg
Introduction
All organisms regulate their metabolism during development and to
maintain homeostasis under fluctuating dietary and environmental
conditions. In humans, failure to maintain homeostasis can lead to a
variety of metabolic disorders such as inborn errors in human
metabolism, obesity, hypertension, and diabetes (Sharma
et al, [34]2008; DeBerardinis & Thompson, [35]2012). Metabolism can be
regulated through different mechanisms. One well‐known mechanism is
allostery, a fast‐acting mechanism where metabolites directly modulate
enzyme activity. For instance, the enzyme phosphofructokinase, which
regulates the conversion of fructose 6‐phosphate to fructose
1,6‐biphosphate, is allosterically regulated during glycolysis. This
reaction is coupled to ATP hydrolysis where ATP binding to
phosphofructokinase inhibits enzyme activity by decreasing its affinity
for fructose 6‐phosphate, while conversion to AMP reverses the
inhibitory effect and increases the activity of the enzyme (Blangy
et al, [36]1968; Schirmer & Evans, [37]1990). Metabolism can also be
regulated transcriptionally by activating or repressing the expression
of genes encoding metabolic enzymes or transporters. This mechanism is
relatively slow and allows the organism to adapt to changing cellular
or environmental conditions. Well‐known examples of the transcriptional
regulation of metabolism include induction of the lac operon in
Escherichia coli in response to a switch from glucose to lactose as a
carbon source (Jacob & Monod, [38]1961; Gilbert &
Muller‐Hill, [39]1966); the Leloir pathway in Saccharomyces cerevisiae,
which is transcriptionally activated by galactose (Caputto
et al, [40]1949; Hopper et al, [41]1978); and mammalian cholesterol
biosynthesis genes, which are activated by the transcription factor
(TF) SREBP when cholesterol levels are low (Brown &
Goldstein, [42]1997; DeBose‐Boyd & Ye, [43]2018). Another example of
transcriptional rewiring of metabolism involves propionate degradation
in the nematode Caenorhabditis elegans. Like humans, C. elegans
utilizes a vitamin B12‐dependent pathway to break down this short‐chain
fatty acid. When dietary vitamin B12 is low, propionate metabolism is
transcriptionally rewired to an alternative degradation pathway
referred to as the propionate shunt, thereby preventing toxic
propionate accumulation (Watson et al, [44]2014, [45]2016; Bulcha
et al, [46]2019).
The contribution of transcriptional regulation of metabolism has mostly
been studied at a systems, or network, level, in single‐cell organisms
such as E. coli and S. cerevisiae and to a lesser extent in plants
(Ihmels et al, [47]2004; Kharchenko et al, [48]2005; Seshasayee
et al, [49]2009; Ledezma‐Tejeida et al, [50]2017; Tang
et al, [51]2021). However, the extent to which overall metabolic
activity is under transcriptional control in animals remains unclear.
C. elegans is an excellent multicellular animal model to study the
transcriptional regulation of metabolism at a systems level: Its fixed
lineage of 959 somatic cells was fully described (Sulston &
Horvitz, [52]1977), its metabolism shows extensive conservation with
human metabolism (Lai et al, [53]2000; Shaye & Greenwald, [54]2011),
many gene expression datasets are available, and a genome‐scale
metabolic network model (MNM) has been reconstructed (Yilmaz &
Walhout, [55]2016). The most up‐to‐date MNM, iCEL1314, contains 907
metabolites, 2,230 reactions and 1,314 genes (Yilmaz et al, [56]2020).
By using flux balance analysis (FBA), iCEL1314 can be used to gain
insight into the metabolic state of C. elegans during different
nutritional conditions or in different tissues. An additional set of
metabolic genes has been predicted based on homology with known
metabolic enzymes in other organisms or based on the presence of
domains found in metabolic enzymes (Yilmaz & Walhout, [57]2016;
Bhattacharya et al, [58]2022).
Guilt‐by‐association is a powerful concept in systems biology that can
be used to identify genes with shared functions. One way this can be
done is by coexpression analysis where a functional association can be
predicted when genes are coexpressed in many transcriptomic datasets
(Eisen et al, [59]1998; Hughes et al, [60]2000; Kim et al, [61]2001;
Segal et al, [62]2003; Stuart et al, [63]2003). In C. elegans,
coexpression analysis has been used to study global, temporal, and
spatial gene expression (Reinke et al, [64]2000; Kim et al, [65]2001,
[66]2016; Spencer et al, [67]2011; Liu et al, [68]2018).
Here, we investigated the extent of transcriptional regulation of
C. elegans metabolism. We developed a computational pipeline to
identify genes of which the corresponding mRNA varies significantly
during development, in different tissues, and across a gene expression
compendium consisting of different conditions. Using both a supervised
and a semisupervised method, we identified coexpressed metabolic
pathways and subpathways. Overall, we found that three‐quarters of
metabolic genes exhibit variation in expression, which is comparable to
the proportion in nonmetabolic genes. Further, we found that most
annotated metabolic pathways contain genes that are significantly
coexpressed. With a custom‐made semisupervised method, we identified
clusters of genes that define coexpressed subpathways or combinations
of subpathways that likely form functional metabolic units. We
extracted conditions where coexpressed clusters of genes are
coordinately activated or repressed, revealing how these clusters may
contribute to metabolic homeostasis. We developed a web application we
named “WormClust” that is available on WormFlux website (Yilmaz &
Walhout, [69]2016). WormClust enables querying of C. elegans genes to
identify metabolic (sub‐) pathways with which these genes are
coexpressed. Altogether, our findings show that transcriptional
regulation of metabolic genes and pathways is ubiquitous in C. elegans,
indicating that this principle is broadly conserved from single‐cell
organisms to metazoa. Finally, our analyses and tools provide a
platform for similar studies in other organisms, including humans.
Results
Three‐Quarter of metabolic genes are transcriptionally regulated
mRNA levels are determined by a combination of synthesis and
degradation. Here, we used variation in mRNA levels as a first
approximation for transcriptional regulation. We evaluated the
expression of metabolic genes during development, in different tissues,
and under different conditions to identify metabolic genes that are
highly variant and therefore likely transcriptionally regulated. We
used all annotated metabolic genes (Yilmaz & Walhout, [70]2016; Yilmaz
et al, [71]2020) and grouped them into four classes based on current
annotation (Dataset [72]EV1): Class A, iCEL1314 genes (N = 1,308; after
removal of six pseudogenes, see [73]Materials and Methods); class B,
genes annotated to reactions that cannot yet be connected to the
iCEL1314 model (N = 192); class C, genes encoding proteins with
homology to metabolic enzymes in other organisms (N = 860); and class
D, genes encoding proteins with a domain found in known metabolic
enzymes (N = 132). Hereafter, we refer to the 1,308 genes in class A as
“iCEL1314 genes” and the remaining 1,184 as “other metabolic genes”.
We first identified metabolic genes that vary in expression during
larval development by using a high‐quality postembryonic time‐resolved
RNA‐seq dataset, hereafter referred to as the “development dataset”
(Kim et al, [74]2013; Figs [75]1A and [76]EV1A, and Dataset [77]EV2).
Briefly, this dataset contains expression profiles of
stage‐synchronized animals that were collected every 2 h after hatching
for 48 h at 20°C. In the original paper, genes were grouped into 12
clusters based on similarity in developmental expression profiles. One
of these clusters contains 5,045 genes, including 995 metabolic genes,
with relatively invariant temporal expressions. We will refer to this
cluster as the “flat cluster.” However, although the expression levels
of most of the flat cluster genes are relatively stable during
development, we noticed that some did exhibit considerable variation.
Additionally, many invariant genes from other clusters were not
included in the flat cluster. Therefore, we used an unbiased
statistical method, called variation score (VS) to stringently define
variation in developmental gene expression. This included calculating
deviation from the flat cluster genes' expression and then empirically
establishing a conservative VS threshold (0.169; Fig [78]EV1A and
[79]B, see details in [80]Materials and Methods). We excluded 3,552
genes, including 213 metabolic genes, because they were expressed at
levels too low for variability analysis. For the remaining metabolic
genes, we found that 754 (31.4%, VS ≥ 0.169) are highly variant, and 98
were invariant (4%, VS = 0; Fig [81]1B). The remaining 1,332 metabolic
genes (0 < VS <0.169) were annotated as moderately variant (Figs [82]1B
and [83]EV1C). About a quarter of iCEL1314 genes (329, or 26%) are
highly variant, which is lower than the proportion of other metabolic
genes (37%) and nonmetabolic genes (41%; Fig [84]EV1C and [85]D, and
Dataset [86]EV2). The percentage of highly variant metabolic genes is
lower than that of nonmetabolic genes across most VS thresholds
(Fig [87]1C).
Figure 1. Analysis of metabolic gene expression during development, in
different tissues and in a gene expression compendium.
Figure 1
[88]Open in a new tab
1. Computational pipeline to identify C. elegans metabolic genes that
change in expression during development, across tissues, and
compendium of multiple conditions. Statistically significant
differences in gene expression were calculated in the developmental
dataset using a variation score (VS), in the tissue dataset using
coefficient of variation (CV) and by number of datasets with
CV ≥ 0.75 in the compendium (collection of 177 datasets).
2. Pie charts of metabolic and nonmetabolic gene expression variation
in the three different datasets: development, tissue, and
compendium separately and combined. Bar graph shows metabolic genes
in iCEL1314 and other (predicted) metabolic genes.
3. Comparison of percentage of highly variant metabolic versus
nonmetabolic genes at different VS thresholds.
4. Comparison of percentage of highly variant metabolic versus
nonmetabolic genes at different CV thresholds.
5. Scatter plot of VS (development) versus CV (tissue) of metabolic
genes. The plot is divided into four quadrants: Q1 with
moderate/low VS and moderate/low CV; Q2 with high VS and
moderate/low CV; Q3 with moderate/ low VS and high CV; and Q4 with
high VS and high CV. The Pearson and Spearman correlation
coefficients and the corresponding P‐values are indicated. Examples
of Q4 genes that are highly variant both during development and in
different tissues include ugt‐13, ugt‐18, and ugt‐34 (UGT enzymes);
gcy‐3 (guanylate cyclases); and ddo‐3, gln‐2, argk‐1, and phy‐3
(amino acid metabolism).
6. Comparison of percentage of highly variant metabolic versus
nonmetabolic genes at different cutoffs of number of datasets with
high CV (≥ 0.75).
7. Comparison of percentage of highly variant metabolic versus
nonmetabolic genes at different CV cutoffs in at least three
datasets in the compendium.
8. Comparison of percentage of invariant metabolic versus nonmetabolic
genes at different cutoffs of the fraction of datasets with low CV
(< 0.3).
9. Venn diagram of highly variant metabolic genes in the different
datasets.
Figure EV1. Transcriptional regulation of metabolism during development.
Figure EV1
[89]Open in a new tab
1. Computational pipeline to identify highly variant metabolic genes
during development. Genes that showed either moderate or high
expression in at least one time point were selected, reducing the
number of genes from 18,113 (2,397 metabolic) to 14,561 (2,184
metabolic). For each gene, VS was calculated using the deviation
from a flat reference profile at each time point (see [90]Materials
and Methods). The red line indicates the mean value and light red
shaded area the standard deviation of the flat reference profile.
The profile of acdh‐1 is plotted in blue as an example of a
developmentally regulated gene. d[5] denotes the deviation of
acdh‐1 expression from the flat reference profile at the 5^th data
point.
2. Distribution of VS of genes belonging to the flat cluster versus
those belonging to other clusters. The vertical lines at 1 and 3
represent the iCEL1314 genes nduf‐7 and ugt‐42, which have the
lowest and highest VS of the flat set, respectively. The vertical
line at 2 indicates the gene C27A7.3 with selected threshold of VS
at the 97% quantile of the flat cluster (VS = 0.169).
3. Diagram showing criterion of categorizing metabolic and
nonmetabolic genes into four categories across development: lowly
expressed, invariant, moderately variant and highly variant.
4. Bar graph shows the distinction of low expressed, invariant,
moderately variant and highly variant genes during development in
iCEL1314 and other metabolic genes. Color legend as indicated in
(C).
To identify metabolic genes that exhibit differential expression across
tissues, we selected a high‐quality single‐cell RNA sequencing dataset
that measured gene expression during L2 stage of C. elegans across
seven major tissues: body wall muscle, glia, gonad, hypodermis,
intestine, neurons, and pharynx (Cao et al, [91]2017; Figs [92]1A and
[93]EV2A, and Dataset [94]EV3). This dataset is hereafter referred to
as the “tissue dataset.” Unlike the development dataset, the tissue
dataset does not have a defined cluster of invariant genes. Therefore,
we used the less sophisticated coefficient of variation (CV) measure to
identify variation in gene expression across the seven tissues
(Fig [95]EV2A). We previously found that the five genes comprising the
propionate shunt are differentially expressed in different tissues
(Watson et al, [96]2016; Yilmaz et al, [97]2020), and each of these
genes had a CV greater than 0.75 (Fig [98]EV2B). Visual inspection of
genes with CV values from 0.15 to 1.2 indicates that CV = 0.75 provides
a clear, yet conservative threshold to annotate highly variant genes
across tissues (Fig [99]EV2C). We further classified genes with a CV
less than 0.75 but greater than or equal to 0.3 as moderately variant
and genes with a CV less than 0.3 as invariant (Fig [100]EV2D, see
Fig [101]EV2B and [102]C for examples). A total of 6,370 genes,
including 348 metabolic genes, were not included in this analysis
because they are expressed at low levels (Yilmaz et al, [103]2020;
details in [104]Materials and Methods). We identified ~60 and ~25% of
metabolic genes as highly and moderately variant, respectively. These
include 781 highly variant and 405 moderately variant iCEL1314 genes
(Figs [105]1B and [106]EV2E). A very small number of metabolic genes
(13, or 1%) were invariant across tissues. Even though the analysis of
the different datasets used a different statistical approach, these
results suggest that more metabolic genes are variant and, therefore,
likely transcriptionally regulated in different tissues than during
development (Fig [107]1B). In contrast to development, the percentage
of metabolic genes that are highly variant across tissues at any CV
cutoff is greater than nonmetabolic genes (Fig [108]1D), indicating
that metabolic processes exhibit a relatively high level of tissue
specificity.
Figure EV2. Transcriptional regulation of metabolic genes across different
tissues.
Figure EV2
[109]Open in a new tab
1. Computational pipeline to determine highly variant metabolic genes
across tissues. Genes that show either moderate or high expression
in at least one tissue are selected for analysis reducing the
number of C. elegans genes from 19,675 (2,491 metabolic) to 13,305
(2,143 metabolic). The coefficient of variation (CV) of each gene
was calculated by dividing the standard deviation of expression
across tissues by the mean expression. Different thresholds of CV
were titrated to select a stringent CV to categorize genes as
variant and therefore potentially transcriptionally regulated.
Examples for each threshold are provided in (B).
2. Tissue expression and CV values of propionate shunt genes.
3. Bar graphs showing expression profiles of example genes across
tissues with CV 0.15, 0.3, 0.45, 0.6, 0.75, 0.9, 1.05 and 1.2.
Numbering of examples is according to the corresponding threshold
lines in (A).
4. Diagram showing criterion of categorizing metabolic and
nonmetabolic genes into four categories across tissues: lowly
expressed, invariant, moderately variant and highly variant.
5. Bar graph shows the distinction of low expressed, invariant,
moderately variant and highly variant genes across tissues in
iCEL1314 and other metabolic genes. Color legend as indicated in
(D).
Overall, metabolic gene expression showed higher variation across
tissues at a fixed time point (L2) than larval development. However,
because we used two different statistical methods for the development
and tissue datasets, we confirmed that it held true when we applied the
same CV measure we used in the tissue dataset to the development
dataset (Appendix Fig [110]S1A). The two datasets also have different
resolution: the development dataset has great temporal but no spatial
resolution because it was measured by bulk RNA‐seq while the tissue
dataset, which was measured by single‐cell RNA‐seq, has great spatial
but no temporal resolution. Therefore, we examined genes that are
highly tissue‐specific, because they are highly expressed in a single
tissue in the tissue dataset, and found that only 57% of these are also
highly variant in the development dataset (Appendix Fig [111]S1B).
Therefore, we conclude that transcriptional regulation of metabolic
genes more frequently establishes spatial than temporal gene expression
patterns.
To directly compare metabolic gene expression in tissues and
development, we plotted VS values of metabolic genes across development
versus CV values across tissues and found that these two parameters are
moderately correlated (Fig [112]1E). We divided the scatter plot into
four quadrants, based on the thresholds used in each dataset
(Dataset [113]EV4). To determine whether there are any functional
enrichments, we performed pathway enrichment analysis (PEA) on the
metabolic genes for each quadrant using the tool provided on the
WormFlux website (Yilmaz & Walhout, [114]2016). The first quadrant (Q1)
consists of genes with moderate/low developmental variation and
moderate/low tissue variation. It has 595 metabolic genes, including
385 iCEL1314 genes that are enriched in several metabolic pathways,
such as the electron transport chain (ETC), aminoacyl‐tRNA
biosynthesis, the tricarboxylic acid (TCA) cycle, the pentose phosphate
pathway and glycolysis/gluconeogenesis (Appendix Fig [115]S2A). The
second quadrant (Q2), with high developmental variation and
moderate/low tissue variation, consists of only 176 genes, including 68
iCEL1314 genes that are enriched in sulfur, cysteine, and methionine
metabolism (Appendix Fig [116]S2A). The 891 genes in the third quadrant
(Q3) consist of genes with moderate/low developmental variation and
high tissue variation. They include 504 iCEL1314 genes that are highly
enriched in lipid metabolism. Notably, genes involved in peroxisomal
fatty acid (FA) metabolism vary more in expression than mitochondrial
FA degradation (Appendix Fig [117]S2A). Finally, the 577 genes in the
fourth quadrant (Q4) show high developmental variation and high tissue
variation. They include 261 iCEL1314 genes, which are enriched in
UDP‐glucuronosyltransferases (UGT) enzymes, guanylate cyclases,
glyoxylate and dicarboxylate metabolism, and amino acid metabolism,
such as arginine and proline metabolism and glutamate/glutamine
metabolism (Appendix Fig [118]S2A). Interestingly, there are
differences among different types of metabolic genes. For instance,
amino acid metabolism genes are variant in both development and in
tissues, while lipid metabolism genes are mostly variant in tissues,
and growth and energy metabolism are relatively invariant in both
development and in tissues.
To evaluate metabolic gene expression more broadly, we combined 177
expression profiling datasets into an expression compendium, an earlier
version of which we have used to study TF paralogs (Reece‐Hoyes
et al, [119]2013; Figs [120]1A and [121]EV3A, Datasets [122]EV5 and
[123]EV6, see [124]Materials and Methods). Using a CV threshold ≥ 0.75
in at least three datasets, we found that 775 of the 2,492 metabolic
genes (~31%), including 284 iCEL1314 genes, are highly variant in the
compendium, which is lower than nonmetabolic genes (44%, Figs [125]1B
and [126]EV3B). This difference holds true for different cutoffs of the
number of datasets showing high variation (Fig [127]1F) and across
different CV thresholds (Fig [128]1G). However, the percentage of
invariant genes is similar between metabolic and nonmetabolic genes
using different CV cutoffs (Fig [129]1H).
Figure EV3. Transcriptional regulation of metabolic genes across a
compendium.
Figure EV3
[130]Open in a new tab
1. Computational pipeline to determine transcriptionally regulated
metabolic genes across the gene expression compendium. Datasets
were evaluated for batch effects using correlation distribution.
Some skewed datasets were corrected for batch effect by splitting
into separate datasets, while some were removed if the source of
skewness was not clear. This resulted in 177 datasets for analyses.
CV of each gene was calculated. The criterion of classifying genes
is based on the number of datasets with high CV and fraction of
datasets with low CV.
2. Bar graph showing low expressed, invariant, moderately variant and
highly variant genes in the compendium in iCEL1314 and other
metabolic genes. Color legend as indicated in (A).
When we compared highly variant genes in development, tissue, and
compendium, we found that a total of 1,867 metabolic genes (75%) are
highly variant in at least one of the three datasets and that 283
metabolic genes are highly variant across all three datasets
(Fig [131]1B and [132]I). Using phenotypes provided in WormBase WS282
(Harris et al, [133]2020b), we found that the 75% highly variant
metabolic genes are enriched in conditional response variants such as
chemical and pathogen response, and depleted in essential phenotypes
such as lethality, larval arrest, slow growth, and sterility (Appendix
Fig [134]S2B). Finally, the remaining 624 (25%) of metabolic genes that
are not highly variant in any dataset are similar in pathway enrichment
as the Q1 genes discussed previously and are enriched in the essential
phenotypes (Appendix Fig [135]S2C and D).
Altogether, our analyses indicate that at least 75% (1,867 out of
2,492) of metabolic genes vary in mRNA levels and are therefore likely
transcriptionally regulated, including 902 iCEL1314 genes (~69%;
Fig [136]1B). This is similar to the proportion of varying nonmetabolic
genes (~79%), indicating that metabolic genes are overall at least as
much under transcriptional control as other genes (Fig [137]1B).
A supervised approach shows widespread Coexpression of genes comprising
metabolic pathways
We previously found that the five genes comprising the propionate shunt
pathway are coordinately activated in response to propionate
accumulation (Watson et al, [138]2016; Bulcha et al, [139]2019). In
addition, we found strong coexpression of genes functioning in the
methionine/S‐adenosylmethionine (Met/SAM) cycle, for instance when flux
through this pathway is perturbed (Giese et al, [140]2020). To
systematically test which C. elegans metabolic pathways exhibit
coexpression, we developed a custom pathway enrichment analysis
pipeline (Fig [141]2A) based on gene set enrichment analysis (GSEA; see
[142]Materials and Methods; Subramanian et al, [143]2005) and applied
it to the compendium. We ran this pipeline using metabolic pathways,
enzyme complexes, and enzyme families as defined in WormPaths (Walker
et al, [144]2021). Henceforth, we use “category” to refer to a group of
metabolic genes that best fit in an enzyme complex or related set of
enzymes such as aminoacyl‐tRNA synthetases. We calculated an enrichment
score (ES) that defines the enrichment of relatively high coexpression
within that set. A normalized ES (NES) indicates relative strength of
this enrichment compared with randomized tests, the significance of
which is measured as a false discovery rate (FDR; Fig [145]2A). With an
FDR cutoff of ≤ 0.05, 52 of 84 metabolic pathways or categories (~61%)
exhibit coexpression, which is significantly more than expected by
chance (Fig [146]2B and [147]C, Dataset [148]EV7, Appendix
Fig [149]S3). As expected, the 52 coexpressed metabolic pathways and
categories include the propionate shunt and the Met/SAM cycle
(Fig [150]2D and [151]E). When we examined coexpression in pathways and
categories separately, we found that 78% of categories showed
significant coexpression compared to 58% of pathways (Fig [152]2B).
Examples of metabolic pathways that exhibit high coexpression include
peroxisomal FA degradation and starch and sucrose metabolism
(Fig [153]2F and [154]G). Examples of coexpressed categories include
vacuolar ATPases, ETC complex I, and aminoacyl‐tRNA synthetases
(Appendix Fig [155]S3). There are 32 categories and pathways that do
not exhibit self‐enrichment, including pantothenate and CoA
biosynthesis and mevalonate metabolism (Fig [156]2H and [157]I, and
Appendix Fig [158]S3). Such pathways may either not be regulated at
all, may be regulated by allostery, or only one or a few genes in these
pathways are transcriptionally regulated and may therefore function as
key regulatory genes. Alternatively, these pathways maybe coregulated
in conditions that were not yet profiled and therefore are not included
in the compendium.
Figure 2. Supervised approach to investigate coexpression of metabolic
pathways.
Figure 2
[159]Open in a new tab
* A
Custom computational pathway enrichment analysis pipeline that
determines coexpressed genes functioning in the same metabolic
pathway. Pairwise coexpression was based on the gene expression
compendium. For every annotated metabolic pathway, the coexpression
of pathway genes (columns) to all metabolic genes (rows) was
extracted. A ranked list of genes was obtained for each pathway by
taking the mean of coexpression values in rows while ignoring
self‐correlations. Weighted gene set enrichment analysis was then
performed to find significantly enriched pathways. If a pathway is
self‐enriched with FDR ≤ 0.05, it is annotated as coexpressed.
* B
Bar graph indicating the percentage of metabolic pathways and
categories that show significant coexpression compared with ones
that are not self‐enriched for coexpression.
* C
Histogram denoting the number of significantly coexpressed
metabolic pathways obtained by 1,000 randomizations while
maintaining the structure of the data.
* D–I
Mountain plots showing self‐enrichment of (D) propionate shunt
pathway, (E) Met/SAM cycle, (F) peroxisomal fatty acid degradation
pathway, (G) starch and sucrose metabolism, (H) pantothenate and
CoA biosynthesis, and (I) mevalonate metabolism.
* J
Metabolic pathways often consist of reactions catalyzed by single
genes, OR genes and AND genes. All genes involved in the same
pathway are collectively annotated as all pathway genes. Genes that
are associated with distinct reactions are annotated as PW genes.
PW gene pairs exclude AND and OR gene pairs. Met/SAM cycle pathway,
which consists of 13 metabolic genes, is shown as an example.
* K
Percentages of pairs of AND genes, OR genes, other paralogs, operon
genes, and random metabolic genes categorized as coexpressed using
different coexpression values as cutoffs. Coexpression values are
based on the gene expression compendium.
* L
Percentage of random, all pathway, PW, and PO gene pairs
categorized as coexpressed using different coexpression values as
cutoffs. Coexpression values are based on the gene expression
compendium.
The extent of within‐pathway coexpression of metabolic genes can be
potentially confounded because metabolic reactions are often associated
with multiple genes in gene‐protein‐reaction (GPR) annotations (Kim
et al, [160]2008; Thiele & Palsson, [161]2010). There are two reasons
for this. First, some metabolic reactions are catalyzed by enzyme
complexes comprising two or more proteins. In such cases, all genes
need to be expressed for the reaction to take place and are therefore
annotated here as “AND” genes. Second, some genes are part of larger
families (paralogs) that encode isozymes or highly similar proteins.
Metabolic network reconstruction efforts use protein sequence homology
to associate genes with reactions (Thiele & Palsson, [162]2010; Yilmaz
& Walhout, [163]2016, [164]2017). As a result, multiple highly
homologous paralogs may be associated with the same metabolic reaction.
Such paralogs are annotated here as “OR” genes. Some reactions are
associated with a combination of AND and OR genes (Fig [165]EV4A).
Finally, for some gene families it may be that one member catalyzes one
reaction and another member catalyzes another. Paralogs that are
associated with distinct reactions are referred to here as “other
paralogs” (Fig [166]EV4B). Pathways can be associated with multiple
types of AND and OR genes (Fig [167]2J).
Figure EV4. Reaction‐level analysis of metabolic pathways.
Figure EV4
[168]Open in a new tab
* A
The conversion of succinate to fumarate, which is part of complex
II of the ETC and of the TCA cycle, is carried out by succinate
dehydrogenase. Diagram showing that succinate dehydrogenase is
composed of the OR genes sdha‐1 and sdha‐2 that each function
together with the rest of the genes as AND genes. The GPR of this
reaction is noted as [(sdha‐1 | sdha‐2) & sdhb‐1] & mev‐1 &
sdhd‐1]. Color legend is indicated.
* B
Example of a gene family (aagr) where members occur as paralogous
OR gene pairs in separate reactions. Pairs of paralogs associated
with different reactions are called other paralogs.
* C
Violin plot comparing coexpression for different populations of
gene pairs including random, AND, OR, operon and other paralogs
gene pairs in compendium of expression datasets.
* D, E
Percentages of AND, OR, other paralogs, operon genes and random
metabolic gene pairs categorized as coexpressed using different
coexpression values as cutoffs across developmental time (D) and
tissues (E). Color legend as indicated in (D).
* F–H
Violin plots showing coexpression of random gene pairs, all pathway
genes, PW genes and PO genes across compendium (F), different
developmental stages (G) and tissues (H).
AND genes encode proteins that function together in complexes, and such
genes are often strongly coexpressed (Jansen et al, [169]2002). For
example, genes encoding ETC complex members are coexpressed and
coregulated (van Waveren & Moraes, [170]2008). Therefore, we wondered
whether this holds true for AND genes in iCEL1314 and, if so, whether
this would inflate pathway coexpression enrichment. To test this, we
systematically assessed coexpression of different types of gene pairs.
As expected, we found that AND genes are significantly more coexpressed
than random gene pairs, OR genes and other paralogs (Figs [171]2K and
[172]EV4C–E). Both OR genes and other paralogs are also more
coexpressed than random metabolic gene pairs in all three datasets
(Figs [173]2K and [174]EV4C–E). Surprisingly, OR genes are more
coexpressed than other paralogs across tissues (Figs [175]2K and
[176]EV4C–E).
In C. elegans, ~18% of genes are transcribed from operons (Blumenthal
et al, [177]2002). In total, 26% of metabolic genes occur in operons.
However, they most frequently occur as a pair with a nonmetabolic gene.
In total, 242 metabolic genes (~10% of all metabolic genes) occur in a
pair with another metabolic gene in an operon (Dataset [178]EV1).
As expected, these operon gene pairs are more coexpressed than random
gene pairs, OR genes and other paralogs, thus serving as a validation
for our coexpression analysis. However, these pairs are less
coexpressed than AND genes and there is no overlap between the two
categories. This shows that enzyme complexes are strongly coregulated
and their coregulation mechanism is largely independent of operonic
organization (Figs [179]2K and [180]EV4C–E).
Based on the analysis of AND, OR and operon genes, it is difficult to
determine the contribution of the coexpression of such gene pairs to
pathway enrichment. Therefore, we examined coexpression of gene pairs
that are annotated with distinct reactions in a pathway, which we refer
to as pathway (PW) genes (Fig [181]2J). We found that PW gene pairs are
significantly more coexpressed than random gene pairs (Figs [182]2L and
[183]EV4F–H). We also examined coexpression of gene pairs that are not
part of an operon, which we refer to as pathway excluding operon (PO)
genes. There are only three pathway gene pairs that are part of operon;
hence, there is no significant difference between pathway genes and PO
genes coexpression (Figs [184]2L and [185]EV4F–H). Therefore, pathway
coexpression is not just driven by AND, OR and operon genes, indicating
that pathway genes' coexpression is a true feature of many metabolic
pathways.
A Semisupervised approach extracts coexpressed subpathways
Our finding that metabolic pathways and categories exhibit extensive
coexpression was based on previously annotated pathways (Walker
et al, [186]2021). However, these pathways connect into the larger
metabolic network and the definition of the start and ending of each
pathway is somewhat arbitrary. Since there is extensive coexpression of
genes that function together in predefined pathways, we reasoned that
we may be able to use coexpression to extract metabolic (sub)‐pathways
in an unbiased manner. To specifically focus on metabolic genes that
function in connected reactions in the metabolic network, we developed
a “coflux” metric that calculates flux dependency between metabolic
genes using the network model (see details in [187]Materials and
Methods; Dataset [188]EV8). Reactions in linear pathways have complete
flux dependence (i.e., coflux = 1), while in branched pathways flux
dependency may be partial (coflux = between 0 and 1), and in uncoupled
reactions, there is no dependence (coflux = 0). We then used a custom
semisupervised approach that multiplies coflux and coexpression values
and clustered the resulting product matrix with a relatively stringent
set of parameters (Fig [189]3A and [190]B, Dataset [191]EV9, Appendix
Fig [192]S4A, see [193]Materials and Methods for details).
Figure 3. Semisupervised approach to extract coexpressed and flux‐dependent
metabolic genes.
Figure 3
[194]Open in a new tab
1. Computational pipeline to extract tightly coregulated units in the
metabolic network: Functional relationships are provided through
theoretical flux associations (coflux) calculated using C. elegans
metabolic network model iCEL1314, while expression correlations
come from the compendium of 177 expression datasets. Hierarchical
clustering on the product of coflux and coexpression matrix gives
coexpressed metabolic pathways.
2. Heatmaps showing coflux and coexpression of iCEL1314 genes and
clustered heatmap showing added modularity to coexpression space by
product of coexpression and coflux. Color legend is indicated.
3. Distinct clusters denoted by clustered heatmap of genes in
canonical and shunt pathways of propionate degradation were
extracted using dynamic cut tree algorithm with stringent
parameters (deepSplit = 2, minClusterSize = 3). Color legend as
indicated in (B).
As expected, the propionate shunt pathway genes formed a tight cluster
(Fig [195]3C and Appendix Fig [196]S4B). Interestingly, while the first
four genes, acdh‐1, ech‐6, hach‐1, and hphd‐1, occurred closely
together, the fifth gene, alh‐8, was not part of the same cluster. This
could be explained in two ways. First, alh‐8 encodes an enzyme that
functions at a junction in the pathway where its substrate malonate
semialdehyde is converted either to acetyl‐coa or, potentially, to
beta‐alanine. Therefore, metabolic flux is divided in two directions
and is not linearly coupled with the shunt pathway flux like the first
four reactions. Second, alh‐8 is annotated to another reaction where
2‐methyl‐3‐oxopropionate is converted to propionyl‐CoA (Fig [197]3C).
This approach also revealed another cluster comprising the canonical,
vitamin B12‐dependent propionate degradation pathway, indicating that,
like the propionate shunt, this pathway may also be transcriptionally
activated or repressed under specific conditions (Fig [198]3C,
Dataset [199]EV9).
The semisupervised approach reveals pathway boundaries
The propionate shunt example above shows that the semisupervised
approach can extract subpathways (e.g., the propionate shunt) from
previously defined pathways (e.g., propionate degradation) based on
coexpression and coflux. We therefore used other clusters defined by
the semisupervised approach to better define starts and ends of
different pathways.
An example of a pre‐annotated WormPaths pathway that was fully captured
with the semisupervised approach is peroxisomal FA degradation
(Fig [200]4A, Dataset [201]EV9). The first reaction in peroxisomal
beta‐oxidation is catalyzed by acyl‐CoA oxidases (encoded by acox
genes). Only acox‐1.1 and acox‐3 in the acox family genes are
coexpressed with the other peroxisomal FA oxidation genes, indicating
that they are more likely to function in this pathway than the other
acox genes, which are coexpressed with each other, and with
mitochondrial FA degradation genes (Dataset [202]EV9).
Figure 4. Semisupervised approach defines metabolic pathway boundaries.
Figure 4
[203]Open in a new tab
1. Clustered heatmap denoting a distinct cluster consisting of at
least one gene from every reaction in peroxisomal fatty acid
degradation. Heatmap genes are shown in bold font. Color legend,
indicated here, applies to all panels.
2. Clustered heatmap showing a distinct cluster formed by the tyrosine
degradation genes separate from the rest of the tyrosine
metabolism.
3. Clustered heatmap showing a distinct cluster formed by a boundary
within the histidine degradation pathway.
4. Clustered heatmap showing a distinct cluster formed by genes
traversing pathway boundaries that are parts of propionate,
alanine, and pyrimidine metabolism.
Examples where only a subset of annotated pathway genes clustered
together include tyrosine metabolism and histidine degradation.
Tyrosine can be metabolized via different reactions, in different
pathway branches (Fig [204]4B). In one pathway branch, tyrosine is
degraded in five steps to produce fumarate and acetoacetate. The genes
in this branch; gst‐43, C31H2.4, Y53G8B.1, hpd‐1, hgo‐1, fah‐1, and
gst‐42 form a tight cluster (Fig [205]4B, Dataset [206]EV9). This
cluster consists of OR genes hpd‐1 OR C31H2.4; and Y53G8B.1 OR gst‐42
OR gst‐43, which suggests that these genes are correctly annotated to
this pathway branch. Histidine can also be degraded via two pathway
branches: one converting histidine to glutamate through
N‐formyl‐L‐glutamate and the other converting histidine to
3‐methylimidazoleacetic acid. The four genes associated with the
conversion of histidine to N‐formyl‐L‐glutamate; haly‐1, Y51H4A.7,
amdh‐1, and cpin‐1, form one of the top‐ranked clusters (Fig [207]4C,
Dataset [208]EV9). However, the genes in the other branch are not
coexpressed.
We also found clusters consisting of genes that traversed different
pathways. For instance, alh‐8 and gta‐1, which are functionally
associated but not strongly coexpressed with the propionate shunt
(Fig [209]3D), cluster with T09B4.8 (alanine metabolism) and four other
genes belonging to pyrimidine metabolism: dpyd‐1, dhp‐1, dhp‐2 and
upb‐1 (Fig [210]4D, Dataset [211]EV9). This observation functionally
connects genes in what were heretofore separately annotated pathways,
that is, pyrimidine, alanine, and propionate metabolism. The
coexpression of these genes suggests that thymine is degraded, leading
to the formation of propionyl‐CoA, L‐3‐amino‐isobutanoate and
acetyl‐CoA. This observation also suggests that alh‐8 levels have a
stronger functional role in the conversion of 2‐methyloxopropanoate to
propionyl‐CoA than in the propionate shunt. Remarkably, this further
indicates that alh‐8 may participate in both the generation and
degradation of propionyl‐CoA.
The stringent cluster derivation parameters we used favor small
clusters, and as a result, interconnections between different pathways
may be lost in the analysis. To unveil such connections, we relaxed the
parameters to allow the derivation of larger clusters of coexpressed
genes (Dataset [212]EV10, Appendix Fig [213]S4C). With these settings,
the propionate shunt cluster expanded and included bckd‐1A, bckd‐1B,
dbt‐1, Y43F4A.4, ard‐1, acdh‐3, acdh‐9, and B0250.5, which are
annotated to branched‐chain amino acids (BCAA) isoleucine and valine
degradation pathways, but not alh‐8 or gta‐1 (Fig [214]EV5A, Appendix
Fig [215]S4D, Dataset [216]EV10). Propionyl‐CoA, the starting
metabolite of the propionate shunt, is produced by the breakdown of
valine and isoleucine. We recently proposed that the propionate shunt
not only functions to detoxify excess propionate but also to produce
acetyl‐CoA for ketone body and energy production (Ponomarova
et al, [217]2022). The coexpression of valine and isoleucine breakdown
genes with the propionate shunt indicates a functional connection
between these pathways to produce energy.
Figure EV5. Semisupervised clustering of product matrix using relaxed
parameters.
Figure EV5
[218]Open in a new tab
1. Distinct cluster of known coregulated metabolic pathway propionate
shunt genes together with isoleucine and valine degradation genes
obtained using relaxed clustering parameters (deepSplit = 3,
minClusterSize = 6) with the dynamic cut tree algorithm.
2. Clusters of Met/SAM cycle genes and genes of related pathways
obtained using relaxed (deepSplit = 3, minClusterSize = 6) and
stringent (deepSplit = 2, minClusterSize = 3) parameters. Stringent
clusters are shown near the drawn pathways of clustered genes.
3. Pathway boundary (green shade) within selenocompound metabolism
defined by genes in a high‐quality cluster obtained by relaxed
parameters and shown by the clustered heatmap (left). Mountain
plots showing comparison of self‐enrichment analysis statistics
(NES, ES and FDR) of selenocompound metabolism from WormPaths with
that of selenocompound cluster obtained by the semisupervised
approach (right).
The Met/SAM cycle provides another example of different degrees of
clustering that can be unveiled with different parameter settings
(Fig [219]EV5B). The smaller clusters with stringent clustering
captured different parts of one‐carbon metabolism with their
connections to Met/SAM cycle, while relaxed clustering combined these
genes into one single cluster. This cluster acts as a subsystem that
connects the Met/SAM cycle on one side with glycerophospholipid
metabolism, specifically phosphatidylcholine biosynthesis, which
depends on methylation reactions using SAM (Walker et al, [220]2011),
as well as with purine metabolism (Ducker & Rabinowitz, [221]2017).
Second, Met/SAM cycle genes are highly coexpressed with the folate
cycle gene which produces the methyl group that is used to convert
homocysteine into methionine in the Met/SAM cycle (Ducker &
Rabinowitz, [222]2017; Giese et al, [223]2020). Together, these results
confirm that the Met/SAM cycle is overall coexpressed (Giese
et al, [224]2020) and show that additional co‐functioning genes can be
identified.
The semisupervised approach also identified gene clusters that are not
part of any coexpressed pathway identified by the supervised method
above. An example is selenocompound metabolism, where a set of seven
genes form a highly coexpressed cluster (FDR = 0.025, NES = 1.7;
Fig [225]EV5C). In comparison, the respective FDR and NES values for
self‐enrichment of the WormPaths pathway of selenocompound metabolism
were 0.75 and 0.98 (Dataset [226]EV7). Altogether, these results
illustrate that not all genes in a pathway are coexpressed and further
indicate that a subset of a pathway or a combination of subsets from
multiple pathways may be under transcriptional control, illustrating
the utility of semisupervised approach as an addition to the predefined
metabolic pathways in WormPaths.
Metabolic pathway communities reveal coexpression among complexes and
pathways
To explore additional coexpression clusters than those that were
captured by the relaxed settings described previously, we visually
inspected the product matrix and extracted three clusters we refer to
as metabolic pathway “communities” (Fig [227]5A). We analyzed these
communities by WormPaths PEA (Walker et al, [228]2021). The first
community is enriched in ETC complexes I, III, and IV, indicating broad
transcriptional control of energy production (Fig [229]5B,
Dataset [230]EV11). The second community is enriched in mitochondrial
and peroxisomal FA degradation, FA biosynthesis, ascaroside
biosynthesis, and BCAA degradation (Fig [231]5C, Dataset [232]EV11).
The connection between FA metabolism and BCAA degradation may reflect
the fact that some FAs are synthesized from BCAA breakdown products.
For instance, branched‐chain fatty acids (BCFAs) are synthesized from
the branched‐chain alpha‐keto acids of valine, leucine, and isoleucine
such as isovaleryl‐CoA and isobutyryl‐CoA after their decarboxylation
and further chain elongation (Daschner et al, [233]2001; Jia
et al, [234]2016; Wallace et al, [235]2018). The third community is
enriched in aminoacyl‐tRNA biosynthesis, N‐glycan biosynthesis,
collagen biosynthesis, iron metabolism, and mevalonate metabolism, all
of which produce biomass precursors. While aminoacyl‐tRNA synthetases
play a major role in protein biosynthesis by linking amino acids to
their cognate transfer RNAs (tRNAs), mevalonate metabolism provides
precursors for glycan, collagen biosynthesis provides collagen for the
formation of the cuticle and other extracellular matrices, and iron
metabolism is important for many aspects of metabolism, including the
production of heme groups of heme proteins. This result points toward
the possibility that growth is transcriptionally regulated by a central
mechanism controlling pathways that produce biomass precursors and
assemble biomass (Fig [236]5D, Dataset [237]EV11). Taken together, we
confirmed the coexpression of metabolic pathways and revealed
coexpressed subpathways, as well as coexpression among pathways.
Figure 5. Extraction of metabolic communities.
Figure 5
[238]Open in a new tab
* A
Clustered heatmap indicating communities formed by multiplying
coflux and coexpression values of iCEL1314 genes. (B, C, and D)
define three major communities shown in the respective parts of
this fig.
* B–D
PEA of communities B (B), C (C), and D (D).
Metabolic subpathways are activated or repressed under different conditions
The gene expression compendium is comprised of 177 expression profiling
datasets that measure relative mRNA levels in a variety of experimental
conditions and genotypes. Therefore, we next asked whether we could
identify specific conditions in which different metabolic gene clusters
are activated or repressed. Using a custom computational pipeline
(Fig [239]6A), we first identified those datasets that best represent
the coexpression of a particular cluster. We then manually investigated
the top datasets for each cluster (see [240]Materials and Methods). To
validate this approach, we first examined the expression of propionate
shunt cluster genes in top‐scoring datasets. We previously showed that
propionate shunt genes are repressed in animals fed Comamonas aquatica
DA1877, a bacterium that (unlike the standard E. coli OP50 diet)
produces vitamin B12, thus enabling flux through the canonical
propionate degradation pathway (MacNeil et al, [241]2013; Watson
et al, [242]2013, [243]2014, [244]2016). The dataset from that study,
labeled as dataset 15 in the compendium, scored as most significant for
propionate shunt gene coexpression, where the genes are expressed in
animals fed E. coli OP50, but not in animals fed C. aquatica
(Fig [245]6B, Dataset [246]EV4). Interestingly, propionate shunt genes
are also highly coexpressed in a dataset that measured expression in
spr‐5 mutants versus wild‐type animals across 1(f1), 13(f13) and
26(f26) generations (Fig [247]6B, dataset 139). Propionate shunt genes
are more highly expressed in the N2 reference strain compared with
spr‐5 mutant animals (Fig [248]6B). Since spr‐5 encodes a histone
demethylase, this may indicate that the expression of shunt genes is
regulated not only by TFs but also by epigenetic mechanisms.
Figure 6. Condition analysis of metabolic gene coexpression.
Figure 6
[249]Open in a new tab
* A
Computational pipeline to extract activating or repressing
conditions of metabolic clusters. The mean coexpression of all gene
pairs in each cluster in each dataset is calculated separately. To
rank datasets that showed highest coexpression uniquely for each
cluster, these mean coexpression values are normalized using
z‐scoring across each dataset (as shown by heatmap). Thirty best
datasets that potentially represent activation/repression
conditions of each cluster are identified by the z‐score values of
mean coexpression. Then, for each cluster, datasets are manually
inspected in the order of decreasing mean coexpression along with
its associated published paper and the heatmap to understand
activation/repression conditions.
* B–F
Mean coexpression of the 30 best datasets for clusters of
propionate shunt (B), peroxisomal fatty acid degradation (C),
histidine degradation (D), tyrosine degradation (E), and canonical
propionate degradation (F), followed by heatmap examples from
selected datasets as indicated by bold‐blue dataset numbers. Color
bar and heat map legend as indicated in (B).
Peroxisomal FA degradation genes were most significantly coexpressed in
dataset 132, which measured gene expression in precisely staged embryos
during the first quarter of embryonic development (Fig [250]6C). The
time course included a stage of exclusively maternal transcripts
(four‐cell), the transition to zygotic transcription (28‐cell), and the
presumptive commitment to the major cell fates (55‐, 95‐, and 190‐cell
stages; Yanai & Hunter, [251]2009). Peroxisomal FA degradation genes
were lowly expressed in four‐cell embryos and their expression
increased in later embryonic stages, which may reflect a change in
carbon source for energy and biomass generation prior to hatching and
feeding. Peroxisomal FA degradation genes are also upregulated in
animals fed P. aeruginosa compared with the standard E. coli OP50 diet
(Fig [252]6C, dataset 51). This may reflect the high energy demand
during infection (Nhan et al, [253]2019).
Inspection of expression of the histidine degradation cluster discussed
above revealed that it is highly expressed in animals fed C. aquatica
(Fig [254]6D, dataset 15). However, these genes were not affected in
animals fed E. coli supplemented with vitamin B12 (Bulcha
et al, [255]2019), suggesting that the effect of C. aquatica on this
cluster may be independent of this cofactor. Animals showed lower
expression of this cluster upon exposure to UV treatment (Fig [256]6D,
dataset 33). In humans, UV converts cis‐uruconate (the product of first
reaction of histidine degradation, Fig [257]4C) to trans‐uruconate,
which has been proposed to play a protective role in skin (Brosnan &
Brosnan, [258]2020). Our result indicates that, in C. elegans, UV
exposure rewires metabolic flux to avoid histidine degradation, for
instance to preserve histidine that could be converted to
trans‐uruconate.
The tyrosine degradation cluster was most significantly coexpressed in
dataset 11, wherein expression profiles of gas‐1 mutant animals, which
are deficient in mitochondrial respiration, were compared with
wild‐type animals (Falk et al, [259]2008; Fig [260]6E). This study
revealed that free tyrosine levels are decreased in gas‐1 mutants. In
addition, there is a failure of NAD^+‐dependent ketoacid oxidation in
mitochondrial respiratory chain mutants (Falk et al, [261]2008). To
compensate for this respiratory dysfunction, multiple pathways are
upregulated, including the TCA cycle and ketone body metabolism (Falk
et al, [262]2008). Since the end products of tyrosine degradation
pathway cluster are the TCA cycle intermediate fumarate and the ketone
body acetoacetate (Fig [263]4B), the function of the upregulation of
tyrosine degradation during mitochondrial dysfunction may be to supply
metabolites for compensatory pathways.
Altogether, these results show that the gene expression compendium can
be used to gain insight into the conditions that most greatly affect
the activation or repression of different metabolic gene clusters.
However, one needs to be careful to manually inspect the conditions of
interest because sometimes coexpression can be biased by an outlier
experiment. An example of this is dataset 47, one of the top datasets
in which canonical propionate degradation pathway genes are
coexpressed. Even though the conditions in this dataset are not related
to canonical propionate breakdown, this dataset falsely appears as one
of the top datasets due to coexpression driven by one bad outlier
sample. It is however also possible that this outlier sample was
unknowingly contaminated to change the nutritional or environmental
state, hence driving the variable expression of these genes
(Fig [264]6F).
Overall, our systematic analysis revealed specific conditions of when
metabolic gene clusters are activated or repressed, reinforcing our
overall finding that transcriptional regulation plays an important role
in the control of metabolism.
WormClust web application enables gene‐by‐gene query to identify coexpression
with metabolic (sub)‐pathways
A major premise of this study is the assumption that variance in mRNA
levels results, at least in part, from transcriptional regulation,
which in turn suggests that genes are coexpressed because they are
coregulated. In reverse engineering of gene regulatory networks,
coexpression of TFs with their target genes has been used to define
causal relationships (Segal et al, [265]2004; MacNeil &
Walhout, [266]2011). To make our data available to the community as
well as to enable the easy identification of TFs and other C. elegans
genes that are coexpressed with metabolic (sub)‐pathways, we developed
a web application named WormClust, which is available on the WormFlux
website (Yilmaz & Walhout, [267]2016). This tool takes any C. elegans
gene as input and evaluates its coexpression with metabolic
(sub)‐pathways. If the query gene is an iCEL1314 gene, the output is a
clustered heatmap of the coexpressed genes in the model based on
product matrix, and according to the selected level of stringency, that
is, relaxed, or stringent. If the query gene is not an iCEL1314 gene,
then an association of the gene with annotated metabolic pathways is
provided. The threshold for this association can be based on FDR and/or
NES (Fig [268]7A).
Figure 7. WormClust: a web application that enables querying of genes to
identify coexpression with metabolic (sub)‐pathways.
Figure 7
[269]Open in a new tab
1. Diagram showing the workflow of WormClust. A C. elegans gene is
taken as input. If the gene is part of iCEL1314, a clustered
heatmap of closely associated genes in product matrix of coflux and
coexpression is displayed, based on stringency level of clustering.
If the input gene is not an iCEL1314 gene, enrichment bar graphs of
the gene to annotated metabolic pathways are displayed, based on
selected FDR and NES thresholds.
2. Bar graph of pathways that are significantly coexpressed with
nhr‐68 with NES ≥ 2 and FDR ≤ 0.05.
3. Plot showing significant coexpression of nhr‐31 with vacuolar
ATPases (FDR ≤ 0.05 and NES ≥ 2).
4. Plot showing significant coexpression of nhr‐79 with peroxisomal
fatty acid degradation (FDR ≤ 0.05 and NES ≥ 2).
5. Plot showing significant coexpression of vha‐20 with vacuolar
ATPases (FDR ≤ 0.05 and NES ≥ 2).
6. Bar graph of pathways that are significantly coexpressed with
acdh‐11 with NES ≥ 2 and FDR ≤ 0.05 (left). Examples of specific
reactions in metabolic network model where acdh‐11 can be annotated
as OR gene (right).
We, and others, previously found that nuclear hormone receptor (NHRs)
TFs frequently associate with metabolic genes in different types of
assays and dataset (Van Gilst et al, [270]2005; Arda et al, [271]2010;
Mori et al, [272]2017; Bhattacharya et al, [273]2022). To illustrate
the utility of WormClust, we tested three NHRs with known metabolic
pathway associations for coexpression with annotated metabolic pathways
in the compendium of 177 C. elegans expression datasets. All of these
showed coexpression with their target metabolic pathways
(Fig [274]7B–D): nhr‐68 was highly coexpressed with the propionate
shunt (FDR = 0.02 and NES = 2.1; Bulcha et al, [275]2019), nhr‐31
associated with vacuolar ATPases (FDR = 0.036, NES = 2.04;
Hahn‐Windgassen & Van Gilst, [276]2009), and nhr‐79 is coexpressed with
peroxisomal FA degradation (FDR = 0.019, NES = 2.3; Zeng
et al, [277]2021). In addition to pathways, we also performed
enrichment of clusters or subpathways from our semisupervised analysis
with TFs. We found that nhr‐79 is enriched to cluster 16 (stringent),
which contains peroxisomal FA degradation genes; nhr‐31 shows
enrichment to cluster 5 (relaxed) that consists of vacuolar ATPases;
and nhr‐68 shows enrichment to cluster 12 (stringent) which contains
propionate shunt genes, albeit with higher FDR. In addition, nhr‐68
shows enrichment to cluster 40 consisting of mans‐2, hex‐2, and fut‐8
(N‐glycan biosynthesis), and cluster 51 consisting of bgal‐1, gana‐1
(galactose metabolism) and hex‐1 (sphingolipid metabolism; Appendix
Fig [278]S5A–C, Dataset [279]EV9). This observation suggests that
nhr‐68 may play a broader role in the regulation of metabolic gene
expression.
WormClust also provides an opportunity to annotate new metabolic genes.
For example, vha‐20, a vacuolar ATPase that is not part of the original
iCEL model because it was only recently annotated by WormBase and KEGG,
shows significant enrichment to vacuolar ATPases (Fig [280]7E). This
result suggests that WormClust can be used to “deorphan” unannotated
metabolic genes. As another example, we found that acdh‐11 is highly
coexpressed with mitochondrial fatty acid degradation and isoleucine
degradation genes (Fig [281]7F). Therefore, we propose that acdh‐11 can
now be added to the iCEL model as another OR gene to reactions in these
pathways. We envision future systematic studies of orphan metabolic
genes and TFs to increase the annotation of metabolic genes and
elucidate the transcriptional mechanisms that regulate their
expression.
Discussion
In this study, we performed a systems‐level analysis of mRNA level
variation as a proxy for the transcriptional regulation in C. elegans.
Even with our conservative approach, we found that most metabolic genes
are under transcriptional control. By a combination of supervised and
semisupervised methods, we found that genes in many metabolic pathways
are coexpressed and identified gene clusters that represent parts of
pathways, or combinations of pathways with strong coexpression. These
results build upon earlier work in single‐cell organisms such as
E. coli and S. cerevisiae (Ihmels et al, [282]2004; Kharchenko
et al, [283]2005; Seshasayee et al, [284]2009; Ledezma‐Tejeida
et al, [285]2017; Tang et al, [286]2021) implying that coexpression of
metabolic genes that function together is a principle that is
evolutionarily conserved.
Our data indicate metabolic genes are more subject to transcriptional
regulation across tissues than during development. What is the purpose
of transcriptional activation or repression of metabolic genes? We
propose that the transcriptional regulation of metabolic genes can
serve different purposes. First, there is extensive transcriptional
regulation of metabolic genes in different tissues. This can be viewed
as the setup of metabolic network functionality depending on a tissue's
needs. Indeed, tissues have different needs. For instance, the
C. elegans intestine serves as the entry point of bacterial nutrients
and requires the expression of enzymes that aid digestion and
metabolite transport to other tissues. Similarly, the animal's muscle
needs to produce energy to support movement and is therefore highly
catabolic. Second, metabolic genes are transcriptionally regulated
during development. This likely reflects the need for different aspects
of metabolism as tissues differentiate and grow and as different
metabolic functions are necessary. We refer to the expression of
different metabolic genes and pathways in different tissues and
different developmental stages as “metabolic network wiring.” A third
function of metabolic gene activation or repression is under different
conditions that dictate the need for different metabolic functions.
This can be for the breakdown of different nutrients, for example,
carbohydrates versus fats, versus protein, or to rewire metabolic
pathways when others are perturbed. An example of this is the
propionate shunt, which is transcriptionally activated when flux
through the preferred, vitamin B12‐dependent pathway is perturbed
(Watson et al, [287]2016; Bulcha et al, [288]2019). We refer to the
transcriptional rerouting of metabolism as “metabolic rewiring.”
Finally, metabolic genes can be transcriptionally activated when flux
through the pathway in which they function is hampered. An example of
this is the Met/SAM cycle in C. elegans, which is transcriptionally
activated when flux through the cycle is low, for instance under low
vitamin B12 dietary conditions (Giese et al, [289]2020). For genes
encoding enzymes that function in multiple reactions and metabolic
pathways (e.g., alh‐8), we can learn with which pathways they are more
strongly coexpressed. Our study provides a facile portal to investigate
the tissues, developmental stages, or conditions under which particular
metabolic genes and pathways are highly coexpressed, which will help to
formulate hypotheses for detailed follow‐up studies.
Genes that are coexpressed often function together and are frequently
coexpressed with their transcriptional regulators (Eisen
et al, [290]1998; Hughes et al, [291]2000; Kim et al, [292]2001; Segal
et al, [293]2003; Stuart et al, [294]2003). We used this principle to
develop WormClust with which any C. elegans gene, including TFs, can be
used to search for metabolic pathways with which it is coexpressed.
However, it is important to note that the most critical regulators,
those that respond to the initial information, are often not
coexpressed with their target genes. Indeed, nhr‐10, which is essential
for activation of the propionate shunt in response to high levels of
propionate, does not change much in expression under relevant
conditions (Bulcha et al, [295]2019). To identify such “first
responders,” it will be useful to employ promoter‐reporter strains with
large‐scale RNAi screens (MacNeil et al, [296]2015; Bhattacharya
et al, [297]2022). Further, at least half of all C. elegans metabolic
genes are not yet associated with reactions or pathways (Yilmaz
et al, [298]2020). We have already shown the utility of WormClust in
“deorphaning” genes and connecting them to metabolic network such as in
case of vha‐20 and acdh‐11 (Fig [299]7E and [300]F). We propose that
more such genes may be “deorphaned” in the future. Longer term, we
envision that association of other types of regulators, such as RNA
binding proteins and microRNAs, with metabolic pathways can be used to
gain broader insights into the functional connections among different
biological processes.
Finally, the approaches used here should be broadly applicable to any
organism, including humans, for which large gene expression profile
compendia and high‐quality metabolic network models are available. By
applying these approaches, deeper insights into the transcriptional
control of metabolism will be obtained, as well as insights into the
conditions under which metabolic genes and pathways are activated or
repressed.
Materials and Methods
Reagents and Tools table
Reagent/Resource Reference or Source Identifier or Catalog Number
Software
Python 3.6.6 [301]https://www.python.org N/A
MATLAB 2019a [302]https://www.mathworks.com N/A
GSEApy 0.1.18 [303]https://doi.org/10.1093/bioinformatics/btac757 N/A
dynamicTreeCut 0.1.0
[304]https://CRAN.R‐project.org/package=dynamicTreeCut (Langfelder
et al, [305]2008) N/A
Numpy 1.18.3 Harris et al ([306]2020a) N/A
Pandas 1.1.0 [307]https://pandas.pydata.org/ N/A
Matplotlib 3.2.1 [308]https://matplotlib.org/ N/A
Scikit‐learn 0.20.3 Pedregosa et al ([309]2011) N/A
Scipy 1.4.1 Virtanen et al ([310]2020) N/A
Seaborn 0.9 [311]https://joss.theoj.org/papers/10.21105/joss.03021 N/A
Statannot 0.2.3 [312]https://doi.org/10.5281/zenodo.7213391 N/A
Sleipnir Huttenhower et al ([313]2008) N/A
[314]Open in a new tab
Methods and Protocols
Preprocessing of genes
The master list of C. elegans genes considered for analysis was
downloaded from WormBase public ftp site (release WS282; Harris
et al, [315]2020b). The genes were filtered out to obtain only live and
protein‐coding genes, which amounted to 19,985 genes in total.
Development dataset
Postembryonic expression profiles were based on published RNA‐seq data
(Kim et al, [316]2013). Briefly, the authors measured the transcriptome
of wild‐type (N2) animals from hatching to 48‐h posthatching every 2 h.
This dataset includes 21,714 protein‐coding genes, including 2,405
metabolic genes. Genes were classified into 12 clusters based on their
expression profiles (Kim et al, [317]2013). We refer to the cluster
showing relatively invariant expression as the “flat cluster” and the
genes within this cluster as “flat genes.” Selecting only live
protein‐coding genes (WS282) resulted in a total of 18,113 genes,
including 2,397 metabolic genes. Of these, 4,689 are flat genes,
including 995 metabolic genes. We generated a histogram of average gene
expression across development using the logarithm of reads per kilobase
per transcript (RPKM) values at base 2. This resulted in a bimodal
expression distribution that was fitted by two superposed Gaussian
curves, representing a high‐expression subpopulation and a
low‐expression subpopulation (LES). Genes that showed expression values
less than the mean plus the standard deviation of LES at all the time
points were filtered out to avoid false fluctuations in gene
expression. After this step, a total of 14,561 genes were left,
including 2,184 metabolic genes. The number of flat genes was reduced
to 4,646, including 986 metabolic genes.
Tissue dataset
Tissue‐level expression profiles were based on a single‐cell RNA‐seq
dataset of animals at the second larval stage (L2; Cao
et al, [318]2017). This dataset provides gene expression as transcripts
per million (TPM) for 20,271 protein‐coding genes including 2,506
metabolic genes across seven major tissues: body wall muscle, glia,
gonad, hypodermis, intestine, neurons, and pharynx. Selecting only live
genes (WS282) reduced the number of genes to 19,675, including 2,491
metabolic genes. The dataset was previously processed to label gene
expression in every tissue according to the level of expression into
four categories: high, moderate, low, and rare (Yilmaz
et al, [319]2020). Genes that showed rare or low expression in all
seven tissues were filtered out in this study, resulting in 13,305
genes including 2,143 metabolic genes.
Gene expression compendium
A compendium of gene expression datasets was generated using a
combination of public datasets. First, 374 microarray, RNA‐Seq, and
tiling array datasets related to C. elegans were downloaded from
WormBase (Harris et al, [320]2020b). Then, only those datasets that
consisted of at least 10 conditions were selected, resulting in 169
datasets. These datasets were individually examined for batch effects,
since many were obtained from multiple microarray experiments where
total RNA was not normalized. Initially, histograms of expression
values were analyzed, and 16 microarray datasets that displayed
abnormal distributions where the correlation distributions were skewed
toward +1, hence suggesting that the data may consist of samples that
are highly distinctive from each other or are from separate experiments
altogether. Such datasets were selected for further examination
(Fig [321]EV3A). Twelve of these datasets were found to be composed of
two subsets of data, where all genes in one subset were up or
downregulated with respect to the other except for a few. The samples
forming each subset were independent of the other and seemed to have
different amounts of total RNA or a similar batch effect. Therefore,
these datasets were divided into two separate datasets to correct for
batch effects. The remaining four datasets were removed since the
source of abnormalities in their distributions of expression was not
clear. This processing resulted in a total of 177 datasets. For each
dataset, the expression of every gene was normalized by converting the
expression values to z‐scores based on expression across all conditions
using the Normalizer function of Sleipnir library (Huttenhower
et al, [322]2008). Once all the datasets were z‐normalized, they were
combined to form a compendium with 4,796 conditions (sum of multiple
conditions within 177 datasets) using the Combiner function of the
Sleipnir library (Huttenhower et al, [323]2008), which took a union set
of all genes across the different datasets and converted missing values
to NaN (not a number) for subsequent processes.
Calculation of variation score in the development dataset
We define variation score (VS) as a measure of the deviation of a
gene's expression profile from a flat reference over time in the
development dataset (Kim et al, [324]2013). Prior to any analysis,
expression values of every gene were normalized by total expression in
all time points using equation ([325]1),
[MATH: xinorm=xi∑1ixi, :MATH]
(1)
where x [i ]indicates the expression value of gene x at time i. To
define a reference profile of invariant expression, a line was
constructed in time by joining the mean normalized expression value of
the flat cluster at every time point. An envelope around this line was
then defined by adding and subtracting the standard deviation of each
point. A deviation from this envelope, referred as variation score
(VS), was then computed by taking the average distance between an
individual gene profile and the flat reference profile according to
equation ([326]2),
[MATH: VSg=∑dinwithdi=0, if xinorm ∈ <
msub>μi ± σi else minxinorm ‐μi + σi,xinorm ‐μi − σi
:MATH]
(2)
where n is the number of observations for the gene g, d [i ]is the
distance at time i between the normalized level of expression of the
gene
[MATH: xinorm :MATH]
and the closest border of the reference flat profile, and μ[ i ]and σ[
i ]are mean and standard deviations of normalized expression values of
flat genes at time point i, respectively. A graphical example of this
calculation is provided in Fig [327]EV1A. With this definition, a
VS = 0 means that the profile of a given gene stays within the envelope
of the flat cluster and is therefore perfectly flat, or invariant. To
define highly variable genes, we empirically established a conservative
VS threshold value of 0.169 based on the distribution of VS between
flat genes and all other genes, such that 97% of flat genes were not
annotated as variant (Fig [328]EV1B, see details in [329]Materials and
Methods).
Calculation of coefficient of variation
Coefficient of variation (CV) is a statistical measure that is used to
calculate the dispersion of data. For every gene, CV was calculated by
dividing standard deviation of expression across different samples (σ)
(e.g., different tissues in case of tissue dataset) to the mean of
expression across samples (μ). CV was empirically thresholded using the
CV of known propionate shunt genes to keep the approach conservative.
[MATH: CV=σμ :MATH]
(3)
Categorizing genes based on expression variation in compendium
To be consistent with the approach used with the tissue dataset and to
be conservative in our assessment, a CV threshold of 0.75 was selected
and required that highly variant genes had a CV greater than or equal
to this threshold in at least three datasets. Genes that showed
CV < 0.3 in at least 95% of the datasets were labeled as invariant, and
genes that fit into neither category were annotated as moderately
variant (Fig [330]EV3A). It was further assumed that genes that are not
present in a dataset are lowly expressed, and were removed from the
analysis.
Calculation of coexpression of gene pairs
The correlation in expression of metabolic gene pairs during
development, across tissues, and across the compendium of gene
expression studies was calculated based on Pearson correlation
coefficient (PCC) using the Distancer function of Sleipnir library
(Huttenhower et al, [331]2008). These correlations defined pairwise
coexpression. Differences in the distribution of coexpression values
between random, AND genes, OR genes, other paralogs, all pathway genes,
and PW genes were evaluated using Mann–Whitney U test (Fay &
Proschan, [332]2010).
Custom pathway enrichment analysis pipeline
Pathway‐to‐gene annotations from level 4 of WormPaths (Walker
et al, [333]2021) were used as input gene sets. Each metabolic pathway
(or category such as an enzyme complex; hereafter referred to as
pathway for simplicity) consists of two or more annotated metabolic
genes. The coexpression of genes in each metabolic pathway with all
other genes in the metabolic network was extracted from the compendium.
Subsequently, the mean of the correlations of pathway genes with all
other metabolic genes (excluding the self‐correlations) was calculated.
The mean values were used to define a ranked list of metabolic genes
for every metabolic pathway. GSEA was then performed on the preranked
list of each pathway using the PreRank module (Subramanian
et al, [334]2005). Enrichment score (ES) is the degree to which the
genes in a gene set are overrepresented at the top or bottom of the
entire ranked list of genes. Since the genes that are functionally
related are mostly positively correlated, we only consider the genes at
the top of the list, hence the ones positively contributing to ES.
Leading edge subset enlists the gene hits before the peak while
calculating ES, therefore consisting of genes that contribute the most
to the enrichment score.
NES was derived for each pathway by normalizing the ES values to mean
ES for all permutations of the gene sets. This accounts for differences
in gene set size. FDR is the estimated probability that a gene set with
a given NES represents a false‐positive finding. The significance
cutoff for the GSEA was set at an FDR value of ≤ or =0.05. If a pathway
was found to be significantly self‐enriched, it was categorized as
coexpressed. We validated this result by running the custom pathway
enrichment analysis pipeline on 1,000 randomized gene sets. For these
randomizations, the structure of the data was maintained such that the
correlation matrix, the number of genes in each pathway, and the number
of times an individual gene was repeated across pathways all remained
the same.
Semisupervised approach that combines Coflux and Coexpression
The first part of the approach involves using flux balance analysis
(FBA) to simulate reaction rates (fluxes) in the metabolic network and
then using a flux dependency metric, referred to as coflux, to measure
pairwise associations of genes in the C. elegans metabolic network
model (Yilmaz & Walhout, [335]2016; Yilmaz et al, [336]2020). We
examined all reaction pairs to see whether constraining the flux of one
of the reactions to zero reduces the flux of the other (see below for
the algorithm). The coflux value is zero for independent reactions and
one for reactions that are fully coupled. Reactions that are connected
by a junction to another reaction are usually partially dependent.
After generating coflux values for each pair of reactions, we converted
the reaction matrix to a pairwise gene coflux matrix using GPR
associations. A high coflux value for a gene pair indicates that the
genes encode enzymes acting in the same metabolic process. For the
second part of the approach, a coexpression matrix was derived from the
C. elegans gene expression compendium described above. All negative
correlations were converted to zero to be consistent with the coflux
matrix. The coflux and coexpression matrices were multiplied to obtain
a product matrix. Since both coexpression and coflux values are between
0 and 1, a product takes a high value only if both coflux and
coexpression values are high. Hierarchical clustering was then
performed on the product matrix using dynamic cut tree algorithm using
cutreehybrid package (Langfelder et al, [337]2008).
Coflux algorithm
The coflux value for each gene pair was calculated using FBA with
iCEL1314 (Yilmaz et al, [338]2020). First, the standard bacterial diet
was amended with a minimum set of nutrients (i.e., by allowing uptake
through exchange reactions in the model as indicated in
Dataset [339]EV8) that warranted nonzero flux in all reactions of the
model. Then, the following steps were taken to calculate coflux values:
* For every irreversible reaction i,
+ Calculate v[max,i ], the maximum flux that can be achieved
with the intact network.
+ For every reaction j, calculate v[max,ij ], which is the
maximum flux observed in reaction i when reaction j is
constrained to a flux of zero. If i is included in a
predefined set of 15 redundant reaction pairs (i.e., reactions
with similar reactants and products except for differences
such as the use of NADP instead of NAD as electron carrier,
Dataset [340]EV8), the flux of the corresponding reaction in
the pair was also constrained to zero.
+ For every reaction j, calculate the coflux with i (c [ij ])
using equation ([341]4).
[MATH: cij=Vmax<
/mi>,i−Vmax,ijV
max,i :MATH]
(4)
* For every reversible reaction i,
+ For every reaction j, repeat the above steps to calculate the
coflux with i in forward direction (c [ij,forward]).
+ Calculate v[min,i ], the minimum (i.e., the most negative, as
negative flux indicates flux in reverse direction) flux that
can be achieved with the intact network.
+ For every reaction j, calculate v[min,ij ], which is the
minimum flux observed in reaction i when reaction j is
constrained to a flux of zero. Once again, the reaction
redundant with reaction i is also constrained to zero flux, if
applicable.
+ For every reaction j, calculate the coflux with the reverse
direction of i (c [ij,reverse]) using equation ([342]5).
[MATH: cij,reverse=Vmin<
/mi>,i−V<
mi>min,ijV
min,i :MATH]
(5)
* For every reaction j, calculate final coflux with i as the maximum
of c [ij,forward] and c [ij,reverse] (equation [343]6).
[MATH: cij=maxcij,forwardabscij,reverse :MATH]
(6)
* Since c [ij ]and c [ji ]are not necessarily equal, calculate final
coflux value for every reaction pair using equation ([344]7).
[MATH: cij,final=maxcijcji :MATH]
(7)
* Convert the reaction coflux matrix to a gene coflux matrix based on
gene–reaction associations. If a gene pair is associated through
multiple reaction pairs (i.e., when at least one of the genes is
associated with multiple reactions), take the maximum of coflux
values between reactions to calculate gene coflux.
Hierarchical clustering
Hierarchical clustering was performed using average method of linkage
on the dissimilarity matrix generated by 1 minus the product matrix
value (coflux*coexpression). Dynamic cut tree algorithm from
cutreehybrid package was used to cut the dendrogram generated by this
clustering with stringent parameters deepSplit = 2 and
minClusterSize = 3 and relatively relaxed parameters deepSplit = 3 and
minClusterSize = 6 (Langfelder et al, [345]2008). The stringent setting
was thresholded based on the occurrence of propionate shunt genes
together in one cluster while keeping the size of the smallest cluster
to be at least 3. The relaxed setting was chosen to capture larger
clusters, such as the Met/SAM cycle genes in a single cluster.
Quantifying cluster quality through Silhouette score
Silhouette score determines the quality of clustering by measuring the
cohesiveness of genes within the same cluster and separateness from the
genes in the neighboring clusters (Rousseeuw, [346]1987). It was
calculated using scikit‐learn package (Pedregosa et al ([347]2011). We
first calculated silhouette score of each metabolic gene based on its
placement in each cluster and then calculated mean silhouette score
(MSS) for every cluster. Stringent clustering led to 197 clusters,
ranging in size from three to 27 genes. We ranked these stringent
clusters using MSS, where few of the top‐ranked were inspected in more
detail (Appendix Fig [348]S3A). For relaxed clustering, we followed the
same approach (Appendix Fig [349]S3C).
Finding activation and repression conditions of metabolic clusters
To find activation and repression conditions of each cluster, we first
calculated the mean coexpression of all gene pairs in that cluster in
each dataset separately. To rank datasets that showed highest
coexpression uniquely for each cluster, we normalized these mean
coexpression values of all clusters using z‐scoring across each
dataset. We then identified the 30 best datasets that potentially
represent activation/repression conditions of each cluster by the
z‐score values of mean coexpression. After this, we manually inspected
each dataset in the order of decreasing mean coexpression, its
associated published paper, and the heatmap to understand
activation/repression conditions.
Gene‐centric coexpression with metabolic (sub)‐pathways by WormClust
We developed a custom computational pipeline that identifies
coexpression of C. elegans genes with metabolic genes used in this
study. The pathway gene sets were generated using WormPaths as a GMT
(Gene Matrix Transposed) file, a tab‐delimited file of gene sets
(Walker et al, [350]2021). The ranked coexpression list of metabolic
genes was extracted for each queried gene, from the global coexpression
matrix generated using compendium of 177 datasets. The ranked list of
each queried gene was used to run Gene Set Enrichment Analysis (GSEA)
on the custom metabolic pathway gene sets.
Author contributions
Albertha JM Walhout: Conceptualization; supervision; funding
acquisition; writing – original draft; project administration; writing
– review and editing. Shivani Nanda: Conceptualization; software;
formal analysis; investigation; methodology; writing – original draft;
writing – review and editing. Marc‐Antoine Jacques: Conceptualization;
investigation; methodology; writing – review and editing. Wen Wang:
Methodology. Chad L Myers: Supervision; methodology; writing – review
and editing. L Safak Yilmaz: Conceptualization; formal analysis;
supervision; investigation; methodology; writing – review and editing.
Disclosure and competing interests statement
The authors declare that they have no conflict of interest. AJMW is an
editorial advisory board member. This has no bearing on the editorial
consideration of this article for publication.
Supporting information
Appendix
[351]Click here for additional data file.^ (6.4MB, pdf)
Expanded View Figures PDF
[352]Click here for additional data file.^ (474.8KB, pdf)
Dataset EV1
[353]Click here for additional data file.^ (371.7KB, xlsx)
Dataset EV2
[354]Click here for additional data file.^ (1.1MB, xlsx)
Dataset EV3
[355]Click here for additional data file.^ (1.2MB, xlsx)
Dataset EV4
[356]Click here for additional data file.^ (177.7KB, xlsx)
Dataset EV5
[357]Click here for additional data file.^ (104.9KB, xlsx)
Dataset EV6
[358]Click here for additional data file.^ (2.9MB, xlsx)
Dataset EV7
[359]Click here for additional data file.^ (88.2KB, xlsx)
Dataset EV8
[360]Click here for additional data file.^ (176.4KB, xlsx)
Dataset EV9
[361]Click here for additional data file.^ (47.3KB, xlsx)
Dataset EV10
[362]Click here for additional data file.^ (37.6KB, xlsx)
Dataset EV11
[363]Click here for additional data file.^ (80.2KB, xlsx)
PDF+
[364]Click here for additional data file.^ (10.5MB, pdf)
Acknowledgements