Abstract
We present a comprehensive multi-omic analysis of the EPISTOP
prospective clinical trial of early intervention with vigabatrin for
pre-symptomatic epilepsy treatment in Tuberous Sclerosis Complex (TSC),
in which 93 infants with TSC were followed from birth to age 2 years,
seeking biomarkers of epilepsy development. Vigabatrin had profound
effects on many metabolites, increasing serum deoxycytidine
monophosphate (dCMP) levels 52-fold. Most serum proteins and
metabolites, and blood RNA species showed significant change with age.
Thirty-nine proteins, metabolites, and genes showed significant
differences between age-matched control and TSC infants. Six also
showed a progressive difference in expression between control, TSC
without epilepsy, and TSC with epilepsy groups. A multivariate approach
using enrollment samples identified multiple 3-variable predictors of
epilepsy, with the best having a positive predictive value of 0.987.
This rich dataset will enable further discovery and analysis of
developmental effects, and associations with seizure development in
TSC.
Subject terms: Genetics of the nervous system
__________________________________________________________________
A comprehensive multi-omic analysis of the EPISTOP trial for Tuberous
Sclerosis Complex (TSC). was performed. Here, authors show many
differences in serum proteins and metabolites, and blood RNA species,
including associations with seizure development.
Introduction
Tuberous sclerosis complex (TSC) is a multisystem disorder caused by
inactivating mutations in either TSC1 or TSC2^[104]1,[105]2. Tumors
form in many organs in TSC, following the two hit Knudson mechanism,
including the heart, brain, kidneys, skin, and lungs^[106]1,[107]2.
However, the predominant morbidity in TSC infants and young children is
due to severe and often drug-resistant epilepsy, with infantile spasms
and other seizures occurring in 70–90% of TSC infants^[108]3,[109]4.
Neurodevelopmental comorbidities including intellectual disability and
autism are also common in TSC, are collectively called TSC-associated
neuropsychiatric disorders (TAND)^[110]5,[111]6, and correlate with
seizure severity and lack of effective seizure control.
TSC is often diagnosed prenatally, because cardiac rhabdomyomas are a
cardinal feature of TSC, and are commonly observed on fetal
ultrasound^[112]2. This situation has enabled consideration of
pre-symptomatic antiseizure medication (ASM) for TSC infants, prior to
seizure onset, with the goal of prevention or at least diminution of
seizure severity, ideally leading to long-term seizure control and
improved neurocognitive outcome^[113]7.
EPISTOP (Long-Term, Prospective Study Evaluating Clinical and Molecular
Biomarkers of Epileptogenesis in a Genetic Model of Epilepsy–Tuberous
Sclerosis Complex, [114]NCT02098759) was a controlled multicenter
study, in which the safety and efficacy of pre-symptomatic ASM
treatment with vigabatrin, introduced when focal discharges more than
10% of the time or multifocal discharges were seen on video-EEG, was
compared with conventional ASM started after the onset of
seizures^[115]8. In 54 enrolled EPISTOP subjects, epileptiform EEG
abnormalities were identified before seizures. The time to the first
clinical seizure was significantly longer with presymptomatic than
conventional treatment, and at age 24 months, presymptomatic treatment
reduced the risk of clinical seizures (odds ratio (OR) = 0.21,
p = 0.032), drug-resistant epilepsy (OR = 0.23, p = 0.022), and the
occurrence of infantile spasms (OR = 0, p < 0.001)^[116]8.
In addition to the clinical trial, a key part of EPISTOP was to perform
extensive multi-omic analyses of subject samples with the goal of
identification of biomarkers associated with epilepsy development.
Blood samples were collected from subjects at serial time points
through age 2 years. Subject DNA samples were analyzed by both whole
genome sequencing, and targeted high read-depth sequencing of TSC1/TSC2
to enable identification of mosaic mutations^[117]9. Serial blood/serum
samples were also analyzed for proteomics, metabolomics, by RNA-Seq,
and expression of 45 selected miRNA.
Here we report the results of this analysis, including examination of
age-dependent changes and effects of vigabatrin treatment. We focused
on examination of associations between analyte levels and risk of
seizure development, development of drug-resistant epilepsy, and
ongoing seizures at age 2 years. An age-matched control population was
also sampled to enable comparison with controls.
Results
Sample collection and analyses performed
As part of EPISTOP, blood samples were collected longitudinally from 93
subjects with TSC, from birth until two years of age. Samples were
collected at serial time points, dependent upon EEG findings and
clinical events during the course of the study: 1) at enrollment; 2) at
time of detection of an abnormal epileptiform EEG; 3) at onset of
seizures (clinical or electrographic); 4) at 6 or 12 months of age, if
no abnormal epileptiform EEG or seizure was detected during the study;
5) at age 24 months (Supplementary Fig. [118]1a). Control samples from
58 age-matched children with various conditions but without TSC or
seizure history were also collected. Serum samples were analyzed for:
proteomics by mass spectrometry; 298 preselected water soluble
metabolites by mass spectrometry; and 45 miRNA species by Quantitative
reverse transcriptase PCR (RT-PCR). RNA was extracted from total blood
samples, and analyzed by RNA-Seq (Supplementary Fig. [119]1b). Whole
genome sequencing was performed to permit analysis of single nucleotide
polymorphisms (SNPs) potentially associated with epilepsy. TSC1 and
TSC2 deep sequencing in concert with whole genome sequencing enabled
detection of heterozygous and mosaic pathogenic variants^[120]9.
Four hundred and eighty-one protein groups were identified in the
proteomics analysis (at false discovery rate (FDR) < 1%). 63,677 RNAs
were identified by RNA-Seq (Supplementary Data [121]1). For the
univariate analyses, RNAs and metabolites were retained for further
analysis only if non-zero values were seen in ≥ 50% samples; protein
groups if seen in ≥ 70%. 20,579 RNAs (13854 protein coding genes, 3221
long noncoding, 2598 of other biotypes, including miRNAs, snRNAs,
pseudogenes, etc. and 906 RNAs that could not be assigned to any
biotype), 249 metabolites, and 340 protein groups were retained for
further analysis.
The intent of our analysis was to identify analytes that were
associated with epilepsy outcome in the EPISTOP patient set. We
considered many ways of associating analytes with epilepsy outcomes,
but first assessed confounding effects that could lead to false
associations.
Confounding effects in single dataset analysis
Several confounding effects were expected and identified. First, sample
processing and analysis via mass spectrometry and sequencing was
performed in batches, and batch effects were anticipated. Second,
vigabatrin (VGB) is known to impact metabolite levels^[122]10,[123]11.
VGB therapy was the main therapeutic intervention in the EPISTOP trial,
such that by age 2 years, 89% of all subjects were taking VGB. Third,
we anticipated that multiple analytes would vary naturally from newborn
to age 2 years, so that accounting for such effects was quite important
to avoid confounding by age.
Batch effects
Batch effects were observed in the metabolite dataset (Supplementary
Fig. [124]2a, b), but not in any of the other analyses. Metabolite
batch effects were corrected by a Z-score based correction method (see
Methods for details).
Vigabatrin effects on serum metabolites
VGB therapy clearly affected multiple metabolites, as shown by
Principal Component Analysis (PCA) (Fig. [125]1a). As age also
influenced many metabolite levels, we examined VGB effects in detail
considering samples only from subjects > 40 weeks of postnatal age, an
age range in which age effects were minimal. We compared samples from
individuals who were receiving VGB versus those who were not. Since 89%
of EPISTOP subjects were taking vigabatrin by age 2 years, we increased
the non-VGB group by addition of non TSC samples with similar ages
(n[max] VGB: 72; n[max] non-VGB: 28). 28 metabolites were found to be
significantly different (FDR < 0.05, Wilcoxon rank sum test)
(Supplementary Data [126]2). dCMP, glutathione, glutathione-nega,
aminoadipic acid, 4-aminobutyrate, kynurenic acid and 5 others showed a
>2.5-fold difference (Fig. [127]1c–h, Supplementary Data [128]2). dCMP
showed the greatest fold change of all metabolites, being increased
52-fold in subjects taking VGB compared with those that were not on VGB
(Fig. [129]1g).
Fig. 1. Vigabatrin (VGB) effects on metabolites.
[130]Fig. 1
[131]Open in a new tab
a,b PCA plots of metabolomics demonstrating Vigabatrin (VGB) effect
before (a) and after correction (b) are shown. cdefgh. Correction for
effects of VGB. Plots for kynurenic acid (c), 4 aminobutyrate (d),
glutathione (e), glutathione-nega (f), dCMP (g), and aminoadipic acid
(h) before and after correction for VGB exposure are shown (n[max]
VGB = 72; n[max] no VGB = 28; FDR < 0.05, Wilcoxon rank sum test).
Samples from subjects of age > 40 weeks are shown. The box plot’s
central line denotes the median, and its lower and upper boundaries
indicate the 25th and 75th percentiles of the data, respectively. The
whiskers represent the highest and the lowest values no further than
1.5 times the IQR. All data points are shown. Source data are provided
as a Source Data file.
The VGB effect on selected metabolites was corrected using a Z-score
based method (see Methods for details). Following this correction, PCA
showed no difference according to VGB use (Fig. [132]1b), and all
metabolites showed a similar distribution in those treated without or
with VGB (Fig. [133]1c–h).
Developmental effects on analytes
Multiple serum proteins are known to be highly developmentally
regulated, including alpha feto-protein, which undergoes a decline of
approximately 10,000-fold during the weeks after birth^[134]12. Our
longitudinal study design, newborn to age 2 years, enabled the
detection of both expected and many unexpected age dependent changes in
serum proteins, serum metabolites, and RNA expression. To facilitate
analysis of these effects, we divided our population into age tertiles,
0–10 weeks, 11–40 weeks, and > 40 weeks postnatal age.
There was a strong association between age and the first PCA component
for both protein groups and metabolites (Fig. [135]2a, b). However, an
association between age and the first two components of the PCA was not
readily seen for the RNA data; rather an association was seen with the
3rd, 4th, and 5th components (Fig. [136]2c). In addition, the majority
of metabolites, protein groups, and RNA species showed either a
positive or negative correlation with age as a continuous variable
(Spearman’s rank correlation coefficient, Fig. [137]2d). One such
protein group (tenascin) and one metabolite (ethanolamine) that showed
major changes with age are shown in Fig. [138]2e, f as an example.
Fig. 2. Developmental effects on the proteomic, metabolite and transcriptomic
data.
[139]Fig. 2
[140]Open in a new tab
a–c PCA of each analyte set colored by age groups. d Spearman’s Rank
correlation coefficient of each analyte with age, before age correction
and colored by p value. Each dot represents one analyte. 187 (75%)
metabolites, 286 (62%) protein groups, and 8585 (42%) RNAs showed a
significant correlation with age (FDR < 0.05). A two-sided correlation
test using Spearman’s method was executed, and the results were
adjusted for multiple comparisons using the Benjamini-Hochberg
procedure. The box’s middle line marks the median, its edges represent
the 25th and 75th percentiles, and whiskers extend to data points
within 1.5*IQR. ef. Plots of two individual analytes according to age,
prior to correction. Protein group tenascin (e), Ethanolamine (f)
significantly correlated with age (p < 0.001). The line is drawn to
visualize the trend in the data, representing the linear regression
fit, while the surrounding shaded area denotes the 95% confidence
intervals (CI). Spearman’s Rank correlation coefficients (R) and p
values obtained by two-sided Spearman’s correlation test are indicated.
Source data are provided as a Source Data file.
Kruskal-Wallis analysis (FDR < 0.05) showed that the majority of
analytes had a significant association with age in the 3 group
age-tertile comparison: 285 out of 340 (84%) protein groups, 169 out of
249 (69%) metabolites, and 10,506 out of 20,579 (51%) RNAs
(Supplementary Fig. [141]3a–c, Supplementary Data [142]3).
In the case of protein groups, hemoglobin subunits (alpha, beta,
gamma-1 and gamma-2) showed fold changes >38, with higher levels
detected in the youngest age group, consistent with hemolysis of red
blood cells in blood drawing from infants. As expected,
alpha-fetoprotein showed a fold change of approximately 181 with
inconsistent detection (close to the detection limit of the mass
spectrometer) in the two older age groups (>11 weeks of age)
(Supplementary Fig. [143]4). Overall 147 protein groups showed fold
changes > 1.5. Tenascin, C4b-binding protein alpha chain, collagen
alpha-1(V) chain, collagen alpha-1(I) chain, complement component C1q
receptor, and collagen alpha-1(III) chain all exhibited fold changes >7
between the youngest age group and the middle or oldest age group
(Supplementary Fig. [144]4).
64 metabolites showed median fold changes > 1.5 comparing samples from
age <10 weeks to > 40 weeks. The largest fold changes were detected in
ethanolamine (fold change = 4.1) and leucine-isoleucine levels (fold
change = 4.4), which decreased with increasing age, and shikimate
levels (fold change = 4.7), which increased with increasing age. Four
other metabolites exhibited fold changes > 3 (Supplementary
Fig. [145]5).
2512 of 10506 (24%) age regulated RNAs exhibited fold changes > 1.5.
Fetal hemoglobin genes HBG1 and HBG2 showed reductions of 41.1-fold and
23.4-fold, respectively, an expected developmental event, while KCNG1
showed a major increase (42.5-fold), and EVA1A a major reduction
(8.8-fold), comparing the youngest to the oldest age group
(Supplementary Fig. [146]6). XIST, a long non-coding RNA involved in X
chromosome inactivation^[147]13, showed marked differences in
expression according to sex, as expected (Supplementary Fig. [148]6).
It also appeared to be differentially expressed by age, due to an
imbalance in sex distribution by age.
Hierarchical clustering of age group median Z-scores that were found to
be differentially expressed with age in the Kruskal-Wallis analysis
(FDR < 0.05) was used to identify similar patterns of expression
changes among protein groups, metabolites, and RNA species
(Supplementary Fig. [149]3a–c). Based on visual inspection and tuning,
different clusters with similar changes in levels with age were defined
for protein groups (6 clusters), metabolites (6), and RNAs (8),
respectively (Supplementary Fig. [150]3a–c, Supplementary Data [151]3).
To assess if these clusters of age-dependent analyte changes were due
to common pathways or developmental processes occurring with age,
clusters of analytes were assessed for enrichment in Gene Ontology
(GO), KEGG, and Reactome gene sets (Supplementary Data [152]4a, b).
Protein cluster group 2 showed the greatest decline with age, and was
enriched for genes involved in extracellular matrix organization and
structure in all three gene sets (Supplementary Data [153]4a). Protein
group cluster 3, in which protein levels decreased less strongly over
time, was enriched for genes involved in carbohydrate metabolic process
by GO gene set analysis (Supplementary Data [154]4a). Protein group
clusters 4 and 6, in which protein levels steadily increased with age,
were both enriched for genes involved in complement and coagulation
(Supplementary Fig. [155]3a, Supplementary Data [156]4a).
None of the metabolite age clusters showed enrichment for any of the
Reactome terms which we tested (Supplementary Fig. [157]3b).
In contrast to the protein and metabolite datasets, there was no strong
correlation between gene expression and age. Since it is known that
there is significant variation in the fraction of white blood cell
types in human blood according to age from birth to age 2
years^[158]14–[159]16, we considered the possibility that some of the
variation in expression might be due to variation in the relative
numbers of different white blood cell types. However, age appeared to
be a minor factor affecting gene expression (Fig. [160]2c).
Nonetheless, CIBERSORT^[161]17 was used to analyze RNAseq data to
determine the relative proportions of white blood cell types in each
sample. Correlation between the relative proportion of various
leukocyte populations was with the first 10 principal components (PC)
of the RNA-Seq data (Supplementary Fig. [162]7a). PC1 of the RNA data
showed a positive correlation with the proportion of activated NK cells
and regulatory T-cells. The neutrophil proportion correlated strongly
negatively with PC2, while B cell and various T cell proportions
correlated positively. Several PCs showed some degree of correlation
with age, though not PC1 or PC2. The proportion of both monocytes and
CD4 naive T-cells declined significantly with age, while the neutrophil
fraction more strongly increased with age through two years
(Supplementary Fig. [163]7b).
Differences in the neutrophil vs. lymphocyte fraction with age was
reflected in the RNA cluster analysis, which grouped genes into eight
clusters. Several of these clusters (1, 2, 3, 6) were enriched for
genes related to neutrophil or immune function (Supplementary
Data [164]4b, Supplementary Fig. [165]3c), consistent with maturation
of the immune system with age.
Association between clinical features and analytes
To enable a search for associations between analytes and clinical
events in the EPISTOP cohort, it was necessary to correct for the
strong developmental effects on many analytes. Two independent methods
were used: 1) a linear mixed model (LMM) approach; and 2) calculating
the Z-score for each analyte within each age group, and then Z-score
reversal using the standard deviation and median for all samples. Both
methods were applied independently, and to be conservative, we report
only associations that were significant after FDR correction by each of
these two methods. One example of age-based correction by each of the
two methods is shown for C4b-binding protein α-chain (Supplementary
Fig. [166]2c).
We sought to identify associations between analytes and a broad set of
clinical features, including: I. diagnostic status, TSC vs. control;
II. presence of clinical or electrographic seizures vs. no seizures at
time of sample collection; III. presence of abnormal epileptiform EEG
vs. normal EEG at time of sample collection; IV. drug-resistant
epilepsy vs. seizure free or drug controlled epilepsy at age 24 months;
V. history of clinical or electrographic seizures vs. no detected
seizures during the EPISTOP project using analytes determined at age 24
months (Va) or analytes determined at enrollment (Vb).
(I) Comparisons between non TSC control samples (n[max] = 52) and TSC
samples (n[max] = 126) without prior treatment and no seizure history,
of any age (including samples from individuals that showed abnormal
EEGs at sample draw) identified 39 significant associations: three
protein groups (catalase, collagen alpha-2(I), Peroxiredoxin-2), and
two metabolites (aminoimidazole carboxamide ribonucleotide,
leucine-isoleucine), all five higher in TSC c/w control; and 34 RNAs
(25 higher in TSC subjects) (FDR < 0.05, fold change > 1.5,
Supplementary Data [167]5, Supplementary Fig. [168]8). Although these
differences were statistically significant, no analyte had levels that
could reliably distinguish the control group from the TSC group.
Gene set enrichment analysis of the 25 genes with higher expression in
the TSC subjects identified multiple enriched gene sets using each of
the GO, KEGG, and Reactome gene sets, many of which related to ribosome
structure and function (Supplementary Data [169]6). In contrast, the 9
genes whose expression was lower in the TSC subjects than controls were
enriched in gene sets related to receptor tyrosine kinase and
phosphoinositide 3-kinase activity (Supplementary Data [170]6). These
observations may reflect the consequences of haplo-insufficiency for
either TSC1 or TSC2 in the blood cells from which the RNA was derived,
as complete loss of either TSC1 or TSC2 is known to strongly increase
ribosome biogenesis while suppressing upstream receptor tyrosine kinase
and phosphoinositide 3-kinase signaling^[171]18.
(II) Comparison of samples drawn at the time of, or shortly after,
first seizure occurrence (n[max] = 86) versus samples from seizure free
individuals (n[max] = 95) showed no significant differences in protein
groups, metabolites, or RNAs.
(III) Similarly, comparison of samples drawn at the time of abnormal
EEG detection (n[max] = 45) versus those from subjects with normal EEG
(n[max] = 65) showed no significant differences in any analyte.
(IV) Comparison of samples from EPISTOP subjects with drug-resistant
epilepsy at age 24 m (n[max] = 41) vs. those who were seizure free or
drug controlled at age 24 m (n[max] = 43) led to identification of one
protein group, Collagen alpha-1(XI) chain that was significantly higher
in those with drug-resistant epilepsy, and one metabolite,
Hydroxyphenylacetic acid that was significantly lower in those with
drug-resistant epilepsy (Fig. [172]3). Again, although these
differences were statistically significant, neither could distinguish a
sample from an individual with drug-resistant epilepsy vs. seizure-free
or drug controlled.
Fig. 3. Analytes whose expression at age 24 months was associated with
resistant epilepsy.
Fig. 3
[173]Open in a new tab
Collagen alpha-1(XI) chain (a) and Hydroxyphenylacetic acid (b) are
shown, with sample values corrected by two independent methods. These
were the only two analytes significantly different in comparison of
EPISTOP subjects with refractory epilepsy (n[max] = 41) vs. those
without at age 24 months (n[max] = 43) (Wilcoxon rank sum test;
FDR < 0.05; fold change > 1.5). The median is indicated by the center
line of the box, the edges show the 25th and 75th percentiles, whiskers
reach to values within 1.5*IQR. All data points are shown. Source data
are provided as a Source Data file.
(Va) Comparison of samples drawn at age 24 m from EPISTOP subjects with
seizure history (n[max] = 59) vs. subjects without seizure history
(n[max] = 9) vs. subjects without seizure history but
pre-symptomatically treated with VGB (n[max] = 14) showed no
differences for any analyte. Because of the small sample size of groups
2 and 3 we repeated this analysis combining these two groups, and there
were still no analytes with significant differences identified.
(Vb) Comparison of analytes determined on samples drawn at enrollment
into EPISTOP, comparing subjects who subsequently developed clinical or
electrographic seizures vs. subjects who never developed seizures, was
analyzed in detail. To enhance the sensitivity of this analysis, we
compared three groups: (group 1) non-TSC control; (group 2) TSC
subjects who never developed epilepsy during the two year course of the
study; and (group 3) TSC subjects who did develop epilepsy. Initially
we performed this comparison under the stringent conditions of
excluding subjects from group 2 that never developed clinical or
electrographic seizures, but had been treated pre-symptomatically with
vigabatrin per trial protocol; and excluding subjects from groups 2 and
3 who had an abnormal EEG (but not seizures) at the time of enrollment.
With these conditions, there were 34, 9, and 31 (n[max]) samples in
groups 1, 2, and 3, respectively. Using the Kruskal-Wallis Rank Sum
test, one protein group, Periostin, one metabolite,
glucose-6-phosphate, and four RNAs (ATP5ME, NDUFA1, BHLHA15, CARD16)
were significantly different among the three groups (FDR < 0.05; fold
change > 1.5 comparing group 1 to either group 2 or 3; Table [174]1,
Fig. [175]4).
Table 1.
Results of statistical comparison to identify potential predictive
biomarkers for epilepsy in proteomics, metabolite and RNAseq data
up (+) or down (-) in TSC epilepsy* up (+) or down (-) in TSC no
epilepsy* up (+) or down (-) in TSC epilepsy* TSC epilepsy v CTL TSC no
epilepsy v CTL TSC epilepsy v TSC no epilepsy sample size
molecule_name alternative name molecule type TSC epilepsy v CTL TSC no
epilepsy v CTL TSC epilepsy v TSC no epilepsy KW p value fdr FC FC FC
control group TSC no epilepsy TSC epilepsy
Periostin [176]Q15063 protein group + 6.964E-05 1.74 1.41 1.23 34 9 31
glucose-6-phosphate metabolite + 1.902E-04 1.58 1.49 1.06 25 8 23
ENSG00000125356 NDUFA1 RNA + 1.324E-02 1.83 1.40 1.30 24 7 27
ENSG00000169020 ATP5ME RNA + 1.010E-02 2.06 1.88 1.09 24 7 27
ENSG00000180535 BHLHA15 RNA - - 1.127E-02 0.36 0.64 0.57 23 7 25
ENSG00000204397 CARD16 RNA + 1.010E-02 1.67 1.34 1.25 24 7 27
[177]Open in a new tab
Kruskal-Wallis rank sum test was used to identify significant
regulations (FDR < 0.05; p value correction using Benjamini-Hochberg
procedure; fold change > 1.5). Data was previously corrected for batch,
VGB treatment (metabolite data) via Z-score correction method and age
(protein, metabolite and RNAseq data) using a linear mixed model
correction. Only comparisons showing a “+” or “-” in columns marked
with “*” were found to be significantly regulated using two-sided
Dunn’s Multiple Comparison Test (FDR < 0.05; fold change > 1.5).
Fig. 4. Analytes associated with epilepsy and/or TSC.
[178]Fig. 4
[179]Open in a new tab
a, protein groups; b, metabolites; and c, mRNA species that were
significantly different (Kruskal-Wallis Rank Sum test (FDR < 0.05; fold
change > 1.5) in the comparison of group 1 with either group 2 or group
3 of three groups: (1) non-TSC control (n[max] = 34); (2) TSC subjects
who never developed epilepsy during the two year course of the study
(n[max] = 9); and (3) TSC subjects who did develop epilepsy (n[max] =
31). The median is indicated by the center line of the box, the edges
show the 25th and 75th percentiles, whiskers reach to values within
1.5*IQR. All data points are shown. Source data are provided as a
Source Data file.
We repeated this analysis under less stringent conditions, in which
group 2 included subjects who never developed seizures but had received
pre-symptomatic vigabatrin, per study protocol; and TSC subjects with
an abnormal EEG at enrollment were included in both groups 2 and 3.
Under these conditions, max group sizes were 34, 24, and 59
respectively. Nine protein groups, five metabolites, and 48 genes
showed significant differences in levels (FDR < 0.05; fold change >
1.5, comparing group 1 with either group 2 or 3; Kruskal-Wallis Rank
Sum test) (Supplementary Data [180]7). The single protein group, single
metabolite, and four RNAs that were identified in the more stringent
analysis were also identified in this less stringent analysis.
Eight of nine protein groups were increased in the TSC epilepsy group
compared to controls, of which six were also upregulated in the TSC no
epilepsy group. These protein groups were enriched in ten GO, nine
KEGG, and 11 Reactome gene sets, of diverse types, but showed some
consistency with gene sets involving collagen and extracellular matrix
structure (Supplementary Data [181]8a). Only one protein group, Plasma
protease C1 inhibitor, was increased in the control group. For the five
metabolites, all of which were upregulated in the TSC epilepsy group,
there was no significant enrichment in any pathway.
Expression of 16 genes was increased in TSC epilepsy subjects compared
to controls, while 26 were decreased in that comparison (Supplementary
Data [182]7). There was apparent enrichment of the 16 genes in multiple
gene sets for these genes, which showed limited consistency for
electron transport, respiratory chain, and mitochondrial activity;
while the 26 genes were enriched in diverse pathways without apparent
consistency (Supplementary Data S[183]8b).
Integrative predictive models of seizure development in TSC
As noted, a primary goal of this analysis of the EPISTOP cohort was to
identify predictive biomarkers of epilepsy development. In addition to
the above analysis, we performed an integrative approach, using results
from all initial on-study samples to develop a predictive classifier
for seizure development (Supplementary Fig. [184]1a). For this
analysis, we also included SNP data for SNPs previously associated with
epilepsy development in the general population (n = 86, see methods,
Supplementary Data [185]11), as well as levels of 45 miRNAs determined
for 139 subjects (93 TSC, 46 Control) in 324 serum samples (see
Methods, Supplementary Data [186]10). Due to the large number of
analytes (>20,000, considering RNA species), an initial filtering step
was performed, in which only those analytes showing an
Area-under-the-curve (AUC) > 0.6 for association with epilepsy
development in univariate analysis were retained. In addition, a
maximum of 30 analytes, those with highest AUC values, from each data
type were included. Thus 126 analytes were considered for predictive
classifier analysis.
Logistic regression was used to examine all potential combinations of
variables, for their ability to predict epilepsy development, using an
arbitrary decision boundary for each model. The Matthews Correlation
Coefficient (MCC)^[187]19 was used as a metric for the quality of each
classifier with the observed data. A maximum of three variables per
predictor were considered so that in total ~330,000 1-, 2- and
3-variable models were evaluated. The mean MCC among all classifiers
was 0.4; 295 classifiers had MCC values > 0.70; 5 had values >0.80. A
permutation test was used to assess the quality of the predictor using
randomized data (Supplementary Fig. [188]9). Based on the threshold
defined by the permutation test, 100 models were considered
statistically significant; these had mean MCC values between 0.733 and
0.873. The ten best classifiers and their evaluation metrics
(Table [189]2), and the complete list of statistically significant
classifiers (Supplementary Data [190]9A) are shown. Twenty-two
analytes/characteristics appeared in > 4 of the top 100 classifiers
(Supplementary Data [191]9B), including a SNP on chromosome 16 seen in
55 classifiers, and serum carnitine levels seen in 26 classifiers, for
which lower levels at study entry were associated with seizure risk
(Table [192]3).
Table 2.
Top 10 predictive classifiers for seizure development based on the
Matthew Correlation Coefficient (MCC) metric
Features n total LR threshold MCC test mean MMCE test mean PPV test
mean NPV test mean p-value
hsa-miR-130a-3p + CECR7 + RADX 57 0.321 0.873 0.039 0.987 0.019
hsa-miR-130a-3p + RADX + rs951997 A/G chr2 223567016 56 0.285 0.839
0.046 0.973 0.892 0.025
hsa-miR-130a-3p + RADX + rs16944 A/G chr2 113594867 56 0.205 0.836
0.045 0.968 0.918 0.025
PARM1 + rs1046276 T/C chr16 30914626 + rs951997 A/G chr2 223567016 61
0.632 0.807 0.053 0.973 0.862 0.031
MIR4539 + rs1046276 T/C chr16 30914626 + rs211037 C/T chr5 161528280 61
0.423 0.802 0.051 0.965 0.9 0.032
CORO2B + RADX + rs16944 A/G chr2 113594867 61 0.711 0.796 0.052 0.971
0.033
ENSG00000266408 + rs1046276 T/C chr16 30914626 + r211037 C/T chr5
161528280 61 0.487 0.794 0.051 0.961 0.914 0.034
LINC00689 + LINC02576 + rs211037 C/T chr5 161528280 61 0.317 0.791
0.054 0.967 0.034
carnitine + rs1046276 T/C chr16 30914626 62 0.645 0.79 0.069 0.968
0.821 0.035
TSC1 + LINC00689 + LINC02576 62 0.539 0.79 0.064 0.96 0.034
[193]Open in a new tab
Note: Using only the initial on-study samples; n[max] Seizures = 54;
n[max] no seizures = 11. The p value was determined through a one-sided
permutation test.
Table 3.
Most common analytes in the top 100 classification models (n[max]
Seizures = 54; n[max] no seizures = 11)
Variable Appears in N models up (+) or down(-) in seizures group
Seizures (median (IQR)) No seizures (median (IQR)) Mean MCC Mean MMCE
Mean PPV Mean NPV
rs1046276 T/C chr16 30914626 55 hm: 33% / ht: 67% hm: 90% / ht: 10%
0.75 0.073 0.959 0.82
Carnitine 26 — 0.95 (0.89–1.03) 1.05 (1.03–1.11) 0.748 0.079 0.956
0.818
RADX (ENSG00000147231) 14 + 0.33 (0.11–0.55) 0.02 (0–0.14) 0.774 0.068
0.962 0.836
PARM1 (ENSG00000169116) 12 — 1.54 (0.67–2.44) 3.75 (2.88–4.61) 0.75
0.071 0.967 0.797
IGLV3-24 (ENSG00000253822) 12 — 0.02 (0–0.14) 0.24 (0.06–0.44) 0.742
0.063 0.949 0.871
CAV2 (ENSG00000105971) 10 + 0.65 (0.35–1.18) 0.14 (0.06–0.24) 0.752
0.069 0.943 0.909
rs211037 C/T chr5 161528280 9 hm: 41% / ht: 59% hm: 80% / ht: 20% 0.764
0.06 0.956 0.892
IGKV1-9 (ENSG00000241755) 8 — 2.64 (1.47–4.14) 6.04 (3.91–7) 0.749 0.07
0.956 0.828
GPM6A (ENSG00000150625) 7 — 2.23 (1.05–3.13) 5 (4.05–6.31) 0.741 0.073
0.96 0.8
ENSG00000254859 7 — 0.57 (0.2–1.33) 1.6 (1.17–2.07) 0.755 0.074 0.964
0.806
hsa-miR-130a-3p 7 + 2539342.26 (1278752.8–5076227.22) 1545365.73
(1293279–1811184.58) 0.789 0.062 0.969 0.835
rs6872795 A/G chr5 35677385 7 hm: 46% / ht: 54% hm: 80% / ht: 20% 0.751
0.068 0.959 0.82
[194]Open in a new tab
The strongest predictor was a 3-variable classifier consisting of
miR-130a-3p, CECR7 and RADX (Fig. [195]5a). Seizure-free EPISTOP
subjects had lower normalized expression of miR-130a-3p (<5 M) and RADX
expression (<0.5 CPM), compared to those with seizures. CECR7 had a
more minor effect on seizure prediction, although no patient without
seizure had a CECR7 level <0.5 CPM. This classifier had a mean test MCC
of 0.873, and strong positive predictive power (PPV = 0.987).
Fig. 5. Results of integrative predictive epilepsy analysis.
[196]Fig. 5
[197]Open in a new tab
a 3D plot of three-variable model consisting of miR-130a-3p, CECR7 and
RADX (mean test MCC = 0.873). Low level of miR-130a-3p (<5M normalized
Unit) and RADX expression (<0.5 CPM) corresponds with seizure-free
status. b The two-variable model consisting of carnitine and rs1046276
T/C (mean test MCC = 0.79). Homozygote for C at position rs1046276 T/C
in combination with high levels of carnitine is associated with
remaining seizure free at age 2 years. Source data are provided as a
Source Data file.
Using the two variables most frequently seen in the classifiers, SNP
rs1046276 T/C, and serum carnitine, a 2-variable classifier was
generated. Nine of 10 patients who were seizure-free were homozygous
for C at rs1046276 and had high normalized levels of carnitine (>1).
The mean test MCC for this classifier was 0.79, with mean PPV = 0.97,
and mean negative predictive value (NPV) of 0.82 (Fig. [198]5b).
Discussion
Prior studies examining multiomic changes in children have focused
primarily on the first week of life^[199]20–[200]22. An elegant
integrated study on infants from The Gambia and Papua New Guinea
demonstrated several pathways that were operative during development
from newborn to age 1 week, particularly interferon signaling, the
complement cascade, and neutrophil activity^[201]20. Genes with
decreasing expression (in leukocytes) during the first week of life
were enriched for those involved in cellular responses to stress,
detoxification of reactive oxygen, and heme biosynthesis and iron
uptake. Genes with increasing expression during the first week were
involved in interferon signaling, Toll-like-receptor (TLR), negative
regulation of Retinoic Acid Inducible Gene I (RIG-I) and complement
activation^[202]20. Metabolomic differences detected during this age
interval involved pathways related to plasma steroids and
carbohydrates^[203]20.
We have significantly extended these studies, by expanding the age
range examined from birth to age 2 years. In fact, these prior studies
examined a perinatal time interval that was not available to us for the
vast majority of our samples.
Although many medications undoubtedly have important effects on
metabolite levels, our findings on the effects of vigabatrin on serum
metabolites in infants and children are striking (Fig. [204]1,
Supplementary Data [205]2). Twenty-eight of 249 (11%) of metabolites
showed a significant change in subjects on vigabatrin; 11 showed a
>2.5-fold change; and dCMP showed a huge 52-fold increase in EPISTOP
subjects on VGB (Fig. [206]1). Prior studies have focused on breakdown
products of VGB and effects on brain metabolites^[207]10,[208]11. VGB
is well-recognized as the most effective ASM for treatment of
TSC-associated seizures^[209]23,[210]24. Hence it was selected as the
intervention in this trial to attempt to prevent epilepsy. However, VGB
is associated with a wide variety of side-effects, the most feared of
which is vigabatrin-associated visual field loss^[211]25–[212]27. The
frequency and significance of this visual field loss is
controversial^[213]28,[214]29, but it is reported to occur as a
characteristic pattern of progressive centripetal loss. The possibility
that one of these metabolic derangements induced by VGB contributes to
either its therapeutic benefit or vigabatrin-associated visual field
loss deserves further investigation.
We found that a striking number of all analytes examined showed
significant changes in levels going from newborn to age 2 years
(Fig. [215]2d, e, f). Dividing the EPISTOP and control population into
three tertiles (postnatal age ≤ 10 weeks, 11-40 weeks, and > 40 weeks),
the majority of protein groups, metabolites, and expressed genes showed
significant changes with age. Hierarchical clustering of protein
groups, metabolites, and expressed genes indicated that these
expression changes occurred in similar patterns within these analyte
types (Supplementary Fig. [216]3), and showed enrichment in multiple
gene sets, suggesting that underlying global developmental processes
were driving these changes in expression for each analyte class.
Changes in leukocyte percent and total number are known to occur during
the age of birth to two years^[217]14–[218]16. Total leukocyte counts
are higher in newborns, with relative high levels of monocytes and
lymphocytes. Proportions of monocytes and lymphocytes decline through
age 2 years, and the relative proportion of neutrophils increases
during this interval. We observed these same changes with age in the
EPISTOP cohort, with a specific decline in naïve CD4 + T cells and
monocytes, as assessed using CIBERSORT and the RNA-Seq data
(Supplementary Fig. [219]7a, b). Principal component analysis of the
expression data showed many strong associations between leukocyte
subsets determined by CIBERSORT and the different principal components
(Supplementary Fig. [220]7a), likely reflecting variability in
leukocyte fractions in different samples due to variable clinical
conditions (recent infections, vaccinations, environmental exposures,
diet, etc.).
Comparisons between non-TSC controls and TSC samples at a range of ages
identified three protein groups, two metabolites, and 34 gene RNAs
whose levels were significantly different between these two
populations. As noted, although statistically significant, no analyte
showed differences that could reliably distinguish a sample from one
group versus the other. Gene set enrichment analysis of the 9 genes
whose expression was lower in the TSC subjects than the controls showed
that genes were enriched for receptor tyrosine kinase and
phosphoinositide 3-kinase functions. Since the TSC subjects would have
a single allele loss of TSC1 or TSC2 in general, this may fit with a
reduction in the activity of these pathways due to a negative feedback
loop^[221]18,[222]30.
It is somewhat disappointing to note the lack of significant
differences in any analyte considering samples drawn at time of seizure
or aEEG compared to those who were seizure-free. Although alterations
in any of the analyte classes might have been predicted, metabolomic
differences reflecting effects of seizures or aEEG were anticipated but
were not seen. This may be due in part to timing issues such that blood
samples for analysis were not always drawn on the same day as diagnosis
of seizure/aEEG, due to central review and interpretation of EEGs,
delaying recognition of seizures/aEEG in some instances, as well as
blood-brain barrier effects.
Collagen alpha-1(XI) protein group was significantly higher in those
with resistant epilepsy at age 2 years; while hydroxyphenylacetic acid
was significantly lower in those with resistant epilepsy at that age.
Col11a1 is a marker of early-born pyramidal neurons^[223]6, and it is
possible that its serum marker reflects increased pyramidal neuron
development in TSC patients with resistant epilepsy.
Hydroxyphenylacetic acid levels in cerebrospinal fluid have been
reported to be lower in both schizophrenic and epileptic
patients^[224]31, potentially explaining its reduced levels in TSC
patients with drug-resistant epilepsy.
A primary goal of this analysis was to identify biomarkers that could
be assessed at birth, and might be associated with seizure development
in TSC infants. A reliable biomarker of this kind could be developed
into a test which might be applied to TSC newborns, to help to identify
those TSC infants at high risk of seizure development, and in whom
pre-symptomatic use of vigabatrin or other interventions might be
taken. To enhance the sensitivity of this analysis, we compared three
different groups: control, TSC no epilepsy, and TSC epilepsy groups,
finding 6 analytes that were significantly different (Table [225]1,
Fig. [226]4). Periostin levels were higher in the TSC epilepsy group
than the other two groups, and it is known as a biomarker of asthma
severity^[227]32, though not previously identified in the context of
TSC. ATP5ME and NDUFA1 showed higher expression in blood RNA of the TSC
epilepsy group compared with the other two groups, and encode
mitochondrial components important for mitochondrial energy production.
Their increase in both TSC without seizures and more so in TSC with
seizures (Fig. [228]4) suggests the possibility that TSC alone and with
seizures enhances a need for mitochondrial energy production in blood
cells. The other two genes associated with TSC and epilepsy were
BHLHA15, a basic-loop-helix transcription factor and CARD16, an
inhibitor of apoptosis; both of uncertain relevance to TSC
pathogenesis. However, none of these six markers absolutely
discriminated between those developing seizures and those who did not.
A second effort to identify analyte differences that were associated
with seizure development was a multivariate approach using logistic
regression, resulting in many (100) 3-variable predictors of seizure
development (Table [229]2, Supplementary Data [230]9), with the best
having a mean test MCC of 0.873 with PPV of 0.987 (Fig. [231]5a).
We recognize the limitation of the size of this patient dataset, in
that only 93 subjects with TSC were studied, and only 65 were available
for seizure association analysis, since many had received
pre-symptomatic vigabatrin. In addition, additional follow-up beyond 2
years will be valuable, and is ongoing at this time. Hence we view all
of these observations as exploratory and tentative, and hope that a
replication population may be studied for validation in the near
future.
Methods
Full details on the EPISTOP subjects and the prospective clinical trial
in which they participated have been reported^[232]8. Briefly, this
study was part of the EPISTOP project (Long-Term, Prospective Study
Evaluating Clinical and Molecular Biomarkers of Epileptogenesis in a
Genetic Model of Epilepsy–Tuberous Sclerosis Complex,
ClinicalTrials.gov [233]NCT02098759). It was carried out from March
2014 to October 2018 at 9 sites in Europe and 1 site in Australia. The
study was approved by local ethics committees at all study sites, and
caregivers of all participants signed informed consent before
enrollment. It adhered to the International Conference on Harmonization
Guidelines for Good Clinical Practice and the Declaration of Helsinki.
Blood samples were collected from 93 TSC EPISTOP subjects, 297
different samples in total, and 58 control hospitalized infants without
TSC from whom a single sample was collected. Serial samples were
collected from TSC subjects in a manner depicted in Supplementary
Fig. [234]1. Blood samples were collected at recruitment into the
study, age range birth to age four months. EPISTOP TSC subjects were
monitored intensively as part of this trial, including serial EEG
monitoring for abnormal activity. When an abnormal EEG was detected a
second blood sample was drawn. Another blood sample was collected when
either a clinical seizure or subclinical seizure was identified. If no
seizure was observed during the first 6 months to 12 months, a blood
sample was drawn at one of these timepoints. All TSC subjects had a
last blood sample drawn at age 24 months, the end of the study. Due to
delays in recognition of abnormal EEGs or caregiver’s report of
seizures, those samples were often collected 1–2 weeks following those
clinical events. The metadata available for each sample included but
was not limited to: TSC mutation type, family history of TSC, epilepsy
in the family, presence of seizures (clinical or subclinical), presence
of tubers, presence of radial migration lines, white matter
abnormalities, gender, grouping (observational or randomized study
arm), treatment type (conservative or preventative), infantile spasms,
drug-resistant epilepsy (defined as either: 1) ongoing seizures despite
the administration of 2 ASM with appropriate doses; or 2) special
treatment (epilepsy surgery, ketogenic diet, ACTH)), treatment
sequence, vigabatrin treatment, and presence of autism spectrum
disorder (ASD) and/or developmental delay (DD) by psychological
examination.
Aliquots of serum samples were processed for proteomic analysis as
follows. Samples were thawed and centrifuged for 10 min at 16,000x g at
4 °C. 320 µL MARS buffer A (Agilent Technologies) were added to a
Cellulose Acetate spin filter (0.22 µm, Agilent Technologies). 80 µL of
the centrifuged serum sample was transferred to the MARS buffer A
filled spin filters and the mixture centrifuged for 3 min at 16,000x g
at 4 °C. The filtrate was transferred into a HPLC vial and stored at
4 °C until affinity chromatography depletion.
Affinity chromatography was performed using MARS 14 depletion columns
(Agilent Technologies; Cat: 5188-6558) on an 1100 Agilent HPLC system
following the manufacturer’s instructions. Flowthrough and bound
fraction were collected and precipitation of proteins performed
overnight at 4 °C using Trichloro-acetic acid (TCA). Precipitate was
washed twice using 90% Acetone (v/v), air dried and suspended in 50 µL
8 M Urea and frozen at −20 °C.
Protein concentration was measured using Bradford assay. 10 µg of
protein were reduced with 4.4 mM tris(2-carboxyethyl)phosphine (TCEP)
at room temperature and afterwards alkylated for 30 min in the dark
with 5.9 mM iodacetamide. 2 µL of Lys-C (0.1 µg/µL; Wako Chemicals) was
added to the mixture and a predigestion performed for 2 h at 37 °C. The
mixture was then diluted with 56 µL 50 mM triethylammonium bicarbonate
and trypsin digestion started using 2.5 µL trypsin (0.04 µg/µL;
Promega) at 37 °C overnight. 2.5 µL fresh trypsin (0.04 µg/µL) was
added the next day. After further 4 h of incubation at 37 °C, digested
samples were frozen at −20 °C until acidification with 5 µL 20% (v/v)
formic acid and LCMS-analysis.
2 µg of serum peptides were analyzed via LCMS on a Dionex 3000 Ultimate
HPLC System (Thermo Scientific) coupled to an Orbitrap XL mass
spectrometer (Thermo Scientific) in a data-dependent acquisition mode.
Injected peptides were washed on a temperature-controlled (5 °C)
µ-precolumn (C18 PepMap 100, 5 µm, 100 Ȧ; Thermo Scientific) with a
flow rate of 25 µL/min using a loading buffer composition 0.5% formic
acid and 0.5% acetonitrile in ddH2O. After 4 min peptides were eluted
over an analytical reversed-phase column (ReproSil-Pur 120, C18-AQ, 3
μm 500 × 0.075 mm; Dr. Maisch GmbH) at 450 nL/min (Buffer A: 0.1%
formic acid in ddH2O; Buffer B: 0.1% formic acid in acetonitrile) using
the following gradient: 1–26% B in 149 min, 26% to 46% B in 74 min,
46–90% B in 1 min, holding 90% B for 11 min, 90–1% B in 0.5 min and
holding 1% B for 11.5 min. To reduce the amount of carryover between
each sample, an injection of 10 µL 80% acetonitrile and a 30 minute
wash run was performed using the following gradient: 1% to 90% B in11
min, holding 90% B for 5 min, 90–1% B in 1 min, holding at 1% B 9 min.
The Orbitrap XL was run in positive mode in a data-dependent
acquisition manner. A full scan (375–1600 m/z) in the orbitrap analyzer
at a resolution of 30,000 (maximum injection time 100 ms, AGC target
value 10^6) was followed by up to 10 MSMS in the ion trap analyzer
(minimal intensity threshold 10,000 counts, Isolation width 2.5 m/z,
normalized collision energy CID 2.5 eV, maximum ion injection time
25 ms, AGC target value 10^4). Charge state screening was enabled and
only precursors with a charge state of 2 and higher were allowed for
data dependent fragmentation. Additionally dynamic exclusion was
enabled with a repeat count of 2 in 30 s before exclusion of the
precursor (± 5 ppm) for 900 sec. Early expiration was allowed to remove
precursors from the exclusion list after an expiration count of 7 full
scans with a signal-to-noise ratio of lower than 3.
To assure quality over the processing pipeline in each batch of
processed samples a serum pool derived from various female and male
subjects with and without diagnosed TSC between the age of 3 months and
3 years of age was also processed as described above. These control
samples were measured in a randomized manner in between every 5–10
samples.
Raw files were processed using MaxQuant software (version
1.6.1.0)^[235]33,[236]34 with the following settings: carbamidomethyl
of cysteine as fixed modification; variable modification of methionine
oxidation and deamidation of glutamine and asparagine; match between
runs was enabled with a matching window of 2 min and an alignment
window of 20 min; peptide spectrum match FDR, as well as peptide and
protein FDR were set at 1%, LFQ with a minimum ratio count of 1.
Peptide spectrum matches were analyzed against the MaxQuant
contamination database and a fasta file consisting of only reviewed
entries from Homo sapiens (Uniprot, downloaded 3^rd September 2018).
Downstream processing was performed in Excel, Perseus^[237]35 software
and R.
Blood samples for transcriptomic analysis and processing were collected
and stored in PAXgene tubes. The total RNA was isolated using the
PAXgene blood RNA kit in accordance with the manufacturer’s guidelines
(Qiagen). Library preparation and sequencing was completed at
GenomeScan (Leiden, the Netherlands). Sequencing libraries were
prepared using the NEBNext Ultra Directional RNA Library prep Kit for
Illumina (New England Biolabs) in accordance with the manufacturer’s
guidelines. Globin mRNA was removed using Invitrogen’s Globin clear kit
(Waltham) and mRNA was isolated from total RNA using oligo-dT magnetic
beads. After fragmentation of the mRNA cDNA synthesis was performed.
Sequencing adapters were then ligated to the cDNA fragments, and the
resultant product was subject to PCR amplification. Clustering and DNA
sequencing was performed using the Illumina HiSeq 4000. All samples
were subjected to paired-end sequencing with a read length of 151
nucleotides.
Processing of the raw RNA-seq was carried out by GenomeScan using
custom in-house scripts. Read quality was assessed using an in-house QC
v1 tool and Trimmomatic v0.30 was used to filter out reads of low
quality and any contaminating adapter sequencers^[238]36. The reads
that passed the quality control steps were then aligned to the homo
sapiens reference genome (GRCh37) using Tophat v 2.0.14^[239]37. Next,
featureCounts from the Subread package was used to calculate the number
of reads that aligned to each gene producing an unnormalized gene count
matrix^[240]38. The unnormalized count matrix was then normalized for
library size using the trimmed mean of M-values (TMM) and then
converted to count per million values using edgeR^[241]39. The median
number of reads per sample that aligned to known genes was 45 (IQR:
36–74) million reads. Samples with less than 15 million reads aligning
to known genes, and one low-quality sample, were discarded.
Deconvolution of the bulk RNA-sequencing was performed using
CIBERSORTx^[242]17. The CIBERSORTx module “Cell Fractions” was used to
enumerate the proportions of distinct cell subpopulations in the bulk
RNA-Seq profile. This is performed by providing a matrix that has
signature genes derived from either single-cell transcriptomes or
sorted cell populations. For the deconvolution of the EPISTOP RNA-Seq
the publicly available LM22 signature matrix was used. LM22 is a
signature genes file consisting of 547 genes that accurately
distinguish 22 mature human hematopoietic populations and activation
states, including seven T cell types, naïve and memory B cells, plasma
cells, NK cells, and myeloid subsets.
miRNA PCR array analysis was performed on 45 miRNAs (Supplementary
Data [243]1, [244]10). The 45 miRNAs were chosen based on small RNA-seq
data from a subset of the serum samples, or possible relevance to
epilepsy, seizures or cognitive functioning according to
literature^[245]40–[246]44. RNA was isolated from serum samples using
the miRNeasy Serum/Plasma kit (Qiagen). cDNA was synthesized using the
miRCURY LNA SYBR Green PCR Kit and miRNA levels were determined using
miRCURY LNA miRNA Custom PCR panels (Qiagen). Plates were run on a
Roche LightCycler 480 thermocycler (Roche Applied Science).
Quantification of data was performed using the computer program
LinRegPCR^[247]45 in which linear regression on the Log (fluorescence)
per cycle number data is applied to determine the amplification
efficiency per sample. Inter-plate variability was normalized using the
inter-plate calibrator assay (UniSp3 IPC) according to the
manufacturers’ guidelines. The starting concentration of each product
was divided by the starting concentrations of two ‘spike-in’ miRNAs
(cel-miR39 and UniSp6) that were added during RNA isolation and cDNA
synthesis, respectively, to protect against technical variation and to
normalize expression patterns.
Targeted metabolite analysis of serum samples was performed as
described^[248]46. Briefly, 100 µL of isolated serum was stored at
−80 °C until processing; samples were thawed at 4 °C and spiked with
10 µM alanine D3 and succinic acid D4 and then centrifuged for 10 min
at 14,000x g. The supernatant was then added to −80 °C cooled methanol
to result in an 80% (vol/vol) methanol solution, gently shaken, and
incubated overnight at −80 °C. Samples were then centrifuged for 10 min
at 14,000x g (twice if separate liquid phases were observed).
Supernatant was collected and stored at −80 °C before being lyophilized
under no heat for 3–4 h. Pellets were then suspended in 20 µL HPLC
grade water, centrifuged at 14,000x g for 10 min, and 5 µL of
supernatant submitted for mass spectrometry analysis coupled to liquid
chromatography.
Samples were analyzed using a hybrid 6500 QTRAP triple quadrupole mass
spectrometer (AB/SCIEX) coupled to a Prominence UFLC HPLC system
(Shimadzu) via selected reaction monitoring (SRM) of a total of 298
endogenous water-soluble metabolites for steady-state analyses of
samples. Some metabolites were targeted in both positive and negative
ion mode for a total of 307 SRM transitions using positive/negative ion
polarity switching. ESI voltage was +4950 V in positive ion mode and
–4500 V in negative ion mode. The dwell time was 3 ms per SRM
transition and the total cycle time was 1.55 seconds. Approximately
10–14 data points were acquired per detected metabolite. Samples were
delivered to the mass spectrometer via hydrophilic interaction
chromatography (HILIC) using a 4.6 mm i.d x 10 cm Amide XBridge column
(Waters) at 400 μL/min. Gradients were run starting from 85% buffer B
(HPLC grade acetonitrile) to 42% B from 0 to 5 min; 42% B to 0% B from
5 to 16 min; 0% B was held from 16 to 24 min; 0% B to 85% B from 24 to
25 min; 85% B was held for 7 min to re-equilibrate the column. Buffer A
was composed of 20 mM ammonium hydroxide/20 mM ammonium acetate (pH =
9.0) in 95:5 water:acetonitrile. Peak areas from the total ion current
for each metabolite SRM transition were integrated using MultiQuant
v3.0 software (AB/SCIEX) and used as the basis for quantitative
analysis. All metabolites with >50% missing values across all 357
samples were dropped from analysis, leaving a total of 249 metabolites
for further consideration. Repeated measurements for some of the
control samples were performed as a quality check between acquisition
batches.
To remove observed batch and Vigabatrin effects in the metabolite data
set a correction was applied. Individual samples were annotated with
their respective batch number and whether the subject was on
vigabatrin. A Z-score correction of samples from individual batches was
performed on log2-transformed peak area values followed by un-Z-scoring
using the standard deviation and mean of all samples prior to
Z-scoring. VGB correction was then performed separately.
Age correction was applied using two different Methods: A Z-score
un-Z-score method, and a Linear Mixed Models approach. For the Z-score
un-Z-score method, samples were divided into three tertiles of age
groups: 0–10 weeks, 11–40 weeks, >40 weeks of age. Z-scoring of log2
transformed intensity/integrated peak area/count values was performed
in individual age groups and followed by un-Z-scoring using the mean
and standard deviation of all applicable samples prior to Z-scoring.
Linear Mixed Models (LMM) was also used, since there were repeated
measurements on the same subject over time. The LMM approach permits
characterization of time trends both within and between subjects. The
formula for the LMM model for the i-th subject is as follow:
[MATH: Yi=Xiβ+Ziui+εi :MATH]
Where:
Y[i] - vector of responses, here – measured intensity/peak area/counts
β - fixed effects (time-variant), constant across individuals; here –
the age at the moment of sampling,
u[i] - random effect (time-invariant), grouping factor; here –
patient’s code.
We estimated the random and fixed effects using lme4 and lmerTest
packages in R. We have treated the patient’s code as a random effect
and the age as a fixed effect with the application of the following
formula:
[MATH: lmerTest::lmer(intensity~sample_age_w+(1∣code))
:MATH]
1
By transforming the equation, we can derive a formula for intensity
correction, depending on the sampling time and accounting for
inter-patient variability:
[MATH: icorr=i−aget*a−bID :MATH]
2
where:
i - intensity
t - time of sampling
a – fixed effect for each week of age
b - model intercept (random effect for each patient)
ID - patient’s ID
The assessment of developmental patterns in analytes involved several
steps. The first step was to apply the Kruskal-Wallis Rank Sum test,
alongside Dunn’s post hoc test and Benjamini-Hochberg correction for
multiple comparisons for each analyte and age category. Then for
analytes with statistically significant differences among age groups,
normalization was performed on all values for each molecule using
Z-score standardization. To group molecules demonstrating similar
patterns, the medians of standardized values were used as input for
hierarchical clustering analysis. The process of hierarchical
clustering using the Complete linkage method was executed on a
dissimilarity matrix derived from Euclidean distance calculations.
For two-group comparisons, first, we verified the t-test assumptions:
data normality using the Shapiro-Wilk test and variance homogeneity
with F-test. If both criteria were met, we applied the t-test;
otherwise, we applied the Wilcoxon rank sum test. Both tests were set
as two-tailed. In case of three-group comparisons, we implemented the
Kruskal-Wallis Rank Sum test followed by Dunn’s post hoc test. We
applied correction for multiple comparisons inside each test hypothesis
using Benjamini-Hochberg procedure.
The median fold change between groups was calculated for each
comparison to extract the variables with the highest clinical
usability. Only variables with a median fold change greater than 1.5
(either up- or downregulated) between tested groups and p-value FDR
lower than 0.05 were considered significant and presented in the
results section. Due to differences in valid values for each individual
analyte tested, sample sizes (n) changed depending on missingness.
All calculations were performed using R (v. 4.1.0), with the packages
tidyverse (general analysis), stats (statistical tests and hierarchical
clustering), gplots (for heatmaps preparation), factoextra (for PCA
analysis), ggsci (Figure color palette), ggdist (raincloud plots).
Pathway enrichment analysis was performed in R using packages
biomaRt^[249]47,[250]48, ReactomePA^[251]49 and
clusterProfiler^[252]50,[253]51. In case of metabolites, the
relationship between metabolites and pathways was obtained from the
Reactome website^[254]52 and pathway enrichment was assessed by
performing a hypergeometric test.
Classifier analysis was performed as follows. Initial on-study subject
samples (median age: 4.7 weeks, Q1-Q3: 2.1–8.0 weeks) were analyzed by
multivariable analysis. Epilepsy status was assessed at age 2 years,
the end of the study. Patients who developed clinical or subclinical
seizures by age 2 years had status of seizures present, else their
status was seizure-free.
Seven subjects were excluded from this analysis due to occurrence of
subclinical seizures at the time of study entry. 14 additional subjects
were excluded because they had received Vigabatrin on a pre-symptomatic
basis, and never developed seizures. Additionally, for 7 patients no
comparable sample was available for inclusion. Hence, a total of 65
samples were analyzed, 54 from patients who developed seizures during
the 2 year follow-up period, and 11 who did not. Sample size fluctuated
due to missing data. Multiple data types were included in this
analysis: Proteomics, metabolomics, transcriptomics, miRNA data,
genotype information for 86 SNPs (Supplementary Data [255]1), TSC
mutation status, and clinical data.
The 86 SNPs were selected based on review of published genetic
association studies (review performed using PubMed literature database
in April to May 2019). Selected SNPs had to have an allele associated
with predisposition to epilepsy of any kind: focal or generalized
epilepsy (including absence and juvenile myoclonic epilepsy), infantile
spasms, or febrile and other types of seizures (Supplementary
Data [256]11). The association finding had to be made in: (i) at least
one genome-wide association study (GWAS), genome-wide meta-analysis
association analysis, genome-wide mega-analysis association study, or
candidate SNPs meta-analysis association study; or (ii) at least two
independent candidate SNP association studies (Supplementary
Data [257]11). SNPs were selected from studies performed in populations
of any ethnicity, with no limits on population allele frequency or
predicted risk effect, but had to be statistically significant. SNP
genotypes were extracted from the EPISTOP WGS data^[258]9, and were
classified into two categories, either (i) major allele homozygote or
(ii) heterozygote or minor allele homozygote.
For the classifier analysis each dataset was processed in the following
way: first, constant features (columns with constant values spread
across the whole dataset, or, with variance equal to zero) were
excluded. Next, each variable was standardized (transformed to common
scale), so the final values in each column had zero-mean and unit
variance. More extensive filtering was used for the RNA-Seq dataset,
since there were ~63,000 initial values. In the first filtering step,
only genes with more than three reads in at least 75% of samples were
retained. In the second step, a 2-fold median change between the
seizure and no-seizure groups was required. Together, these filters
reduced the number of RNA species under consideration to 267.
Analytes were selected for inclusion in the classifier analysis using
univariate Area-under-the curve (AUC) metric to assess the ability of
each variable to predict seizure appearance. Analytes had to meet an
AUC threshold > 0.6 to be retained. In addition, no more than 30
analytes of each type could be selected, and they were down-sampled for
inclusion based on AUC rank. This led to retention of 126 variables for
inclusion in the classification algorithm from all analyte categories.
We used a logistic regression binary classification algorithm to
generate a three-variable predictor of seizure development.
[MATH: lnp1−
p=β
mi>0+β1x1+β
2x2
mrow>+β3x3
:MATH]
3
Where:
p – expected probability,
x[1,..,3] – independent variables,
β[0,…,3] – regression coefficients.
Logistic regression models the probability of an event (here: seizure
development). If the calculated probability is higher than a defined
threshold value, then the classifier predicts that patient will develop
seizures (positive class). Otherwise the algorithm predicts that there
will be no seizures (negative class). The probability threshold between
positive and negative class was adjusted for each classifier separately
due to high size imbalance between the classes.
The 126 analyte variables were combined into 1-, 2- and 3-predictor
models. Over 330 thousand combinations were assessed for their
association with seizure occurrence.
To validate our findings, we used the Monte-Carlo cross-validation
method (repeated random subsampling). Each model was scored 100 times
with the split of 2/3 in the training set and 1/3 in the test set with
stratification to maintain the positive/negative case ratio in both
sets. The goodness of fit with observed outcome was evaluated with
Matthews Correlation Coefficient^[259]19:
[MATH: MCC=TP ⋅ TN−FP
⋅ FN(TP+FP)⋅(TP+FN)⋅(TN+FP)⋅(TN+FN), :MATH]
4
Where:
MCC - Matthews Correlation Coefficient,
TP – true positives; samples correctly classified as positive (patients
with seizures classified correctly),
TN – true negatives; samples correctly classified as negative (patients
with seizures labelled as no seizures),
FP – false positives; samples incorrectly classified as positive
(patients without seizures classified as with seizures),
FN - false negatives; samples incorrectly classified as negative
(patients without seizures classified correctly).
Matthews Correlation Coefficient ranges from −1 (complete disagreement
between prediction and observation) to 1 (complete agreement), with 0
meaning lack of correlation.
As the final metric for each classifier we chose the mean test MCC
value obtained from Monte-Carlo cross-validation technique.
Additionally, for each model, we present mean misclassification error
(MMCE), positive predictive value (PPV) and negative predictive value
(NPV). Missing values in the negative predictive value are related to
the low number of negative examples in the test dataset. Due to the
limited number of negative examples (n[max] for no seizures group =
12), the classifier sometimes incorrectly labeled all cases as
positives, which caused the lack of true/false negatives (patients
without seizures) in the formula for NPV calculation.
Due to the lack of an independent dataset for validation, we validated
our results with a permutation test. To calculate the sampling
distribution under the null hypothesis, we permuted the outcome
variable (seizure status) and repeated the whole experiment 30 times –
starting from feature selection to a model evaluation with subsampling
(subsampling was done 50 times for each model to reduce the computation
time). This resulted in about 16 million models. The test-statistic was
set to the mean test MCC. The significance level was set at 0.05. A p
value represents the fraction of models with random treatment
allocation where the mean test MCC was equal or higher than in the
models created for original data. Based on the permutation results, the
calculated test statistics for this experiment had to exceed the mean
MCC test equal to 0.73 to be considered significant.
The volume of data collected in the EPISTOP project consisted of about
33 TB. The experiment was designed in an R environment (version 4.1.0)
using tidyverse, mlr, parallel and plotly packages (among all). We used
packrat for package version control. The dedicated computing server
consisted of 40 cores, 216 GB RAM and an additional 216 TB for data
storage. Complete code for the project is stored in the Github
repository ([260]https://github.com/JagGlo/molecular_EPISTOP).
Reporting summary
Further information on research design is available in the [261]Nature
Portfolio Reporting Summary linked to this article.
Supplementary information
[262]Supplementary Information^ (6.9MB, pdf)
[263]Peer Review File^ (546.1KB, pdf)
[264]41467_2023_42855_MOESM3_ESM.pdf^ (20.1KB, pdf)
Description of Additional Supplementary Files
[265]Supplementary Data 1^ (49.2KB, xlsx)
[266]Supplementary Data 2^ (15.4KB, xlsx)
[267]Supplementary Data 3^ (1.1MB, xlsx)
[268]Supplementary Data 4^ (302.5KB, xlsx)
[269]Supplementary Data 5^ (22.2KB, xlsx)
[270]Supplementary Data 6^ (28.3KB, xlsx)
[271]Supplementary Data 7^ (42.4KB, xlsx)
[272]Supplementary Data 8^ (31.7KB, xlsx)
[273]Supplementary Data 9^ (24.2KB, xlsx)
[274]Supplementary Data 10^ (11.7KB, xlsx)
[275]Supplementary Data 11^ (15.8KB, xlsx)
[276]Reporting Summary^ (278.3KB, pdf)
Source data
[277]Source Data^ (2.9MB, xlsx)
Acknowledgements