Graphical abstract
graphic file with name fx1.jpg
[45]Open in a new tab
Highlights
* •
enONE, a computational framework of NAD-RNA, reduces technical
noise using spike-in
* •
Human peripheral blood cells contain NAD-RNA and tend to increase
with age
* •
RNA-based aging clock, integrating gene expression with NAD-RNA,
can predict age
* •
NAD-capped RNA from circulating blood can be developed as potential
biomarkers
__________________________________________________________________
Computational bioinformatics; Sequence analysis; Transcriptomics;
Methodology in biological sciences
Introduction
Nicotinamide adenine dinucleotide (NAD), an adenine nucleotide
containing metabolite, can be incorporated into the RNA 5′-terminus to
result in NAD-capped RNA (NAD-RNA),[46]^1^,[47]^2 which is different
from eukaryotic canonical cap structure formed by 7-methylguanosine
(m^7G) via a 5′-to-5′-triphosphate bridge (m^7G-RNA).[48]^3^,[49]^4 It
has been estimated that NAD-capped forms constitute approximately 0.6%
and 1.3% of the transcripts expressed in the mouse liver and kidney,
respectively.[50]^5 To capture such low-level capping events, the
recently developed NAD-RNA identification methods involve multi-steps
of chemo-enzymatic reaction, followed by affinity-based
enrichment.[51]^6^,[52]^7^,[53]^8 As a consequence, the resulting
high-throughput sequencing data can be hampered by the effect of
capture procedures and other unwanted variations, e.g., the batch
effect. Given these limitations, current computational methods cannot
be directly applied to the omic-level quantitative assessment of
NAD-capped RNAs.
Proper normalization may serve to remove unwanted variations from
enrichment-based sequencing methods. N^6-methyladenosine (m^6A), a
known epitranscriptomic modification in RNA, has been extensively
characterized in virus and eukaryotic organisms by antibody-based
immunoprecipitation and sequencing.[54]^9 Computational tools for
m^6A-seq, e.g., RADAR[55]^10 and m^6A-express,[56]^11 employ a split
scaling strategy that calculates scale factors for input and
enrichment, respectively, to adjust the variations from enrichment
procedures and sequencing depth. However, these analytical methods
cannot properly account for the unwanted variation between samples with
and without enrichment, thus challenging the identification of
enrichment signals. More generally, current analyses of
epitranscriptomic data are mostly based on normalization designed for
bulk RNA-seq, e.g., scaling-based methods, such as total count (TC),
trimmed mean of M values (TMM),[57]^12 and DESeq.[58]^13 The implicit
assumption underlying scaling-based methods is that all the gene-level
counts are proportional to scale factors and that the between-sample
variations can be adequately adjusted by scale factors. Unfortunately,
this assumption is inevitably violated when affinity-based enrichment
selectively amplifies the signal of genes, e.g., m^6A and NAD capping,
which leads to disproportional gene counts between input and
enrichment. Another regression-based method, remove unwanted variation
(RUV),[59]^14 regresses gene count measurements on unwanted factors,
thus computing corrected expression values from the residuals. The
implicit assumption underlying this method is that a set of negative
controls, which are not affected by covariates of interest, is
available, such as the spike-in from the External RNA Controls
Consortium (ERCC). However, the ERCC-based method suffers from
discrepancies between endogenous transcripts and spike-in, hindering
its usage in omic-level profiling. Limited by current analytical
methods, nuisance variations fail to be properly corrected for
enrichment-based epitranscriptomic sequencing data, thereby obscuring
true biological signals.
NAD, an essential metabolite and redox regent, has been found to
decline with age in crucial tissues of mammals including human.[60]^15
Given the dynamics of NAD and gene expression over the course of adult
lifespan, NAD-capped RNA, poised to integrate metabolomics and
transcriptomics, may provide novel insights into physiological and
perhaps pathological situations. Thus, it is tempting to explore how
NAD-modified epitranscriptome is modulated with age. However, analysis
of NAD-capped RNAs from heterogeneous populations, particularly those
from aging human cohort, presents an additional challenge due to the
interrelated variations from individual heterogeneity, impacts of
aging, and the technical noises inherent in affinity-based enrichment
procedures. Therefore, such complex datasets require a new analytical
strategy for disentangling the covariates of interest.
In the present study, we propose enONE computational framework for
NAD-capped RNA data analysis by efficiently removing the impact of
affinity-based enrichment variations using spike-ins. By identifying
genes that capture unwanted variations from exogenous spike-in,
referred to as the “anchor set”, enONE can transform data into a shared
space, in the presence of extensive technical variations, e.g., batch
effect. Based on the spike-in anchor set, enONE integrates global
scaling and regression-based normalizations, followed by performance
evaluation that selects the local-optimal normalization method to
remove unwanted variation. As a computational approach, enONE, combined
with NAD-RNA capture and sequencing technology, particularly our
recently developed ONE-seq method,[61]^8 can facilitate quantitative
analysis of NAD-RNA profiles. Using human aging cohort, we apply enONE
to the identification of NAD-RNAs from human peripheral blood
mononuclear cells (PBMCs), revealing critical features and dynamics of
NAD modification with age. We further combined transcriptomic and
epitranscriptomic features, to develop an RNA-based clock for the
prediction of human chronological age.
Results
The workflow of enONE
We designed a computational framework that implemented and evaluated
the performance of various spike-in-based normalization methods for the
analysis of NAD-RNA sequencing data. We thereby named our analytical
method enONE, for Epitranscriptional NAD-capped RNA analysis by
spike-in-based Omic-level Normalization and Evaluation ([62]Figure 1).
Figure 1.
[63]Figure 1
[64]Open in a new tab
The workflow of enONE
enONE starts with quality control to obtain high-quality sequence reads
by removing outliers and lowly expressed genes. enONE performs
spike-in-based normalization that integrates global scaling and
regression-based procedures to generate normalization toolsets. Next,
enONE uses eight data-driven metrics to evaluate normalization
performance. By exploiting the full space of normalizations, enONE
identifies the optimal procedure that maximally removes unwanted
variations, while minimally impacting the signals from NAD-RNA-seq
data.
enONE initiates with a quality control step, to remove outlier samples
and lowly expressed transcripts. Second, a subset of genes from total
RNA spike-in, with their expressions presumably not being influenced by
the covariates of interest (e.g., enrichment assay or biological
condition), are used as the anchor set to estimate unwanted variation
(e.g., batch effect). A generalized linear model is then applied to
regress the observed read counts from anchor set on the unknown
nuisance variables to estimate factors that are subsequently used by
the normalization tools for the adjustment of unwanted variation.
Third, a two-part normalization template is employed to define an
ensemble of the normalization procedures: (1) global scaling of read
counts to account for between-sample difference in sequencing depth and
other parameters of the read count distribution and (2)
regression-based adjustment for unwanted variations. For instance, one
can apply a robust scaling procedure, such as TMM, followed by
unsupervised procedures to estimate hidden unwanted variations and
regress them out of the data (e.g., RUV[65]^14). Fourth, enONE compares
different normalization tools to identify sets of top-performing
procedures. Specifically, enONE calculates ranks based on eight
performance metrics that represent the local-optimal trade-offs toward
removing unwanted variation, preserving biological variation of
interest, and maintaining minimal technical variability of global
expression. Combined, enONE utilizes a data-driven approach to
determine appropriate normalization procedures for the quantitative
analysis of NAD-modified epitranscriptome.
Epitranscriptomic profiling of human PBMCs
To gain insights of NAD-modified epitranscriptome during aging, we
collected human PBMCs from an aging cohort of community subjects
comprising of young (N = 23, age: 23–32), middle (N = 20, age: 40–50),
and old (N = 18, age: 54–67) individuals for epitranscriptome-wide
profiling of NAD-RNAs ([66]Figure 2A), according to the inclusion
criteria approved by the Ethics Committee. Clinical characteristics of
the participants were evaluated and listed in [67]Table S1.
Figure 2.
[68]Figure 2
[69]Open in a new tab
The feasibility of enONE
(A) Aging Cohort and experiment outline. A total of 61 participants
(female/male = 31:30, age 23–67) were enrolled for NAD-modified
epitranscriptome profiling. Total RNAs from PBMCs were mixed with
Drosophila spike-in RNA, and two synthetic spike-in, one with 5%
NAD-capped forms and another with 100% m^7G-capped forms. The mixture
was subjected to NAD-RNA-seq, followed by enONE computational analysis
of NAD-RNA profiles (highlighted in red).
(B) Scatterplots of first two PCs for the 1,000 least significantly
enriched genes (denoted as anchor set) from Drosophila spike-in.
(C) Boxplot showing the normalization performance based on anchor set
of different sizes from Drosophila spike-ins.
(D) Fraction of the first two expression PCs variance explained values
(taken cumulatively) for linear regression model using enrichment
variable as the explanatory variable. PCs were computed from the
variance stabilizing and transformed matrix of spike-in counts.
(E) Scatterplots of first two PCs for Drosophila spike-in counts from
different procedures.
(F) The R^2 of linear regression between batch effect and up to the
first six PCs (taken cumulatively).
(G) Spearman’s correlation coefficients between the raw/normalized
counts and batch effect.
(H) The number of genes varied among bathes as inferred by ANOVA
(p < 0.01).
(I) Boxplot showing the relative expression levels of synthetic
spike-ins with 5% NAD-caps and that with 100% m^7G-caps from different
methods.
Since NAD capping is an evolutionary conserved RNA modification,[70]^16
we reasoned that exogenous total RNA from an invertebrate species can
be used as common standards to mitigate unwanted variations from human
samples. We deliberately included three types of spike-in RNAs: (1)
total RNA from Drosophila melanogaster, an invertebrate model organism
with well-annotated genome sequence; (2) a synthetic RNA consisting of
5% NAD relative to m^7G-capped forms, used to determine the capture
sensitivity; and 3) a synthetic RNA with 100% m^7G-capped form, used to
determine the capture specificity ([71]Figure 2A). Notably, spike-ins 2
and 3 were synthesized with templates from different sequences.
Combined, we subjected total RNAs from human PBMCs, Drosophila,
spike-in 2, and spike-in 3 to ONE-seq-based capture and sequencing,
followed by enONE computational analysis of NAD-RNA profiles.
After quality control, we obtained an average about 49.2 million
high-quality and uniquely mapped sequencing reads from each library
([72]Figure S1A). Assessment of datasets corroborated that sequencing
saturation has been reached ([73]Figure S1B). Spike-in 2, which
contained 5% NAD-capped form, was significantly enriched, whereas no
enrichment was found for spike-in 3 made up with 100% m^7G-RNA
([74]Figure S1C). Above evidence highlighted the sensitivity and
specificity of the enrichment experiment, as reflected by the
enrichment of NAD-capped, but not m^7G-capped, transcripts.
The feasibility of enONE
Since all samples were added with equal amounts of Drosophila total RNA
as spike-in, its disconcordance, if present, can be used to pinpoint
the nuisance technical variation in an epitranscriptome-wide manner,
and its concordance, on the other hand, can be used to validate the
effect of normalization. To capture unwanted variation, i.e., batch
effect, we used a set of genes (n = 1,000) whose expression patterns
should be highly reproducible and now become differed among batches as
the anchor set ([75]Figure 2B). In addition, we showed that
normalization procedures were robust when the enrichment effect
accounted for a small fraction of the anchor set variance, e.g., anchor
set size ranged from 500 to 2,500 ([76]Figures 2C and 2D). With anchor
set from Drosophila spike-in RNA, we implemented enONE normalization
procedures with five scaling toolsets, including TC, UQ, TMM, DESeq,
and PoissonSeq,[77]^17 as well as three regression-based procedures,
namely RUVg, RUVs, and RUVse. By integrating two normalization modules,
we generated a total of 96 combinatorial procedures for the current
data ([78]Figure S2A). By inspecting the full space of normalization
performance metrics, we found that the top-ranked procedure involved
DESeq scaling followed by RUVg adjustment for the first 4 factors of
unwanted variation ([79]Figures S2B and S2C). Additionally, this
integrated procedure showed that the combination of global scaling and
regression-based procedures could improve the normalization
performance.
To compare the effect of normalization, we applied three different
methods—RUVg (k = 4), a regression-based method; RADAR, a scaling-based
method; and enONE—on all three types of spike-in RNAs. Compared to
other procedures, enONE normalization dramatically mitigated the batch
effect of Drosophila spike-in in both input and enrichment libraries,
while preserving the enrichment signals ([80]Figure 2E). In Drosophila
spike-in, linear regression analysis between the first six PCs
cumulatively and batch effect showed that batch effect explained over
83% of the variations in raw count data. This variation was reduced to
approximately 24% and 52% with the RUVg and RADAR methods,
respectively. Significantly, enONE normalization further mitigated the
variance explained by batch effect to approximately 17%
([81]Figure 2F). Analysis of correlation between individual gene
expression values from Drosophila spike-in and batch variation revealed
a large proportion of genes showing strong correlations with batch
effect in raw, RUVg, and RADAR datasets, whereas this correlation was
dramatically diminished by enONE ([82]Figure 2G). ANOVA was performed
on Drosophila spike-in datasets from different batches to evaluate the
impact of batch effect on gene expression measurements. Ideally, there
should be minimal difference in gene expression across batches;
however, over 5,800 genes were found to be differentially expressed in
the raw dataset, reflecting strong batch effect. enONE was able to
abolish the batch effect by reducing 96% of such differentially
expressed genes, whereas RUVg and RADAR could reduce 90% and 62% of
these genes, respectively ([83]Figure 2H). Since synthetic spike-ins
were not used to estimate unwanted variation factors, their concordance
could serve as an independent validation of normalization performance.
The boxplots of raw relative log-expression showed large differences
among synthetic spike-ins with equal input amounts ([84]Figure 2I).
Compared to RUVg and RADAR, enONE clearly reduced the overall
variations of synthetic spike-ins as illustrated by the lower
interquartile range. Additionally, enONE normalization preserved the
distribution differences between input and enrichment samples of
NAD-capped, but not m^7G-capped, spike-ins, indicating that enONE could
retain enrichment signals from transcripts with NAD-cap, but not
m^7G-cap ([85]Figure 2I).
We next compared the performance of enONE with other normalization
approaches using human PBMCs dataset. Although the RUVg and RADAR
reduced the variations, both methods had limitations; specifically,
RUVg failed to preserve the enrichment signal and RADAR overcorrected
the variation among enrichment samples ([86]Figure 3A). Principal
component analysis plots illustrated that enONE improved upon the RUVg
and RADAR normalizations in removing the variation of batch effect and
preserving the enrichment signals ([87]Figure 3A). To examine batch
effects in raw and normalized data, we performed linear regression
analysis and ANOVA, which showed that enONE exhibited improved
performance than RUVg and RADAR ([88]Figure 3B). Furthermore, we
evaluated the preservation of covariate of interests, i.e., enrichment
effect. By performing vector correlation analysis between the first six
cumulative PCs and the enrichment covariate, we noted that over 97% of
the enrichment variations were well preserved by enONE ([89]Figure 3B).
Besides, the silhouette coefficient and adjusted Rand index analysis
was used to assess the separation of input and enrichment samples.
Consistently, this result showed that enONE outperformed other
procedures in retaining the enrichment signals ([90]Figure 3C).
Together, these results demonstrate the capacity of enONE in removing
unwanted variation while retaining the covariates of interest.
Figure 3.
[91]Figure 3
[92]Open in a new tab
Performance assessment of different normalizations on human PBMCs
dataset
(A) Top row: scatterplots of first two PCs for human PBMCs counts from
different procedures colored by batch. Bottom row: same as the top row
colored by the enrichment assay.
(B) Left: the R^2 of linear regression between batch effect and up to
the first six PCs (taken cumulatively). Right: the number of genes
varied among bathes as inferred by ANOVA (p < 0.01).
(C) Left: line chart showing the vector correlation coefficient between
enrichment effect and the first six PCs (taken cumulatively). Right:
scatterplot showing silhouette coefficient and adjusted Rand index
(ARI) for measuring the separation of enrichment groups.
Characterization of NAD-RNAs from human PBMCs
We proceeded to set 2-fold enrichment of read counts as the cutoff,
which led us to identify a total of 809 NAD-RNAs from human PBMCs
([93]Figure 4A and [94]Tables S2, [95]S3, and [96]S4). We then
characterized these newly identified NAD-RNAs. In human PBMCs, NAD
capping mostly occurred on protein-encoding genes, but also extended to
pseudogenes and non-coding RNAs, including lincRNA, snRNA, snoRNA, and
miscRNA ([97]Figure 4A). NAD-RNAs were shown to be derived from genes
localized on autosomes, X chromosomes, and the mitochondrion genome,
but not from the Y chromosome ([98]Figure 4B). By dividing NAD-RNAs
into 5 deciles based on enrichment, we observed that genes with short
length and fewer introns tended to have increased modification of NAD
([99]Figure 4C), a pattern consistent with our recent study in mouse
livers.[100]^8 Since previous studies have shown that RNA capping
events can be influenced by the proximal RNA secondary
structure,[101]^18 we examined the sequence feature using the 100 bp
region downstream of the transcription start site. By leveraging the
minimum free energy (MFE),[102]^19 we observed that transcripts with
NAD-cap tended to adopt predicted structures with higher MFE level than
those with m^7G-cap ([103]Figure 4D). Furthermore, transcripts with
increased MFE level positively correlated with the extent of NAD
modification ([104]Figure 4E), suggesting that RNA transcripts with a
stable cap-proximal structure might promote NAD capping. To inspect NAD
modification of genes associated with biological functions, we
performed pathway enrichment analysis, which revealed that NAD-RNAs
from human PBMCs were mainly involved in RNA metabolism, translation,
transcription, energy metabolism, and immune system ([105]Figure 4F and
[106]Table S5).
Figure 4.
[107]Figure 4
[108]Open in a new tab
enONE facilitates NAD-capped RNAs identification
(A) Barplot showing the gene types of identified NAD-RNAs (n = 809)
from human PBMCs.
(B) Circular bar plot showing the chromosomes distribution of
identified NAD-RNAs.
(C) From five deciles based on enrichment, genes with short length and
with fewer introns tended to have increased modification of NAD. p
values were obtained with ANOVA.
(D) Left: for transcripts with 5′ untranslated region (UTR), the 100 bp
region downstream of transcription start site (TSS) is used to compute
the MFE. Using bootstrapping, the background distribution of MFE was
generated based on the transcripts that did not produce NAD-cap. Right:
transcripts that produced NAD-cap have predicted structures with higher
MFE levels than those did not. p value was estimated with bootstrapping
(n = 1,000).
(E) From five deciles based on enrichment, genes with higher MFE levels
tend to have increased NAD capping. p value was obtained with ANOVA.
(F) Pathway analysis reveals the biological processes of NAD-capped
RNAs that are mainly involved in RNA metabolism, transcription,
translation, and immune system. Gray dashed line in the bar plot
indicates the 0.05 FDR cutoff.
Age alters NAD-modified epitranscriptome
To assess the heterogeneity of individual subjects, we utilized the
Jensen-Shannon distance (JSD) metric to evaluate the variation in
transcriptomic or epitranscriptomic distributions relative to their
reference distribution, i.e., centroid.[109]^20^,[110]^21 A JSD value
of 0 indicates an equal distribution between the reference and sample,
while a JSD of 1 represents the greatest difference between the
reference and sample. The transcriptome-derived JSD metrics had an
average value of 0.10 and exhibited no significant correlation with
ages (Pearson’s R = −0.036, p = 0.78). On the other hand, the JSD
derived from epitranscriptome profiles had an average value of 0.12 and
demonstrated a significantly negative correlation with age (Pearson’s
R = −0.29, p = 0.025). These findings suggested a greater
inter-individual heterogeneity in the distribution of epitranscriptome
profiles, compared to transcriptome, and this heterogeneity tended to
decrease with age ([111]Figure 5A). To further explore how NAD-RNAs
were modulated with age and its consequent impact on the progression of
aging, we analyzed NAD-RNA profiles from all age groups. Interestingly,
even though NAD levels in circulating blood exhibited substantial
inter-individual variations with advancing age,[112]^22 we found that
the number of NAD capping events tended to increase in aged human
subjects ([113]Figure 5B). To dissect this observation, we grouped
age-associated trajectories into two major clusters using hierarchical
clustering ([114]Figure 5C and [115]Table S6). For genes in each
cluster, the age-associated dynamic of their transcript abundances and
NAD modification levels did not exhibit positive correlations,
suggesting that the transcriptome and NAD-modified epitranscriptome
entailed distinct facets of the aging process ([116]Figures S3A and
S3B). Furthermore, increased NAD modification was found for genes in
cluster 1, with their function being involved in basic cellular events
and adaptive immune response. NAD capping of genes associated with
aminoacyl-tRNA biosynthesis and cellular respiration, which were
ascribed as cluster 2, remained generally constant with age
([117]Figure 5C and [118]Table S7).
Figure 5.
[119]Figure 5
[120]Open in a new tab
enONE reveals dynamics of NAD-capped RNAs during aging
(A) Left: diagram illustrating the Jensen-Shannon distance. The
Jensen-Shannon distance was used to quantify the dissimilarity between
an individual’s profile and their corresponding reference profile,
represented by the centroid. Right: scatterplot showing the
relationship between age and the Jensen-Shannon distance from
transcriptomic or epitranscriptomic profiles, respectively. The R
represented the Pearson’s correlation coefficients and the p values
were obtained from two-sided Pearson correlation test.
(B) Heatmap showing epitranscriptomic profiles from individuals across
age groups (N = 61). Top bar represented the enrichment signals
(fold-change ≥2) of each sample.
(C) Unsupervised hierarchical clustering was used to group NAD-RNAs
with similar trajectory. Two major clusters were identified, and the
top five non-redundant pathways were listed. The solid line and shaded
region represent the smoothed trajectory and its 95% confidence
intervals, respectively.
(D) Heatmap showing a set of NAD-RNAs (n = 96) that highly associated
with age.
(E) Z-transformed, smoothed trajectories of NAD capping levels (in red)
and expression levels (in black) of selected genes were shown. The
solid line and shaded region represent the smoothed trajectory and its
95% confidence intervals, respectively.
(F) Assessment of gene-specific NAD capping by qRT-PCR. Selected
NAD-RNAs, including age-increased NAD-RNAs (PDIA3 and HSF2), as well as
age-decreased NAD-RNAs (UPF2 and NRGN), were examined. Data were shown
in mean ± s.e.m. (one-sided Student’s t test).
By inspecting the correlation between NAD modification and age, we
identified a set of NAD-RNAs that highly associated with age (n = 96)
([121]Figure 5D). Specifically, select NAD-RNAs, such as those involved
in protein folding (PDIA3), and heat shock response (HSF2) had
increased capping with age ([122]Table S6), but the abundance at their
RNA transcript levels was not increased ([123]Figure 5E). In addition,
NAD capping of genes linked to mRNA decay (UPF2), and calmodulin
binding (NRGN) were decreased during aging ([124]Figure 5E and
[125]Table S6). The expression levels of UPF2 and NRGN also decreased
during aging ([126]Figure 5E).
To validate age-associated changes of NAD-RNA, we utilized a
quantitative PCR strategy slightly modified from the CapZyme-Seq
method.[127]^23 We introduced three types of spike-in RNAs: 1) a
synthetic NAD-capped RNA as the positive control, 2) a synthetic
m^7G-RNA as the negative control, and 3) a synthetic non-capped ppp-RNA
as the baseline negative control. The RNA mixture was subjected to
calf-intestinal alkaline phosphatase treatment to remove phosphate
groups from the 5′ end of non-capped RNAs. Subsequently, NudC
pyrophosphatase was introduced to hydrolyzed NAD-capped RNA, producing
a ligatable 5′ monophosphate on these RNAs, followed by adapter
ligation, adapter-specific cDNA synthesis, and PCR amplification
([128]Figure S4A). Control experiment showed that NAD-capped spike-in
RNA, but not ppp- and m^7G-RNA, was selectively and significantly
enriched ([129]Figure S4B). We proceeded to validate the NAD capping of
four endogenous transcripts, including two transcripts with
age-increased NAD modification (PDIA3 and HSF2), and two with
age-decreased NAD capping (UPF2 and NRGN). In support of our findings
at the epitranscriptome level, we found that NAD capping of PDIA3 and
HSF2 was dramatically higher in aged compared to young individuals. On
the other hand, UPF2 and NRGN had a significant decline of NAD capping
in aged individuals ([130]Figure 5F).
RNA-based prediction of chronological age
Aging is a complex process that entails both functional and molecular
changes. Currently, RNA-based aging clock solely relies on the
longitudinal change of select RNA transcripts, which limits the
accuracy of prediction.[131]^24 Given the fact that NAD-RNAs
intrinsically link metabolite with gene expression, we reasoned that
RNA-based clock, if combined with an additional layer of NAD-modified
epitranscriptome, might substantially improve the performance of
prediction. To do this, we randomly split the aging cohort into two
groups: the discovery cohort (N = 35), which served as the training
set, and the validation cohort (N = 26), which was used for model
validation. Both cohorts were balanced for ages and gender distribution
([132]Table S1). The performance of each model was subsequently
evaluated by the median absolute error (MAE; unit, years) with
leave-one-out cross-validation ([133]Figure 6A). We found that the
transcriptome-based model (MAE = 7.1; Pearson’s R = 0.81 and p =
5.5e-9) and epitranscriptome-based model (MAE = 5.2; Pearson’s R = 0.81
and p = 3.9e-9) could generate prediction of chronological age. We
further combined signature from transcriptome and epitranscriptome,
yielding an improved prediction model (MAE = 4.2; Pearson’s R = 0.90
and p = 1.4e-13) ([134]Figure 6B and [135]Tables S8, [136]S9, and
[137]S10). To assess whether the trained model presented a
dataset-specific bias, the combined model was applied to an independent
validation cohort (N = 26). We obtained a comparable result from the
validation cohort (MAE = 5.7; Pearson’s R = 0.81 and p = 4.5e-7),
suggesting the model can be generalizable to the independent dataset
([138]Figure 6C). By employing machine learning approaches, we
confirmed that our aging clock, which combined both transcriptomic and
epitranscriptomic signatures, exhibited better performance compared to
the models solely based on the transcriptome or epitranscriptome
([139]Figures S5A–S5F). These findings suggested that RNA-based aging
clock with an additional epitranscriptomic layer could lead to improved
age prediction accuracy. We assessed our model using elastic net
regression ([140]Figures 6B and 6C), which showed outperformed result
than models based on ridge, lasso, and random forest regression
([141]Figures S5A–S5F). Specifically, the features of combined aging
clock were comprised of 20 transcriptomic and 11 epitranscriptomic
signatures ([142]Figure 6D). 16 of them positively correlated with the
progression of age, such as transcriptomic signatures involved in
innate immune response (ANXA1), cell cycle (ZC3H12D), and apoptosis
(VRK2). On the other hand, 15 features negatively associated with the
progression of age, such as transcriptomic signatures involved in RNA
processing (SRRM1), GPCR pathway (ARHGEF4), and epitranscriptomic
signatures involved in mRNA decay (UPF2) ([143]Figure 6E). Moreover, we
defined the age gap as subtracting real chronological age of human
subjects from predicted age based on the combined aging clock, to
investigate whether the accelerated (age gap >0) or decelerated
biological ages (age gap <0) were associated with distinct
physiological conditions. Using correlation analysis, we found that the
advanced biological age was significantly associated with lower levels
of blood cell count, such as neutrophil, white blood cell, and
basophil, while positively correlated with blood cholesterol levels,
including total cholesterol, and low-density lipoprotein
([144]Figure 6F and [145]Table S12). Together, these data highlight the
potential of RNA-based aging clock for the prediction of chronological
age and perhaps the physiological condition of individual subjects.
Figure 6.
[146]Figure 6
[147]Open in a new tab
RNA-based age prediction model
(A) Illustration of the development and validation of age prediction
model.
(B) Performance of age prediction models based on transcriptome (left),
NAD-modified epitranscriptome (middle) and the combination of
transcriptome and NAD-modified epitranscriptome (right) in the
discovery cohort.
(C) Performance of the combined age prediction model in the validation
cohort. Each dot represents an individual. The blue solid line and
shaded region represent a regression line and its 95% confidence
intervals, respectively. The black dashed line represents the identity
line. N is the number of individuals and n is the number of features.
Numbers of epitranscriptomic features were colored in red.
(D) Coefficients of features in the RNA-based aging clock that combined
signatures from transcriptome and epitranscriptome.
(E) Boxplot showing the age-associated trend of features contributing
the most variance to the combined aging clock. Epitranscriptomic
features were colored in red.
(F) Age gap (predicted age – chronological age) was strongly associated
with the blood cell count and blood cholesterol levels (|R[s]| > 0.25).
p values were obtained with two-sided Spearman’s correlation test
(∗p < 0.05, ∗∗p < 0.01).
Combined, our study revealed the first NAD-modified epitranscriptome
from human PBMCs. Using enONE, we were able to pinpoint
epitranscriptomic alteration of NAD capping during the aging process.
With a machine learning approach, we build an accurate RNA-based aging
clock combing signatures from both transcriptome and epitranscriptome,
which could predict the chronological age.
Discussion
Incorporation of NAD, the hub metabolite and redox cofactor for cells,
into RNA represents the crosstalk between metabolite and gene
expression, defining a critical layer of epitranscriptomic regulation.
NAD-RNA sequencing technologies, particularly our recently developed
ONE-seq,[148]^8 have substantially facilitated the
epitranscriptome-wide identification of novel NAD-capped RNAs across
phyla.[149]^16 However, since these methods require affinity-based
enrichment, data normalization simply borrowed from bulk RNA-seq could
not be readily applied. In this study, we devise a general approach for
epitranscriptome-wide NAD-RNA-seq data normalization and evaluation.
Previous epitranscriptome profiling approaches have added low amount of
ERCC spike-ins to account for the unwanted variation. However, ERCC
spike-ins are only composed of 92 exogenous transcripts with length
ranging from 250 to 2,000 nt, which cannot faithfully mimic the
endogenous transcripts throughout stepwise enrichment procedures and
library preparation.[150]^14^,[151]^25 Since NAD capping is an
evolutionary conserved mechanism,[152]^16 our strategy deliberately
includes Drosophila total RNA that comprises full spectrum of NAD
epitranscriptome as spike-in. Particularly, we introduce relatively
higher amount of fly RNA than that of human as surrogate for RNA loss
resulted from affinity-based enrichment procedures. Since Drosophila is
evolutionarily apart from human,[153]^26 sequencing reads can be easily
distinguished in the subsequent data analysis. Although scaling-based
approaches have been extensively used in the analysis of
epitranscriptome profiles, we demonstrate that integration of global
scaling and RUV strategy can optimize normalization performance (see
[154]Figure S2). Selection of the “best” normalization, however, might
not be feasible in practice due to the subjective definition of
optimality.[155]^27^,[156]^28 Therefore, enONE emphasizes the choice of
an appropriate rather than the “best” strategy. Collectively, enONE can
efficiently remove the impact of unwanted noises arising from
affinity-based enrichment in the heterogeneous human aging data while
preserving the age-associated signals. Also, we highlight the
application that enONE, an open-source R package, can be extended into
the analysis of other types of epitranscriptomic sequencing data that
involve stepwise enrichment procedures, e.g., m^6A-seq.
The enONE computational framework facilitates the identification of
NAD-RNA biomarkers. As a source of liquid biopsy, peripheral blood can
serve as sentinel tissue to monitor individual health in a non-invasive
manner. Identification of novel blood-derived features may open up new
avenues in detection of biomarkers for various physiological and
pathological conditions. Here, we reveal prominent features of NAD-RNAs
from human PBMCs. Large collections of NAD-RNAs are produced by
protein-encoding genes, with their biological functions mainly involved
in basic cellular events, such as translation, RNA metabolism, and
transcription. Additionally, genes with specialized function, such as
those involved in immune system, are found to be capped by NAD. Thus
far, yet the function of NAD-RNAs remains elusive, their dynamic
changes may provide insights into how individual subjects respond to
perturbations, such as the aging process. In line with the decline of
NAD metabolite,[157]^29^,[158]^30 our recent mouse liver study has
shown that the extent of NAD capping tends to decrease with age.[159]^8
However, in contrast to biological replicates from a controlled
laboratory environment, this pattern appears to be much more complex in
individual human subjects. A recent study based on a large Chinese
cohort (n = 1,518) has shown that peripheral blood NAD levels
demonstrate remarkable inter-individual heterogeneity and
gender-specific variation during aging.[160]^22 In the present study,
we propose that NAD capping in PBMCs tends to increase in elder
population (see [161]Figure 5B). This finding suggests that the
addition or removal of 5′-terminus NAD moiety in different tissues
might not be solely dependent on the cellular reserve of NAD. In
addition, NAD-modified epitranscriptome might be remodeled by other
mechanisms during aging, such as the activity of NAD-RNA decapping
enzymes[162]^31 and small RNAs that destabilize NAD-capped
transcripts.[163]^32 Moreover, our data illustrate some of the
age-associated NAD-RNAs that are involved in molecular pathways
profoundly impinging on the hallmarks of aging, such as mitochondrial
function and immune system.[164]^33 During aging, mitochondrial
function progressively deteriorates, compromising the homeostasis of
energy metabolism and redox status.[165]^33 Previous studies have
reported that mitochondrial genome can produce NAD-RNAs in various
eukaryotic organisms, including yeast,[166]^34 Arabidopsis,[167]^35 and
human.[168]^36 Here, our data reveal a substantial proportion of
transcripts, including 30 out of the 37 mitochondria-encoded
transcripts and 81 mitochondria-related genes from nucleus, can
generate NAD-capped RNAs. Among them, we note an age-associated
increase in NAD capping of transcripts associated with essential
mitochondrial functions, such as oxidative phosphorylation. These
findings suggest that NAD capping may provide novel insights into
mechanisms that modulate mitochondrial activities with age. In
addition, increased inflammation, characterized by an elevation of
circulating inflammatory factors, is another key hallmark of
aging.[169]^33 Our data highlight that the increase of NAD modification
of genes related to adaptive immune system represents a new feature in
the system-wide manifestation of inflammation during human aging.
Our preliminary study provides proof-of-concept evidence that
blood-derived, RNA-based measures can be used to predict the
chronological age of human. Additionally, our new RNA-based aging
clock, which combines an additional layer of NAD-modified
epitranscriptome, exhibits a better age prediction performance than
those solely based on the abundance of transcripts, e.g., bulk
RNA-seq[170]^37 and recently single-cell RNA-seq.[171]^38 Since there
is currently no gold standard measure of biological aging,[172]^39 it
is tempting to exploit how NAD-capped RNAs as biological implications
might signify physiological and perhaps pathological conditions.
Taken together, we propose enONE as a flexible, modular, and general
framework for the spike-in-based normalization and evaluation of
NAD-RNA sequencing data in a wide spectrum of biological contexts. To
the best of our knowledge, enONE is the first computational approach
for NAD-modified epitranscriptome profiles, which can be extended into
the analysis of other types of epitranscriptomic sequencing data that
involve stepwise enrichment procedures, e.g., m^6A-seq. Using enONE, we
identify hundreds of NAD-capped transcripts from human peripheral blood
and reveal age-associated dynamics of NAD capping that are linked to
biological processes impinging on aging hallmarks. Potentially,
blood-derived NAD-RNAs can be exploited as non-invasive biomarkers for
aging.
Limitations of the study
The present study mainly focusses on disentangling the age-associated
features of NAD-capped epitranscriptome from human peripheral blood
using enONE. However, the overall number of NAD-RNAs from individual
subjects is not simply correlated with age (see [173]Figure 5B),
raising the possibility that additional age-associated features remain
to be interrogated. For instance, gender effect has been shown to
impact the fundamental metabolism[174]^40 and gene expression
programs.[175]^41^,[176]^42 How NAD-capped RNAs, which represent the
crosstalk between the metabolome and transcriptome, vary between
different genders warrants future investigation. In addition, our human
aging cohort is a relatively small-scale population recruited from the
local community of Shanghai, which does not represent the general
population. That how this set of RNA biomarkers compares with existing
DNA-based quantification is worth of future investigation in a larger
and more diverse cohort.[177]^43
STAR★Methods
Key resources table
REAGENT or RESOURCE SOURCE IDENTIFIER
Biological samples
__________________________________________________________________
D. melanogaster: 5905 Bloomington Drosophila Stock Center RRID:
BDSC_5905
__________________________________________________________________
Chemicals, peptides, and recombinant proteins
__________________________________________________________________
yDcps New England Biolabs M0463S
Glycoblue Thermo Fisher Scientific AM9516
RRI Takara Bio 2313A
TRizol Takara Bio 9109
CIAP Thermo Fisher Scientific 18009–027
NudC Pyrophosphatase New England Biolabs M0607S
T4 RNA Ligase 1 (ssRNA Ligase) New England Biolabs M0204S
__________________________________________________________________
Critical commercial assays
__________________________________________________________________
Fast RNA-seq Lib Prep Kit V2 Abclonal RK20306
ABScript II cDNA First-Strand Synthesis Kit Abclonal RK20400
ABScript III RT Master Mix for qPCR Abclonal RK20428
MEGAscript T7 Transcription Kit Thermo Fisher Scientific AM1334
__________________________________________________________________
Deposited data
__________________________________________________________________
Raw and analyzed data Gene Expression Omnibus [178]GSE226636
__________________________________________________________________
Experimental models: Organisms/strains
__________________________________________________________________
D. melanogaster: 5905 Bloomington Drosophila Stock Center RRID:
BDSC_5905
__________________________________________________________________
Oligonucleotides
__________________________________________________________________
List of primers and other oligonucleotides, see [179]Table S13 This
paper N/A
__________________________________________________________________
Software and algorithms
__________________________________________________________________
enONE This paper [180]https://github.com/thereallda/enONE
[181]Open in a new tab
Resource availability
Lead contact
Additional information and requests for resources and reagents used in
this study should be directed to and will be fulfilled by the lead
contact, Nan Liu, (liunan@sioc.ac.cn).
Materials availability
* This study did not generate new unique reagents.
Data and code availability
* •
All high-throughput RNA sequencing data as well as transcript
quantifications have been deposited at the Gene Expression Omnibus
under accession number [182]GSE226636
.
* •
All original code has been deposited at Zenodo
([183]https://doi.org/10.5281/zenodo.10068529). enONE is available
as an R package on GitHub
([184]https://github.com/thereallda/enONE). The datasets and the
code are publicly available as of the date of publications.
* •
Any additional information required to reanalyze the data reported
in this paper is available from the [185]lead contact upon request.
Experimental model and study participant details
The project was approved by the Medical Ethics Committee of Shanghai
Changzheng Hospital (2022SL006). Informed consent was obtained from all
subjects in accordance with the local research Ethics Committee
guidelines. We conducted a cross-sectional study of natural aging in
community subjects recruited from Shanghai Chang Zheng Hospital. The
aging cohort consists of 31 females and 30 males, aged from 23 to 67
years old. Details of participant information were listed in
[186]Table S1. Inclusion criteria for cohort were: 1) above 20 years
old; 2) independently able to provide written informed consent. For
comparative analysis of age-related subgroups, the cohort was divided
into three age groups: Young (20–35 years old), Middle (36–50 years
old), and Old (51 years old and above). Height and weight are measured
by trained staff following standardized protocols. BMI (kg/m^2) is
derived from the calculation and stored for further analyses. After
5 minutes of rest in the seated position, blood pressure (mmHg) was
measured three times with an automatic sphygmomanometer and the mean of
the measurements was used for analysis. Blood samples were taken from
all patients in the morning after they had been seated for 5 minutes.
All participants had blood drawn using lithium heparin tubes (BD
Vacutainer, catalog: 367884) by phlebotomists and consented to having
their de-identified survey data made publicly available.
Method details
Isolation of peripheral blood mononuclear cells (PBMCs) from whole blood
For isolation of peripheral blood mononuclear cells (PBMCs), 5 mL whole
blood was mixed with 630 μL OptiPrep (Sigma-Aldrich, catalog: D1556)
and 500 μL solution C (0.85% (w/x) NaCl and 10 mM Tris-HCl, pH 7.4),
followed by centrifugation at 4°C and 1,300 g for 30 min. PBMCs were
collected and mixed with two volumes of solution C, followed by
centrifugation at 4°C and 500 g for 10 min. Cell pellet was washed
twice with 1 mL PBS. The suspension was used for NAD-RNA detection.
In vitro transcription of NAD-RNA, and m^7G-RNA
To assess the sensitivity of enrichment procedures, spike-in NAD-RNA
(500 nt; sequence A) and m^7G-RNA (500 nt; sequence A) with identical
sequence, oligonucleotide without adenine was synthesized (Genewiz) and
were subjected to polyadenylation for poly(A) tails elongation
([187]Table S13). To assess the specificity of enrichment procedures,
spike-in m^7G-RNA (500 nt; sequence B) oligonucleotide was synthesized
(Genewiz) and were subjected to polyadenylation for poly(A) tails
elongation ([188]Table S13). To prepare the baseline control for qPCR,
spike-in ppp-RNA (500 nt) oligonucleotide was synthesized (Genewiz) and
were subjected to polyadenylation for poly(A) tails elongation
([189]Table S13). To prepare the positive control for qPCR, spike-in
NAD-RNA (500 nt) oligonucleotide was synthesized (Genewiz) and were
subjected to polyadenylation for poly(A) tails elongation
([190]Table S13). For in vitro transcription, 10 μM of double-stranded
DNA (dsDNA) template in 100 μL transcription buffer (Promega, catalog:
P1300), along with 1 mM of each of GTP, CTP and UTP, with 4 mM NAD (for
NAD-RNA) or 4 mM m^7GpppA (New England Biolabs, catalog: S1406S) (for
m^7G-RNA), 10 μL of T7 RNA polymerase (Promega, catalog: P1300), 5%
DMSO, 5 mM DTT and 2.5-unit RNase inhibitor were added and the
transcription mixture was incubated at 37°C for 4 h. The reaction was
incubated with 11-unit DNase I (Promega, catalog: P1300) at 37°C for
30 min to remove the DNA template. RNA was then extracted using acid
phenol/chloroform and precipitated with isopropanol (with 0.3 M sodium
acetate, pH 5.5) at −80°C overnight. The RNA pellet was washed twice
with 75% ethanol, air-dried, re-dissolved in DEPC-treated H[2]O, and
stored at −80°C.
NAD-capped RNA sequencing
Total RNAs from human PBMCs and total RNAs from Drosophila (spike-in)
were prepared in accordance with the manufacturer’s instruction (Takara
Bio, catalog: 9108). Total RNAs (10 μg) from human PBMCs were mixed
with 40 μg Drosophila RNA (spike-in 1), 0.1 ng synthetic RNAs (spike-in
2: 5% NAD-RNA/95% m^7G-RNA; sequence A), and 0.1 ng synthetic RNAs
(spike-in 3: 100% m^7G-RNA; sequence B). The mixture of total RNAs and
spike-in RNAs was incubated with 100 mM HEEB (1 M stock in DMSO) with
ADPRC (25 μg/mL) in 100 μL of ADPRC reaction buffer (50 mM Na-HEPES pH
7.0, 5 mM MgCl[2]) at 37°C for 1 h. 100 μL of DEPC-treated H[2]O was
then added and acid phenol/ether extraction was performed to stop the
reaction. RNAs were precipitated by ethanol and re-dissolved in 100 μL
of DEPC-treated H[2]O. 5 μL of biotinylated RNAs were kept as input.
After HEEB reaction, biotinylated RNAs were incubated with streptavidin
bead particles (6 μL, MedChemExpress, catalog: HY-K0208) and 0.4 U/μL
of RNase Inhibitor (Takara Bio, catalog: 2313B) at 25°C for 30 min.
Beads were washed four times with streptavidin wash buffer (50 mM
Tris-HCl (pH 7.4) and 8 M urea), and three times with DEPC-treated
H[2]O. To ensure complete elution, biotin-conjugated RNAs were replaced
from streptavidin beads by incubating with 1 mM biotin buffer (20 μL
NaOH pH 8.0, Sigma-Aldrich, catalog: B4639) at 94°C for 8 min, followed
by incubation with 500 nM NudC (New England Biolabs, catalog: M0607S)
in 25 μL of NudC reaction buffer (100 mM NaCl, 50 mM Tris-HCl pH 7.9,
10 mM MgCl[2], 100 μg/ml Recombinant Albumin) at 37°C for 30 min. After
NudC treatment, biotinylated-RNAs that are resistant to NudC catalysis,
potentially derived from contaminating m^7G-RNAs, were retained on
beads by incubation with high-capacity streptavidin particle (20 μL,
Thermo Fisher Scientific, catalog: 20357) at 25°C for 30 min. Eluted
RNAs in the supernatant were used for next step. Input (see above) and
NudC-eluted RNAs were used for NGS library construction, in accordance
with the manufacturer’s instructions (mRNA-seq Lib Prep Kit for
Illumina, Abclonal, catalog: RK20302). Library quality was assessed by
Bioanalyzer 2100 (Agilent, United States), and quantification was
performed by qRT-PCR with a reference to a standard library. Libraries
were pooled together in equimolar amounts to a final 2 nM concentration
and denatured with 0.1 M NaOH (Sigma, catalog: 72068). Libraries were
sequenced on the Illumina NovaSeq 6000 system (paired end; 150 bp).
Validation of NAD-capped RNA by a modified CapZyme-Seq method
Total RNAs (3 μg) from human PBMCs of young (n = 3) and old (n = 3)
individuals were mixed with Drosophila total RNA (7 μg), as surrogate
for RNA loss, and polyadenylated spike-in RNAs generated from different
sequence templates, including 0.1 ng NAD-RNA (500 nt), 0.1 ng
m^7Gppp-RNA (500 nt), and 0.1 ng ppp-RNA (500 nt). RNAs were incubated
with 10 U CIAP (Thermo Fisher Scientific, catalog: 18009019) in the
presence of 40 U RNaseOUT at 37°C for 1 h to remove 5′-terminal
phosphate from non-capped RNAs. After purification with RNAiso Plus
(Takara Bio, catalog: 9108), CIAP-treated RNAs were incubated with
250 nM NudC (New England Biolabs, catalog: M0607S) in 40 μl of NudC
reaction buffer (100 mM NaCl, 50 mM Tris–HCl pH 7.9, 10 mM MgCl2,
100 μg/ml Recombinant Albumin, 40 U RNaseOUT) at 37°C for 30 min. This
step aims to hydrolyze the cap from NAD-RNA resulting in ligatible
monophosphate at the 5′ end. Products were purified with RNAiso Plus
(Takara Bio, catalog: 9108) and further ligated with 100 μM 5′ adaptor
oligo listed in [191]Table S13, in the presence of 15 U T4 RNA ligase 1
(New England Biolabs, catalog: M0202), 15% of PEG8000, 1 mM ATP and
40 U RNaseOUT in 40 μl of 1× T4 RNA ligase buffer at 16°C for 16 h.
RNAs were purified and re-dissolved in 20 μl of DEPC-treated H[2]O.
Reverse transcription was performed with designed primer listed in
[192]Table S13, by ABScript II cDNA First-Strand Synthesis Kit
(ABclonal, catalog: RK20400). 10 μl of products was diluted to 200 μl
and served as input. Meanwhile, 5 μl of reversely transcribed products
were further amplified by PCR using adaptor-specific primers listed in
[193]Table S13. The PCR program operated with an initial denaturation
step of 3 min at 95°C, amplification for 14 cycles (denaturation for
15 s at 95°C, annealing for 15 s at 60°C and extension for 6 min at
72°C), and a final extension for 5 min at 72°C. The ppp-RNA (500 nt)
was used as the baseline control, with NAD-RNA (500 nt) serving as the
positive control, and m^7G-RNA (500 nt) serving as the negative
control. Primers were listed in [194]Table S13. To calculate the
enrichment from qRT-PCR data, the Ct value of the target gene was first
normalized to the Ct of the ppp-RNA (baseline). Next, the normalized
PCR+ fraction value (ΔCt of the target gene normalized to the ppp-RNA)
was normalized to the background (ΔCt calculation for the gene in the
input), to yield the ΔΔCt value. The linear conversion of this ΔΔCt
resulted in the fold enrichment. Significance was assessed by Student’s
t test.
Quantification and statistical analysis
High-throughput sequencing data analysis
All sequencing reads were processed with Trim Galore (v0.6.6)[195]^44
with the parameters “--nextseq 30 --paired” to remove the adapter
sequences (AGATCGGAAGAGC) from NovaSeq-platforms, and reads longer than
20 bp were kept. Reads that passed the quality control procedure were
kept and mapped to the combined genome of Homo sapiens (GRCh38) and
Drosophila melanogaster (dmel-all-chromosome-r6.36) using STAR
(v2.7.6a)[196]^45 with default parameters. Uniquely mapped read pairs
were counted against annotations from Homo sapiens (Ensembl: GRCh38.94)
and Drosophila melanogaster (Flybase: dmel-all-r6.36) using
featureCounts (v2.0.1)[197]^46 with parameters “-p -B -C” and
summarized as gene-level counts. Sequencing saturation was assessed by
randomly subsampling the original libraries and examined the
corresponding changes in the number of genes, derived from human
genome, with more than 10 read counts.
enONE workflow
enONE is implemented in R and publicly available at
[198]https://github.com/thereallda/enONE. enONE workflow consists of
four steps: 1. Quality control; 2. Gene set selection; 3. Normalization
procedures; 4. Normalization performance assessment. By “log”
transformation, we generally refer to the log[2](x+1) function unless
otherwise stated. Below, steps are shown in detail.
Quality control
The goal of quality control was to remove problematic or noisy
observations from downstream analysis. In this study, we used sample
and gene filtering to control data quality. To assess outliers, we
applied Rosner’s outlier test on principal component 1. Principle
component analysis (PCA) was performed with prcomp function on the top
20,000 genes based on a transformed counts matrix by vst function from
R package DESeq2 (v1.36.0).[199]^13 All samples were kept for
subsequent analysis. To keep well-detected genes across samples, we
used filterByExpr function from the R package edgeR (v3.38.4)[200]^47
with parameter “min.count=20”. All ribosomal RNA encoded genes and TEC
genes were excluded.
Gene set selection
enONE defined three sets of control genes for adjustment of the
unwanted variations, evaluation of the unwanted variations, and
evaluation of the wanted variations, respectively. For adjustment of
the unwanted variation, we defined the 1,000 least significantly
enriched genes (denoted as anchor set) in Drosophila spike-ins, ranked
by likelihood ratio, as negative control genes. Since the effect of
affinity-based enrichment was the known covariate of interest, we used
these genes, that were not affected by the enrichment effect, to
compute the unwanted variation in the subsequent RUV procedure. For
evaluation of the unwanted variation, we defined the 500 least
significantly varied genes in human, ranked by likelihood ratio, as
negative evaluation genes. We performed differential analysis test
across all covariates of interest to determine genes with constant
expression levels. Since the variation of constant genes could reflect
the handling effects, those genes could be used to evaluate the removal
of unwanted variation in the subsequent normalization evaluation step.
For evaluation of the wanted variation, we defined the 500 most
significantly enriched genes in human, ranked by likelihood ratio, as
positive evaluation genes. We performed differential analysis test
between enrichment and input samples in human to determine genes
affected by enrichment procedures. Those genes were used to evaluate
the preservation of wanted variation in the subsequent normalization
evaluation step.
Normalization procedures
enONE implemented global scaling and regression-based methods for the
generation of two-part normalization schemes. For the global scaling
normalization methods, gene-level read counts were adjusted by a
sample-wise scale factor. This procedure encompassed five distinct
scaling methods, including: 1) Total Count (TC): The scale factor was
defined as the sum of the read counts across all genes within a sample.
2) Upper-Quartile (UQ): The scale factor was computed as the
upper-quartile of the gene-level count distribution for each sample. 3)
TMM[201]^12: The scale factor was determined using a robust estimation
of the overall expression fold change between a given sample and a
reference sample. TMM was performed by the calcNormFactors function
from R package edgeR with parameter: method = “TMM”. The default
criterion for reference sample selection was based on its upper
quartile being the closest to the mean upper quartile value across all
samples. 4) DESeq[202]^13: The scale factor for a given sample was
calculated as the median fold-change between the samples and a
pseudo-reference sample, whose counts were defined as the geometric
means of the counts across samples. This method was performed by the
calcNormFactors function from R package edgeR with parameter: method =
“RLE”. Note that the method discarded any gene having zero count in at
least one sample. 5) PoissonSeq: The scale factor was derived based on
an iterative estimate of sequencing depth, considering a subset of
genes characterized by Poisson goodness-of-fit statistics.[203]^17 By
default, this method discarded genes with less than 5 read counts.
For the regression-based procedures, enONE considered the following
generalized linear model, which allows adjustment for factors of
unwanted variation:
[MATH: Y=Xβ+Wα+O, :MATH]
(Equation 1)
where Y was the n x j matrix of gene-level read counts, X was an n x m
design matrix corresponding to the m covariates of interest of ‘‘wanted
variation’’ (e.g., treatment) and β its associated m x j matrix of
parameters of interest, W was an n x k matrix corresponding to unknown
factors of unwanted variation and α its associated k x j matrix of
nuisance parameters, O was an n x j matrix of offsets that can either
be set to zero or estimated with other normalization procedure (such as
TMM normalization). Here, n is the number of genes, j is the number of
samples, m is the number of covariates, and k is the chosen number of
unwanted variation factors.
The unwanted variation W was estimated via singular value decomposition
(SVD) using approaches implemented in R package RUVSeq
(v1.30.0).[204]^14 In this study, we introduced three variants of RUV,
including: 1) RUVg: This variant estimated the unwanted variation
factors based on negative control genes that were assumed to exhibit
constant expression across all samples (β = 0). 2) RUVs: This method
estimated the factors of unwanted variation based on negative control
genes derived from the replicate samples, such as those within each
treatment group, where the covariates of interest remained constant
(β = 0). 3) RUVse: This novel variant was a modified version of RUVs.
It estimated the unwanted variation factors using negative control
genes from the replicate samples within each assay group (i.e.,
enrichment and input), with the assumption that the enrichment effect
remained constant. Following the estimation of unwanted variations, for
a chosen value of k (1 ≤ k ≤ j), the adjusted read counts Y∗ can be
defined by
[MATH: Y∗=Y−Wˆ(k)αˆ(k)−O. :MATH]
Normalization performance evaluation
To evaluate the performance of normalization, we leveraged eight
normalization performance metrics that related to different aspects of
gene expression measures.[205]^27
The following four metrics evaluated normalization procedures based on
how well the samples can be clustered according to factors of wanted or
unwanted variation. Clustering by wanted factors was desirable, while
clustering by unwanted factors was undesirable. We used silhouette
widths as clustering measurements. For any clustering of n samples, the
silhouette width of sample i was defined as:
[MATH:
sil(i<
/mi>)=b(i
)−a(i)max{a
(i),b(i)}∈[−1,1], :MATH]
(Equation 2)
where a(i) denoted the average distance (by default, Euclidean distance
over the first three PCs of expression measures) between the ith sample
and all other samples in the cluster to which i was assigned and b(i)
denoted the minimum average distance between the ith sample and samples
in other clusters. The larger the silhouette widths, the better the
clustering. Thus, the average silhouette width across all n samples
provides an overall quality measure for a given clustering. Silhouette
width was calculated with the silhouette function from R package
cluster (v2.1.3). These metrics were defined as follows: 1) BIO_SIM:
Group the n samples according to the value of a categorical covariate
of interest (e.g., age group or treatment) and compute the average
silhouette width for the resulting clustering. 2) BATCH_SIM: Group the
n samples according to the batch and compute the average silhouette
width for the resulting clustering. 3) EN_SIM: Group the n samples
according to the assay (e.g., enrichment or input) and compute the
average silhouette width for the resulting clustering. 4) PAM_SIM:
Cluster the n samples using partitioning around medoids (PAM) for a
range of numbers of clusters and compute the maximum average silhouette
width for these clusters. PAM clustering was done by the pamk function
from R package fpc (v2.2-9). Large values of BIO_SIM, EN_SIM and
PAM_SIM and low values of BATCH_SIM were desirable.
The next two metrics evaluated the association of log-count principal
components (by default, the first 3 PCs) with “evaluation” principal
components of wanted or unwanted variation. 1) UV_COR: The weighted
coefficient of determination
[MATH: R¯2
:MATH]
for the regression of log-count principal components on factors of
unwanted variation (UV) derived from negative evaluation genes. The
submatrix of log-transformed unnormalized counts for negative
evaluation genes is row-centered and scaled and factors of unwanted
variation are defined as the right-singular vectors as computed by the
svd function. 2) WV_COR: The weighted coefficient of determination
[MATH: R¯2
:MATH]
for the regression of log-count principal components on factors of
wanted variation (WV) derived from positive evaluation genes. The WV
factors were computed in the same way as the UV factors above, but with
positive instead of negative evaluation genes. Large values of WV_COR
and low values of UV_COR were desirable.
The weighted coefficients of determination were computed as follows.
For each type of evaluation criterion (i.e., UV, or WV), regressed each
expression PC on all supplied evaluation PCs. Denoted SST[k] as the
total sum of squares, SSR[k] as the regression sum of squares, and
SSE[k] as the residual sum of squares for the regression for the kth
expression PC. The coefficient of determination was defined as:
[MATH: Rk2=SSR
kSST
k=1−SSEk<
/msub>SSTk, :MATH]
(Equation 3)
and weighted average coefficient of determination as:
[MATH: R¯2=∑kSSTkRk2∑<
/mo>kSST<
mi>k=∑kSSRk
mrow>∑kSS
Tk=<
mn>1−∑kSSEk∑kSS
mi>Tk, :MATH]
(Equation 4)
The next two metrics evaluated the similarity of gene expression
distributions. We defined gene-level relative log-expression (RLE)
measures, as log-ratios of read counts to median read counts across
samples, for comparing distribution of gene expression. RLE was defined
as follows:
[MATH:
RLEi
j=logYijMedianiYij, :MATH]
(Equation 5)
for gene i in sample j. For similar distributions, the RLE should be
centered around zero and have a similar distribution across samples.
Thus, the metrics were defined as follows: 1) RLE_MED: Mean squared
median RLE. 2) RLE_IQR: Variance of inter-quartile range (IQR) of RLE.
Low values of RLE_MED and RLE_IQR were desirable.
In the enONE framework, the expression measures were normalized
according to a set of methods (including raw counts) and the eight
metrics above were computed for each dataset. The performance
assessment results can be visualized using biplots and the
normalization procedures were ranked based on a function of the
performance metrics. In particular, enONE defined a performance score
by orienting the metrics (multiplying by ±1) so that large values
correspond to good performance, ranking procedures by each metric, and
averaging the ranks across metrics.
Comparison methods
For benchmarking, we compared the enONE procedure with the
regression-based method, RUVg, and the scaling-based approach, RADAR.
For RUVg, we used the anchor set from Drosophila spike-ins as negative
control genes and the first four estimated factors of unwanted
variations for adjustment. For RADAR, we normalized the input libraries
using the median-of-ratio method of DESeq based on input gene-level
counts. To normalize the enrichment libraries, we estimated sample-wise
size factors from the fold change
[MATH:
Fn,j=enn,ji
nn,j :MATH]
of the top 1% genes ranked by enrichment read counts, where
[MATH:
enn,
j :MATH]
is enrichment read count and
[MATH:
inn,
j :MATH]
is the DESeq normalized gene counts from gene n and sample j.
Linear regression for normalization performance evaluation
R^2 values of fitted linear models were used to quantify the strength
of the linear relationships between a source of unwanted variation,
i.e., batch effect, and global sample summary statistics, such as the
first k PCs (1 ≤ k ≤ 6).[206]^48 The lm R function was used for this
analysis.
Spearman correlation
Spearman correlation coefficients were used to determine the monotonic
relationships between a source of unwanted variation, i.e., batch
effect, and the raw or normalized gene-level read counts. The cor R
function was used for this analysis.
ANOVA
To evaluate the effects of the different sources of effect on the data,
ANOVA were performed. For evaluating the batch effect, ANOVA tests were
computed across batches. Genes with p value <0.01 were deemed to be
affected by the batch effect. The aov R function was used for this
analysis.
Vector correlation
The Rozeboom squared vector correlation[207]^49 was used to quantify
the strength of relationships between the first k expression PCs (1 ≤ k
≤ 6) and the enrichment covariates.
Adjusted rand index
The adjusted Rand index (ARI), a correction of the Rand Index, measures
the similarity between two clustering.[208]^50 The ARI was used to
evaluate the performance of normalization methods in terms of
separation between enrichment and input samples. The first three
expression PCs was utilized to perform ARI analysis.
Jensen-Shannon distance
To quantify differences in gene expression and NAD-RNA profiles between
individuals, we calculated the pairwise distance for all individuals
and their corresponding centroid using the Jensen-Shannon
Distance.[209]^20^,[210]^21 We first computed the centroid distribution
P[c] by averaging the transcriptomic or epitranscriptomic profiles from
all samples, respectively. For each individual’s transcriptomic or
epitranscriptomic distribution, the Jensen-Shannon Divergence (JSD) can
be calculated as:
[MATH: JSD(Pc,
Pi)=H(Pc+Pi
2)−H(Pc)+H(Pi)2,<
/mrow> :MATH]
(Equation 6)
where P[i] is the distribution for individual i and H is the Shannon
entropy function:
[MATH:
H(X)=<
/mo>−∑i
=1nP(xi)log2(P(xi)),
:MATH]
(Equation 7)
Finally, the Jensen-Shannon Distance can be calculated as the square
root of JSD.
Quantitative NAD-RNA-seq data analysis
To identify NAD-RNA from sequencing data, we performed differential
analysis using FindEnrichment function from enONE package. Significance
of logarithmic fold changes was determined by a likelihood ratio test
to approximate p values, and genes were adjusted for multiple testing
using the Benjamini-Hochberg procedure to yield a false discovery rate
(FDR). NAD-RNAs were defined as fold change of normalized transcript
counts ≥2 and FDR <0.05 in enrichment samples compared to those in
input samples. The fold change of normalized counts between pairwise
enrichment and input sample was denoted as NAD modification level. Gene
annotation information, such as chromosome, gene types, and gene
lengths were retrieved from Ensembl (GRCh38.94) annotations. The violin
plot, boxplot, bar plot, line chart and scatterplot were generated by R
package ggplot2 (v3.3.6).[211]^51
RNA secondary structure analysis
For each transcript with 5′ untranslated region (UTR), the sequence of
100 bp region downstream of transcription start site (TSS) was used to
calculate RNA minimum free energy (MFE) by ViennaRNA with default
parameter.[212]^19 To generate a background distribution of MFE,
transcripts that did not produce NAD-cap were randomly sampled 1,000
times from the whole transcriptome with a similar size (n = 600) to
transcripts producing NAD-cap (n = 589).
Pathway enrichment analysis
Pathway enrichment analysis was performed using Metascape
(v3.5).[213]^52 Pathways were defined as the molecular pathways of
Reactome, the biological process (BP) of GO, the Kyoto Encyclopedia of
Genes and Genomes (KEGG), the WikiPathways (WP), the canonical pathways
of MSigDb, and the CORUM database. Metascape clustered enriched terms
into non-redundant groups based on similarities between terms and used
the most significant term within each cluster to represent the cluster.
The resulting clusters of pathways were manually reviewed. Enrichment
was tested using the hypergeometric test. Multiple hypothesis
correction was performed with Benjamini-Hochberg procedure, and the
significance threshold was defined as α = 0.05.
Hierarchical clustering analysis
Hierarchical clustering with a Euclidean distance metric was performed
for both rows (NAD-RNAs, complete method) and columns (samples, ward’s
method) on the scaled NAD modification matrix. Clustering and
corresponding heatmaps were generated by R package Complex Heatmap
(v2.12.1).[214]^53 Using cutree function with “k = 2” parameter, we
identified 3 clusters of NAD-RNAs changing with age, ranging from 201
to 608 NAD-RNAs.
Trajectory analysis
To estimate NAD modification trajectories during aging, log NAD
modification levels were Z-scored, and LOESS regression was fitted for
each gene. Similarly, log expression levels were used to estimate the
gene expression trajectories during aging. The trajectory of clusters
was estimated using the median levels of genes in each cluster.
Pathways were queried as above to gain insights into the biological
functions of each cluster. Top five non-redundant enriched terms were
shown. To identify NAD-RNAs that correlated with age, we computed
Spearman’s rank correlation coefficients (R[s]) for each NAD-RNAs on
the basis of NAD modification and age. The resulting correlation
metrics was then filtered, such that only NAD-RNAs that were
significantly correlated with age (p < 0.05 and | R[s] | ≥ 0.3) were
considered. Similarly, correlation between age and gene expression
levels was computed.
Age prediction model
We developed different age prediction models for human PBMCs based on
NAD-modified epitranscriptome, transcriptome, and the combination of
NAD-modified epitranscriptome and transcriptome, respectively. To
construct those models, we employed four machine learning approaches,
including elastic net, lasso, ridge, and random forest regression.
Aging cohort was randomly split into the discovery cohort (N = 35) for
model training, and the validation cohort (N = 26) for independent
evaluation. Age prediction was performed by regressing chronological
age on log-transformed feature levels (i.e., NAD modification and
transcript abundances) using the discovery cohort. To build a robust
aging clock, we pre-selected features using the age-correlated NAD-RNAs
(p < 0.05 and | R[s] | ≥ 0.3) and transcripts (p < 0.05 and | R[s] | ≥
0.5) as a starting point. To begin, we randomly split the discovery
cohort into training (70%) and test (30%) sets with balanced ages.
Model optimization including hyperparameter tuning was done by a grid
search with leave-one-out cross-validation (LOOCV) based on training
sets. Model performance was assessed on test sets, using several
statistics including median absolute error (MAE), Pearson’s correlation
coefficient and its associated p-value. Furthermore, we performed a
cross-validation scheme for arriving the least biased estimates of the
accuracy of the aging clocks, consisting of leaving out a single sample
from the regression, predicting age for that sample, and iterating over
all samples on discovery cohort. The best-tuned hyperparameters of
trained models were listed in [215]Table S11. Above model training and
hyperparameter tuning are performed with R packages caret (v6.0-93),
glmnet (v4.1-4) and randomForest (v4.7–1.1).
Association of age gap index and physiological phenotypes
Subtracting chronological age from predicted age, termed as the age
gap, provided a normalized measure of the extent to which an individual
appeared older (age gap >0) or younger (age gap <0) than same-aged
peers.[216]^54 Then, we explored whether age gap was associated with
physiological phenotypes, such as blood pressure, cholesterol levels,
blood cell counts. Associations were evaluated using spearman’s
correlation coefficients, while adjusting for chronological age and
gender through linear regression. The associations of age gap and
physiological phenotypes were listed in [217]Table S12.
Statistical analyses
All p values reported herein were calculated using the non-parametric
Mann-Whitney rank test unless otherwise stated.
Acknowledgments